Combining purposive sampling and k-means algorithm for qualitative research sampling

Even though several authors have historically applied – and are currently applying – mixed methods in their investigations, quantitative and qualitative approaches have long been seen as two distinct and separate ways of conducting research. In the last decades, the era of Big Data, which is no more than a vast availability of data and the computational power to process that information (Grimmer et al., 2021), the gap between quantitative and qualitative approaches increased. The consequence is a dehumanization and mythification of quantitative machine learning techniques (Ziewitz, 2016) and, therefore, the creation of scenarios in which it is difficult to conduct an unruffled search for the potentialities that arise from the combination of qualitative techniques and machine learning techniques. In this context, this article is based on the Glaser and Strauss (1967, p.17-18) position when they defend that there is no fundamental clash between the purposes and capacities of qualitative and quantitative methods and, also, that each form of data is useful for both verification and generation of theory. In other words, by combining “big data” with “thick data” (Wang, 2016), investigations can gain rich and nuanced insights, leading to more robust and insightful conclusions. This mixed methods approach holds significant promise for enhancing the rigor and efficiency of research across various fields.

UX and mixed methods

Precisely, User Experience (UX) research is a discipline that keeps a firm position on mixing qualitative and quantitative methods. In its aim to understand users’ needs, behaviors, and attitudes towards products or services, UX research uses surveys and a wide range of quantitative experiments to provide a way to gather large amounts of data on user practices and preferences. At the same time, qualitative methods, such as interviews or user testing, are also used to provide rich insights into users’ experiences, motivations, and attitudes. It is recognized in the UX field that a mixed methods approach is necessary to really understand users’ experiences and perceptions and, consequently, design products and services that meet their needs.

In this sense, I consider UX research as a markedly appropriate field to be the mentioned scenario where a rigorous, transparent, and functional combination of qualitative and quantitative techniques can be proposed. This two-phased article is concretely focused on the combination of qualitative purposive sampling (Patton, 2002; Suri, 2011) and k-means algorithm (De Soete and Carroll, 1994; Jain, 2010; Tomar, 2022) for the user recruitment phase. Sampling and recruitment are crucial moments in research, as they determine the validity of the collected data. The selection of participants can significantly impact the insights gained from the research, as the goal is to uncover user needs, expectations, and behaviors. In the end, the challenge in that stage is to achieve a sample that is diverse enough to account for individual differences and also to account for any potential biases.

Synergies between Purposive sampling and k-means

First, I will describe the two fundamental pillars on which this article is based: purposive sampling and k-means.

Purposive sampling

Purposive sampling refers to the intentional selection of participants or cases based on their relevance to the research questions or preliminary hypothesis. In purposive sampling – as in qualitative methods in general – the goal is not to obtain a representative sample of a population, but rather to select informants that can provide rich and diverse data that will be used to generate grounded and significant insights (Charmaz, 2006: 14). In other words, the informant’s selection should respond to theoretical purpose and relevance for furthering the development of the research goals (Glaser and Straus, 1967, p.48). At this point, Small (2009) reflects on a common but complex concern when conducting qualitative investigation in his article precisely titled How many cases do I need?”. The issue of what sample size is needed for qualitative research is more frequently asked by individual researchers than frequently discussed in the literature (Roland, 2016; Gill, 2020). Along the same line, few authors make explicit that the final number of users recruited largely depends on exogenous variables like available funding, timelines, the desired degree of depth, and other contextual variables.

I coincide with Small’s position (2009, p.10) when he states that, when referring to the sample size, the frameworks of qualitative methods often tend to wrongly imitate classical statistics language. Based on this, I refuse the existence of an objective and general formula or criteria that can dispense a specific number of the informant’s quantity required for a set of interviews, focus groups, etcetera. Hence, it is the qualitative researcher’s responsibility to balance the investigation scope, the available resources, and the adequate methodological technique to finally determine a sample size. 

Returning to Glaser and Strauss (1967, p.60), fundamental questions about the target population should be answered by the researcher to determine the sample size and, ultimately, achieve theoretical relevance. These questions include: What are the sociological subgroups that compose our target population? How many subgroups should the research be focused on? And, to what degree of depth should each subgroup be studied? 

All the answers to those questions are strictly related to the concept of theoretical saturation. In the qualitative field, theoretical saturation is generally understood as the moment when no new issues or insights emerge when adding new cases and analyzing data. Authors like Roland (2016, p.1) and Low (2019, p.1) expose how problematic this previous definition is. They both state that it proves no didactic guidance on how a researcher can determine such a point. This is the reason why many researchers use theoretical saturation as a justificative tool, that is to say, to simply proclaim that they have reached it without demonstrating how they have achieved it (Charmaz, 2006, p.114). 

I do not pretend to exhaustively reflect on how to achieve theoretical saturation in qualitative research but to defend that an analytically sustained recruitment is the first indispensable step to reach it. For that exercise, as previously explained, it will be crucial to identify the main sociological variables that cut across our target population and, therefore, to have a solid understanding of which subgroups compose it. Building on this foundation, the investigator will be able to formulate robust hypotheses and research questions about each subgroup. Additionally, this knowledge can also determine the research direction, for example, when deciding to adapt specific methodologies to each subgroup. Thus, taking it one step further, we should use the concept of stratified purposive sampling (Patton, 2002, p.240; Suri, 2011, p.70) rather than just purposive sampling. Drawing inspiration from these two authors, I define stratified purposeful sampling as the useful recruitment for examining the variations in the manifestation of a phenomenon, specifically when our stratified target population is constituted by samples, or subgroups, that are internally homogeneous and mutually distinguishable.

Compiling the addressed ideas, in the second phase of this article I will propose a data-driven and step-by-step methodological guide to execute an informed participant selection based on the stratified purposive sampling technique. This way of proceeding will drive the investigator in the mission of deciding the optimal sample size and attributes, which is a key factor in achieving theoretical saturation and attaining relevant answers to the research questions. The k-means algorithm is the required tool for delving deep into the subgroups within our target population.

K-means algorithm

K-means clustering is an unsupervised learning algorithm, whose function is to uncover inherent structures or clusters within the data or to identify patterns that may not be immediately apparent. Specifically, k-means is a distance-based algorithm and it aims to group similar data points together based on their attributes (Tomar, 2022). The procedure starts by randomly assigning k centroids, which act as representative points for each cluster. It then iteratively assigns each data point to the nearest centroid and updates the centroids based on the mean of the points assigned to them. This process continues until the centroids no longer change significantly, indicating that the clusters have stabilized. The result is a partition of the data into k distinct groups, with each group containing data points that are closer to each other than to points in other groups. 

The mathematical formula on which the mentioned general definition relies is Euclidean distance (Singh et al., 2013: 14). Euclidean distance is the calculation of (dis)similarity between data points – or cases –. That is to say, it is the measurement of the straight-line distance between two points in a multi-dimensional space. In k-means clustering, the Euclidean distance is used to measure how far each data point is from the cluster centroids. The closer the points are to their respective centroids, the more similar they are considered to be. By minimizing the total sum of squared Euclidean distances within each cluster, k-means finds the optimal cluster assignments and centroid positions. Among all clustering algorithms, k-means stands out as one of the most prominent due to ease of implementation, simplicity, efficiency, and empirical success (Jain, 2010). Besides the inherent effectiveness of the algorithm itself, the programming language used to implement k-means will be Python. As Raschka et al. (2020: 2) explain, in the last decades, Python language has seen tremendous growth in popularity within the scientific community to enable machine learning research and application development. The libraries pandas, scikit-learn, and matplotlib.pyplot are prestigious tools that will allow us to prepare and process the data, implement the algorithm, and plot graphical results. In this respect, the stratified purposive sampling guide will be based on those libraries.   

Reassembling the presented arguments and highlighting the main goal, our k-means guide should allow the researcher to develop an initial sense of the possible underlying patterns in the data, as well as generate hypotheses, detect anomalies, and identify salient features (Muller, 2016, p.5). All these factors are crucial when determining the sample size and characteristics. From a broader point of view, a huge potential underlies the combination of digital humanities and computer science. In other words, and following Nelson (2017, p.1), the dialogue between expert human interpretation and computational power to detect data patterns can produce distinctly rigorous and adaptative methodological approaches to each particular investigation.