October 23, 2025
Artificial intelligence (AI) holds great promise for transforming healthcare. However, despite significant advances, the integration of AI solutions into real-world clinical practice remains limited. A major barrier is the quality and fairness of training data, which is often compromised by biased data collection practices. This paper draws on insights from the AI4HealthyAging project, part of Spain’s national R&D initiative, where our task was to detect biases during clinical data collection. We identify several types of bias across multiple use cases, including historical, representation, and measurement biases. These biases manifest in variables such as sex, gender, age, habitat, socioeconomic status, equipment, and labeling. We conclude with practical recommendations for improving the fairness and robustness of clinical problem design and data collection. We hope that our findings and experience contribute to guiding future projects in the development of fairer AI systems in healthcare.
The application of artificial intelligence (AI) in healthcare has grown rapidly in recent years, offering new possibilities for medical tasks such as diagnosis and clinical decision-making [1]–[3]. Among the most transformative developments are the adoption of machine learning (ML), deep learning (DL) with the recent emergence of generative AI marking a significant new frontier. However, only a small fraction of these AI solutions are ultimately integrated into real-world healthcare systems [4]. This limited adoption can be attributed to different factors, including a mismatch with practical clinical needs or non-compliance with the European Union’s AI Act [5], which emphasizes transparency, safety, and accountability in AI systems.
As a result, many hospitals are increasingly investing in the development of smaller, in-house models tailored to their specific clinical needs. However, training such models remains a significant challenge. Medical data is highly sensitive and typically subject to strict GDPR regulations [6], which limit access and use. Moreover, it is particularly hard to collect data, in terms of the time required to pass through the evaluation of protocols by ethical committees, the constraints of inclusion / exclusion criteria and the difficulty to find enough population aiming to participate into the study. Even so, the amount of collected data will often not be enough to train AI models. It does not get better with publicly available datasets, which often lack essential metadata, undermining both the generalizability and clinical relevance of trained models [7].
Consequently, the successful implementation of in-house AI solutions often depends not on model architecture or performance metrics, but on a more fundamental, and frequently overlooked, factor: how data collection for model training is defined and planned. The success, reliability, and safety of AI systems in healthcare are deeply rooted in the quality, structure, and contextual appropriateness of the data on which they are built.
This paper draws on lessons learned from AI4HealthyAging, a national AI research project under Spain’s 2021 “R&D Missions in Artificial Intelligence” Program, focused on developing AI solutions for age-related diseases. The project addressed a range of clinical use cases, including cardiovascular conditions, sarcopenia, sleep disorders, Parkinson’s disease, mental health, colorectal and prostate cancer, and hearing loss. Our work centered on identifying bias in data through stakeholder interviews and limited metadata analysis. Based on these insights, we present a set of recommendations to guide data planning and collection in clinical AI projects, aiming to improve fairness, quality, and reliability.
Before presenting the specific biases identified in our study, we begin in Section 2 by examining how bias is defined in the existing literature. Section 3 explores how bias has been categorized in previous work. In Section 4, we present the biases we identified, already categorized. Finally, Section 5 offers a set of recommendations, measures that could have mitigated the biases observed and which we hope will support future clinical data collection efforts in AI development.
Despite its widespread use, the term bias lacks a standardized definition in the AI field. Different subfields and applications interpret and operationalize bias in different ways depending on their specific context [8]–[12]. However,
definitions across the literature tend to converge around three core elements: the source of the bias (e.g., algorithms, systems, or errors), how the bias manifests (e.g., through discrimination, unequal impacts, or prediction errors),
and the entities affected by the bias (e.g., individuals, patients, or underrepresented groups). Table 1 presents a selection of definitions from
the literature, spanning general AI to healthcare-specific contexts, organized and color-coded according to these three analytical dimensions.
These definitions of bias imply a normative judgment: that the outcomes it produces are undesirable, unjust, or detrimental to certain individuals or groups. This aligns closely with the concept of health equity, which is defined as the absence of systematic disparities in health (or in the major social determinants of health) between groups with different levels of underlying social advantage/disadvantage—that is, wealth, power, or prestige [13]. Understanding bias in AI as a normative issue highlights that biased outcomes are more than technical errors; they can reinforce social inequities, particularly in healthcare. Addressing bias is needed to achieving health equity by preventing unfair disparities.
| Context | Term | Definition |
|---|---|---|
| General | Algorithmic bias |
It occurs when the outputs of an algorithm benefit or disadvantage certain individuals or groups more than others without a justified reason for such unequal impacts [8]. |
| General | Bias | It refers to systematic and unfair favoritism or prejudice in AI systems, which can lead to discriminatory outcomes [9]. |
| Health | Algorithmic bias |
The instances when the application of an algorithm compounds existing inequities in socioeconomic status, race, ethnic background, religion, gender, disability or sexual orientation to amplify them and adversely impact inequities in health systems [11]. |
| Health | Bias | It refers to systematic errors leading to a distance between prediction and truth, to the potential detriment of all or some patients [10]. |
Many studies in the literature have proposed different categorizations of the sources of bias in AI systems [10], [14]–[22]. While terminology may vary across articles, most reflect a general consensus around three stages in the AI development pipeline where biases can originate: data, model development, and system implementation. Some works also propose additional stages that are also crucial to identify possible biases: (i) an initial stage to formulate of the research problem, where the purpose, requirements and impact need to be evaluated [10], [19]–[22], and (ii) a final phase of monitoring after deployment, which should be maintained as long as the AI system is in use [19], [21]. Table 2 illustrates this alignment by mapping the terminology used in nine different papers to these five stages.
| Design | Data | Modeling | Deployment | - | [10] | |
| - | Data Generation | Model Building | Implementation | - | [14] | |
| - | Data | Algorithm | User Interaction | - | [15] | |
| - | Training Data / Publication | Model Development & Evaluation | Model Implementation | - | [16] | |
| - | Pre-processing | In-processing | Post-processing | - | [17] | |
| Conception & Design | Development | Validation | Access | Monitoring | [19] | |
| Formulating Research Problem | Data Collection & Pre-processing | Model Development & Validation | Model Implementation | - | [20] | |
| Conception | Data Collection & Pre-processing | In-processing | Post-processing | Post-deployment Surveillance | [21] | |
| Problem scope | Data used | Model building | Decisions supported by analytical tool | - | [22] |
Although there is a broad agreement on the stages at which biases are introduced, the types of biases identified within these stages vary in both nomenclature and granularity across sources. Some categories are broad, such as selection bias [17], while others are more fine-grained, like validity of the research question [10]. Certain labels serve as umbrella terms, for example representation bias [14], under which more specific biases fall. For instance, demographic bias [10] can be considered a subcategory of representation bias. Furthermore, there is often overlap between categories, as certain biases span multiple dimensions. For example, institutional bias [10] can be understood as a combination of historical bias [14], which reflects systemic inequalities, and aggregation bias [10], where institutional practices rely on generalized data that may overlook the specific needs of marginalized groups.
In this section, we highlight several biases identified in our work that may affect the performance and generalizability of AI models. These biases arise from the first two steps detailed in the previous section: data design and data collection. To make these issues more tangible, we present concrete examples illustrating how such biases can be inadvertently introduced into training data, potentially compromising model fairness and validity.
For clarity, we categorize these biases according to the three sources of harm in data generation proposed by Suresh and Guttag [14]: historical bias, representation bias, and measurement bias (see Table 3). As noted earlier, these categories are not mutually exclusive. For example, we classify gender bias as historical bias due to its roots in societal norms and systemic inequalities. However, it could also be considered a form of measurement bias if gender is inferred using subjective scoring methods, as the methodology itself can introduce additional bias.
| Problem Design and Data Collection | ||
| Sex | Age | Equipment |
| Gender | Habitat | Labeling |
| Socioeconomic | ||
In the Parkinson’s study, the distribution of participants by age group and sex was generally balanced, except in the 40–49 and 80–89 age groups. The notably lower representation of females in the 80–89 group may be due to higher female mortality rates, which make recruiting females in this age range more difficult. These sex-based differences in participant distribution reflect important biological and disease-related factors. For example, research by Cerri et al. [23] shows that males have about twice the risk of developing Parkinson’s disease compared to females, yet females tend to experience faster disease progression and higher mortality. Such sex differences in disease risk, progression, and survival underscore the importance of carefully considering sex as a key variable to avoid bias in data collection and analysis, ensuring predictive models accurately capture these nuances.
Although gender information was not directly available in the data we analyzed, a Gender Score could be derived as in [24]. Gender bias is particularly important to consider in healthcare contexts. For instance, Samulowitz et al. [25] demonstrated that gender norms influence pain treatment: women with pain received less effective relief, fewer opioid prescriptions, more antidepressants, and more mental health referrals compared to men. Neglecting gender data and its proper analysis can exacerbate existing inequalities and lead to biased health outcomes.
Because the project focuses on age-related conditions, the control group, composed of participants without the disease, tent to be younger on average, while the disease groups include older individuals. This difference arises because these diseases primarily affect older adults, making it easier to recruit younger healthy controls but harder to find older participants without the condition. Another example of age bias is found in the Parkinson’s study, where the majority of subjects were between 60 and 79 years old. This aligns well with the known prevalence of Parkinson’s, which affects approximately 3% of people at age 65 and up to 5% of those over 85 [26]. Additionally, the median age increased with disease severity, 64 years for severity 1, 71.5 years for severity 2, and 75 years for severity 3, with no participants younger than 60 in the most severe category. This further reflects the strong association between age and disease progression. However, such uneven age distributions can introduce age bias in AI models trained on this data. Hence, models may learn to associate age-related features with disease presence or severity rather than true disease-specific markers.
This bias arises when geographic or environmental context affects participant representation. In this project, most participants came from urban areas, largely because the hospitals conducting the studies were located in urban settings. Travel distance and accessibility can be significant barriers for individuals living in rural areas, making it less likely for them to participate or remain involved in long-term studies. Even in urban areas, usually hospitals concentrate patients from some specific regions of the city that are determined by socio-economic factors or environmental exposure that have an impact on their quality of life [27], which leads to the following type of bias.
It occurs when participants’ social and economic factors, such as income, education, occupation, or access to healthcare, influence who is included in a study. For example, in one of the studies, data was collected from a private hospital. Because private hospitals typically serve patients with higher income levels or better insurance coverage, this creates a socioeconomic bias by primarily including individuals from wealthier backgrounds.
In the hearing loss study, control group participants tended to have higher education levels than other groups. Education often correlates with quieter work environments, while lower education may correspond to noisier jobs (e.g., factory work). Ignoring these factors could lead to misleading conclusions about hearing loss causes.
It occurs when variations in the devices used for data collection, such as different models, calibration settings, or software, affect measurement consistency. This can lead to results that are not comparable across participants or sites. For example, in the hearing loss study, most participants had cochlear implants from the same manufacturer. As a result, findings on quality of life and cognitive improvement may not generalize to users of other implant types, potentially biasing the model toward the characteristics of one specific device.
This bias occurs when data labels, such as diagnoses or classifications, are influenced by human judgment or local context. This can occur when different people use inconsistent criteria, or even when the same team labels all data but follows specific institutional practices. For example, labels from one hospital with its unique diagnostic style may not generalize well elsewhere, reducing model accuracy and fairness in real-world settings.
An example of labeling bias was found in the hearing loss study, specifically in the classification of participants’ occupations. Initially, the dataset used standardized occupational categories1 that did not include a significant group of society: homemakers. After adding this category, it was revealed that 35% of women fell into this group, with no male representation. This initial omission and subsequent reclassification demonstrate how labeling categories influenced by human decisions can misrepresent certain groups.
It occurs when two or more demographic variables interact in a way that affects the fairness or validity of a model. In the Alzheimer’s study, a potential intersectional bias involving age and sex was observed across three diagnostic groups: control, mild cognitive impairment (MCI), and Alzheimer’s. On average, female were younger than male across all groups: the age difference is two years in both the control and Alzheimer’s groups, and four years in the MCI group. If the interaction between age and sex is not properly controlled, models may misattribute these normative age-related sex differences to disease-specific changes, compromising the validity and fairness of diagnostic predictions.
In this section, we present recommendations for mitigating bias in medical data collection. We organize them using the same three categories as before: historical, representation, and measurement biases. These categories are not strictly separated, some recommendations may address multiple types of bias.
Involve a diverse, interdisciplinary group in planning the experiment. Collection design may be influenced by the implicit biases of those responsible for data collection. As research has consistently shown, healthcare providers often exhibit biases toward historically excluded groups [28]–[30], and these biases are likely to persist in the absence of curricula specifically focused on minority health [31]. Furthermore, stakeholders hold divergent views on the nature, significance, and mitigation of bias in healthcare AI [32]. As such, assembling a diverse and interdisciplinary team is important to incorporate multiple perspectives, minimize bias, and ensure that data collection strategies are equitable and inclusive.
Ensure that data is collected in an aggregated or disaggregated manner when appropriate. As Cirillo et al. [33] explain, bias can be desirable or undesirable. Including sex and gender, for example, may improve prediction accuracy in cardiovascular diseases [34], but may also reinforce harmful assumptions, such as higher reported depression rates among women [35]. It is therefore essential to review existing literature and carefully plan what metadata to collect and use, to avoid unintended harm.
Define clear and balanced inclusion and exclusion criteria. Criteria should be specific but not overly restrictive, to maintain sample diversity and enable the formation of appropriate control groups. In this project, the study population was older, which made it challenging to find age-matched control groups for those with the comorbidity. As a result, the control groups had lower mean values, potentially introducing spurious correlations and biasing the results.
Analyse the need to include an intersectional benchmark to better represent the targeted population [36]. This will help refining the evaluation metrics and understand the health condition.
Ensure the sample size is feasible and sustainable. This involves assessing recruitment and retention potential within time and resources constraints. Engage experts with experience in similar studies to identify potential challenges, such as high dropout rates or participant burden. For instance, in this study, some protocols had to be shortened, as their extended duration was too demanding for participants.
Evaluate the data labeling process. Review how data has been labeled to ensure that categories are clear and consistent. Well-defined labeling reduces ambiguity, improves data quality becoming an essential step to align with FAIR principles [37]. Depending on the type of data, labeling should be approached differently, for example, socioeconomic variables may benefit from input by an interdisciplinary team, while clinical related data should not be labeled by a single professional alone, in order to minimize personal bias.
Consider equipment and deployment context. It is important to account for the equipment used during data collection and where the model will ultimately be deployed. If both data collection and deployment occur within the same hospital using the same equipment, consistency is maintained. However, if a model is trained on data from one type of equipment and then applied to data form another, equipment-related bias may arise. This can compromise the model’s performance and limit its generalizability.
To successfully incoporate AI systems into healthcare, clinical AI projects must address bias not only as a technical issue but also as a matter of governance. Our work highlights how different forms of bias can emerge during data collection, illustrated with real cases from our project. We provide a list of recommendations to avoid these biases and emphasize the importance of interdisciplinary collaboration, balanced cohort design, and thoughtful inclusion of metadata. We hope that these lessons learned from our experience will inform and support future healthcare AI projets in building more equitable and effective systems that are both legally compliant and socially responsible.
We would like to thank all the collaborators involved in the project who participated in the interviews, and Amparo Callejón-Leblic for her insightful feedback. This research has been funded by the Artificial Intelligence for Healthy Aging (AI4HA, MIA.2021.M02.0007.E03) project from the Programa Misiones de I+D en Inteligencia Artificial 2021 and by the European Union-NextGenerationEU, Ministry of Universities and Recovery, Transformation and Resilience Plan, through a call from Universitat Politècnica de Catalunya and Barcelona Supercomputing Center (Grant Ref. 2021UPC-MS-67461/2021BSC-MS-67461). Anna Arias Duart acknowledges her AI4S fellowship within the “Generación D” initiative by Red.es, Ministerio para la Transformación Digital y de la Función Pública, for talent attraction (C005/24-ED CV1), funded by NextGenerationEU through PRTR. Additional funding from the European Union through the Marie Skłodowska-Curie project AHEAD (grant agreement No 101183031). We would also like to thank Nardine Osman and Mark d’Inverno for Figure 2 in their work [38], which inspired our Table 1.
During the preparation of this work, the authors used ChatGPT (GPT-4) and DeepSeek Chat for grammar and spelling checks. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.