April 07, 2025
Detecting biases in structured data is a complex and time-consuming task. Existing automated techniques are limited in diversity of data types and heavily reliant on human case-by-case handling, resulting in a lack of generalizability. Currently, large language model (LLM)-based agents have made significant progress in data science, but their ability to detect data biases is still insufficiently explored. To address this gap, we introduce the first end-to-end, multi-agent synergy framework, BiasInspector, designed for automatic bias detection in structured data based on specific user requirements. It first develops a multi-stage plan to analyze user-specified bias detection tasks and then implements it with a diverse and well-suited set of tools. It delivers detailed results that include explanations and visualizations. To address the lack of a standardized framework for evaluating the capability of LLM agents to detect biases in data, we further propose a comprehensive benchmark that includes multiple evaluation metrics and a large set of test cases. Extensive experiments demonstrate that our framework achieves exceptional overall performance in structured data bias detection, setting a new milestone for fairer data applications.1
Data have been extensively utilized for various purposes, including model training, decision support, and personalized recommendations [1]. However, inevitable biases in data are the root causes of downstream biased behaviors in models [2], including discriminatory decision-making [3] and the perpetuation of social inequality [4]. For example, biases in the MIMIC-IV dataset have been shown to result in predictive models unevenly distributing medical resources across different demographic and socioeconomic groups [5]. The detection of bias in structured data is defined as the process of analyzing and quantifying attribute distributions and correlations to identify unfairness or inaccuracies for specific subsets [6], [7]. It is an important step for improving model prediction reliability and fairness in fields such as healthcare [8], education [9], and finance [10].
Biases in structured data vary in form and degree, making it hard to develop a universal automated detection method. Despite advancements in automated bias detection, significant limitations remain. Limited detection metrics struggle to identify certain types of bias or accurately measure their extent, making them difficult to apply to ineffective for application across diverse datasets [11], [12]. The requirement for specialized expertise forces users to understand bias detection concepts and possess programming skills, posing a significant barrier for non-expert users [13]. Furthermore, the interpretability of detection results is insufficient, which primarily provide fixed numerical outputs while lacking intuitive visualization options, natural language descriptions of bias severity, and improvement suggestions [14].
Large Language Models (LLMs) have significantly enhanced the capabilities of autonomous agents to effectively address domain-specific tasks [15], [16]. Excelling in task planning [17] and workflow execution [18], LLMs are particularly well-suited for tackling data analysis tasks. Recently, LLM-based agents have achieved remarkable progress in data analysis [19], [20]. However, due to the lack of flexible task planning for bias detection [21], comprehensive bias detection tools [22] and specific bias level descriptions [23], existing data science agents fail to meet the requirements for bias detection. Thus, an automated agent capable of comprehensive bias detection in structured data is urgently needed to address these challenges.
In this work, we introduce BiasInspector, the first multi-agent synergy framework for detecting bias in structured data. As a fully end-to-end agent, it supports users in posing questions across all levels, ranging from
generalized abstraction (e.g., Does insurance type influence the allocation of medical resources?) to precise specifics (e.g., Using Cohen’s d and Causal Effect to assess the strength of the correlation between insurance_type
and
icu_stay
). Our multi-agent collaborative framework executes iterative interactions across multiple stages, including data preprocessing, detection analysis, and visualization summarization. It continuously optimizes the formulation of
plans and the application of tools, providing users with a comprehensive results report. The report not only includes professional metric results but also presents bias levels, visualizations, and dataset optimization recommendations in an
easy-to-understand manner. It effectively addresses the needs of both technical experts and general users.
Faced with diverse data, the scarcity of detection metrics, the limited scalability of toolsets, and the complexity of operations are key factors hindering the development of previous technologies into general-purpose bias detection methods. To address these long-standing critical challenges in the field of bias detection, we retrieved metrics related to bias detection in structured data from the extensive literature. These metrics were refined into 46 predefined tools and 100 generatable tools, which were separately organized in the toolkit and method library, as shown in Fig. 1. These two modules not only encompass a diverse range of detection tools, but also offer high extensibility. Data scientists and engineers can easily add new tools in the appropriate format to either module as needed. Additionally, the LLM’s ability to directly invoke these tools eliminates the need for users to consult tool usage guides or possess programming skills.
Existing data analysis benchmarks [24] fail to evaluate the ability of LLM agents in detecting biases within structured data. Specifically, they lack metrics for assessing LLM agents’ capability to detect multiple types of biases and their accuracy in measuring bias severity. To address this gap, we propose a new benchmark. We selected features with potential demographic biases from five commonly used datasets and designed a task set comprising 100 tasks for detecting various types of biases. Additionally, we developed an evaluation framework and designed an evaluation agent to separately assess performance from the end-result and intermediate-process perspectives.
We conducted extensive experiments using the newly proposed benchmark, including both End Result Evaluation and Intermediate Process Evaluation. The experimental results demonstrate that BiasInspector achieves an accuracy of up to 78% in bias degree detection tasks and exhibits outstanding performance in six critical aspects such as planning and tooling. These findings underscore BiasInspector’s significant contributions toward ensuring fairness in structured data applications.
In summary, our main contributions are:
We introduce BiasInspector, the first LLM agent designed for detecting biases in structured data. It addresses the diverse needs of both professional and non-professional users in bias detection tasks.
We proposed a benchmark that evaluates LLM agents’ performance in detecting biases in structured data from both end-result and intermediate process perspectives.
Extensive experiments demonstrate that BiasInspector achieves superior performance in structured data bias detection tasks, significantly outperforming existing agent baselines. Additionally, it consistently exhibits outstanding capabilities across multiple intermediate process dimensions.
Figure 1: Overview of the multi-agent architecture with a Primary and an Advisor Agent collaborating and invoking tools from the Toolset and Bias Detection Method Library.
Recently, some studies have leveraged LLM Agents in data science tasks, including data analysis and machine learning. In the field of data analysis, Majumder et al. [19] proposed using large generative models to automate iterative hypothesis generation, validation, and analysis for identifying data correlations. They further proposed a data-driven discovery benchmark [24] to evaluate the capabilities of current LLMs in performing data discovery tasks. TaskWeaver [23] supports data analysis across diverse structures via code generation and plugins. HASSAN et al. [20] employed LLM Agents to facilitate user engagement with data. Bordt et al. [22] combined interpretable models with LLMs to generate detailed dataset summaries. MatPlotAgent [21] focuses on automating data visualization using LLM Agents. In addition, some studies utilizing large language models to address machine learning tasks in data science also involve data cleaning and analysis [25], [26]. Our work also involves correlation analysis and dataset visualization, but differs by providing comprehensive bias detection tools, autonomous planning, and detailed bias-level descriptions and recommendations.
Recent studies have focused on developing automated methods to detect biases in structured datasets. Prior works addressed bias quantification in synthetic data [14], examined data bias effects on model fairness [12], and aimed to mitigate bias propagation in classifier training [27]. However, these studies rely on limited fixed metrics, restricting applicability across diverse datasets. FairerML [11], an extensible platform integrating various fairness metrics, offers an interactive interface that allows users to select tools for bias analysis and visualization. AIF360 [13], an open-source toolkit providing comprehensive bias measurement tools, requires extensive documentation navigation and advanced programming skills, thus posing barriers for non-technical users. In contrast, our method leverages an LLM Agent to automate the entire bias detection pipeline, offering richer and scalable toolsets, enhanced interpretability through natural language descriptions, and visualizations.
Figure 2: Workflow overview: User Input, Data Preprocessing, Bias Detection and Analysis, Visualization and Summarization, and User Feedback. It is iterative rather than strictly sequential, allowing returns to previous stages based on user input or updated plans.
In this section, we introduce BiasInspector, a multi-agent system driven by LLM for detecting bias in structured data. We first discuss how bias manifests in structured data, then introduce the diverse set of tools supported by BiasInspector, its framework architecture, and detection process. This section illustrates how BiasInspector effectively achieves accurate and comprehensive bias detection in diverse bias representations.
Bias in structured data includes distribution bias in individual features and correlation bias among multiple features, occurring in both numerical and categorical data. Numerical data exhibit varied distributions, while categorical data differ significantly in category counts. Such diversity prevents a single method from effectively addressing all bias forms, requiring a variety of specialized approaches tailored to specific bias manifestations.
Previous studies proposed numerous methods targeting specific data types or biases but lacked comprehensive coverage. Nevertheless, these methods offer valuable insights. Therefore, we systematically integrate relevant existing methods into a unified toolkit, enabling our Agent to comprehensively detect all forms of bias.
To leverage previous research, we manually reviewed literature and employed large language models to extract structured data bias detection methods. Each method was decomposed into explicit steps, stored uniformly in a JSON file. To facilitate accurate and efficient retrieval by the agent, we systematically annotated each method in the JSON file according to its corresponding bias type (distribution bias, correlation bias), data type (numerical, categorical), detection methodology, and application domain (e.g., medical, social sciences). This annotated resource, termed the Bias Detection Method Library, is detailed in Appendix 7.
From this library, we compiled 100 methods covering all bias and data types, Further, we categorizing them into five representative scenarios, each corresponding to specific combinations of bias types and data types. For each scenario, we selected the five most representative methods, encapsulating them as callable Python functions. These constitute our Toolset, comprising 25 distinct bias detection tools.
Upon receiving user input, our Agent system automatically determines the bias type and data type scenario relevant to the current task, based on the user’s instructions and the results of data loading and preprocessing. Using Retrieval-Augmented Generation (RAG), the Agent selects suitable bias detection tools in two steps: first retrieving applicable methods from the Bias Detection Method Library, then directly invokes predefined Python functions for methods available in the Toolset; for those methods not predefined in the Toolset, the Agent automatically generates executable code to perform the detection.
The Bias Detection Method Library is designed for extensibility, enabling users to easily integrate new methods using large language models through a consistent process. This ensures the Agent continuously benefit from increasingly comprehensive and advanced bias detection techniques in future applications.
Compared to traditional manual approaches, where domain experts typically select limited tools—potentially lacking sensitivity to certain biases—and must manually write or invoke detection code, our Agent system swiftly and automatically identifies multiple suitable tools from an extensive library, significantly enhancing comprehensiveness. Additionally, automating tool invocation and code generation markedly improves detection efficiency. Thus, the automated and expandable toolset enables our Agent to fully leverage its advantages in structured data bias detection.
To construct a comprehensive agent system for structured data bias detection, we developed a complete suite of functionalities, including data preprocessing, bias analysis, result visualization, and reporting. Since practical tasks require dataset-specific strategies, we implemented numerous functional tools suitable for various scenarios. These tools are encapsulated as predefined functions in a unified Toolset for direct Agent invocation. Currently, the Toolset contains 46 distinct tools, detailed in Appendix 8.
Bias detection is typically an iterative reasoning-action process, where the Agent continuously updates execution plans based on user interactions and tool feedback. Detection tasks involve multiple stages, including data preprocessing, bias analysis, and result visualization, each potentially requiring single or combined tool usage. Such complexity imposes significant demands on large language models for planning, outcome interpretation, feedback, and decision-making.
To address potential limitations of large language models and enhance task completion rate and quality, we designed a multi-agent framework comprising a Primary Agent and an Advisor Agent. The Primary Agent directly interacts with users, formulates execution plans, invokes necessary tools, and consults the Advisor Agent at key stages. The Advisor Agent reviews and refines these plans, identifies oversights, recommends tool invocations, analyzes results, and provides strategies for anomalies or errors. Detailed roles of both agents are provided in Appendix 9.
This section describes the operational workflow for bias detection tasks, as illustrated in Figure 2, comprising five key steps: user input, data preprocessing, bias detection and analysis, result visualization and summarization, and user feedback.
Users provide a structured dataset and bias-related questions or task instructions. BiasInspector supports varying expertise levels, accommodating general bias inquiries or explicit tool-specific instructions.
Upon receiving data, the Agent formulates and executes a preprocessing plan, including reading data, identifying features and types, extracting relevant columns, handling missing values and outliers, and constructing a processed subset for downstream analysis.
The Agent then develops and executes a bias detection plan, selecting suitable methods from the Toolset and Bias Detection Method Library, sequentially executing them, and obtaining numerical metrics indicating bias severity.
The Agent formulates and executes a visualization strategy, selecting suitable representations to clearly illustrate results. It analyzes metrics, summarizes bias severity, and formulates targeted recommendations. All relevant information—including bias types, associated features, detection tools and metrics, bias severity levels, visualizations, and recommendations—is consolidated into a comprehensive PDF detection report.
Finally, the Agent presents detection results through natural language explanations and the PDF report, proactively engaging users for additional comments or requirements, ensuring alignment with user expectations.
Effective evaluation of agent capability in structured data bias detection remains underexplored. Existing data analysis benchmarks [24] inadequately assess diverse bias scenarios and lack accuracy evaluation for bias severity estimation. To address these limitations, we propose a novel benchmark framework. In this section, We first describe benchmark construction, then discuss two evaluation dimensions: end-result evaluation and intermediate-process evaluation.
We selected five structured datasets widely used in prior bias mitigation research [28]: Adult [29], COMPAS [30], Statlog [31], MIMIC-IV [32], and Student Performance [33]. These datasets cover diverse domains (socioeconomic, criminal justice, finance, healthcare, education), vary from hundreds to tens of thousands of instances, and include 10 to 33 categorical and numerical features. Thus, they are highly suitable for evaluating an agent’s bias detection capability in structured data. Dataset details are provided in Appendix 10.
We selected 100 demographic-related features or feature combinations from the datasets, covering categorical and numerical data. Individual features were classified as potentially exhibiting distribution bias, while feature combinations were classified as involving correlation bias. Using a large language model, we created diverse bias detection queries for each case, varying linguistic expressions to simulate human questioning styles. Queries include explicitly stated bias types (distribution or correlation) and intentionally ambiguous forms, termed implication bias, reflecting real-world user scenarios. Task set details and examples are provided in Appendix 11.
Bias detection in structured datasets involves not only identifying bias presence but also accurately quantifying its severity,which is typically challenging. To enable quantitative severity assessment, we defined five bias levels, ranging from ‘most balanced’ to ‘most biased.’
We categorized bias manifestations into five representative scenarios based on combinations of data types (categorical, numerical) and bias types (distribution, correlation). For each scenario, we selected five widely-used bias detection metrics, developed corresponding detection scripts, and mapped metric values to the predefined bias severity levels.
To ensure mapping accuracy, we manually constructed synthetic datasets exhibiting varying degrees of bias, data scales, and feature counts. Through iterative testing and refinement, we verified that our scripts precisely map different bias severities to correct levels. These verified mappings serve as the ground truth for outcome evaluation.
We developed an automated evaluation framework to assess the agent’s bias detection performance from an outcome perspective. For each task, the framework identifies the involved feature or feature combination, maps it to the corresponding bias scenario, and applies five scenario-specific detection tools, each producing a predicted bias level. We select the maximum predicted level as the reference value, assuming it represents the most sensitive bias detection.
We then compare this reference with the agent’s predicted bias level. The absolute difference quantitatively measures the agent’s outcome-level detection accuracy.
When evaluating an agent’s bias detection capability, it is essential to consider not only final outcomes but also the reasoning, planning, and decision-making processes throughout task execution. To achieve a comprehensive assessment, we further evaluate the agent from a process-oriented perspective, focusing on five aspects: user communication, task planning, tool invocation, dynamic plan adjustment, and result analysis.
We developed an agent-based automated evaluation system, defining five performance rating levels (Excellent, Proficient, Adequate, Mediocre, Unsatisfactory) with explicit criteria across these dimensions. This system automatically analyzes the agent’s operational logs and generates a PDF report detailing ratings and supporting rationale for each evaluated task. Further details are provided in Appendix 12.
To validate the automated evaluation’s reliability, we recruited human evaluators to independently review execution logs and assess the agent’s performance along the same five dimensions. Correlation analysis comparing automated ratings with human assessments confirmed the effectiveness and reliability of our evaluation method. Additional details are provided in Appendix 13.
To systematically evaluate BiasInspector, we designed several framework-model combinations and conducted experiments using the benchmark from Section 4.
We evaluated four agent frameworks: (1) the full BiasInspector (Primary and Advisor Agents); (2) single-agent BiasInspector (Primary Agent only); (3) ReAct-based agent [34] equipped with the same toolset as BiasInspector; and (4) Self-reflection agent, relying solely on model reasoning with tool access limited to basic data reading.
We conducted experiments with two widely-used large language models: GPT-4o and Llama 3.3 70B. Decoding temperature was set to 0 for output stability.
Each framework-model combination is evaluated from two perspectives:
End-result evaluation. To measure accuracy, we compute the Mean Absolute Error (MAE) between agent-predicted and ground-truth bias levels across tasks. For intuitive interpretation, we convert MAE into an Average Similarity Score \(S_{\text{avg}}\), defined as \(S_{\text{avg}} = \frac{1}{n}\sum_{i=1}^{n}\left(1 - \frac{|x_i - y_i|}{4}\right)\times 100\%\), where \(x_i\) and \(y_i\) denote the predicted and ground-truth bias levels for task \(i\). Higher scores indicate greater accuracy.
Intermediate-process evaluation. Using our automated GPT o3-mini-based evaluation system, we assess each task across five dimensions: user interaction, planning, tool invocation, dynamic adaptation, and result summarization. We report the average score over all tasks, reflecting overall workflow execution capability.
(Multi-Agent) | |||||
(Single-Agent) | |||||
Agent | |||||
Agent | |||||
Overall | 77.53 | 76.52 | 67.68 | 60.61 | |
Distribution | 84.46 | 86.49 | 72.97 | 60.14 | |
Correlation | 72.70 | 64.80 | 60.86 | 59.54 | |
Implication | 76.04 | 81.25 | 72.92 | 65.62 | |
Overall | 71.97 | 71.46 | 63.13 | 66.67 | |
Distribution | 74.32 | 72.30 | 77.03 | 66.89 | |
Correlation | 71.79 | 70.51 | 55.13 | 63.46 | |
Implication | 68.75 | 70.83 | 55.21 | 71.88 |
In Table 1, we present Average Similarity Scores for model-framework combinations evaluated on the full task set and subsets (distribution bias, correlation bias, implication bias). This metric effectively reflects bias detection accuracy for each combination.
Our analysis shows BiasInspector consistently achieves 70%-80% similarity scores, significantly outperforming ReAct and Self-Reflection agents. The ReAct Agent shows instability with notable accuracy fluctuations across bias types. While Multi-Agent and Single-Agent BiasInspector perform similarly overall, the Multi-Agent variant achieves 8 points higher accuracy on complex correlation bias tasks with GPT-4o, highlighting collaborative agent benefits.
The GPT series consistently outperforms Llama 3.3 70B when powering BiasInspector. Notably, GPT-4o achieves over 10 percentage points higher accuracy than Llama 3.3 70B on distribution bias tasks in both Multi-Agent and Single-Agent architectures. This underscores that powerful LLMs like GPT-4o better leverage BiasInspector’s planning and tooling capabilities, substantially enhancing structured data bias detection effectiveness.
Figure 3: Intermediate process performance of four agent frameworks.
Figure 3 presents Intermediate Process Performance of all model-agent architecture combinations across six dimensions: Integration, Communication, Planning, Tooling, Adaptivity, and Summarization, clearly illustrating BiasInspector’s superior performance.
BiasInspector notably excels in Communication, Planning, and Summarization, achieving scores over 90 (GPT-4o) and over 80 (Llama 3.3 70B). It also demonstrates strong Tooling and Adaptivity, with minor gaps (1–2 points) from top scores, confirming its consistent excellence across dimensions.
BiasInspector consistently surpasses ReAct agents by 5–15 points in key dimensions like Planning and Tooling, and significantly outperforms Self-Reflection agents. This highlights BiasInspector’s superior design and effective utilization of large language models, explaining its higher practical bias detection accuracy. Additionally, the Multi-Agent BiasInspector slightly outperforms the Single-Agent variant across dimensions, validating the benefits of multi-agent collaboration.
Figure 3 (a) shows GPT-4o-powered agents outperform Llama 3.3 70B agents (Figure 3 (b)) by about 10 points in all dimensions. This confirms advanced LLMs like GPT-4o significantly boost agent performance, effectively leveraged by BiasInspector.
In this work, we introduce BiasInspector, an LLM-driven multi-agent framework for automated bias detection in structured data. Collaborative agents formulate multi-stage task plans based on user requirements, using diverse analytical tools to deliver detailed bias detection results. To address the lack of standardized evaluation for LLM-based bias detection, we propose a comprehensive benchmark with multiple metrics and extensive test cases. Experiments shows BiasInspector achieves exceptional bias detection performance, ensuring robust data fairness and setting a benchmark for future research.
To support flexible and reproducible bias detection, we developed a structured library of detection methods encoded in JSON
format. Each method entry includes metadata and a set of structured procedural steps. Specifically, each record
contains:
id: A unique identifier for the method entry.
intention: A high-level description of the goal, specifying the data type, analytical approach, and bias context.
method: A dictionary of ordered steps detailing the procedure for detecting bias using tools or algorithms.
title: The title of the reference publication.
article link: A URL linking to the original article.
field: The academic or application domain.
year: The year of publication.
Table 2 presents a sample excerpt from this library.
Excerpt from the JSON-based bias detection method library.
We have created four categories of tools, totaling 46 tools in the toolkit, including:
7 commonly used data loading and preprocessing tools, designed to meet various preprocessing needs for structured datasets.
25 bias detection tools, with 5 commonly used metric detection tools provided for each type of bias manifestation.
9 data visualization tools, offering comprehensive chart representations for different data types.
5 miscellaneous tools, designed for user interaction, reading the reference methods library, executing dynamically generated tool code, and generating PDF detection reports.
Table 3 presents each tool and its functional description.
Description of tools in toolset.
The figures 4 and 5 respectively show the excerpted prompt templates for the Primary Agent and Advisor Agent.
In the Primary Agent’s prompt template, we explicitly list its task requirements in bullet points, including communicating with the user, developing a detection plan, selecting appropriate tools, and providing the user with a detailed summary of the results. Each task requirement is further elaborated. For example, in the "Develop a detection plan" section, the prompt specifies that the plan should cover data loading and preprocessing, detection and analysis methods, visualization schemes, and result summarization. In the "Select appropriate tools" section, the prompt directs the agent to choose suitable tools from the available toolset and bias detection method library.
In the Advisor Agent’s prompt template, we also clearly outline its task requirements in bullet points, including assessing whether the Primary Agent’s actions align with the user’s intent, optimizing the execution plan, improving tool selection and providing feedback on the execution results, as well as revising the result summary provided to the user. Each task requirement is further explained in detail.
Compared to the excerpted prompt templates shown, the actual prompt templates will further elaborate on each task description. They will be continuously refined and optimized by adding examples, emphasizing key sections, and incorporating feedback from runtime debugging and execution results.
Figure 4: PrimaryAgentPrompt
Figure 5: AdvisorAgentPrompt
In our study, we selected five widely used datasets for bias detection tasks across different domains, including socioeconomic status, criminal justice, finance, healthcare, and education. These datasets vary in sample size, number of features, and data characteristics. Each dataset contains both categorical and numerical features, making them suitable for a comprehensive bias analysis. The details of each dataset are summarized in Table [tab:dataset-summary].
width=
Features: age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, income.
Features: Person_ID, AssessmentID, Case_ID, Agency_Text, LastName, FirstName, MiddleName, Sex_Code_Text, Ethnic_Code_Text, DateOfBirth, ScaleSet_ID, ScaleSet, AssessmentReason, Language, LegalStatus, CustodyStatus, MaritalStatus, Screening_Date, RecSupervisionLevel, RecSupervisionLevelText, Scale_ID, DisplayText, RawScore, DecileScore, ScoreText, AssessmentType, IsCompleted, IsDeleted.
Features: Status of existing checking account, Duration in months, Credit history, Purpose, Credit amount, Savings account/bonds, Present employment since, Installment rate, Personal status and sex, Other debtors/guarantors, Present residence since, Property, Age in years, Other installment plans, Housing, Number of existing credits at this bank, Job, Number of people liable for maintenance, Telephone, Foreign worker, Credit risk.
Features: admission_type, hospital_expire_flag, admission_location, discharge_location, patient_insurance, patient_lang, patient_marital, patient_race, patient_gender, patient_age.
Features: school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences, G1, G2, G3.
To systematically evaluate the capabilities of our bias detection agent, we constructed a taskset of 100 diverse prompts derived from the five datasets used in our study. The taskset covers three major categories of bias: Distribution Bias, Correlation Bias, and Implication Bias. These categories reflect common forms of bias that occur in real-world datasets and are essential for evaluating fairness.
For each bias category, we designed prompts that vary in structure and language to reflect human-like diversity in expression. We utilized large language models (LLMs) to paraphrase and diversify the formulations of each prompt, ensuring naturalness and realism in phrasing. The distribution of tasks across datasets and bias types is shown in Table 4.
Dataset | Distribution | Correlation | Implication | Total |
---|---|---|---|---|
Adult.csv | 11 | 9 | 6 | 26 |
COMPAS.csv | 8 | 8 | 5 | 21 |
Statlog.csv | 7 | 8 | 5 | 20 |
MIMIC-IV.csv | 6 | 6 | 6 | 18 |
Student Performance.csv | 5 | 8 | 2 | 15 |
Total | 37 | 39 | 24 | 100 |
Table 5 provides a sample of five representative questions from the taskset. These examples illustrate how the questions are designed to probe distributional characteristics of specific features while maintaining natural and varied phrasing.
Question | Type | Feature(s) | Bias Significance |
---|---|---|---|
Can you check if the age distribution across individuals is balanced, or do certain age groups dominate? | Distribution | age (Numerical) | Imbalanced age groups may skew income-related insights. |
How does the distribution of work class look to you? Are there any work classes that appear to be overrepresented? | Distribution | workclass (Categorical) | Uneven representation can bias analysis of income across sectors. |
From your perspective, is the distribution of education levels spread fairly, or do certain levels dominate? | Distribution | education (Categorical) | Dominance in education levels may affect fairness in outcome evaluation. |
In your view, how does the marital status distribution appear? Are any marital statuses overrepresented? | Distribution | marital-status (Categorical) | Skewed marital status can distort socio-economic impact assessments. |
Do certain occupations dominate the dataset, or is the distribution of occupations relatively even? | Distribution | occupation (Categorical) | Occupation imbalance may lead to biased interpretation of income disparities. |
In this appendix, we provide detailed information regarding the automated evaluation system used to assess the agent’s performance from the intermediate process perspective.
The evaluation agent leverages the GPT-o3-mini model, specifically chosen due to its superior analytical reasoning capability in comparison to GPT-4o and GPT-4o mini, providing more accurate and nuanced evaluation results at a lower API usage cost.
We evaluate the agent’s intermediate processes along five comprehensive dimensions, carefully selected to ensure thorough and holistic performance assessment:
Effective Communication with the User to Clarify Tasks
Comprehensiveness and Thoroughness of Planning
Efficiency in Tool Execution and Dynamic Adjustment
Ability to Dynamically Adjust Plans Based on Execution Results
Clarity and Depth of Results Analysis and Summary
These dimensions collectively ensure a complete assessment of the agent’s capability in performing bias detection tasks.
To facilitate thorough evaluation, we constructed a dedicated toolset enabling the evaluation agent to systematically analyze and rate agent performance. Details of the specific tools provided in the toolset are summarized in Table 6.
Automated Agent Evaluation Tools | |
---|---|
Tool | Description |
(Continued from previous page) | |
Tool | Description |
Continued on next page | |
get_user_input_tool | Captures user input dynamically during an interaction and formats it as a dictionary to be added to the agent’s conversation. |
read_json_log | Reads a JSON formatted log file, extracting and organizing log entries. |
read_markdown_log | Reads a Markdown formatted log file, extracting headers, bold text, and regular text as structured log entries. |
read_bias_report_pdf | Reads a bias detection report in PDF format, extracting both textual and graphical information. |
generate_evaluation_report_pdf | Generates a flexible evaluation report in PDF format, combining text narratives and visual charts as specified in the input. |
Figure 6 presents an example of a refined prompt snippet used by the automated evaluation agent to standardize the evaluation process.
Figure 6: Evaluation Agent Prompt
We provided human evaluators with the same evaluation criteria as those used in the automated agent evaluation system’s prompt (an excerpt of these criteria is shown in Figure 6). Specifically, we selected three detection tasks from each of four model-framework combinations (BiasInspector Multi-Agent, BiasInspector Single-Agent, ReAct-Based Agent, and Self-Reflection Agent, each under GPT-4o and Llama 3.3 70B) across various combinations of data types and bias types, resulting in a total of 120 evaluation instances. We then compared the scores assigned by human evaluators and the automated agent evaluation system along five distinct evaluation dimensions in these instances.
(Multi-Agent) | |||||
(Single-Agent) | |||||
Agent | |||||
Agent | |||||
Integration | 3.53 | 3.27 | 4.20 | 5.60 | |
Communication | 4.20 | 4.07 | 3.86 | 4.13 | |
Planning | 5.67 | 6.13 | 5.53 | 6.27 | |
Tooling | 2.93 | 3.20 | 9.87 | 31.80 | |
Adaptivity | 3.80 | 4.13 | 4.47 | 4.53 | |
Summarization | 4.87 | 5.07 | 5.27 | 4.80 | |
Integration | 3.73 | 3.87 | 4.40 | 5.40 | |
Communication | 3.80 | 3.93 | 4.07 | 3.60 | |
Planning | 6.07 | 6.40 | 5.80 | 5.40 | |
Tooling | 2.80 | 2.60 | 10.20 | 24.60 | |
Adaptivity | 4.20 | 3.73 | 3.93 | 3.67 | |
Summarization | 4.93 | 4.73 | 5.20 | 5.13 |
As shown in Table 7, we compared the scoring differences between human evaluators and the automated agent evaluation system across all evaluation dimensions. The results demonstrate that the scoring discrepancies between the two evaluation methods generally remained within 5 points across the Integration, Communication, Planning, Adaptivity, and Summarization dimensions, indicating strong reliability and practicality of the automated agent evaluation system. However, in the Tooling dimension, human evaluators assigned lower scores to the ReAct-Based Agent due to its limited quantity and variety of tool usage, and assigned the lowest scores to the Self-Reflection Agent, which lacks tool invocation capability entirely. Although the automated evaluation system also significantly lowered the scores for these two agents, human evaluators applied a more substantial reduction, resulting in more noticeable discrepancies in this dimension. Nevertheless, the automated evaluation system successfully captured these critical differences, as further reflected by the significantly lower scores illustrated in Figure 3. This underscores the effectiveness and sensitivity of the automated agent evaluation system in the Tooling dimension.
Code and data are available at: https://github.com/uscnlp-lime/BiasInspector↩︎