Abstract

Detecting biases in structured data is a complex and time-consuming task. Existing automated techniques are limited in diversity of data types and heavily reliant on human case-by-case handling, resulting in a lack of generalizability. Currently, large language model (LLM)-based agents have made significant progress in data science, but their ability to detect data biases is still insufficiently explored. To address this gap, we introduce the first end-to-end, multi-agent synergy framework, BiasInspector, designed for automatic bias detection in structured data based on specific user requirements. It first develops a multi-stage plan to analyze user-specified bias detection tasks and then implements it with a diverse and well-suited set of tools. It delivers detailed results that include explanations and visualizations. To address the lack of a standardized framework for evaluating the capability of LLM agents to detect biases in data, we further propose a comprehensive benchmark that includes multiple evaluation metrics and a large set of test cases. Extensive experiments demonstrate that our framework achieves exceptional overall performance in structured data bias detection, setting a new milestone for fairer data applications.¹

1 Introduction↩︎

Data have been extensively utilized for various purposes, including model training, decision support, and personalized recommendations [1]. However, inevitable biases in data are the root causes of downstream biased behaviors in models [2], including discriminatory decision-making [3] and the perpetuation of social inequality [4]. For example, biases in the MIMIC-IV dataset have been shown to result in predictive models unevenly distributing medical resources across different demographic and socioeconomic groups [5]. The detection of bias in structured data is defined as the process of analyzing and quantifying attribute distributions and correlations to identify unfairness or inaccuracies for specific subsets [6], [7]. It is an important step for improving model prediction reliability and fairness in fields such as healthcare [8], education [9], and finance [10].

Biases in structured data vary in form and degree, making it hard to develop a universal automated detection method. Despite advancements in automated bias detection, significant limitations remain. Limited detection metrics struggle to identify certain types of bias or accurately measure their extent, making them difficult to apply to ineffective for application across diverse datasets [11], [12]. The requirement for specialized expertise forces users to understand bias detection concepts and possess programming skills, posing a significant barrier for non-expert users [13]. Furthermore, the interpretability of detection results is insufficient, which primarily provide fixed numerical outputs while lacking intuitive visualization options, natural language descriptions of bias severity, and improvement suggestions [14].

Large Language Models (LLMs) have significantly enhanced the capabilities of autonomous agents to effectively address domain-specific tasks [15], [16]. Excelling in task planning [17] and workflow execution [18], LLMs are particularly well-suited for tackling data analysis tasks. Recently, LLM-based agents have achieved remarkable progress in data analysis [19], [20]. However, due to the lack of flexible task planning for bias detection [21], comprehensive bias detection tools [22] and specific bias level descriptions [23], existing data science agents fail to meet the requirements for bias detection. Thus, an automated agent capable of comprehensive bias detection in structured data is urgently needed to address these challenges.

In this work, we introduce BiasInspector, the first multi-agent synergy framework for detecting bias in structured data. As a fully end-to-end agent, it supports users in posing questions across all levels, ranging from generalized abstraction (e.g., Does insurance type influence the allocation of medical resources?) to precise specifics (e.g., Using Cohen’s d and Causal Effect to assess the strength of the correlation between insurance_type and icu_stay). Our multi-agent collaborative framework executes iterative interactions across multiple stages, including data preprocessing, detection analysis, and visualization summarization. It continuously optimizes the formulation of plans and the application of tools, providing users with a comprehensive results report. The report not only includes professional metric results but also presents bias levels, visualizations, and dataset optimization recommendations in an easy-to-understand manner. It effectively addresses the needs of both technical experts and general users.

Faced with diverse data, the scarcity of detection metrics, the limited scalability of toolsets, and the complexity of operations are key factors hindering the development of previous technologies into general-purpose bias detection methods. To address these long-standing critical challenges in the field of bias detection, we retrieved metrics related to bias detection in structured data from the extensive literature. These metrics were refined into 46 predefined tools and 100 generatable tools, which were separately organized in the toolkit and method library, as shown in Fig. 1. These two modules not only encompass a diverse range of detection tools, but also offer high extensibility. Data scientists and engineers can easily add new tools in the appropriate format to either module as needed. Additionally, the LLM’s ability to directly invoke these tools eliminates the need for users to consult tool usage guides or possess programming skills.

Existing data analysis benchmarks [24] fail to evaluate the ability of LLM agents in detecting biases within structured data. Specifically, they lack metrics for assessing LLM agents’ capability to detect multiple types of biases and their accuracy in measuring bias severity. To address this gap, we propose a new benchmark. We selected features with potential demographic biases from five commonly used datasets and designed a task set comprising 100 tasks for detecting various types of biases. Additionally, we developed an evaluation framework and designed an evaluation agent to separately assess performance from the end-result and intermediate-process perspectives.

We conducted extensive experiments using the newly proposed benchmark, including both End Result Evaluation and Intermediate Process Evaluation. The experimental results demonstrate that BiasInspector achieves an accuracy of up to 78% in bias degree detection tasks and exhibits outstanding performance in six critical aspects such as planning and tooling. These findings underscore BiasInspector’s significant contributions toward ensuring fairness in structured data applications.

In summary, our main contributions are:

We introduce BiasInspector, the first LLM agent designed for detecting biases in structured data. It addresses the diverse needs of both professional and non-professional users in bias detection tasks.
We proposed a benchmark that evaluates LLM agents’ performance in detecting biases in structured data from both end-result and intermediate process perspectives.
Extensive experiments demonstrate that BiasInspector achieves superior performance in structured data bias detection tasks, significantly outperforming existing agent baselines. Additionally, it consistently exhibits outstanding capabilities across multiple intermediate process dimensions.

fig: — Figure 1: Overview of the multi-agent architecture with a Primary and an Advisor Agent collaborating and invoking tools from the Toolset and Bias Detection Method Library.

2.0.0.1 LLM Agents in Data Science

Recently, some studies have leveraged LLM Agents in data science tasks, including data analysis and machine learning. In the field of data analysis, Majumder et al. [19] proposed using large generative models to automate iterative hypothesis generation, validation, and analysis for identifying data correlations. They further proposed a data-driven discovery benchmark [24] to evaluate the capabilities of current LLMs in performing data discovery tasks. TaskWeaver [23] supports data analysis across diverse structures via code generation and plugins. HASSAN et al. [20] employed LLM Agents to facilitate user engagement with data. Bordt et al. [22] combined interpretable models with LLMs to generate detailed dataset summaries. MatPlotAgent [21] focuses on automating data visualization using LLM Agents. In addition, some studies utilizing large language models to address machine learning tasks in data science also involve data cleaning and analysis [25], [26]. Our work also involves correlation analysis and dataset visualization, but differs by providing comprehensive bias detection tools, autonomous planning, and detailed bias-level descriptions and recommendations.

2.0.0.2 Automated Bias Detection in Structured Data

Recent studies have focused on developing automated methods to detect biases in structured datasets. Prior works addressed bias quantification in synthetic data [14], examined data bias effects on model fairness [12], and aimed to mitigate bias propagation in classifier training [27]. However, these studies rely on limited fixed metrics, restricting applicability across diverse datasets. FairerML [11], an extensible platform integrating various fairness metrics, offers an interactive interface that allows users to select tools for bias analysis and visualization. AIF360 [13], an open-source toolkit providing comprehensive bias measurement tools, requires extensive documentation navigation and advanced programming skills, thus posing barriers for non-technical users. In contrast, our method leverages an LLM Agent to automate the entire bias detection pipeline, offering richer and scalable toolsets, enhanced interpretability through natural language descriptions, and visualizations.

3 BiasInspector↩︎

fig: — Figure 2: Workflow overview: User Input, Data Preprocessing, Bias Detection and Analysis, Visualization and Summarization, and User Feedback. It is iterative rather than strictly sequential, allowing returns to previous stages based on user input or updated plans.

In this section, we introduce BiasInspector, a multi-agent system driven by LLM for detecting bias in structured data. We first discuss how bias manifests in structured data, then introduce the diverse set of tools supported by BiasInspector, its framework architecture, and detection process. This section illustrates how BiasInspector effectively achieves accurate and comprehensive bias detection in diverse bias representations.

3.1 Collection of Bias Detection Methods and Construction of Toolset↩︎

3.1.0.1 Rationale for Method Selection

Bias in structured data includes distribution bias in individual features and correlation bias among multiple features, occurring in both numerical and categorical data. Numerical data exhibit varied distributions, while categorical data differ significantly in category counts. Such diversity prevents a single method from effectively addressing all bias forms, requiring a variety of specialized approaches tailored to specific bias manifestations.

Previous studies proposed numerous methods targeting specific data types or biases but lacked comprehensive coverage. Nevertheless, these methods offer valuable insights. Therefore, we systematically integrate relevant existing methods into a unified toolkit, enabling our Agent to comprehensively detect all forms of bias.

3.1.0.2 Method Compilation and Toolset Development

To leverage previous research, we manually reviewed literature and employed large language models to extract structured data bias detection methods. Each method was decomposed into explicit steps, stored uniformly in a JSON file. To facilitate accurate and efficient retrieval by the agent, we systematically annotated each method in the JSON file according to its corresponding bias type (distribution bias, correlation bias), data type (numerical, categorical), detection methodology, and application domain (e.g., medical, social sciences). This annotated resource, termed the Bias Detection Method Library, is detailed in Appendix 7.

From this library, we compiled 100 methods covering all bias and data types, Further, we categorizing them into five representative scenarios, each corresponding to specific combinations of bias types and data types. For each scenario, we selected the five most representative methods, encapsulating them as callable Python functions. These constitute our Toolset, comprising 25 distinct bias detection tools.

3.1.0.3 RAG-Based Tool Retrieval and Execution

Upon receiving user input, our Agent system automatically determines the bias type and data type scenario relevant to the current task, based on the user’s instructions and the results of data loading and preprocessing. Using Retrieval-Augmented Generation (RAG), the Agent selects suitable bias detection tools in two steps: first retrieving applicable methods from the Bias Detection Method Library, then directly invokes predefined Python functions for methods available in the Toolset; for those methods not predefined in the Toolset, the Agent automatically generates executable code to perform the detection.

The Bias Detection Method Library is designed for extensibility, enabling users to easily integrate new methods using large language models through a consistent process. This ensures the Agent continuously benefit from increasingly comprehensive and advanced bias detection techniques in future applications.

Compared to traditional manual approaches, where domain experts typically select limited tools—potentially lacking sensitivity to certain biases—and must manually write or invoke detection code, our Agent system swiftly and automatically identifies multiple suitable tools from an extensive library, significantly enhancing comprehensiveness. Additionally, automating tool invocation and code generation markedly improves detection efficiency. Thus, the automated and expandable toolset enables our Agent to fully leverage its advantages in structured data bias detection.

3.2 Agent System Architecture↩︎

3.2.0.1 Functional Toolset

To construct a comprehensive agent system for structured data bias detection, we developed a complete suite of functionalities, including data preprocessing, bias analysis, result visualization, and reporting. Since practical tasks require dataset-specific strategies, we implemented numerous functional tools suitable for various scenarios. These tools are encapsulated as predefined functions in a unified Toolset for direct Agent invocation. Currently, the Toolset contains 46 distinct tools, detailed in Appendix 8.

3.2.0.2 Multi-agent Interaction

Bias detection is typically an iterative reasoning-action process, where the Agent continuously updates execution plans based on user interactions and tool feedback. Detection tasks involve multiple stages, including data preprocessing, bias analysis, and result visualization, each potentially requiring single or combined tool usage. Such complexity imposes significant demands on large language models for planning, outcome interpretation, feedback, and decision-making.

To address potential limitations of large language models and enhance task completion rate and quality, we designed a multi-agent framework comprising a Primary Agent and an Advisor Agent. The Primary Agent directly interacts with users, formulates execution plans, invokes necessary tools, and consults the Advisor Agent at key stages. The Advisor Agent reviews and refines these plans, identifies oversights, recommends tool invocations, analyzes results, and provides strategies for anomalies or errors. Detailed roles of both agents are provided in Appendix 9.

3.3 Agent Execution Workflow↩︎

This section describes the operational workflow for bias detection tasks, as illustrated in Figure 2, comprising five key steps: user input, data preprocessing, bias detection and analysis, result visualization and summarization, and user feedback.

3.3.0.1 Step 1: User Input

Users provide a structured dataset and bias-related questions or task instructions. BiasInspector supports varying expertise levels, accommodating general bias inquiries or explicit tool-specific instructions.

3.3.0.2 Step 2: Data Preprocessing

Upon receiving data, the Agent formulates and executes a preprocessing plan, including reading data, identifying features and types, extracting relevant columns, handling missing values and outliers, and constructing a processed subset for downstream analysis.

3.3.0.3 Step 3: Bias Detection and Analysis

The Agent then develops and executes a bias detection plan, selecting suitable methods from the Toolset and Bias Detection Method Library, sequentially executing them, and obtaining numerical metrics indicating bias severity.

3.3.0.4 Step 4: Result Visualization and Summarization

The Agent formulates and executes a visualization strategy, selecting suitable representations to clearly illustrate results. It analyzes metrics, summarizes bias severity, and formulates targeted recommendations. All relevant information—including bias types, associated features, detection tools and metrics, bias severity levels, visualizations, and recommendations—is consolidated into a comprehensive PDF detection report.

3.3.0.5 Step 5: Result Feedback to Users

Finally, the Agent presents detection results through natural language explanations and the PDF report, proactively engaging users for additional comments or requirements, ensuring alignment with user expectations.

4 BiasBenchmark↩︎

Effective evaluation of agent capability in structured data bias detection remains underexplored. Existing data analysis benchmarks [24] inadequately assess diverse bias scenarios and lack accuracy evaluation for bias severity estimation. To address these limitations, we propose a novel benchmark framework. In this section, We first describe benchmark construction, then discuss two evaluation dimensions: end-result evaluation and intermediate-process evaluation.

4.1 Benchmark Setup and Task Design↩︎

4.1.0.1 Dataset Selection

We selected five structured datasets widely used in prior bias mitigation research [28]: Adult [29], COMPAS [30], Statlog [31], MIMIC-IV [32], and Student Performance [33]. These datasets cover diverse domains (socioeconomic, criminal justice, finance, healthcare, education), vary from hundreds to tens of thousands of instances, and include 10 to 33 categorical and numerical features. Thus, they are highly suitable for evaluating an agent’s bias detection capability in structured data. Dataset details are provided in Appendix 10.

4.1.0.2 Task Set Construction

We selected 100 demographic-related features or feature combinations from the datasets, covering categorical and numerical data. Individual features were classified as potentially exhibiting distribution bias, while feature combinations were classified as involving correlation bias. Using a large language model, we created diverse bias detection queries for each case, varying linguistic expressions to simulate human questioning styles. Queries include explicitly stated bias types (distribution or correlation) and intentionally ambiguous forms, termed implication bias, reflecting real-world user scenarios. Task set details and examples are provided in Appendix 11.

4.1.0.3 Bias Severity Quantification

Bias detection in structured datasets involves not only identifying bias presence but also accurately quantifying its severity,which is typically challenging. To enable quantitative severity assessment, we defined five bias levels, ranging from ‘most balanced’ to ‘most biased.’

We categorized bias manifestations into five representative scenarios based on combinations of data types (categorical, numerical) and bias types (distribution, correlation). For each scenario, we selected five widely-used bias detection metrics, developed corresponding detection scripts, and mapped metric values to the predefined bias severity levels.

To ensure mapping accuracy, we manually constructed synthetic datasets exhibiting varying degrees of bias, data scales, and feature counts. Through iterative testing and refinement, we verified that our scripts precisely map different bias severities to correct levels. These verified mappings serve as the ground truth for outcome evaluation.

4.2 End Result Evaluation↩︎

We developed an automated evaluation framework to assess the agent’s bias detection performance from an outcome perspective. For each task, the framework identifies the involved feature or feature combination, maps it to the corresponding bias scenario, and applies five scenario-specific detection tools, each producing a predicted bias level. We select the maximum predicted level as the reference value, assuming it represents the most sensitive bias detection.

We then compare this reference with the agent’s predicted bias level. The absolute difference quantitatively measures the agent’s outcome-level detection accuracy.

4.3 Intermediate Process Evaluation↩︎

When evaluating an agent’s bias detection capability, it is essential to consider not only final outcomes but also the reasoning, planning, and decision-making processes throughout task execution. To achieve a comprehensive assessment, we further evaluate the agent from a process-oriented perspective, focusing on five aspects: user communication, task planning, tool invocation, dynamic plan adjustment, and result analysis.

We developed an agent-based automated evaluation system, defining five performance rating levels (Excellent, Proficient, Adequate, Mediocre, Unsatisfactory) with explicit criteria across these dimensions. This system automatically analyzes the agent’s operational logs and generates a PDF report detailing ratings and supporting rationale for each evaluated task. Further details are provided in Appendix 12.

To validate the automated evaluation’s reliability, we recruited human evaluators to independently review execution logs and assess the agent’s performance along the same five dimensions. Correlation analysis comparing automated ratings with human assessments confirmed the effectiveness and reliability of our evaluation method. Additional details are provided in Appendix 13.

5 Experiments↩︎

5.1 Experimental Setup↩︎

To systematically evaluate BiasInspector, we designed several framework-model combinations and conducted experiments using the benchmark from Section 4.

5.1.0.1 Framework Settings

We evaluated four agent frameworks: (1) the full BiasInspector (Primary and Advisor Agents); (2) single-agent BiasInspector (Primary Agent only); (3) ReAct-based agent [34] equipped with the same toolset as BiasInspector; and (4) Self-reflection agent, relying solely on model reasoning with tool access limited to basic data reading.

5.1.0.2 Model Choices

We conducted experiments with two widely-used large language models: GPT-4o and Llama 3.3 70B. Decoding temperature was set to 0 for output stability.

5.1.0.3 Evaluation Metrics

Each framework-model combination is evaluated from two perspectives:

End-result evaluation. To measure accuracy, we compute the Mean Absolute Error (MAE) between agent-predicted and ground-truth bias levels across tasks. For intuitive interpretation, we convert MAE into an Average Similarity Score \(S_{\text{avg}}\), defined as \(S_{\text{avg}} = \frac{1}{n}\sum_{i=1}^{n}\left(1 - \frac{|x_i - y_i|}{4}\right)\times 100\%\), where \(x_i\) and \(y_i\) denote the predicted and ground-truth bias levels for task \(i\). Higher scores indicate greater accuracy.
Intermediate-process evaluation. Using our automated GPT o3-mini-based evaluation system, we assess each task across five dimensions: user interaction, planning, tool invocation, dynamic adaptation, and result summarization. We report the average score over all tasks, reflecting overall workflow execution capability.

5.2 End Result Performance↩︎

Table 1: End-results performance across different agent configurations and LLMs. All values represent average similarity scores (%).

(Multi-Agent)
(Single-Agent)
Agent
Agent
	Overall	77.53	76.52	67.68	60.61
	Distribution	84.46	86.49	72.97	60.14
	Correlation	72.70	64.80	60.86	59.54
	Implication	76.04	81.25	72.92	65.62
	Overall	71.97	71.46	63.13	66.67
	Distribution	74.32	72.30	77.03	66.89
	Correlation	71.79	70.51	55.13	63.46
	Implication	68.75	70.83	55.21	71.88

In Table 1, we present Average Similarity Scores for model-framework combinations evaluated on the full task set and subsets (distribution bias, correlation bias, implication bias). This metric effectively reflects bias detection accuracy for each combination.

5.2.0.1 Comparative Analysis of Agent Architectures

Our analysis shows BiasInspector consistently achieves 70%-80% similarity scores, significantly outperforming ReAct and Self-Reflection agents. The ReAct Agent shows instability with notable accuracy fluctuations across bias types. While Multi-Agent and Single-Agent BiasInspector perform similarly overall, the Multi-Agent variant achieves 8 points higher accuracy on complex correlation bias tasks with GPT-4o, highlighting collaborative agent benefits.

5.2.0.2 Comparative Analysis of LLMs

The GPT series consistently outperforms Llama 3.3 70B when powering BiasInspector. Notably, GPT-4o achieves over 10 percentage points higher accuracy than Llama 3.3 70B on distribution bias tasks in both Multi-Agent and Single-Agent architectures. This underscores that powerful LLMs like GPT-4o better leverage BiasInspector’s planning and tooling capabilities, substantially enhancing structured data bias detection effectiveness.

5.3 Intermediate Process Performance↩︎

fig: — Figure 3: Intermediate process performance of four agent frameworks.

Figure 3 presents Intermediate Process Performance of all model-agent architecture combinations across six dimensions: Integration, Communication, Planning, Tooling, Adaptivity, and Summarization, clearly illustrating BiasInspector’s superior performance.

5.3.0.1 Analysis of Performance Across Individual Dimensions

BiasInspector notably excels in Communication, Planning, and Summarization, achieving scores over 90 (GPT-4o) and over 80 (Llama 3.3 70B). It also demonstrates strong Tooling and Adaptivity, with minor gaps (1–2 points) from top scores, confirming its consistent excellence across dimensions.

5.3.0.2 Comparative Analysis of Different Agent Architectures

BiasInspector consistently surpasses ReAct agents by 5–15 points in key dimensions like Planning and Tooling, and significantly outperforms Self-Reflection agents. This highlights BiasInspector’s superior design and effective utilization of large language models, explaining its higher practical bias detection accuracy. Additionally, the Multi-Agent BiasInspector slightly outperforms the Single-Agent variant across dimensions, validating the benefits of multi-agent collaboration.

5.3.0.3 Performance Comparison Across Different LLMs

Figure 3 (a) shows GPT-4o-powered agents outperform Llama 3.3 70B agents (Figure 3 (b)) by about 10 points in all dimensions. This confirms advanced LLMs like GPT-4o significantly boost agent performance, effectively leveraged by BiasInspector.

6 Conclusion↩︎

In this work, we introduce BiasInspector, an LLM-driven multi-agent framework for automated bias detection in structured data. Collaborative agents formulate multi-stage task plans based on user requirements, using diverse analytical tools to deliver detailed bias detection results. To address the lack of standardized evaluation for LLM-based bias detection, we propose a comprehensive benchmark with multiple metrics and extensive test cases. Experiments shows BiasInspector achieves exceptional bias detection performance, ensuring robust data fairness and setting a benchmark for future research.

7 Bias Detection Method Library↩︎

To support flexible and reproducible bias detection, we developed a structured library of detection methods encoded in JSON format. Each method entry includes metadata and a set of structured procedural steps. Specifically, each record contains:

id: A unique identifier for the method entry.
intention: A high-level description of the goal, specifying the data type, analytical approach, and bias context.
method: A dictionary of ordered steps detailing the procedure for detecting bias using tools or algorithms.
title: The title of the reference publication.
article link: A URL linking to the original article.
field: The academic or application domain.
year: The year of publication.

Table 2 presents a sample excerpt from this library.

[TABLE]: Excerpt from the JSON-based bias detection method library.

8 Information on Tools in the Toolkit↩︎

We have created four categories of tools, totaling 46 tools in the toolkit, including:

7 commonly used data loading and preprocessing tools, designed to meet various preprocessing needs for structured datasets.
25 bias detection tools, with 5 commonly used metric detection tools provided for each type of bias manifestation.
9 data visualization tools, offering comprehensive chart representations for different data types.
5 miscellaneous tools, designed for user interaction, reading the reference methods library, executing dynamically generated tool code, and generating PDF detection reports.

Table 3 presents each tool and its functional description.

[TABLE]: Description of tools in toolset.

9 Instructions for the Two Agents↩︎

The figures 4 and 5 respectively show the excerpted prompt templates for the Primary Agent and Advisor Agent.

In the Primary Agent’s prompt template, we explicitly list its task requirements in bullet points, including communicating with the user, developing a detection plan, selecting appropriate tools, and providing the user with a detailed summary of the results. Each task requirement is further elaborated. For example, in the "Develop a detection plan" section, the prompt specifies that the plan should cover data loading and preprocessing, detection and analysis methods, visualization schemes, and result summarization. In the "Select appropriate tools" section, the prompt directs the agent to choose suitable tools from the available toolset and bias detection method library.

In the Advisor Agent’s prompt template, we also clearly outline its task requirements in bullet points, including assessing whether the Primary Agent’s actions align with the user’s intent, optimizing the execution plan, improving tool selection and providing feedback on the execution results, as well as revising the result summary provided to the user. Each task requirement is further explained in detail.

Compared to the excerpted prompt templates shown, the actual prompt templates will further elaborate on each task description. They will be continuously refined and optimized by adding examples, emphasizing key sections, and incorporating feedback from runtime debugging and execution results.

10 Datasets↩︎

In our study, we selected five widely used datasets for bias detection tasks across different domains, including socioeconomic status, criminal justice, finance, healthcare, and education. These datasets vary in sample size, number of features, and data characteristics. Each dataset contains both categorical and numerical features, making them suitable for a comprehensive bias analysis. The details of each dataset are summarized in Table [tab:dataset-summary].

width=

Adult.csv↩︎

Features: age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, income.

COMPAS.csv↩︎

Features: Person_ID, AssessmentID, Case_ID, Agency_Text, LastName, FirstName, MiddleName, Sex_Code_Text, Ethnic_Code_Text, DateOfBirth, ScaleSet_ID, ScaleSet, AssessmentReason, Language, LegalStatus, CustodyStatus, MaritalStatus, Screening_Date, RecSupervisionLevel, RecSupervisionLevelText, Scale_ID, DisplayText, RawScore, DecileScore, ScoreText, AssessmentType, IsCompleted, IsDeleted.

Statlog (German Credit Data).csv↩︎

Features: Status of existing checking account, Duration in months, Credit history, Purpose, Credit amount, Savings account/bonds, Present employment since, Installment rate, Personal status and sex, Other debtors/guarantors, Present residence since, Property, Age in years, Other installment plans, Housing, Number of existing credits at this bank, Job, Number of people liable for maintenance, Telephone, Foreign worker, Credit risk.

MIMIC-IV.csv↩︎

Features: admission_type, hospital_expire_flag, admission_location, discharge_location, patient_insurance, patient_lang, patient_marital, patient_race, patient_gender, patient_age.

Student Performance.csv↩︎

Features: school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences, G1, G2, G3.

11 Taskset Construction and Examples↩︎

To systematically evaluate the capabilities of our bias detection agent, we constructed a taskset of 100 diverse prompts derived from the five datasets used in our study. The taskset covers three major categories of bias: Distribution Bias, Correlation Bias, and Implication Bias. These categories reflect common forms of bias that occur in real-world datasets and are essential for evaluating fairness.

For each bias category, we designed prompts that vary in structure and language to reflect human-like diversity in expression. We utilized large language models (LLMs) to paraphrase and diversify the formulations of each prompt, ensuring naturalness and realism in phrasing. The distribution of tasks across datasets and bias types is shown in Table 4.

Table 2: Number of constructed bias detection tasks across datasets and bias types.
Dataset	Distribution	Correlation	Implication	Total
Adult.csv	11	9	6	26
COMPAS.csv	8	8	5	21
Statlog.csv	7	8	5	20
MIMIC-IV.csv	6	6	6	18
Student Performance.csv	5	8	2	15
Total	37	39	24	100

Table 5 provides a sample of five representative questions from the taskset. These examples illustrate how the questions are designed to probe distributional characteristics of specific features while maintaining natural and varied phrasing.

Table 3: Sample questions from the constructed taskset (excerpt).
Question	Type	Feature(s)	Bias Significance
Can you check if the age distribution across individuals is balanced, or do certain age groups dominate?	Distribution	age (Numerical)	Imbalanced age groups may skew income-related insights.
How does the distribution of work class look to you? Are there any work classes that appear to be overrepresented?	Distribution	workclass (Categorical)	Uneven representation can bias analysis of income across sectors.
From your perspective, is the distribution of education levels spread fairly, or do certain levels dominate?	Distribution	education (Categorical)	Dominance in education levels may affect fairness in outcome evaluation.
In your view, how does the marital status distribution appear? Are any marital statuses overrepresented?	Distribution	marital-status (Categorical)	Skewed marital status can distort socio-economic impact assessments.
Do certain occupations dominate the dataset, or is the distribution of occupations relatively even?	Distribution	occupation (Categorical)	Occupation imbalance may lead to biased interpretation of income disparities.

12 Details of the Automated Agent Evaluation System↩︎

In this appendix, we provide detailed information regarding the automated evaluation system used to assess the agent’s performance from the intermediate process perspective.

12.0.0.1 Model Selection

The evaluation agent leverages the GPT-o3-mini model, specifically chosen due to its superior analytical reasoning capability in comparison to GPT-4o and GPT-4o mini, providing more accurate and nuanced evaluation results at a lower API usage cost.

12.0.0.2 Evaluation Dimensions

We evaluate the agent’s intermediate processes along five comprehensive dimensions, carefully selected to ensure thorough and holistic performance assessment:

Effective Communication with the User to Clarify Tasks
Comprehensiveness and Thoroughness of Planning
Efficiency in Tool Execution and Dynamic Adjustment
Ability to Dynamically Adjust Plans Based on Execution Results
Clarity and Depth of Results Analysis and Summary

These dimensions collectively ensure a complete assessment of the agent’s capability in performing bias detection tasks.

12.0.0.3 Supporting Toolset

To facilitate thorough evaluation, we constructed a dedicated toolset enabling the evaluation agent to systematically analyze and rate agent performance. Details of the specific tools provided in the toolset are summarized in Table 6.

Table 4: Description of tools in the evaluation agent toolset.
Automated Agent Evaluation Tools
Tool	Description
(Continued from previous page)
Tool	Description
Continued on next page
get_user_input_tool	Captures user input dynamically during an interaction and formats it as a dictionary to be added to the agent’s conversation.
read_json_log	Reads a JSON formatted log file, extracting and organizing log entries.
read_markdown_log	Reads a Markdown formatted log file, extracting headers, bold text, and regular text as structured log entries.
read_bias_report_pdf	Reads a bias detection report in PDF format, extracting both textual and graphical information.
generate_evaluation_report_pdf	Generates a flexible evaluation report in PDF format, combining text narratives and visual charts as specified in the input.

12.0.0.4 Evaluation Prompt

Figure 6 presents an example of a refined prompt snippet used by the automated evaluation agent to standardize the evaluation process.

fig: — Figure 6: Evaluation Agent Prompt

13 Manual Verification of the Validity of Intermediate Process Evaluation↩︎

We provided human evaluators with the same evaluation criteria as those used in the automated agent evaluation system’s prompt (an excerpt of these criteria is shown in Figure 6). Specifically, we selected three detection tasks from each of four model-framework combinations (BiasInspector Multi-Agent, BiasInspector Single-Agent, ReAct-Based Agent, and Self-Reflection Agent, each under GPT-4o and Llama 3.3 70B) across various combinations of data types and bias types, resulting in a total of 120 evaluation instances. We then compared the scores assigned by human evaluators and the automated agent evaluation system along five distinct evaluation dimensions in these instances.

Table 5: Score differences between human evaluators and the automated agent evaluation system across five evaluation dimensions.Values represent the average absolute score difference between two evaluations (each ranging from 0 to 100).

(Multi-Agent)
(Single-Agent)
Agent
Agent
	Integration	3.53	3.27	4.20	5.60
	Communication	4.20	4.07	3.86	4.13
	Planning	5.67	6.13	5.53	6.27
	Tooling	2.93	3.20	9.87	31.80
	Adaptivity	3.80	4.13	4.47	4.53
	Summarization	4.87	5.07	5.27	4.80
	Integration	3.73	3.87	4.40	5.40
	Communication	3.80	3.93	4.07	3.60
	Planning	6.07	6.40	5.80	5.40
	Tooling	2.80	2.60	10.20	24.60
	Adaptivity	4.20	3.73	3.93	3.67
	Summarization	4.93	4.73	5.20	5.13

As shown in Table 7, we compared the scoring differences between human evaluators and the automated agent evaluation system across all evaluation dimensions. The results demonstrate that the scoring discrepancies between the two evaluation methods generally remained within 5 points across the Integration, Communication, Planning, Adaptivity, and Summarization dimensions, indicating strong reliability and practicality of the automated agent evaluation system. However, in the Tooling dimension, human evaluators assigned lower scores to the ReAct-Based Agent due to its limited quantity and variety of tool usage, and assigned the lowest scores to the Self-Reflection Agent, which lacks tool invocation capability entirely. Although the automated evaluation system also significantly lowered the scores for these two agents, human evaluators applied a more substantial reduction, resulting in more noticeable discrepancies in this dimension. Nevertheless, the automated evaluation system successfully captured these critical differences, as further reflected by the significantly lower scores illustrated in Figure 3. This underscores the effectiveness and sensitivity of the automated agent evaluation system in the Tooling dimension.

References↩︎

[1]

Qingchen Zhang, Laurence T. Yang, Zhikui Chen, and Peng Li. A survey on deep learning for big data. Information Fusion, 42: 146–157, 2018. ISSN 1566-2535. . URL https://www.sciencedirect.com/science/article/pii/S1566253517305328.

[2]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54 (6): 1–35, 2021.

[3]

Gianluca Demartini, Kevin Roitero, and Stefano Mizzaro. Data bias management. Communications of the ACM, 67 (1): 28–32, 2023.

[4]

Emilio Ferrara. Fairness and bias in artificial intelligence: A brief survey of sources, impacts, and mitigation strategies. Sci, 6 (1): 3, 2023.

[5]

Alexandra Kakadiaris. Evaluating the fairness of the mimic-iv dataset and a baseline algorithm: Application to the icu length of stay prediction. arXiv preprint arXiv:2401.00902, 2023.

[6]

Nima Shahbazi, Yin Lin, Abolfazl Asudeh, and HV Jagadish. Representation bias in data: A survey on identification and resolution techniques. ACM Computing Surveys, 55 (13s): 1–39, 2023.

[7]

Agathe Balayn, Christoph Lofi, and Geert-Jan Houben. Managing bias and unfairness in data for decision support: a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems. The VLDB Journal, 30 (5): 739–768, 2021.

[8]

Siqi Li, Qiming Wu, Xin Li, Di Miao, Chuan Hong, Wenjun Gu, Yuqing Shang, Yohei Okada, Michael Hao Chen, Mengying Yan, et al. Fairfml: Fair federated machine learning with a case study on reducing gender disparities in cardiac arrest outcome prediction. arXiv preprint arXiv:2410.17269, 2024.

[9]

Sribala Vidyadhari Chinta, Zichong Wang, Zhipeng Yin, Nhat Hoang, Matthew Gonzalez, Tai Le Quy, and Wenbin Zhang. Fairaied: Navigating fairness, bias, and ethics in educational ai applications. arXiv preprint arXiv:2407.18745, 2024.

[10]

Yuhang Zhou, Yuchen Ni, Xiang Liu, Jian Zhang, Sen Liu, Guangnan Ye, and Hongfeng Chai. Are large language models rational investors? arXiv preprint arXiv:2402.12713, 2024.

[11]

Bo Yuan, Shenhao Gui, Qingquan Zhang, Ziqi Wang, Junyi Wen, Bifei Mao, Jialin Liu, and Xin Yao. Fairerml: An extensible platform for analysing, visualising, and mitigating biases in machine learning [application notes]. IEEE Computational Intelligence Magazine, 19 (2): 129–141, 2024.

[12]

Yongsu Ahn and Yu-Ru Lin. Fairsight: Visual analytics for fairness in decision making. IEEE transactions on visualization and computer graphics, 26 (1): 1086–1095, 2019.

[13]

Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, et al. Ai fairness 360: an extensible toolkit for detecting. Understanding, and Mitigating Unwanted Algorithmic Bias, 2, 2018.

[14]

Shubham Gujar, Tanishka Shah, Dewen Honawale, Vedant Bhosale, Faizan Khan, Devika Verma, and Rakesh Ranjan. Genethos: A synthetic data generation system with bias detection and mitigation. In 2022 International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), pp. 1–6. IEEE, 2022.

[15]

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, and Tao Gui. The rise and potential of large language model based agents: A survey, 2023. URL https://arxiv.org/abs/2309.07864.

[16]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18 (6), March 2024. ISSN 2095-2236. . URL http://dx.doi.org/10.1007/s11704-024-40231-1.

[17]

Ken Gu, Madeleine Grunde-McLaughlin, Andrew McNutt, Jeffrey Heer, and Tim Althoff. How do data analysts respond to ai assistance? a wizard-of-oz study. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–22, 2024.

[18]

Mario Sänger, Ninon De Mecquenem, Katarzyna Ewa Lewińska, Vasilis Bountris, Fabian Lehmann, Ulf Leser, and Thomas Kosch. Large language models to the rescue: Reducing the complexity in scientific workflow development using chatgpt. arXiv preprint arXiv:2311.01825, 2023.

[19]

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, and Peter Clark. Data-driven discovery with large generative models. arXiv preprint arXiv:2402.13610, 2024.

[20]

Md. Mahadi Hassan, Alex Knipper, and Shubhra (Santu) Karmaker. Chatgpt as your personal data scientist. ArXiv, abs/2305.13657, 2023. URL https://api.semanticscholar.org/CorpusID:258841776.

[21]

Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, et al. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. arXiv preprint arXiv:2402.11453, 2024.

[22]

Sebastian Bordt, Ben Lengerich, Harsha Nori, and Rich Caruana. Data science with llms and interpretable models. arXiv preprint arXiv:2402.14474, 2024.

[23]

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023.

[24]

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. arXiv preprint arXiv:2407.01725, 2024.

[25]

Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Wenyi Wang, Xiangru Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Zongze Xu, and Chenglin Wu. Data interpreter: An llm agent for data science, 2024. URL https://arxiv.org/abs/2402.18679.

[26]

Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. Ds-agent: Automated data science by empowering large language models with case-based reasoning, 2024. URL https://arxiv.org/abs/2402.17453.

[27]

Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and information systems, 33 (1): 1–33, 2012.

[28]

Max Hort, Zhenpeng Chen, J Zhang, Federica Sarro, and Mark Harman. Bias mitigation for machine learning classifiers: A comprehensive survey. ACM Journal on Responsible Computing, 1: 1 – 52, 2022. URL https://api.semanticscholar.org/CorpusID:250526377.

[29]

Barry Becker and Ronny Kohavi. . UCI Machine Learning Repository, 1996. : https://doi.org/10.24432/C5XW20.

[30]

Tim Brennan and William Dieterich. Correctional Offender Management Profiles for Alternative Sanctions (COMPAS), chapter 3, pp. 49–75. John Wiley & Sons, Ltd, 2018. ISBN 9781119184256. . URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119184256.ch3.

[31]

Hans Hofmann. . UCI Machine Learning Repository, 1994. : https://doi.org/10.24432/C5NC77.

[32]

Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-wei H. Lehman, Leo A. Celi, and Roger G. Mark. Author correction: Mimic-iv, a freely accessible electronic health record dataset. Scientific Data, 10 (1): 219, 2023. . URL https://doi.org/10.1038/s41597-023-02136-9.

[33]

Paulo Cortez. . UCI Machine Learning Repository, 2008. : https://doi.org/10.24432/C5TG7T.

[34]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023.

Code and data are available at: https://github.com/uscnlp-lime/BiasInspector ↩︎

BiasInspector: Detecting Bias in Structured Data through LLM Agents