LAMBDA: A Large Model Based Data Agent


Houduo Qi

, Defeng Sun, Yancheng Yuan\(^*\), Jian Huang1


Abstract

We introduce “LAMBDA," a novel open-source, code-free multi-agent data analysis system that that harnesses the power of large models. LAMBDA is designed to address data analysis challenges in complex data-driven applications through the use of innovatively designed data agents that operate iteratively and generatively using natural language. At the core of LAMBDA are two key agent roles: the programmer and the inspector, which are engineered to work together seamlessly. Specifically, the programmer generates code based on the user’s instructions and domain-specific knowledge, enhanced by advanced models. Meanwhile, the inspector debugs the code when necessary. To ensure robustness and handle adverse scenarios, LAMBDA features a user interface that allows direct user intervention in the operational loop. Additionally, LAMBDA can flexibly integrate external models and algorithms through our knowledge integration mechanism, catering to the needs of customized data analysis. LAMBDA has demonstrated strong performance on various machine learning datasets. It has the potential to enhance data science practice and analysis paradigm by seamlessly integrating human and artificial intelligence, making it more accessible, effective, and efficient for individuals from diverse backgrounds. The strong performance of LAMBDA in solving data science problems is demonstrated in several case studies, which are presented at https://www.polyu.edu.hk/ama/cmfai/lambda.html.

1 Introduction↩︎

Figure 1: Overview, LAMBDA facilitates efficient data analysis through natural language interactions. It bridges the gap between domain experts and the field of data science, which demands extensive code knowledge.

Over the past decade, the data-driven approach utilizing deep neural networks has motivated the success of artificial intelligence across an extensive of challenging applications in various fields [1]. Despite these advancements, the current paradigm encounters challenges and limitations in data science applications, particularly in domains that demand extensive expertise and advanced coding knowledge, such as biology [2], healthcare [3], and business [4]. A notable barrier is the lack of effective communication channels between domain experts and sophisticated AI models [5]. To address this issue, we introduce “LAMBDA," a new open-source, code-free multi-agent data analysis system designed to overcome this dilemma. LAMBDA aims to facilitate the creation of a much-needed medium, fostering seamless interaction between domain knowledge and the capabilities of AI in data science. Our main objectives in developing LAMBDA are as follows.

(a) Crossing coding barrier: Coding has long been recognized as a significant barrier for domain experts without a computer science background to leverage powerful AI tools effectively [3]. LAMBDA addresses this challenge by enabling users to interact with data agents through natural language instructions, thereby offering a coding-free experience. This approach significantly lowers the barriers to entry for tasks in data science, such as data analysis and data mining, while simultaneously enhancing efficiency, making them more accessible to professionals across various disciplines.

(b) Integrating human intelligence and AI: The existing paradigm of data analysis is confronted with a challenge due to the lack of an efficient intermediary that connects human intelligence with artificial intelligence [5]. On one hand, AI models often lack an understanding of the unlearned domain knowledge required for specific tasks. On the other hand, domain experts find it challenging to integrate their expertise into AI models to enhance their performance [6]. LAMBDA provides a solution to this problem. With well-designed interface templates, the agents can access external resources like algorithms or models. This integration ensures that domain-specific knowledge is effectively leveraged, meets the need for customized data analysis, and enhances the agent’s ability to perform complex tasks with higher accuracy and relevance.

(c) Reshaping data science education: LAMBDA has the potential to become an interactive platform that can transform data science education. It offers educators the flexibility to tailor their teaching plans and seamlessly integrate the latest research findings. This adaptability makes LAMBDA an invaluable tool for educators seeking to provide cutting-edge, personalized learning experiences. Such an approach stands in contrast to the direct application of models like GPT-4 [7], [8], offering a unique and innovative educational platform.

Beyond these, the design of LAMBDA also includes reliability and portability. Reliability refers to LAMBDA’s ability to stably and properly handle data analysis tasks. Portability indicates that LAMBDA is compatible with various LLMs, which ensures that LAMBDA can always enhanced by the latest state-of-the-art models. To enable users to save time from odd jobs such as document writing, LAMBDA is further upgraded by automatic analysis report generation.

While GPT-4 has demonstrated state-of-the-art performance in advanced data analysis, its closed-source nature constrains its adaptability to the rapidly expanding needs of data science applications and specialized educational fields. Additionally, concerns regarding data privacy [9] and security risks are inherent in the present configuration of GPT-4. In contrast, by utilizing the open-sourced LAMBDA, users can eliminate concerns regarding data privacy while benefiting from enhanced flexibility and convenience in integrating domain knowledge, installing packages, and utilizing computational resources.

LAMBDA demonstrates a superior performance on various machine learning (ML) datasets. Notable results with an accuracy of 100%, 98.07%, and 98.89% on datasets NHANES, Breast Cancer, and Wine respectively. To sum up, the main characteristics of LAMBDA are as follows: (1) Coding-free and natural language interface. (2) Integrating human intelligence and AI. (3) Reliability. (4) Automatic analysis report generation.

This paper begins with the background and foundational works in Section 2. Section 3 provides a detailed method of the proposed LAMBDA. To evaluate its effectiveness, we present our experiments and results in Section 4. Additionally, Section 5 illustrates the case study of LAMBDA in various scenarios, including data analysis, integrating human intelligence, and interactive education. The paper concludes with a summary in Section 6. Supplementary materials, including an initial idea, prompts, case studies, and experimental settings, are provided in the Appendix.

2 Backgrounds and Foundations↩︎

In recent years, the rapid progress in Large Language Models (LLMs) has brought boundless possibilities to the field of artificial intelligence. Notable examples of LLMs include GPT-3 [10], GPT-4 [7], PaLM [11], LLaMA [12] and Qwen [13]. LLMs demonstrate an outstanding ability to understand, generate, and apply natural language. Benefiting from this revolution, LLM-powered agents (LLM agents) are developed to automatically solve problems in various domains like the search engine, software engineering, gaming, recommender systems, and scientific experiments [14][16]. LLM agents are usually guided by a chain of thought (CoT) such as ReAct [17], and hold capabilities of tool use such as APIs, code interpreters, and retrievers. These works motivated us in the design of the LAMBDA in two ways: function calling and code interpreter.

2.1 Enhancing LLMs with Function Calling↩︎

The integration of external APIs or tools into LLMs is known as function calling, which signifies LLMs are able to use tools to handle tasks by their functional capabilities [18], [19]. This process can be summarized as follows: First, LLMs classify the user’s instruction to functions based on the function annotations. Then, the program executes the selected functions, and the LLM makes a final answer based on the result of the execution [19], [20]. [21] investigates the current paradigm of tool learning with LLMs and shows its advancements in diverse applications such as programming, calculators, and weather inquiry. However, In the data science scenario, we conjecture the function calling method can not perform well because of the following drawbacks:

  • Conjecture A In the data science scenario, numerous APIs have complex interrelationships and extensive annotations. We hypothesize that these lengthy API annotations may result in sequences that exceed the maximum processing capacity of LLMs, leading to the truncation of both API details and user messages.

  • Conjecture B We speculated that the model’s capacity to select APIs accurately diminishes with an increasing number of APIs. This is attributed to the enhanced complexity introduced by an increasing number of APIs that LLMs need to select. The wrong choice of tools can lead directly to the wrong result and answer.

2.2 Powering LLMs by Code Interpreter↩︎

Equipping LLMs with code interpreters can enable them to execute code and obtain the execution result [20], [22]. [23][25] demonstrate the code interpreter capabilities in ChatGLM, making it accessible for programming like complex calculation and drawing figures. [26] ingeniously utilized the code interpreter to solve mathematics problems, achieving significant progress. However, the code in data science problems is more complex and challenging than in the aforementioned domains. If the correctness of the code cannot be guaranteed, the reliability of the code interpreter approach will be compromised. To address this issue, we propose a self-correction mechanism that enhances reliability by enabling our agent to learn from its failures.

2.3 Multi-agent Collaboration↩︎

A multi-agent system consists of numerous autonomous agents that collaboratively engage in planning, discussions, and decision-making, mirroring the cooperative nature of human group work in problem-solving tasks [14]. Each agent plays a role that possesses unique capabilities, objectives, and perceptions, operating either independently or collectively to tackle complex tasks or resolve problems [27]. [28][30] proposed an agent system as a software company that consists of several agents like programmers, code reviewers, and test engineers, enabling the completion of the entire software development process in under seven minutes. [31] introduced an agent hospital that simulates the entire process of treating illness which achieves a state-of-the-art effect on respiratory diseases. LAMBDA also simulates real-world data science workflows through the collaboration of multiple agents, thereby enhancing the stability of the system.

2.4 Retrieval Augmented Generation↩︎

Retrieval Augmented Generation (RAG) is a technique that retrieval external sources to improve the accuracy and reliability of LLM’s responses [20], [32][34]. Resources will be split and embedded into sub-fragment vectors and stored in the vector base. RAG first queries a vector database, matching document fragments relevant to the query from the user by computing similarity. These fragments are then used to enhance the accuracy of the responses generated by LLMs [32]. This approach is often used for infusing external knowledge and reducing hallucinations [35]. However, the general RAG workflow can not work well in some data science scenarios. First, the user’s instructions and the relevant code fragments may not exhibit significant similarity in the representation space, leading to inaccurate searches. Second, the varying lengths of code fragments can also impact the final search results. For example, the code of customized optimization algorithms is quite long, which leads to the challenge of matching whole correct code fragments. In LAMBDA, we designed a KV knowledge base to solve this issue in our scenario and integrate human intelligence into AI.

3 Methodology↩︎

Our proposed multi-agent data analysis system (LAMBDA) consisted of two agents that cooperate seamlessly to solve data analysis tasks by natural language, as shown in Figure 2. The macro workflow involves initially writing code based on user instructions and subsequently executing that code. Besides, another idea of a function calling-based agent system is presented in Section 7.1. Through comparison study, we will discuss the dilemma of the function calling method and the merits of the multi-agent collaboration system in the section Result 4.1.

Figure 2: Framework of proposed LAMBDA. We utilized cloud storage to cache the dataset uploaded by the user and files generated by the system. This enables the LAMBDA to store the whole conversation, including the data, figures, and models in persistence.

3.1 Overview↩︎

LAMBDA is structured around two core agent roles: the "programmer" and the "inspector", who are tasked with code generation and evaluation respectively. When a user submits an instruction, the programmer agent writes code based on the provided instructions and dataset. This code is then executed within the kernel environment of the host system. Should any errors arise during execution, the inspector intervenes, offering suggestions for code refinement. The programmer takes these suggestions into account, revises the code, and resubmits it for re-evaluation. This iterative cycle continues until the code runs error-free or a preset maximum number of attempts is reached. In order to cope with unfavorable situations and enhance its reliability and flexibility, a human intervention mechanism is integrated into the workflow. This feature allows users to directly modify the code and intervene when deemed necessary. The collaboration algorithm is demonstrated in Algorithm 3.

Figure 3: Multi-agent Collaboration. Definitions: \(A_n\), \(C_n\) is the answer and extracted code by the programmer in iteration \(n\). We assume each \(A_n\) contains \(C_n\), otherwise, the programmer’s reply will be returned to the user directly. \(r\) is the execution result with \(E\) indicating an error, \(S_n\) are suggestions provided by the inspector in iteration \(n\), \(C_h\) is the code written by a human. The final response is denoted as \(R\).

3.2 Programmer↩︎

The primary responsibility of the programmer is to write code and respond to the user. Upon the user’s dataset upload, the programmer receives a tailored system prompt that outlines the programmer’s role, environmental context, and the I/O formats. This prompt is augmented with examples to facilitate few-shot learning for the programmer. Specifically, the system prompt encompasses the user’s working directory, the storage path of the dataset, the dimensions of the dataset, the name of each column, the type of each column, information on missing values, and statistical description.

The programmer’s workflow can be summarized as follows: initially, the programmer writes code based on instructions from the user or the inspector; subsequently, the program extracts code blocks from the programmer’s output and executes them in the kernel. Finally, the programmer generates a final response based on the execution results and communicates it to the user. This final response consisted of a summary and suggestions for the next steps.

3.3 Inspector and Self-correcting Mechanism↩︎

The inspector’s primary role is to provide reasonable modification suggestions when errors occur in code execution. The prompt of the inspector includes the code written by the programmer during the current dialogue round and the error messages from the kernel. The inspector will offer actionable revision suggestions to the programmer for code correction. This suggestion prompt contains the erroneous code, kernel error messages, and the inspector’s suggestions. This collaborative process between the two agents iterates several rounds until the code executes successfully or the maximum number of attempts is reached. This self-correcting mechanism enables the programmer and inspector to make multiple attempts in case of error. A case of self-correcting mechanism is demonstrated in Figure 16. Our experiments result in Table 2 demonstrate that the agent system incorporating inspector significantly enhances the reliability.

3.4 Integrating Human Intelligence and AI↩︎

Beyond leveraging the inherent knowledge of LLMs, LAMBDA is further enhanced to integrate human intelligence through external resources such as customized algorithms and models from users. Due to the challenges faced by general RAG methods in data science scenarios, which stem from the potential lack of clear correlation between user instructions and code fragments in the representation space, as well as the impact of varying code fragment lengths. We design a KV knowledge base to store the code resources.

Figure 4: Knowledge matching in LAMBDA. The process selects the codes from the knowledge base by calculating the similarity between descriptions and the instruction.

\[\mathcal{K} = \{(d_i, c_i) \mid i = 1, 2, \ldots, n\}\] where \(\mathcal{K}\) is the knowledge base, \(d_i\) represents the description of the \(i\)-th piece of knowledge and \(c_i\) represents the corresponding source code.

When the user issues an instruction \(ins\), an embedding model (denoted as \(\mathcal{F}\)) encodes all descriptions in the knowledge base and the \(\;ins\). The embedding tensors for descriptions and instruction are represented by \(\mathbf{e}_{d_i}\) and \(\mathbf{e}_{ins}\) respectively. The cosine similarity between them is calculated to select the knowledge with a similarity score greater than a threshold \(\theta\) and the highest match as the knowledge.

Let the embedding function be \(\mathcal{F}\), the \(\mathbf{e}_{d_i}\) and \(\mathbf{e}_{ins}\) are formulated as follows \[\mathbf{e}_{d_i} = \mathcal{F}(d_i) \quad \forall i \in \{1, 2, \ldots, n\}, \quad \mathbf{e}_{ins} = \mathcal{F}(ins)\]

The similarity \(S_i\) between description and instruction can be computed by cosine similarity as \[S_i(\mathbf{e}_{d_i}, \mathbf{e}_{ins}) = \frac{\mathbf{e}_{d_i} \cdot \mathbf{e}_{ins}}{\|\mathbf{e}_{d_i}\| \|\mathbf{e}_{ins}\|} \quad \forall i \in \{1, 2, \ldots, n\}\]

We let the matching threshold \(\theta = 0.5\). The matched knowledge \(k\) with the highest \(S_i\) is selected while it satisfies \(S_i > \theta\), computed as \[k = c_{i^*}, \quad i^* = \arg\max_{i} \left( S_i(\mathbf{e}_{d_i}, \mathbf{e}_{ins}) \cdot \mathbf{1}_{\{S_i(\mathbf{e}_{d_i}, \mathbf{e}_{ins}) > \theta\}} \right) \quad \forall i \in \{1, 2, \ldots, n\}\]

The knowledge \(k\) will be embedded in in-context learning (ICL) for the LLM to generate answer \(\hat{A}\). Formally, given a query \(q\), matched knowledge \(k\), a set of demonstrations \(D = \{(q_1, k_1, a_1), (q_2, k_2, a_2), \ldots, (q_n, k_n, a_n)\}\), and the LLM \(\mathcal{M}\), the model estimates the probability \(\mathcal{P}(a|q, k, D)\) and outputs the answer \(\hat{A}\) that maximizes this probability. The final response \(\hat{A}\) is generated by the model \(\mathcal{M}\) as follows: \[\hat{A} \gets \mathcal{M}(q, D)\]

By integrating \(k\) through ICL, the model effectively combines retrieved knowledge with contextual learning to provide responses that are both informed and contextually appropriate. To sum up, the knowledge integration mechanism enables LAMBDA to perform customized data science tasks. It also brings flexibility to meet the demands of some complex data analysis problems.

3.5 Report Generation↩︎

LAMBDA can generate analysis reports based on the dialogue history. The report usually includes data processing, data visualizations, model descriptions, and evaluation results. We offer various report templates for users to select. LAMBDA produces reports in the desired format by ICL. This feature enables users to concentrate on high-value tasks, rather than spending time and resources on report writing and formatting. A sample case can be found in the Figure 22.

Overall, the programmer agent, inspector agent, self-correcting mechanism, and human-in-the-loop provide LAMBDA with theoretical reliability. Knowledge integration makes LAMBDA scalable and flexible. Besides, to bring portability to LAMBDA, we provide the OpenAI style interface. This implies that most LLMs, once deployed via open-source frameworks such as vLLM [36] and LLaMAF-Factory [37], can be compatible with our system. Some prompts and cases are provided in Section 7.3.

4 Experiments and Results↩︎

We first verify the conjectures of the function calling method in Section [conj]. Additionally, we conducted an ablation study on the proposed LAMBDA to show the impact and performance of each agent. Lastly, to evaluate the proposed LAMBDA, we observed the performance of LAMBDA on several machine learning (ML) datasets and whether human intervention is required in the process.

4.1 Challenges of the Function Calling Method↩︎

We estimate the maximum number of APIs that some open-source LLMs can handle in the data science scenario by using the average length of the APIs we pre-defined. Figure 5 illustrates the results. Qwen1.5 and Mistral-v0.1 [38] were specifically designed to naturally handle lengthy sequences, capable of managing 400 and 372 APIs respectively. However, general-purpose LLMs such as LLaMA2, LLaMA3, Mistral-V0.2, Qwen1, ChatGLM2, and ChatGLM3 can process fewer than 100 APIs, posing a challenge for applications requiring a larger number of APIs, such as data science tasks.

Figure 5: Average token length for one API and maximum number of APIs each LLM can process.

To investigate the impact of the number of APIs on the accuracy of LLMs selecting these APIs, we made a dataset comprising 100 commonly used APIs in the data science scenario. Through few-shot learning, we generated 880 testing instructions aligned with these APIs by Qwen1.5-110B. Subsequently, we segmented both the APIs and testing instructions into intervals of 10 functions for analysis. The detail of the evaluation dataset is shown in Table 1 and the result is shown in Figure [fig:fca].

Table 1: Number of APIs and corresponding instructions in the evaluation dataset.
APIs
10 20 30 40 50 60 70 80 90 100
Instructions 74 163 268 352 446 525 614 684 806 880

Figure 6: The accuracy of API chosen by model Qwen1.5. Qwen1.5 is used for the experiments due to its capability of processing the maximum number of APIs in our experiments.

In summary, the function calling method exhibits several significant drawbacks. Firstly, the labor-intensive process of defining numerous APIs proves inefficient. Secondly, the static nature of APIs hinders adaptability to diverse and evolving user demands. Thirdly, extensive API annotations can occupy a substantial portion of the input sequence, potentially leading to truncation risks. Lastly, as the number of APIs increases, the model’s accuracy in correct selection decreases, thereby introducing potential inaccuracies in responses.

4.2 Ablation Study on LAMBDA↩︎

To assess the reliability and performance of each agent in LAMBDA, We designed an ablation study based on the heart disease dataset [39]. This dataset containing missing values will bring challenges naturally. We utilized Qwen1.5-110B to generate instructions for related tasks. There are 454 instructions in the experiment after filtering. We evaluated the execution pass rate on a single programmer agent and multiple agents (programmer and inspector) respectively. The results are summarized in Table 2.

Table 2: Experiment on single agent versus multiple agents. The percentages in brackets are the improvement rate over the single agent. Both the programmer and inspector agent are implemented by Qwen1.5-32b in this experiment.
Agents Passing Rate %
programmer agent only 68.06
programmer + inspector 95.37 (40.13%)

The result demonstrates a significant gap in the passing rate between using a programmer agent alone and incorporating an inspector. The programmer agent achieved a passing rate of 68.06%, while the integration of the inspector increased the passing rate to 95.37%, marking a substantial improvement of 40.13% over the single-agent setup. This experiment verified the crucial role of collaborative agents in enhancing the reliability and robustness of LAMBDA. By leveraging complementary strengths in error suggestion and correction, the multi-agent collaboration approach not only improves the code passing rate but also reduces the frequency of human intervention in executing complex data analysis tasks.

4.3 Experiment on Machine Learning Dataset↩︎

In order to acquire practical experience and performance in real-world data science tasks, we evaluated LAMBDA on several ML datasets [table:datasets] by recording its performance on these datasets and noted if human involvement was needed. Accuracy was utilized as the evaluation metric for classification tasks, while mean squared error was employed for regression tasks. The results are presented in Table [table:reliability-lambda].

  • AIDS Clinical Trials Group Study 175 provides healthcare statistics and categorical data on AIDS patients. Released in 1996, it includes 2139 instances with 23 features, primarily used for predicting patient mortality within a specified time frame [40].

  • National Health and Nutrition Health Survey 2013-2014 Age Prediction Subset (NHANES) is derived from the CDC’s comprehensive health and nutrition survey. This subset, with 6287 instances and 7 features, focuses on predicting respondents’ age groups (senior or non-senior) using physiological, lifestyle, and biochemical data [41].

  • Breast Cancer Wisconsin (Diagnostic) comprises 569 instances and 30 features. It classifies patients into categories of malignant or benign [42].

  • Wine contains the results of a chemical analysis of wines from a specific region in Italy, derived from three different cultivars. It includes 178 instances and 13 features, focusing on the quantities of various constituents found in the wines.[43].

  • Concrete Compressive Strength contains 1030 instances with 8 features, examining the highly nonlinear relationship between concrete compressive strength, age, and ingredients [44].

  • Combined Cycle Power Plant features 9568 data points collected over six years, with 4 features, to analyze the performance of a power plant under full load conditions [45].

  • Abalone predicts the age of abalone from physical measurements. Comprising 4177 instances with 8 features [46].

  • Airfoil Self-Noise aims to predict scaled sound pressure, includes 1503 instances with 5 features derived from aerodynamic and acoustic tests of airfoil blade sections in an anechoic wind tunnel [47].

cell13 = c=4c, cell31 = r=9c, cell151 = r=9, hline1-3,12-15,24-26 = -, & ****Model**** & DataSets & & &
& & AIDS & NHANES & Breast Cancer & Wine
Classification & Logistic Regression & 86.54% & 99.43% & 98.07% & 98.89%
& SVM & 88.45% & 98.82% & 97.72% & 98.89%
& Neural Network & 88.82% & 99.91% & 97.82% & 82.60%
& Decision Tree & 87.70% & 100.00% & 94.26% & 92.14%
& Random Forest & 89.29% & 100.00% & 96.84% & 98.33%
& Bagging & 89.62% & 100.00% & 96.49% & 96.65%
& Gradient Boost & 89.20% & 100.00% & 96.84% & 96.65%
& XGBoost & 89.67% & 100.00% & 97.54% & 95.54%
& AdaBoost & 88.92% & 100.00% & 97.72% & 93.89%
& Best Score & 89.67% & 100.00% & 98.07% & 98.89%
& Human involved & & & &
& & Concrete & Power Plant & Abalone & Airfoil
Regression & Linear Regression & 0.4596 & 0.0714 & 0.5086 & 0.5717
& Lasso & 0.5609 & 0.0718 & 0.8042 & 0.5738
& SVR & 0.4012 & 0.0534 & 0.4542 & 0.3854
& Neural Network & 0.2749 & 0.0612 & 0.4551 & 0.4292
& Decision Tree & 0.5242 & 0.0551 & 0.5566 & 0.3823
& Random Forest & 0.4211 & 0.0375 & 0.4749 & 0.2655
& Gradient Boost & 0.3414 & 0.0315 & 0.4778 & 0.2528
& XGBoost & 0.3221 & 0.0319 & 0.4778 & 0.2741
& CatBoost & 0.2876 & 0.0325 & 0.4795 & 0.2529
& Best Score & 0.2749 & 0.0315 & 0.4542 & 0.2528
& Human involved & & & &

The result in Table [table:reliability-lambda] demonstrates superior performance in executing machine learning tasks. For the classification tasks, LAMBDA achieved the highest accuracies of 89.67%, 100%, 98.07%, and 98.89% for the AIDS, NHANES, Breast Cancer, and Wine datasets, respectively. For the regression tasks, it achieved the lowest MSE of 0.2749, 0.0315, 0.4542, and 0.2528, respectively. The performance showcased its comprehensiveness in using various models of diverse data science scenarios. Besides, there was no human involvement in the entire process of these experiments, which validated that LAMBDA effectively overcame the coding barrier and bridged the gap between data science and human experts lacking coding knowledge. To sum up, The above experimental results indicate that LAMBDA can serve as an efficient and reliable data agent, assisting individuals across various industries in handling data science tasks.

5 Examples↩︎

We provide three examples to demonstrate the ability of LAMBDA in data analysis, integrating human intelligence and education respectively.

Data Analysis: We simulate the scenarios where the user asks LAMBDA to conduct different tasks, including data preprocessing, data visualization, and model training, on the provided dataset Iris [48]. LAMBDA consistently provides an accurate response. More importantly, LAMBDA further generates an analysis report based on the whole instructions from the user. The demo is presented in https://www.polyu.edu.hk/ama/cmfai/files/lambda/lambda.mp4.

Integrating Human Intelligence: We demonstrated the knowledge integration in LAMBDA by computing the nearest correlation matrix using the Quadratically Convergent Newton Method [49]. We first showed the limitations of GPT-4 on the task, highlighting the value of LAMBDA by comparison. The demo is presented in https://www.polyu.edu.hk/ama/cmfai/files/lambda/knw.mp4.

Interactive Education: We consider one educational situation where the teacher uses LAMBDA to design the curriculum and the students use LAMBDA to finish the exercise. The exercise dataset is Abalone. This education support improves the efficiency of both teaching and learning. The demo is presented in https://www.polyu.edu.hk/ama/cmfai/files/lambda/LAMBDA_education.mp4.

6 Conclusion↩︎

In this work, we introduce LAMBDA, an open-source multi-agent data analysis system that integrates human intelligence with AI. Experimental results demonstrate that LAMBDA achieves satisfactory performance in handling data analysis tasks. In the future, it can be further enhanced with planning and reasoning techniques. Our results and examples highlight the great potential of LAMBDA to enhance data science practice and education. By bridging the gap between human expertise and AI capabilities, LAMBDA aims to democratize data science and analysis, fostering a more inclusive environment for innovation and discovery.

7 Appendix↩︎

7.1 Initial Idea of Function Calling Based Agent System↩︎

The first idea that came to our mind was function calling. We developed extensive APIs that encompass a wide range of data processing and machine learning functionalities, including statistical descriptions (e.g., mean, median, standard deviation), encoding schemes (e.g., one-hot encoding, ordinal encoding), data partitioning, and model training (e.g., logistic regression, decision tree). We utilized five function libraries to build these APIs, each tailored for different purposes: the Data Description Library, Data Visualization Library, Data Processing Library, Modeling Library, and Evaluation Library. Each library caches variables such as processed data and models throughout the program’s lifecycle. The framework and workflow are illustrated in Figure 7.

Figure 7: Agent system design by the function calling method. The FCL means function calling library.

We implemented the function calling service by ReAct. Specifically, when prompted to generate text up to the “Observation" section, the LLM should halt generation at this point. This is essential as the”Observation" section requires the outcome of API execution to prevent LLMs from generating results autonomously. The details are depicted in Figure 8.

Figure 8: Workflow of function calling service, demonstrated by Qwen1.5 and ReAct.

7.2 Datasets and Metrics in the Study↩︎

Accuracy is defined as the ratio of the number of correctly classified instances to the total number of instances. It is given by the formula:

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

where \(TP\) is True Positives, \(TN\) is True Negatives, \(FP\) is False Positives, \(FN\) is False Negatives

Mean Squared Error (MSE) is defined as the average of the squared differences between the predicted values and the actual values. It is given by the formula:

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

where: \(n\) is number of data points, \(y_i\) is actual value, \(\hat{y}_i\) is predicted value.

hline1,12 = -0.1em, hline2 = -, DataSets & Usage
AIDS Clinical Trials Group Study 175 & Classification - ML
NHANES & Classification - ML
Breast Cancer Wisconsin & Classification - ML
Wine & Classification - ML
Concrete Compressive Strength & Regression - ML
Combined Cycle Power Plant & Regression - ML
Abalone & Regression - ML, Education Case Study
Airfoil Self-Noise & Regression - ML
Iris & Classification - Data Analysis Case Study
Heart Disease & Regression - Education Case Study

7.3 Prompt in LAMBDA↩︎

Figure 9: Prompt for programmer.

Figure 10: Prompt for self-correcting mechanisms.

Figure 11: Prompt for report generation.

Figure 12: Prompt for knowledge integration.

7.4 Case Study↩︎

Figure 13: Case study of data analysis.

Figure 14: Cont. Case study of data analysis.

Figure 15: Cont. Case study of data analysis.

Figure 16: A case of self-correcting mechanism in LAMBDA.

Figure 17: A case of integrating human intelligence.

Figure 18: Cont. A case of integrating human intelligence.

Figure 19: Case study of using LAMBDA in education. The teacher will give a lecture on Lasso.

Figure 20: Cont. Case study of using LAMBDA in education. The student learn the lecture and do the assignment of Lass.

Figure 21: Cont. Using LAMBDA in education. The student learn the lecture and do the assignment of Lasso.

Figure 22: Sample case of report generation

7.5 Experiments Setting↩︎

References↩︎

[1]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
[2]
Tracey L Weissgerber, Vesna D Garovic, Jelena S Milin-Lazovic, Stacey J Winham, Zoran Obradovic, Jerome P Trzeciakowski, and Natasa M Milic. Reinventing biostatistics education for basic scientists. PLoS Biology, 14(4):e1002430, 2016.
[3]
Bentley James Oakes, Michalis Famelis, and Houari Sahraoui. Building domain-specific machine learning workflows: A conceptual framework for the state of the practice. ACM Transactions on Software Engineering and Methodology, 33(4):1–50, 2024.
[4]
Claus Weihs and Katja Ickstadt. Data science: the impact of statistics. International Journal of Data Science and Analytics, 6:189–194, 2018.
[5]
Soya Park, April Yi Wang, Ban Kawas, Q Vera Liao, David Piorkowski, and Marina Danilevsky. Facilitating knowledge sharing from domain experts to data scientists for building nlp models. In Proceedings of the 26th International Conference on Intelligent User Interfaces, pages 585–596, 2021.
[6]
Tirtharaj Dash, Sharad Chitlangia, Aditya Ahuja, and Ashwin Srinivasan. A review of some techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports, 12(1):1040, 2022.
[7]
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[8]
Xinming Tu, James Zou, Weijie Su, and Linjun Zhang. ? Harvard Data Science Review, 6(1), jan 19 2024.
[9]
Itai Bavli, Anita Ho, Ravneet Mahal, and Martin J McKeown. Ethical concerns around privacy and data security in ai health monitoring for parkinson’s disease: Insights from patients, family members, and healthcare professionals. AI & SOCIETY, pages 1–11, 2024.
[10]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[11]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[12]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[13]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
[14]
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024.
[15]
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023.
[16]
Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Xiangru Tang, Ningyu Zhang, Huajun Chen, Peng Cui, and Mrinmaya Sachan. Agents: An open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870, 2023.
[17]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2023.
[18]
Wei Chen, Zhiyuan Li, and Mingyuan Ma. Octopus: On-device language model for function calling of software apis. arXiv preprint arXiv:2404.01549, 2024.
[19]
Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. An llm compiler for parallel function calling. arXiv preprint arXiv:2312.04511, 2023.
[20]
Grégoire Mialon, Roberto Dessı̀, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
[21]
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. arXiv preprint arXiv:2405.17935, 2024.
[22]
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023.
[23]
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
[24]
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
[25]
Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024.
[26]
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, and Hongsheng Li. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
[27]
Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010, 2023.
[28]
Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
[29]
Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Experiential co-learning of software-developing agents. arXiv preprint arXiv:2312.17025, 2023.
[30]
Chen Qian, Jiahao Li, Yufan Dang, Wei Liu, YiFei Wang, Zihao Xie, Weize Chen, Cheng Yang, Yingli Zhang, Zhiyuan Liu, and Maosong Sun. Iterative experience refinement of software-developing agents. arXiv preprint arXiv:2405.04219, 2024.
[31]
Junkai Li, Siyu Wang, Meng Zhang, Weitao Li, Yunghwei Lai, Xinhui Kang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957, 2024.
[32]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
[33]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:22311.05232, 2023.
[34]
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426, 2022.
[35]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
[36]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
[37]
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics.
[38]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
[39]
Andras Janosi, William Steinbrunn, Matthias Pfisterer, and Robert Detrano. . UCI Machine Learning Repository, 1988. : https://doi.org/10.24432/C52P4X.
[40]
Scott M. Hammer, David A. Katzenstein, Michael D. Hughes, Holly Gundacker, Robert T. Schooley, Richard H. Haubrich, W. Keith Henry, Michael M. Lederman, John P. Phair, Manette Niu, Martin S. Hirsch, and Thomas C. Merigan. A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. aids clinical trials group study 175 study team. The New England journal of medicine, 335 15:1081–90, 1996.
[41]
Clifford Leroy Johnso and Sylvia M. . UCI Machine Learning Repository, 2023. : https://doi.org/10.24432/C5BS66.
[42]
William Wolberg, Olvi Mangasarian, Nick Street, and W. Street. . UCI Machine Learning Repository, 1995. : https://doi.org/10.24432/C5DW2B.
[43]
Stefan Aeberhard and M. Forina. . UCI Machine Learning Repository, 1991. : https://doi.org/10.24432/C5PC7J.
[44]
I-Cheng Yeh. . UCI Machine Learning Repository, 2007. : https://doi.org/10.24432/C5PK67.
[45]
Pnar Tfekci and Heysem Kaya. . UCI Machine Learning Repository, 2014. : https://doi.org/10.24432/C5002N.
[46]
Warwick Nash, Tracy Sellers, Simon Talbot, Andrew Cawthorn, and Wes Ford. . UCI Machine Learning Repository, 1995. : https://doi.org/10.24432/C55C7W.
[47]
Thomas Brooks, D. Pope, and Michael Marcolini. . UCI Machine Learning Repository, 2014. : https://doi.org/10.24432/C5VW2C.
[48]
R. A. Fisher. . UCI Machine Learning Repository, 1988. : https://doi.org/10.24432/C56C76.
[49]
Houduo Qi and Defeng Sun. A quadratically convergent newton method for computing the nearest correlation matrix. SIAM Journal on Matrix Analysis and Applications, 28(2):360–385, 2006.

  1. Corresponding authors.↩︎