Abstract

There has been significant recent interest in harnessing LLMs to control software systems through multi-step reasoning, planning and tool-usage. While some promising results have been obtained, application to specific domains raises several general issues including the control of specialized domain tools, the lack of existing datasets for training and evaluation, and the non-triviality of automated system evaluation and improvement. In this paper, we present a case-study where we examine these issues in the context of a specific domain. Specifically, we present an automated math visualizer and solver system for mathematical pedagogy. The system orchestrates mathematical solvers and math graphing tools to produce accurate visualizations from simple natural language commands. We describe the creation of specialized data-sets, and also develop an auto-evaluator to easily evaluate the outputs of our system by comparing them to ground-truth expressions. We have open sourced the data-sets and code for the proposed system.

1 Introduction↩︎

Large Language Models (LLMs) and Large Multimodal Models (LMMs) have had extraordinary recent success in tasks involving natural language generation and code generation. Spurred by this, there has been significant interest in harnessing LLMs to control software systems and embodied agents through multi-step reasoning, planning and leveraging tools and APIs [1]–[5]. Promising results have been obtained in many tasks, from device and web-control to game-playing and robotics[6]–[10].

The creation of AI-driven automated systems for specialized domains holds great economic promise; estimated by a recent study at more than a trillion dollars [11]. While there exist several general multi-agent frameworks that can be built on, such as Autogen [12], [13], the development of LLM-driven agentic workflows in specialized domains requires overcoming several additional challenges:

Firstly, such domains come with specialized tools and problems, different from the general-purpose problems that past work has often focused on.
Secondly, there is often a paucity of datasets for training or benchmarking. For example, common math benchmarks like [14], while important, are of limited value for settings like math pedagogy.
Thirdly, automated evaluation of such systems is hard, and human evaluation does not scale. This makes it hard to create continuous improvement loops, which are essential for robustness.

In this paper, we investigate the above issues in the context of a specialized domain; that of math pedagogy. Teachers use a variety of in-classroom technological tools in day-to-day instruction. The variety and complexity of operating these tools imposes a cognitive and time-overload, that teachers would rather spend with students. Generative AI has significant potential to simplify the tools available in the classroom, allowing teachers to spend more time interacting with their students instead of their technology [15]. The combination of specialized tool-use, and the potential benefits of automation make classroom pedagogy a well-suited use-case for our exploration.

Reflecting the challenges described above, previous LLM-based math research [16]–[19] has focused on solving math problems and theorem-proving which, while important, are tangential to in-classroom teaching. Existing math benchmarks like GSM8K [14] and MATH [20] are also of limited value for understanding how LLMs can be applied to the classroom setting. There are no comprehensive datasets that are purposefully aligned with educational standards for middle and high school, nor are there datasets for math visualization pedagogy.

We present an automated math graphing system, that we term MathViz-E, for mathematical pedagogy. Graphs are an essential tool in the classroom, allowing students to visualize and interact with mathematical concepts [21]. Our automated graphing system takes in voice utterances, converts them to mathematical expressions, and graphs them with the Desmos graphing calculator [22]. This simplifies the process of creating graphs in the classroom, allowing teachers to more easily incorporate math visualization techniques into their lessons without disrupting classroom flow.

Our contributions are:

We present a voice-driven automated graphing system, combining an LLM with a mathematical solver and a visual graphing calculator, for the domain of in-classroom math pedagogy.
We design new domain-specific datasets for graphing problems, representative of the Common Core Math standards [23], focused on a range of learning objectives that teachers currently use visualization tools to teach.
Evaluating the accuracy of an automated visual graphing system is non-trivial, given that the output of the system is a set of math visualizations. We create an auto-evaluation pipeline to simplify the evaluation of different versions of our system.
We present results demonstrating that our proposed system achieves high accuracy on a wide variety of learning objectives, and show that incorporating multiple tools significantly out-performs an LLM-only system.

The incorporation of multiple tools, including a solver, provides a foundation of accuracy, as LLMs alone are incapable of reliably solving several types of math problems (as we will see in Section 4). This allows the system to produce accurate graphs even for difficult, multi-step problems requiring complex reasoning. On the other hand, while mathematical solvers such as Wolfram Alpha [24] can provide accurate answers for many categories of problems, they are not capable of understanding all types of natural language. An example of this is anaphora in multi-turn conversations, where a query refers to objects already graphed in the calculator (e.g. "Move the function 5 units horizontally"). By using an LLM to orchestrate across the solver and the visual calculator, we create a robust voice- and dialog-based system. Thus the combination of an LLM with specialized tools produces an automated system with strong potential for domain use. We have open-sourced our code and datasets at https://github.com/EmergenceAI/MathViz-E.

Our work in this paper is related to a large body of recent literature in using LLMs for tool-usage, multi-step reasoning, and plan execution by agents. There is also related work in the use of LLMs for mathematical reasoning, and for pedagogy. In this section, we briefly review this literature.

There has been considerable work in the last couple of years on using LLMs to orchestrate tools and APIs, motivated by the desire to augment language models’ strong text generation capabilities with other, more specialized abilities [1]. Systems like NL2API [2], ToolkenGPT [3], Toolformer [4] and Gorilla [5] use language models that learn how to invoke parameterized APIs. Combining multi-step planning and tool-usage enables the creation of embodied agents that can plan and act in virtual or real environments. Examples of this include agents in virtual game environments [8], [9], [25] and real-world robotic agents [10], [26]–[28]. Beyond single-agent systems, there has also been much interest in multi-agent systems and frameworks [12], [13]. In comparison to the above, our work focuses on a narrow set of tools, but for domain-specific capabilities rather than general-purpose usage; we investigate math solvers and visual calculators specifically for mathematical pedagogy.

There has also been significant work in multi-step reasoning and sequential decision making. Chain-of-thought (CoT) [29], and its many variants [30] have shown gains on several types of reasoning tasks. While initially considered an emergent ability in large models like PaLM [31], subsequent work has used techniques like distillation [32], [33] to specialize smaller models for specific reasoning tasks. CoT-style step-by-step reasoning can be further combined with self-critique driven refinement [34], [35]. Another type of approach is to generate code via LLMs, that can be run via tools like Python interpreters, e.g. [36]. Also related are approaches like ReAct [37], Reflexion [38] and many others (e.g. [39]) that employ LLMs for sequential decision making. In the system we describe in the next section, we use chain-of-thought with a general instruction-tuned LLM for multiple purposes, including query reformulation and tool control. While in this paper we’ve used vanilla CoT and a large model, the use of more specialized CoT variants and smaller distilled models are intriguing directions for future investigation.

Also related is the literature on training LLMs for mathematical reasoning. A popular dataset (among many) is the GSM-8K dataset [14] with grade-school math word problems. Recent work has explored the training of small, parameter-efficient models, generally through fine-tuning using augmented data; examples include [16]–[18]. Finally the use of LLMs for various pedagogical tasks includes work on assessment-generation [40]–[42], learning-content generation [43], [44], and the use of commercial models like ChatGPT through prompt engineering [45], [46]. In contrast to these, we focus on in-classroom math pedagogy, wherein a teacher controls math tools through voice and language.

3 Methodology↩︎

3.1 Dataset Construction↩︎

The Common Core standards [23] are a set of national educational standards describing what students are expected to know at each grade level, and they have been widely adopted in the United States. Based on the math Common Core standards, we identify a set of learning objectives that teachers use visualization tools to teach in the classroom. We use these categories as the basis of our dataset, creating approximately 10 questions per category to evaluate our system on.

The categories included in our datasets and the style of utterances were refined through teacher feedback, to reflect language that is commonly used by teachers in math pedagogy (e.g. "Graph a unit circle" rather than "Draw a circle of radius 1".). This feedback also informed the type of problems we chose and the learning objectives we targeted in our datasets. From the identified categories, we create three datasets; the first (referred to as the utterance-focused dataset) is focused on use cases a teacher might want to have available in the classroom. The utterances in this dataset are written as commands a teacher might say, as opposed to written-out problems for a student to solve. The dataset is mainly comprised of simpler, single-step problems that a teacher might use to demonstrate intermediate steps in the process of solving a problem. To ensure robust evaluation, we created variants for utterances in each category.

Processed Utterance	Natural Language Utterance	Graph Input
Reflect y = 5x - 4 across the y-axis	Reflect y equals five x minus four across the y-axis	\(y = -5x - 4\)

Table 1: Example row from utterance-focused dataset
Processed Utterance	Graph Input
Plot a line that goes through (1, 3) and (4, 8)	\(y = \frac{5x}{3} + \frac{4}{3}\)
Plot a parallel line through the origin	\(y = \frac{5x}{3}\), \(y = \frac{5x}{3} + \frac{4}{3}\)

Our second dataset (referred to as the textbook-focused dataset) is focused on multi-step, complicated problems that require tool use to solve. The main topics in this dataset are a superset of those in the teacher-focused dataset, but the problems are geared towards demonstrating the utility of tool use in LLMs. In contrast to the utterance-focused dataset, we include word problems. The problems in this dataset are based on representative problems explicitly written for Common Core standards.

Our third dataset, referred to as the multi-turn dataset, is similar to the utterance focused dataset but includes multiple turns in each question. This requires the system to incorporate an understanding of the current calculator state into its response.

The datasets include a column for the problems and a column for the graph input associated with the problem. As the system is meant to be used through natural language commands, we also included a column with the utterance for the problem (e.g. “Graph \(y = 5x^2 + 3\)” vs. “Graph y equals five x squared plus three”). This column was automatically generated with GPT-4 [47] based on the original problem column and manually checked over for accuracy. The utterance-focused dataset contains 70 queries across seven categories, the textbook-focused dataset contains 147 queries across fourteen categories, and the multi-turn dataset contains 95 utterances across seven categories.

3.2 System↩︎

The MathViz-E system consists of four main components: the creation of the solver query, the generation of a written explanation based on the solver’s output, the generation of the visual calculator graphing expressions based on the solver’s output, and the validation and correction of the graphing expressions based on LLM self-critique. We incorporate multi-turn functionality into the system by including a calculator state in our prompts which describes the current expressions graphed on the calculator. As a result, the system is able to understand queries within the context of the expressions that have already been graphed. In this paper, we use Desmos as visual calculator, Wolfram Alpha as the solver and GPT-4 as the LLM, but the main principles of the system can be broadly applied to other tools and LLMs. The system is voice-driven; the presented version uses the Web Speech API [48] for speech recognition, but other ASR pipelines can also be used.

For a given problem, we create the solver query by prompting the LLM with instructions and a series of examples demonstrating how to write queries for certain math problems. These examples were chosen by identifying problems the LLM consistently misunderstood. The LLM is also provided with the spoken-utterance version of the problem and the calculator state. The calculator state contains the equations that have previously been graphed in the graphing window, and passing this state allows the system to incorporate this information into its problem-solving process. Below is a truncated version of the prompt used to create the Wolfram Alpha query:

Write a Wolfram Alpha query that can be used to solve the problem. The main purpose of the task is to find the numerical answer to the problem, not to graph the problem. When writing a query for a word problem, only include the necessary equation to solve the problem. Ensure that the query is acceptable by the Wolfram Alpha engine.

For example, if you are asked:

Graph \(y = 6x^2 + 4\) and find the local maxima and minima.

Calculator state: []

You generate:

Find the local maxima and minima of \(y = 6x^2 + 4\)

fig: — Figure 1: Overview of the MathViz-E automated graphing system

Once the query has been generated, we input it to our solver. Wolfram Alpha provides a set of pods for each query, with each pod containing a different category of information related to the query. Wolfram Alpha also provides step-by-step solutions for some problems. From these results, we extract the solution (generally the second pod, after the “Input Interpretation” pod) and the step-by-step solution if it is present.

To generate an explanation of the problem, we prompt the LLM with a zero-shot instruction. Along with the prompt, we provide the natural language utterance version of the problem, the calculator state, the numerical solution as given by the solver, and the solver’s step-by-step solution, if it is present.

In the cases where Wolfram Alpha provides a step-by-step solution, the LLM only has to expand upon this solution by providing more detail and explaining the reasoning behind the steps. When there is no step-by-step solution given, it must write its own explanation from scratch based on the problem and numerical solution.

In order to generate the Desmos graphing expressions, we prompt the LLM with instructions and a set of examples. In the prompt, we ask for a chain-of-thought, which helps to generate more accurate expressions. Chain-of-thought prompting has been shown to improve the accuracy of LLMs’ reasoning, especially with regards to math problems.[[49]][30] As with the explanation prompt, we also provide the natural language utterance version of the problem, the calculator state, the solver’s numerical solution, and the solver’s step-by-step solution, if there is one. The large number of examples in the prompt helps guide the LLM towards producing valid Desmos expressions. The provided examples were created by identifying common points of failure, and writing problems that demonstrate how to accurately deal with these issues.

In the last self-critique step, we ask the LLM to validate and refine the previously-generated Desmos expressions by checking for common errors. A majority of these errors arise from Desmos API deviations from standard latex (e.g. the use of \(le\) and \(ge\) instead of \(leq\) and \(geq\) for inequalities, or using \(abs(\cdot)\) instead of \(|\cdot|\) for the absolute value function). These checks also include ensuring that correct graphing variables are used, operations are formatted correctly, and functions are named properly. This step helps to eliminate basic errors made by the LLM in the expression-generating step.

fig: — Figure 2: UI of MathViz-E demonstrated through a multi-turn inverse problem

4 Experiments↩︎

4.1 Autoevaluation↩︎

Traditional evaluation metrics for text similarity fail for comparing mathematical statements due to the precise nature of math statements. Consider the statement 5=2+3. Lexical similarity metrics, such as Jaccard distance, would consider the statement 5=2+4 more similar than 5=4+1 since the former statements shares more words in common than the latter. Furthermore, many existing similarity metrics do not recognize common mathematical symbols as tokens and thus cannot be converted into a numerical representation. Similarly, directly examining the visual graph output through a multimodal approach is unlikely to be precise enough for our purposes. Although LLMs may be able to evaluate equivalence for simple expressions, their judgement becomes inconsistent for more complex expressions. As a result, evaluating the equations output by the automated graphing system at a large scale is nontrivial.

Due to these limiting factors, we create a new autoevaluation pipeline that can precisely compare two mathematical statements. We use the computer algebra system SymPy [50] to evaluate the mathematical equations output by the LLM in our LLM+Solver system when responding to given questions. In order to compare two equations, we use SymPy to isolate a variable and compare the resulting expressions on the other side of the equality.

Although this approach leads to accurate checking of math statements, an issue is that SymPy cannot parse certain formats of equations, which the LLM in the LLM+Solver system may produce. To combat this, we use an LLM as a backup in the autoevaluation process, where if SymPy cannot parse an equation, it will let the LLM compare the two math statements and output a result.

We construct a set of ground truth evaluations by running all the questions in our datasets through the LLM-only system, and manually evaluating if the system’s output matches the correct answer. This manually-benchmarked dataset allows us to run different versions of the autoevaluator on the dataset and check how its evaluations compare to our manually-written evaluations.

The simpler version of our autoevaluator only uses an LLM to compare two equations. Using GPT-4 as the LLM, we compare the results of LLM-only and LLM+SymPy autoevaluators on the entirety of our utterance-focused and textbook-focused datasets. An older version of the textbook-focused dataset was used, with the same categories and question styles as the current version. In the table below, we display the dataset-wide results as well as the results for selected categories.

Table 2: Accuracy of LLM-only autoevaluator and LLM+SymPy autoevaluator as compared to manual evaluations
	Utterance-Focused Dataset	Textbook-Focused Dataset	Systems of Linear Equations	Graph Inverse Functions	Graph Lines
LLM-Only	77%	76%	50%	40%	92%
LLM+SymPy	86%	88%	100%	80%	85%

The addition of SymPy to the autoevaluation pipeline increases the accuracy of evaluations significantly on almost all categories, especially in Systems of Linear Equations. In general, the LLM+SymPy autoevaluator performs better than the LLM-only autoevaluator on problems that are well-structured but computationally difficult. These problems are easily interpretable and can be solved by SymPy, as it can easily handle complex algebraic manipulations. In contrast, an LLM-only approach would struggle to accurately carry out algebra.

We see a minor drop in performance for a few categories, such as Graph Lines. This decrease in accuracy is generally due to SymPy and the LLM both misunderstanding the formatting of the equation. An important point to note is that the overwhelming majority of incorrectly evaluated answers are the result of a correct input being marked as incorrect. This occurs because SymPy will only mark expressions equivalent if they are genuinely equivalent, and GPT-4 rarely marks inequivalent expressions as being equivalent. As a result, when using the autoevaluation pipeline, we can trust that nearly all the responses marked as correct are actually correct, and manually check if the responses marked as incorrect are in fact incorrect. As a result, this greatly reduces the burden of evaluation when testing the performance of various iterations of the system.

4.2 Results↩︎

For the results reported in this section, we comprehensively evaluate the LLM+Solver system by manually validating all of the outputs generated for the three datasets (utterance-focused, textbook-focused, and multi-turn). We compare the performance of the LLM+Solver system to the results of the LLM-only system. The LLM-only system consists of directly prompting an LLM with instructions to write Desmos expressions and examples, while also providing the natural language utterance problem and the calculator state. No solver solution is provided. Although the framework of the system can be applied to LLMs and solvers broadly, in this paper we evaluate using GPT-4 as the LLM and Wolfram Alpha as the solver. Tables 4, 5 and 6 compare the results for the LLM-only system and the LLM+Solver systems.

Table 3: Accuracy of LLM-only and LLM+Solver models
	LLM-only	LLM+Solver
Utterance-Focused Dataset	66%	90%
Textbook-Focused Dataset	64%	86%
Multi-turn Dataset	86%	91%

Table 4: Accuracy of individual categories in utterance-focused dataset
	LLM-only	LLM+Solver
Graph Circles	90%	100%
Transform Shapes	70%	70%
Intersections of Lines	20%	100%
Transform Functions	100%	90%
Graph Lines	90%	100%
Local Minima and Maxima	40%	90%
X Intercepts, Y Intercepts	50%	80%

Table 5: Accuracy of individual categories in textbook-focused and multi-turn datasets
	LLM-only	LLM+Solver
Proportional Relationships	100%	100%
Linear Inequality Systems	100%	90%
Graph Inequalities	90%	100%
Graph Lines	100%	100%
Graph Lines (multi-turn)	91%	100%
Graph Polynomials + Identify Zeros	67%	100%
Graph Polynomials (multi-turn)	67%	100%
Systems of Linear + Quadratic Equations	15%	100%
Systems of Lin + Quad Eqns (multi-turn)	60%	100%
Graph Circles	100%	100%
Graph Circles (multi-turn)	96%	83%
Transformations of Functions	100%	100%
Transform Functions (multi-turn)	86%	90%
Graph Inverse Functions	70%	100%
Graph Inverse Functions (multi-turn)	90%	100%
Tangents to Parabolas	0%	11%
Tangents to Circles	0%	75%
Linear and Nonlinear Functions	60%	70%
Linear and Nonlinear Functions (multi)	100%	70%
Systems of Linear Equations	40%	100%
Rigid Transformations + Dilations	70%	60%

LLM+Solver vs. LLM-only system Table 4 shows that the addition of the solver results in a significant overall performance increase across all three datasets. Tables 5 and 6 further show that the greatest performance increase occurs in categories such as Local Minima and Maxima, X and Y intercepts, Intersections of Lines, Systems of Equations, and Tangents to Circles. These problems require complex calculations which are difficult for GPT-4 to carry out by itself, meaning that GPT-4 will frequently get them wrong. However, Wolfram Alpha can solve these problems, given a well-formulated query. As a result, the LLM+Solver system shows strong improvement over the LLM-only system for these categories.

Error Modes for LLM+Solver system The proposed system demonstrates excellent accuracy across a wide variety of problem types, as demonstrated in Table 5 and Table 6. However, there are some notable categories on which it has poor accuracy. It is instructive to delve deeper into these error modes.

For the category "Tangents to Parabolas", the problem arises from the Wolfram ALpha solver API, which does not return the correct answer in its Step-by-Step Solution response. This is a specific instance of a broader problem, where it is not always clear which of the Wolfram Alpha API response fields should be input to the LLM for further reasoning; fixing this requires conditioning response interpretation on the problem type (e.g. "Tangents to Parabola" and "Tangents to Circles" produce different types of API responses).
For "Rigid Function Transformations", errors arise from improper handling of polygon transformations. While the solution is very robust to transformations of parametric shapes, it struggles with non-parametric shape transformations. Improving this is an area of future work.
For other categories, some common (though occasional) error modes include: incomplete specifications (like "Draw a circle" without any radius or center specified) sometimes result in the system producing parametric expressions without associated sliders; a single query containing a list of sub-queries ("Plot the x-intercept, the y-intercept, the local minima and maxima, and the asymptotes") occasionally result in some sub-queries being missed; and tasks of sufficient complexity ("Move the circle so it is tangent to a line which satisfies property x and property y") may be computed incorrectly. The last error mode is not generally an issue for the pedagogical levels that the system is designed for, but could be an issue for higher-ed (compared to K-12).

Analysis of the LLM-only system The LLM-only system struggles the most in problems that require complicated reasoning and calculations to solve, such as "Tangents to Circles", "Systems of Linear + Quadratic Equations", "Intercepts" and others. It tends to fail either by solving the problem with an incorrect method or executing calculations incorrectly. It also sometimes plots the underlying functions or shapes, but then does not successfully compute points of interesections, extrema etc.

For many categories, the performance of the LLM-only system was strong to begin with, such as Circles, Proportional Relationships, and Graph Lines. These categories contain simple problems with little mathematical calculation, so GPT-4 is able to succeed at these problems without the help of a solver. There are some categories for which there are no Wolfram Alpha queries that can be used to solve the problem, such as Transform Shapes and Transformations on Functions. These categories do not show much change in the performance of both systems, as Wolfram Alpha cannot be used to provide answers for these categories. Incorporation of a more powerful tool, such as Python, could allow the system to successfully solve these problems.

5 Discussion↩︎

The results presented in this paper highlight the potential of domain-specific automation created via LLM orchestration of specialized tools. The presented case-study highlights some of the main challenges that need to be overcome in such systems. The paucity of preexisting datasets requires careful creation of new datasets for benchmarking; there is a need to identify and control specialized tools; and there is a need for auto-evaluation to validate and improve system performance.

For the case of mathematical pedagogy, we showed that through the design and development of an LLM-orchestrated system incorporating multiple specialized tools, through the creation of new benchmark datasets based on Common Core, and through the creation of an auto-evaluation pipeline, we created an effective automated system and the means to easily evaluate future iterations. The system has strong performance, and for many categories of problems it can consistently produce accurate outputs.

For the domain under consideration, there are categories of problems for which the system is not yet fully reliable. To improve system performance in these categories, there are several directions that we can take in the future. An especially promising approach is to add retrieval-based techniques to make adjustments to the system for the individual problem categories. This individualized approach could improve accuracy for some categories as compared to our current, one-size-fits-all system. Another key direction is to reduce latency; we plan to evaluate our system using smaller open-source LLMs, trained via supervised fine-tuning, instead of a large general model like GPT-4.

More broadly, for domain-specific agentic solutions, the approach presented in this paper offers a set of patterns that can be generalized. We expect that the generalization and application of these patterns to other domains and problems will continue to remain fertile ground for future research.

Acknowledgments↩︎

We would like to thank Ravi Kokku, Marc Pickett, Prasenjit Dey, and Paul Haley for helpful discussions and feedback.

References↩︎

[1]

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tenenholtz. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning, 2022.

[2]

Saghar Hosseini, Ahmed Hassan Awadallah, and Yu Su. Compositional generalization for natural language interfaces to web apis, 2021.

[3]

Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings, 2023.

[4]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023.

[5]

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis, 2023.

[6]

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Autodroid: Llm-powered task automation in android. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking, pp. 543–557, 2024.

[7]

Michael Lutz, Arth Bohra, Manvel Saroyan, Artem Harutyunyan, and Giovanni Campagna. Wilbur: Adaptive in-context learning for robust and accurate web agents. arXiv preprint arXiv:2404.05902, 2024.

[8]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023.

[9]

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models, 2023.

[10]

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. Do as i can, not as i say: Grounding language in robotic affordances, 2022.

[11]

McKinsey & Company. The economic potential of generative ai: The next productivity frontier. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier, 2024.

[12]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023.

[13]

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023.

[14]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021.

[15]

McKinsey & Company. What’s the future of generative AI? an early view in 15 charts. https://www.mckinsey.com/featured-insights/mckinsey-explainers/whats-the-future-of-generative-ai-an-early-view-in-15-charts, 2023.

[16]

Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of slms in grade school math, 2024.

[17]

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2023.

[18]

Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. Tinygsm: achieving >80.

[19]

T.H. Trinh, Y. Wu, Q.V. Le, H. He, and T. Luong. Solving olympiad geometry without human demonstrations. Nature, 36, 2024.

[20]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874.

[21]

Dermot Francis Donnelly-Hermosillo, Libby F. Gerard, and Marcia C. Linn. Impact of graph technologies in k-12 science and mathematics education. Computers & Education, 146: 103748, 2020. ISSN 0360-1315. .

[22]

PBC Desmos Studio. Desmos graphing calculator. https://www.desmos.com/calculator.

[23]

CCSSO and NGA-Center. Common core state standards. https://www.thecorestandards.org/Math/, 2010.

[24]

Wolfram Alpha LLC. Wolfram alpha. https://www.wolframalpha.com/, 2024.

[25]

Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022.

[26]

Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X. Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, Antoine Laurens, Claudio Fantacci, Valentin Dalibard, Martina Zambelli, Murilo Martins, Rugile Pevceviciute, Michiel Blokzijl, Misha Denil, Nathan Batchelor, Thomas Lampe, Emilio Parisotto, Konrad Żołna, Scott Reed, Sergio Gómez Colmenarejo, Jon Scholz, Abbas Abdolmaleki, Oliver Groth, Jean-Baptiste Regli, Oleg Sushkov, Tom Rothörl, José Enrique Chen, Yusuf Aytar, Dave Barker, Joy Ortiz, Martin Riedmiller, Jost Tobias Springenberg, Raia Hadsell, Francesco Nori, and Nicolas Heess. Robocat: A self-improving foundation agent for robotic manipulation, 2023.

[27]

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation, 2023.

[28]

Chethan Bhateja, Derek Guo, Dibya Ghosh, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, and Aviral Kumar. Robotic offline rl from internet videos via value-function pre-training, 2023.

[29]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.

[30]

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. A survey of chain of thought reasoning: Advances, frontiers and future, 2023.

[31]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022.

[32]

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning, 2023.

[33]

Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023.

[34]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023.

[35]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022.

[36]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models, 2023.

[37]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023.

[38]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.

[39]

Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, Felix Yu, and Sanjiv Kumar. Rest meets react: Self-improvement for multi-step reasoning llm agent, 2023.

[40]

Zichao Wang, Jakob Valdez, Debshila Basu Mallick, and Richard G Baraniuk. Towards human-like educational question generation with large language models. In International conference on artificial intelligence in education, pp. 153–166. Springer, 2022.

[41]

Sabina Elkins, Ekaterina Kochmar, Iulian Serban, and Jackie CK Cheung. How useful are educational questions generated by large language models? In International Conference on Artificial Intelligence in Education, pp. 536–542. Springer, 2023.

[42]

Sahan Bulathwela, Hamze Muse, and Emine Yilmaz. Scalable educational question generation with pre-trained language models. In International Conference on Artificial Intelligence in Education, pp. 327–339. Springer, 2023.

[43]

Chaitali Diwan, Srinath Srinivasa, Gandharv Suri, Saksham Agarwal, and Prasad Ram. Ai-based learning content generation and learning pathway augmentation to increase learner engagement. Computers and Education: Artificial Intelligence, 4: 100110, 2023.

[44]

Paul Rodway and Astrid Schepman. The impact of adopting ai educational technologies on projected course satisfaction in university students. Computers and Education: Artificial Intelligence, 5: 100150, 2023.

[45]

Ibrahim Adeshola and Adeola Praise Adepoju. The opportunities and challenges of chatgpt in education. Interactive Learning Environments, pp. 1–14, 2023.

[46]

David Baidoo-Anu and Leticia Owusu Ansah. Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning. Journal of AI, 7 (1): 52–62, 2023.

[47]

OpenAI. Gpt-4 technical report. https://arxiv.org/abs/2303.08774, 2023.

[48]

MDN Web Docs. Mdn web speech api. https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API, 2023.

[49]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022. URL https://arxiv.org/abs/2201.11903.

[50]

Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. Sympy: symbolic computing in python. PeerJ Computer Science, 3: e103, January 2017. ISSN 2376-5992. .

MathViz-E: A Case-study in Domain-Specialized Tool-Using Agents

Abstract

1 Introduction↩︎

3 Methodology↩︎

3.1 Dataset Construction↩︎

3.2 System↩︎

4 Experiments↩︎

4.1 Autoevaluation↩︎

4.2 Results↩︎

5 Discussion↩︎

Acknowledgments↩︎

References↩︎

Subjects

Updated on Academus

MathViz-E: A Case-study in Domain-Specialized Tool-Using Agents

Abstract

1 Introduction↩︎

2 Related Work↩︎

3 Methodology↩︎

3.1 Dataset Construction↩︎

3.2 System↩︎

4 Experiments↩︎

4.1 Autoevaluation↩︎

4.2 Results↩︎

5 Discussion↩︎

Acknowledgments↩︎

References↩︎

Subjects

Updated on Academus