Exploring Communication Strategies for Collaborative LLM Agents in Mathematical Problem-Solving1


Abstract

Large Language Model (LLM) agents are increasingly utilized in AI-aided education to support tutoring and learning. Effective communication strategies among LLM agents improve collaborative problem-solving efficiency and facilitate cost-effective adoption in education. However, little research has systematically evaluated the impact of different communication strategies on agents’ problem-solving. Our study examines four communication modes, teacher-student interaction, peer-to-peer collaboration, reciprocal peer teaching, and critical debate, in a dual-agent, chat-based mathematical problem-solving environment using the OpenAI GPT-4o model. Evaluated on the MATH dataset, our results show that dual-agent setups outperform single agents, with peer-to-peer collaboration achieving the highest accuracy. Dialogue acts like statements, acknowledgment, and hints play a key role in collaborative problem-solving. While multi-agent frameworks enhance computational tasks, effective communication strategies are essential for tackling complex problems in AI education.

1 Introduction↩︎

Recent advances in Large Language Models (LLMs) have shown significant promise in enhancing educational tasks such as feedback generation, learning guidance, interaction strategies, understanding student behaviors, and fostering tutoring dialogues through answer evaluation and content generation [1][5]. However, most implementations rely on single-agent approaches using prompt engineering for logical reasoning (e.g., chain of thought [6]). Emerging research on multi-agent collaboration among LLMs further improves performance through collaborative strategies and holds promise for expanding the effectiveness of AI in education [7], [8]. In multi-agent settings, LLM agents leverage complementary reasoning processes to cross-validate ideas, refine their understanding, and ultimately produce more robust solutions than single-agent approaches [9]. Despite these promising developments, systematic exploration of communication strategies among collaborative LLM agents in problem-solving contexts remains limited.

Mathematical problem-solving presents an ideal testing ground for multi-agent LLM collaboration in education. The domain of mathematics demands precise, step-by-step reasoning and offers clear, binary outcomes, making it well-suited for evaluating how multiple LLM agents can effectively reason, communicate, and learn together. Recent studies on chat-based math problem solving [10] and conversation programming for mathematical problem solving [11] illustrate the potential of multi-agent approaches to address the unique challenges of mathematical reasoning. Systematic analysis of diverse problem-solving strategies in math can provide valuable insights into optimizing agent communication and enhancing overall performance in AI-driven educational systems [12].

Figure 1: An illustration for different dual LLM agents modes for problem-solving.

In this study, we explore a range of collaborative communication strategies using dual LLM-based agents to solve mathematical problems via chat-based interactions. Drawing inspiration from Vygotsky’s educational theories, particularly the Zone of Proximal Development (ZPD) and the More Knowledgeable Other (MKO) [13], and informed by insights from machine psychology [14], we adopt several interaction modes in our study. These include teacher-student interactions, where one agent guides another through direct instruction and the Socratic method; peer-to-peer collaboration, in which agents work as equals to exchange ideas and solve problems; reciprocal peer teaching, where agents alternate roles to reinforce understanding; and critical debate, where agents challenge one another’s solutions to refine their approaches. Fig. 1 provides further details on these communication strategies. By assigning distinct roles to the agents, we aim to enhance their collaborative problem-solving effectiveness and evaluate the performance of these communication strategies in math problem-solving. Dialogue Act (DA) analysis will be applied to conversations across different strategies to identify unique traits and derive actionable insights. Our investigation is guided by the following Research Questions:

  • RQ 1: Which communication strategies among dual-agent LLM systems most effectively enhance mathematical problem-solving performance?

  • RQ 2: How do DA patterns vary across dual-agent communication modes and influence mathematical problem-solving?

The explorations and findings will guide the development of collaborative multi-agent systems powered by LLMs, paving the way for more adaptive and human-centric AI educational platforms. This research contributes to creating knowledgeable agents that leverage effective collaborative strategies to drive advanced intelligent tutoring systems. Ultimately, these insights promise to bridge fundamental AI research with practical educational applications, thereby enhancing learning outcomes in AI-aided education.

2 Method↩︎

All primary language model configurations in LLM agent design were based on the OpenAI GPT-4o model [15].

Dataset. We used a subset (a total of 700) of the MATH dataset [16], selecting only level-5 problems (the highest difficulty level) 2.

Dual-Agent Protocol Definition. We define four dual-agent communication protocols, each tailored to a distinct communication mode (see Fig. 1). Each protocol involves two LLM agents, \(\mathcal{A}\) and \(\mathcal{B}\), to whom we assign specific personas and provide comprehensive guidance for response generation via prompt engineering. For example, in the teacher–student interaction setup, the agent \(\mathcal{A}\) adopts the teacher persona with the summarized version of prompt template \(\mathcal{T}_A(Q,\mathcal{H}) =\)You are a supportive math teacher. Present \(Q\) clearly, summarize key points, offer hints, and ask probing questions based on dialogue history \(\mathcal{H}\) without revealing the full solution”, and the agent \(\mathcal{B}\) adopts the student persona with the summarized version of prompt template \(\mathcal{T}_B(Q, \mathcal{H})=\)You are a curious math student. Follow the teacher’s guidance to solve \(Q\) step-by-step, articulate your reasoning, and ask for clarifications as needed.” For further details, including comprehensive agent definitions and prompt templates for alternative communication modes, please refer to our GitHub repository:https://github.com/LiangZhang2017/aied_dual_agent.

Baseline. We established a baseline single-agent protocol in a zero-shot setting. The single agent is defined as a helpful assistant tasked with solving math problems, generating both the solution process and the final answer by sequentially working through each problem.

Accuracy Evaluation. The final answer produced by both dual-agent and single-agent setups is rigorously compared to the ground truth to assess correctness. Accuracy is defined as the percentage of correctly answered questions, normalized to a 100-point scale for each math subject, communication mode, and run. Average accuracy is computed across different communication modes, with the standard error (SE) calculated from three independent runs to account for observed inconsistencies in the final answers.

Dialogue Act Classification Analysis. The conversation exhibits in-depth reasoning that reveals the underlying logical structure of mathematical problem solutions. To further examine the nuances of dialogue communication and explore the distinct functionalities of dual agents in collaborative math problem solving, we perform a dialogue act classification analysis. Dialogue acts refer to the functional roles that utterances or responses play in a conversation [17][19]. Dialogue acts can include actions like asking questions, providing explanations, and offering feedback. Initially, each agent’s response is segmented into chunks, where a chunk may consist of a single sentence, a series of strongly logically related sentences, or a comprehensive multi-step equation process. We then used a pre-trained Bert-based model introduced in [17], which has been trained and validated on large-scale tutoring dialogue data (3,629 utterances) collected from math tutoring sessions between human tutors and students. Additionally, we apply a framework based on the principles from [19], which provides a structured method for categorizing different types of dialogue acts. Below is a brief reference list of DA tags from the scheme in [19]: H (Hint), DIR (Directive), ACK (Acknowledge), RC (Request Confirmation), RF (Request Feedback), PF (Positive Feedback), NF (Negative Feedback), LF (Lukewarm Feedback), Q (Question), A (Answer), and S (Statement).

3 Results and Discussion↩︎

3.1 Performance Accuracy in Four Communication Modes↩︎

Table 1: Average Accuracy (%) across Different Modes with SE.
Model Single Agent Dual Agents
3-6 Teacher-Student Peer-to-Peer Critical Debate Reciprocal Peer
GPT-4o \(47.43_{3.66}\) \(52.33_{4.13}\) \(54.10_{3.91}\) \(52.76_{4.14}\) \(51.95_{3.95}\)

Table 1 displays the average accuracy for each communication mode, calculated by averaging the performance across all math problems solved within that mode. The results demonstrate that employing dual-agent performance in mathematical problem-solving can yield significant improvements over single-agent settings. Notably, the peer-to-peer collaboration mode, where agents work as equals by sharing intermediate results and cross-verifying outputs, achieved the highest average accuracy (54.10%). Collaborative contribution through reciprocal interactions, as observed in this mode, significantly boost error-checking and reasoning, both of which are crucial for tackling complex mathematical tasks [20], [21]. Moreover, the peer-to-peer collaboration mode exhibited the lowest standard error (SE = 3.91) among all dual-agent approaches, signifying more robust and consistent performance. The teacher-student interaction (52.33%), critical debate (52.76%), and reciprocal peer teaching (51.95%) modes also yield improvements over the single agent (47.43%), although their differences are relatively small.

3.2 Dialogue Act Classification Result↩︎

Figure 2: DA Classification Results Across Four Modes.

Fig. 2 illustrates the frequency distribution of DAs across four communication modes, exemplified by the algebra subject. Both agents’ utterances were merged to capture their collective impact on problem solving, and the top five DA tags in each mode are highlighted in the upper histogram. Notably, statement (S), hints (H) and acknowledge (ACK) emerge as the three most frequent dialogue acts across all modes, significantly contributing to collaborative problem-solving. In particular, statement (S) is the predominant DA across all four modes, with its peak frequency (35.5%) observed in the teacher-student interaction mode. For example, an agent might state, “The algebraic manipulation and the verification steps are clear and logical,” or “However, I would like to explore the solution further to ensure there are no other possible pairs of x and y that satisfy the given conditions.” These examples illustrate how agents express their own perspectives and guide the subsequent moves in problem solving. The hints (H) emerge as the most frequent dialogue act in critical debate (33.0%), where agents actively challenge each other’s solutions. The utterances “Let’s denote the two whole numbers I pick as x and y.” and “According to the problem, my friend picks the numbers x-4 and 2y-1.” serve as hints that are intended to guide the problem-solving process rather than provide complete solutions. This high prevalence of hints reflects an active effort to explore, test, and improve upon proposed solutions [22]. Acknowledge (ACK), for instance, “Thank you, Agent B, for your systematic approach to verifying the solution,” or “Thank you for the constructive discussion!” serve as bridging elements that affirm successful communication between agents. These acknowledgments not only recognize individual contributions but also strengthen the overall collaborative dynamic. Additionally, answer (A) are among the top five most frequently occurring DAs. Overall, the high frequency of these dual-agent-specific dialogue acts underscores their crucial role in driving the collaborative problem-solving process. Using Kendall’s correlation [23], we found that DA patterns in peer-to-peer collaboration are highly correlated with teacher-student interaction (0.99) and reciprocal peer teaching (0.96), but less so with critical debate (0.67), suggesting that even subtle variations in these patterns can significantly impact performance.

4 Conclusion↩︎

In this study, we explored the impact of multi-agent communication strategies on mathematical problem-solving using LLM-based tutoring agents. By systematically comparing four distinct dual-agent communication modes including teacher-student interaction, peer-to-peer collaboration, critical debate, and reciprocal peer teaching, we demonstrated that multi-agent frameworks can significantly enhance problem-solving performance relative to single-agent approaches. Notably, the peer-to-peer collaboration mode achieved the highest accuracy, underscoring the benefits of equal, collaborative interaction among agents. Our dialogue act analysis revealed that specific communicative behaviors, such as statements (S), acknowledge (ACK), and hints (H), play pivotal roles in the agents’ collaborative reasoning process. These findings offer valuable insights into how different interaction patterns contribute to effective problem solving and provide a foundation for designing more adaptive AI tutoring systems. Future work will aim to refine agent communication protocols, integrate advanced prompt optimization techniques, and explore heterogeneous agent configurations to further enhance performance. Overall, our work demonstrates that structured, multi-agent communication not only boosts accuracy in mathematical problem-solving but also enriches the underlying dialogue dynamics, paving the way for more robust and human-centric AI educational platforms.

5 Acknowledgments↩︎

The research reported here was partially supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305C240010 to University of Georgia Research Foundation. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.

References↩︎

[1]
L. Zhang et al., “Predicting learning performance with large language models: A study in adult literacy,” in International conference on human-computer interaction, 2024, pp. 333–353.
[2]
L. Zhang et al., “Data augmentation for sparse multidimensional learning performance data using generative AI,” IEEE Transactions on Learning Technologies, 2025.
[3]
L. Zhang, J. Lin, Z. Kuang, S. Xu, and X. Hu, “SPL: A socratic playground for learning powered by large language mode,” arXiv preprint arXiv:2406.13919, 2024.
[4]
W. Tan et al., “Does informativeness matter? Active learning for educational dialogue act classification,” in International conference on artificial intelligence in education, 2023, pp. 176–188.
[5]
G.-G. Lee, E. Latif, X. Wu, N. Liu, and X. Zhai, “Applying large language models and chain-of-thought for automatic scoring,” Computers and Education: Artificial Intelligence, vol. 6, p. 100213, 2024.
[6]
J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
[7]
J. Yu et al., “From mooc to maic: Reshaping online teaching and learning through llm-driven agents,” arXiv preprint arXiv:2409.03512, 2024.
[8]
M. Yue, W. Lyu, W. Mifdal, J. Suh, Y. Zhang, and Z. Yao, “Mathvc: An llm-simulated multi-character virtual classroom for mathematics education,” arXiv preprint arXiv:2404.06711, 2024.
[9]
E. Latif et al., “A systematic assessment of openai o1-preview for higher order thinking in education,” arXiv preprint arXiv:2410.21287, 2024.
[10]
Z. Liang et al., “MathChat: Benchmarking mathematical reasoning and instruction following in multi-turn interactions,” arXiv preprint arXiv:2405.19444, 2024.
[11]
V. Keating, “Zero-shot mathematical problem solving with large language models via multi-agent conversation programming,” in AI for education: Bridging innovation and responsibility at the 38th AAAI annual conference on AI, 2024.
[12]
L. Zhang et al., “Exploring communicative strategies for dual LLM agents in mathematical problem solving,” 2022.
[13]
L. S. Vygotsky, Mind in society: The development of higher psychological processes, vol. 86. Harvard university press, 1978.
[14]
T. Hagendorff, “Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods,” arXiv preprint arXiv:2303.13988, vol. 1, 2023.
[15]
A. Hurst et al., “Gpt-4o system card,” arXiv preprint arXiv:2410.21276, 2024.
[16]
D. Hendrycks et al., “Measuring mathematical problem solving with the MATH dataset,” NeurIPS, 2021.
[17]
J. Lin et al., “Is it a good move? Mining effective tutoring strategies from human–human tutorial dialogues,” Future Generation Computer Systems, vol. 127, pp. 194–207, 2022.
[18]
J. Lin et al., “Robust educational dialogue act classifiers with low-resource and imbalanced datasets,” in International conference on artificial intelligence in education, 2023, pp. 114–125.
[19]
A. K. Vail and K. E. Boyer, “Identifying effective moves in tutoring: On the refinement of dialogue act annotation schemes,” in International conference on intelligent tutoring systems, 2014, pp. 199–209.
[20]
H. Li et al., “Theory of mind for multi-agent collaboration via large language models,” in Proceedings of the 2023 conference on empirical methods in natural language processing, Dec. 2023, pp. 180–192, doi: 10.18653/v1/2023.emnlp-main.13.
[21]
X. Bo et al., “Reflective multi-agent collaboration based on large language models,” Advances in Neural Information Processing Systems, vol. 37, pp. 138595–138631, 2025.
[22]
K. VanLehn, “The behavior of tutoring systems,” International journal of artificial intelligence in education, vol. 16, no. 3, pp. 227–265, 2006.
[23]
M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1–2, pp. 81–93, 1938.

  1. The manuscript has been accepted for presentation at the 26th International Conference on Artificial Intelligence in Education (AIED 2025), to be held from July 22–26, 2025, in Palermo, Italy.↩︎

  2. https://huggingface.co/datasets/hendrycks/competition_math↩︎