May 07, 2025
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict constraints—such as limited local perception and communication, characteristic of
natural swarms—remains largely unexplored, particularly concerning the nuances of swarm intelligence. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination that arise when agents operate with incomplete
spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five
foundational MAS coordination tasks (Pursuit, Synchronization, Foraging, Flocking, Transport) within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (\(k\times k\) view) and local
communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Evaluating several leading LLMs (e.g., deepseek-v3
, o4-mini
) in a zero-shot setting, we find significant performance variations
across tasks, highlighting the difficulties posed by local information constraints. While some coordination emerges, results indicate limitations in robust planning and strategy formation under uncertainty in these decentralized scenarios. Assessing LLMs
under swarm-like conditions is crucial for realizing their potential in future decentralized systems. We release SwarmBench as an open, extensible toolkit—built upon a customizable and scalable physical system with defined mechanical properties. It
provides environments, prompts, evaluation scripts, and the comprehensive experimental datasets generated, aiming to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of Embodied MAS. Our code repository is
available at https://github.com/x66ccff/swarmbench.
Large Language Models (LLMs) have demonstrated remarkable capabilities in language understanding and generation [1], leading to growing interest in deploying them as autonomous agents capable of perception, tool use, and collaboration [2], [3]. Consequently, research is increasingly investigating the collaborative potential of LLM-driven agents, particularly in tasks requiring spatial reasoning and interaction, connecting to broader studies of collective intelligence in artificial and human systems [4]. However, current evaluations predominantly focus on individual agent skills or multi-agent scenarios characterized by ample communication, global visibility, or predefined organizational structures [5]–[7]. Such settings often sidestep the fundamental challenge of achieving coordination when agents operate under severe, decentralized constraints.
Inspired by decades of research across biology, physics, and robotics, a critical question remains largely unexplored in the LLM context: Can effective coordination and collective intelligence emerge from the decentralized actions of numerous LLM agents operating with strictly limited perception and communication, akin to natural swarms? This principle forms the bedrock of Swarm Intelligence, which investigates how complex group behaviors arise from simple, local interactions [8]. Nature provides compelling examples, from army ants forming living structures [9], [10] and locusts achieving coordinated marching [11], to microswimmers forming vortices [12]. Seminal simulations, like Reynolds’ flocking model [13], demonstrated that sophisticated global patterns can emerge purely from local rules. This paradigm, emphasizing decentralization, local sensing, and minimal communication, has been successfully applied in Swarm Robotics, where collectives of simple robots achieve complex tasks like shape formation [14], [15]. Therefore, a key unknown is whether LLMs, despite their advanced cognitive capabilities, can effectively participate in such decentralized swarms, bridging individual sophistication and emergent collective action under classic swarm constraints.
Figure 1: Swarm Intelligence: Natural Inspiration and SwarmBench Tasks. Top row: Examples of collective behavior in nature driven by local interactions: (a) cooperative wolf pursuit, (b) firefly
synchronization, (c) ant foraging [9], [10], (d) bird flocking [13], and (e) cooperative ant transport. Bottom row: Corresponding abstract tasks simulated
in SwarmBench’s 2D grid environment, depicting agents (represented by colored squares) facing analogous coordination challenges involving targets (P
), food (F
), nests (N
), and obstacles (B
), constrained
by walls (W
). Agents rely solely on local perception and communication, providing a testbed for emergent decentralized coordination.
Existing benchmarks do not adequately address this question rooted in the classical swarm intelligence paradigm. Some evaluate spatial reasoning fundamentals without multi-agent dynamics [16], while others test single-agent reasoning in complex tasks [17]. Multi-agent benchmarks often employ structured games [18], collaborative tasks with rich communication [5], [6], [19], or scenarios where coordination structures are imposed rather than emerging organically [20]. While valuable, these approaches often do not center on the core challenge of achieving robust decentralized coordination despite severe limitations on perception and communication. Consequently, whether LLM collectives can exhibit complex swarm phenomena—like spontaneous leadership [20], the role of noise/diversity [21], [22], or information cascades—under such stringent, decentralized conditions is a critical open question. Existing approaches often bypass the core constraints necessary to observe such organically emerging complexities, highlighting a gap SwarmBench aims to address.
Figure 2: Conceptual Architecture of SwarmBench. The SwarmBench framework is designed for modularity and extensibility in evaluating LLM-based swarm intelligence. An evaluation begins by launching SwarmBench, which orchestrates the core interactions between the defined task, the simulation environment, the LLM-powered agents, and a comprehensive logger for data capture. Snippets from underlying modules illustrate how SwarmBench implements this: defining diverse coordination tasks, managing the environment and its physics, structuring agent perception and decision-making logic, providing the overall experimental framework, and logging results. This architecture enables systematic benchmarking, generates comprehensive Datasets from experiments, and facilitates the creation of a LeaderBoard for model comparison. The framework is also designed to readily support future extensions, such as the integration of an RLVR Pipeline.
To fill this gap, we introduce SwarmBench, a benchmark specifically designed to evaluate the emergent coordination capabilities of LLM agents acting as individuals in a decentralized swarm. Inspired by benchmarks like SnakeBench [23] and ARC-AGI [24], SwarmBench presents five fundamental multi-agent coordination challenges—Pursuit, Synchronization, Foraging, Flocking, and Transport—within a flexible 2D grid world. Critically, agents operate with restricted perception (a small local view) and minimal, optional local communication, forcing reliance on local cues and implicit coordination. We propose metrics to quantify task success and efficiency, and the nature of the emergent collective behavior, including measures related to behavioral diversity.
The SwarmBench framework (Figure 2) was used to conduct extensive zero-shot evaluations of several prominent LLMs. Our contributions are:
SwarmBench: A novel benchmark grounded in swarm intelligence principles, designed to assess emergent decentralized coordination in LLM swarms under strict perception and communication constraints.
A systematic evaluation of contemporary LLMs on SwarmBench, characterizing their current abilities and limitations in canonical swarm scenarios.
An analysis of emergent group dynamics, connecting LLM swarm behavior (e.g., behavioral variability, failure modes) to established collective intelligence concepts.
An open-source toolkit, built upon a customizable and scalable physical system with defined mechanical properties, comprising environments, standardized prompts, evaluation scripts, and the comprehensive datasets generated, to facilitate reproducible research into LLM-based swarm intelligence.
Our findings indicate that while LLMs exhibit potential for basic coordination, they struggle significantly with long-range planning and robust spatial reasoning under uncertainty when operating under severe decentralization. SwarmBench provides a dedicated platform to measure progress and guide future research towards developing LLMs capable of genuine collective intelligence in decentralized settings. Understanding such capabilities is vital, given the growing focus on collective behavior in artificial and human systems [25], [26].
Our research builds upon the foundations of swarm intelligence, multi-agent systems (MAS), and LLMs. We investigate classical swarm principles using modern LLMs, positioning our work relative to recent multi-agent evaluation methodologies.
Swarm intelligence examines how complex, adaptive group behaviors emerge from local interactions among individuals with relatively simple capabilities, drawing from natural systems like insect colonies, bird flocks, and microscopic swimmers [8], [12], [13]. Biological examples include army ants constructing living bridges [9], [10] and locusts transitioning to coherent marching [11]. This natural blueprint inspired swarm robotics, focusing on coordinating large groups of robots, often with limited individual abilities [27]. Seminal projects like Kilobots demonstrated collective shape formation by a thousand simple robots using only local communication [14], [15]. Further advancements include adaptive control hierarchies like Self-Organizing Neural Systems (SoNS) [28], and specialized simulators like ARGoS [29] and Kilombo [30]. The functional role of diversity or noise, potentially enhancing coherence [21], [22], resonates with our exploration. SwarmBench adopts core operational constraints (local sensing, minimal communication) but substitutes simpler agents with powerful LLMs, to investigate how swarm intelligence manifests with sophisticated cognitive entities. This focus on physical action differentiates our work from data processing approaches like Swarm Learning [31].
Utilizing LLMs as the decision-making core for autonomous agents is a rapidly growing field [2], [32]. LLMs bring vast world knowledge and reasoning, enabling more adaptable agent interactions [33]. This extends to MAS, where LLMs might enhance communication and teamwork [34], [35], reflecting human collective intelligence factors [4].
LLMs are applied in diverse multi-agent contexts: software development (MetaGPT
[35]), simulated scientific
discovery [36], [37], complex social simulations [3], [33], [38], [39], and code generation [40]. Studies on emergent
cooperation and Theory of Mind in LLM teams identify promising capabilities alongside consistency limitations [41]. Imposed organizational structures can improve efficiency [20]. LLM integration into multi-robot systems is
also surveyed [42]. However, many
investigations involve rich communication channels, predefined roles, or assume reliable information transfer, contrasting with swarm systems where noise and limited propagation are defining features [43]. A key unaddressed question is how LLMs perform in large, decentralized groups where
coordination emerges from local perception and constrained signaling. SwarmBench directly targets this.
Meaningful evaluation of LLM agents in MAS requires appropriate benchmarks. Recent efforts use cooperative games: LLM-Coordination (Hanabi, Overcooked [18]), Collab-Overcooked (natural language in Overcooked [6]), and COBLOCK (3D construction [44]). SnakeBench uses competitive games [23] without strict swarm constraints. These game-based benchmarks often provide full visibility or structured communication distinct from swarm intelligence’s local-information world. Other benchmarks explore complex tasks: MultiAgentBench (MARBLE) with diverse scenarios, often with roles [5]; VillagerBench (Minecraft group tasks [19]); and Generative Agents (emergent social dynamics [33]). These showcase LLM capabilities but typically rely on higher-level coordination mechanisms. Foundational reasoning is assessed by SpatialEval (spatial understanding [16]) and BALROG (single-player agentic reasoning [17]). LLMs reportedly struggle with patterns like multi-agent flocking [45]. SwarmBench distinguishes itself by concentrating on emergent decentralized coordination within LLM swarms, adopting classical swarm intelligence constraints (restricted perception/communication). It employs foundational, nature-inspired tasks and analyzes collective dynamics often overlooked in benchmarks focused on structured frameworks.
Significant research explores LLMs for coordination within embodied multi-agent systems, often using sophisticated 3D simulators (e.g., AI-THOR, Habitat) for tasks like household assistance and navigation [46]–[48]. These systems frequently address challenges like heterogeneous capabilities, task allocation [49], and integrating perception with planning [50], [51], alongside developing modular architectures [52], world models [53], or deadlock resolution [54]. While demonstrating progress, this body of work typically operates with richer sensory inputs, more sophisticated communication (sometimes low-distortion [43]), or different architectural assumptions (e.g., centralized components). In contrast, SwarmBench specifically isolates and evaluates the emergence of fundamental swarm intelligence driven by minimal local interactions under highly constrained, decentralized conditions, reflecting classical swarm intelligence principles. By focusing on these aspects in a simplified 2D grid world, SwarmBench provides a complementary evaluation focused on raw emergent coordination potential from local constraints.
To evaluate the capacity of Large Language Models (LLMs) for emergent decentralized coordination under constraints typical of swarm intelligence, we introduce SwarmBench. This benchmark provides a suite of multi-agent tasks within a configurable 2D grid-world environment, coupled with standardized protocols for evaluating LLM-driven agents. SwarmBench focuses on scenarios where agents possess only limited local perception and rely on local communication capabilities, necessitating the emergence of collective strategies from decentralized interactions rather than global planning or explicit centralized control. Further details on the environment, agent capabilities, and evaluation protocol are provided in Appendix 11.
SwarmBench utilizes a simulation environment based on a 2D grid world, a design choice aligned with foundational AI benchmarking (e.g., SnakeBench [23], ARC-AGI [24]) to facilitate focused investigation of core coordination dynamics. This environment is a customizable physical system with explicitly modeled mechanical interactions (detailed in Appendix 10). The benchmark includes five core multi-agent coordination tasks: Pursuit, Synchronization, Foraging, Flocking, and Transport. These tasks, visualized in Figure 1 and detailed in Appendix 11, probe different facets of emergent swarm behavior. The environment framework is extensible and supports procedural generation of instances to ensure robust evaluation.
Agents operate with restricted perception, primarily an egocentric \(k \times k\) grid view, and can engage in optional local, anonymous message passing. Based on their local observation (which includes the grid view, self-status, and received messages), agents decide on a primary action (e.g., movement, task-specific) and an optional message. This setup compels reliance on local cues and implicit coordination. Specifics of the observation packet, action space, physics of movement, and communication mechanics are elaborated in Appendix 11, with the agent prompt structure in Appendix 12.
We employ a zero-shot evaluation protocol where each agent is controlled by an independent, stateless LLM instance. Persistence of memory is managed through the prompt. SwarmBench is model-agnostic; our experiments (Section 4) utilize several contemporary LLMs without task-specific fine-tuning. The detailed evaluation methodology is described in Appendix 11.
To quantitatively analyze emergent collective behaviors, we compute metrics based on agent positions and actions, capturing aspects like behavioral diversity and movement coordination. These metrics, detailed in Appendix 14, facilitate the analysis of emergent strategies and performance.
Figure 3: Overview of LLM Performance on SwarmBench Tasks. Average scores achieved by different LLMs across the five core tasks (Pursuit
, Synchronization
, Foraging
,
Flocking
, Transport
). Bars represent the mean score over 5 simulation runs. Performance varies significantly depending on the model and the specific coordination challenge. Detailed scores and standard deviations are provided in
Table 1 in Appendix 15
We evaluated thirteen contemporary LLMs (Fig. 3) on the five core SwarmBench tasks (Pursuit
, Synchronization
, Foraging
, Flocking
, Transport
)
under the zero-shot protocol. Agents operated with a \(5 \times 5\) local view (\(k=5\)), making decisions based on this restricted perception and the potential for local communication via
the MSG
action (as detailed in Section 11.2). Performance, averaged over 5 simulation runs per model per task, reveals significant variation based on both the model and the specific
coordination challenge, highlighting the difficulty of decentralized coordination under strict local constraints.
Figure 3 visually summarizes the average task scores achieved by the evaluated LLMs across the five core SwarmBench challenges: Pursuit
, Synchronization
, Foraging
,
Flocking
, and Transport
. The results reveal significant performance variability, contingent on both the specific LLM and the nature of the coordination task.
Overall trends indicate that tasks presented varying levels of difficulty, with Flocking
generally yielding the highest scores across models, while Synchronization
showed greater divergence in performance. Model strengths also
differed considerably; for instance, gemini-2.0-flash
and o4-mini
demonstrated relative strength in spatial tasks like Pursuit
and Foraging
, whereas claude-3.7-sonnet
excelled specifically in
Synchronization
. Notably, no single model dominated all tasks, and several models (deepseek-r1
, claude-3.5-haiku
) struggled significantly across the board in this zero-shot setting, underscoring the inherent
difficulty of these swarm coordination problems. Figure 4 further illustrates the score progression dynamics over time.
These results emphasize that LLM coordination ability under swarm-like constraints is highly task-dependent and relies heavily on emergent strategies formed from local information. The observed variability points to diverse capabilities and limitations among current models when faced with decentralized coordination challenges. For detailed numerical results, including mean scores and standard deviations over the 5 simulation runs per model and task, please refer to Table 1 in Appendix 15.
Figure 4: LLM Score Progression on SwarmBench Tasks Over Time. Average task score accumulation over 100 simulation rounds for different LLMs across the five core tasks (Pursuit
, Synchronization
,
Foraging
, Flocking
, Transport
). Lines represent the mean score trajectory, and shaded areas indicate the standard deviation across 5 simulation runs. This illustrates the dynamics of performance development during
each task.
To understand the behavioral underpinnings of performance across all tasks, we conducted two main analyses on the simulation data.
First, we performed a primary analysis focusing on the correlation between physical group dynamics metrics (defined in Section 3.4 and detailed in Appendix 14) and the final task scores, aggregating data from all 325 simulation runs. This analysis aimed to identify general principles of effective swarm coordination emerging from agents’ physical actions and
local observations. The detailed results of this dynamics analysis are presented in Appendix 17. Key findings (Table [tab:combined95correlations]) indicate that higher scores are significantly correlated with behavioral variability (e.g., std_action_entropy
, \(r=0.300\); std_dominant_action_prop
, \(r=0.274\)) and efficiency (e.g., prop_stay_actions
, \(r=0.297\)). Conversely, excessive
movement (prop_move_actions
, \(r=-0.222\)) and persistent alignment (avg_polarization_index
, \(r=-0.241\)) are negatively correlated with performance.
Visualizations are provided (Figures 12, 13 in Appendix 17), and a linear regression model based solely on these dynamics
features explains approximately 24.5% of score variance (Table 2 in Appendix 17),
emphasizing the primary role of emergent physical coordination.
Second, to investigate the potential contribution of explicit communication, we conducted a supplementary analysis focused on messages generated by agents via the MSG
action. We quantified basic communication patterns (frequency, length)
and assessed semantic properties using standard NLP techniques (methodology detailed in Appendix 16). Correlating these communication-related metrics with task scores revealed weaker relationships than
observed for physical dynamics (detailed results in Appendix 18). The statistically significant findings from this sampled analysis were:
Message Length: A weak positive correlation between the average length of non-empty messages per run (avg_non_empty_msg_length_run
) and score (\(r \approx 0.19\), \(p < 0.001\)).
Semantic Stability: A weak negative correlation between the variability (standard deviation) of semantic similarity among messages within a run (std_similarity_run
) and score (\(r \approx
-0.17\), \(p < 0.01\)).
Message frequency and average semantic similarity did not show significant correlations in this supplementary analysis (see Appendix 18). These findings suggest that while physical dynamics predominantly drive performance, characteristics of the communication itself—specifically, message length and semantic consistency—may exert a minor influence.
Visual examples of agent trajectories and interactions for each core task are provided in Appendix 13 (Figures 6 through 10). These
visualizations illustrate the challenges faced by agents, such as forming effective containment in Pursuit
, achieving synchrony in Synchronization
, navigating efficiently in Foraging
, maintaining cohesion in
Flocking
, and coordinating pushes in Transport
. They also qualitatively reveal differences in strategies and success levels across different models and runs, complementing the quantitative analysis.
Qualitative observation of simulation runs, particularly those resulting in low scores (see Appendix 15 for score distributions and Appendix 13 for visual examples), reveals
common failure patterns. In tasks like Pursuit
and Transport
, agents often struggle with sustained coordination; initial promising formations may dissolve due to individual agents making suboptimal local decisions or failing to
interpret implicit cues from neighbours. Cascading failures, where one agent’s poor move disrupts others, were observed. In Foraging
, agents sometimes exhibited inefficient exploration or got stuck in loops. The difficulty in robust planning
under uncertainty, stemming from the limited local view, appears to be a major factor. These failures are also reflected in the dynamics analysis (Appendix 17), where high negative correlations with
score were found for metrics indicating disorganised or excessive movement (e.g., prop_move_actions
).
Our SwarmBench evaluations offer key insights into LLM-driven decentralized coordination. The primary analysis (Appendix 17) reveals that emergent physical group dynamics—particularly behavioral
flexibility (e.g., std_action_entropy
) and efficiency (e.g., prop_stay_actions
)—are significant drivers of performance, collectively explaining approximately 24.5% of score variance. In contrast, while our
supplementary communication analysis (Appendices 16, 18) suggests that characteristics like message length and semantic consistency have a
statistically significant, albeit weaker, positive correlation with outcomes, a substantial portion of performance variation remains. This indicates that factors intrinsic to the LLMs themselves—such as differences in their pre-trained reasoning
capabilities, their proficiency in spatial understanding derived from textual inputs, or their interpretation of the task prompts (Appendix 12)—are highly influential yet not fully captured by our current set of
group dynamics metrics.
The evaluated LLMs, operating zero-shot under these strict local constraints, appear to compensate for what might be underdeveloped or inefficient explicit communication strategies by relying more heavily on adaptive physical behaviors and implicit coordination cues. This approach differs notably from many natural swarm systems, which often achieve robust coordination through efficient, low-distortion signaling pathways [43]. The difficulties LLM swarms face in maintaining sustained collective alignment and robust planning under uncertainty likely stem from challenges in spatial reasoning and effective conflict resolution when information is severely limited.
Furthermore, the sensitivity analysis (Appendix 19) highlights that performance is non-trivially dependent on environmental parameters like agent density (\(N\)) and perception range (\(k\)). This underscores the challenge of developing LLM agents that are not only capable of coordination but are also robustly adaptable to varying information availability and group dynamic conditions. Understanding and overcoming these limitations are crucial for harnessing the full potential of LLMs in decentralized, multi-agent systems.
In this work, we introduced SwarmBench, a novel benchmark specifically designed to assess the emergent decentralized coordination capabilities of Large Language Models operating under conditions characteristic of swarm intelligence. Our evaluations reveal that while contemporary LLMs demonstrate foundational abilities for basic coordination within SwarmBench tasks, they face significant challenges in achieving the sophisticated and robust collective behaviors observed in natural systems. Particularly, their emergent collective strategies, especially concerning information propagation and resilience to noise under strict decentralization, do not yet mirror the highly efficient, low-distortion signaling and coordinated action typical of, for instance, bird flocks [43].
This observed gap between individual LLM sophistication and emergent collective intelligence under decentralized constraints highlights a critical direction for future research. Developing LLM-based systems that can more effectively bridge this divide is paramount. SwarmBench offers a systematic platform to measure progress in this domain, guiding the development of LLMs towards more robust, adaptive, and genuinely collective behavior. Such capabilities will be increasingly vital as AI systems become more deeply integrated into complex, real-world contexts requiring decentralized coordination, from autonomous robotic swarms to distributed computational networks.
It is important to acknowledge the scope defined by SwarmBench’s current design. The adoption of a 2D grid world—though the SwarmBench engine is a customizable physical system with defined mechanics (Appendix 10)—is a deliberate choice aligning with foundational AI benchmarks [23], [24]. This facilitates focused investigation of core coordination dynamics but necessarily abstracts from the complexities of continuous 3D spaces relevant to physical swarm robotics. Our primary focus on zero-shot evaluation provides a crucial baseline for LLMs’ intrinsic abilities but defers exploration of adaptive learning mechanisms. The sensitivity analysis (Appendix 19) indicates that results can vary with parameters like agent count (\(N\)) and field of view (\(k\)), implying that findings are contingent on specific configurations. Furthermore, our communication analysis employed a particular Sentence-BERT model; a broader suite of NLP techniques might yield additional insights. These design choices establish clear boundaries, enabling SwarmBench to serve as a foundational platform for systematically dissecting emergent coordination under the specific, well-defined constraints central to classical swarm intelligence.
Building on SwarmBench’s framework and our initial findings, future work will focus on several key areas. These include exploring agent adaptation through learning mechanisms like reinforcement learning or fine-tuning, potentially integrating an RLVR Pipeline (e.g., [55]–[57]), and extending the benchmark with 3D environments, more complex physics, and novel tasks. Deeper investigations into inter-agent communication—analyzing diverse protocols, studying learned messaging, and employing broader NLP methods—are planned. Additionally, developing novel agent architectures and prompting strategies (cf. Appendix 12) to enhance decentralized reasoning under constraint, alongside theoretical models for LLM-based swarms, will be crucial for advancing the field.
As research into decentralized AI systems like LLM-driven swarms progresses, spurred by tools such as SwarmBench, it is imperative to proactively consider the broader societal implications. The enhanced capabilities under development could, if deployed without robust ethical guidelines and safeguards, be misused for purposes such as automated surveillance, sophisticated disinformation campaigns, or the disruption of critical infrastructure. Consequently, technical advancements must be paralleled by dedicated research into mechanisms for controllability, transparency in agent and swarm behavior, and the steadfast alignment of these complex AI systems with human values and societal norms to ensure their responsible development and deployment.
The SwarmBench simulation employs a sophisticated discrete physics engine to govern interactions between agents and objects within the 2D grid world. This engine is designed to resolve complex multi-body pushing scenarios, ensuring that collective
actions, such as those required in the Transport
task, are subject to consistent and non-trivial physical laws.
Two primary constructs define physical entities:
Mesh
: Represents a discrete physical object on the grid. Each Mesh
has:
pos
: Its global top-left coordinate \((i, j)\).
shape
: A 2D array defining its footprint (e.g., a \(1 \times 1\) square for an agent, or a \(1 \times 4\) rectangle for a large obstacle).
static
: A boolean indicating if the object is immovable (e.g., walls ‘W’
).
mass
(\(m\)): Resistance to motion, calculated as \(m = \lfloor \sqrt{\text{area}} \rfloor\), where area is the number of non-empty cells in its shape
. For a
standard \(1 \times 1\) agent, area is 1, thus mass \(m=1\).
Node
: A computational representation used during physics resolution. A Node
can represent a single Mesh
or, crucially, an aggregate of Mesh
es that form a Strongly Connected
Component (SCC) in the interaction graph (see below). Each Node
aggregates:
Total mass
and static
status of the Mesh
(es) it represents.
Net external force
applied to it by agents or other Node
s.
Agents are a specific type of Mesh
with mass \(m=1\). When an agent performs a movement action (e.g., UP
, RIGHT
), it attempts to apply a directed force, typically of magnitude \(F=2\), to an adjacent entity or into empty space.
The simulation resolves all potential movements and pushes within a single time step through a multi-stage process:
Contact Graph Construction: The engine identifies all Mesh
objects that are adjacent and could potentially exert force on one another based on intended agent actions or ongoing pushes. This forms a directed graph where
an edge \(v \to u\) indicates that Mesh
\(v\) could potentially push Mesh
\(u\).
Strongly Connected Component (SCC) Reduction: Tarjan’s algorithm is applied to the contact graph to identify all SCCs. An SCC represents a group of Mesh
es that are mutually pushing each other or form a rigid cluster
that must move as one unit (or not at all). Each SCC is collapsed into a single aggregate Node
. Mesh
es not part of any cycle become individual Node
s. This process transforms the potentially cyclic contact graph into a
Directed Acyclic Graph (DAG) of Node
s, representing the pathways of force transmission. The mass and applied forces for an aggregate Node
are summed from its constituent Mesh
es.
Integer Linear Program (ILP) Formulation and Solution: The core of the physics resolution is an ILP problem formulated and solved using the PuLP
library.
Variables: Binary variables \(x_v\) indicate if Node
\(v\) moves; continuous variables represent net forces on nodes and forces transmitted between
connected nodes in the DAG.
Objective: Maximize \(\sum x_v\) — i.e., maximize the number of (aggregate) Node
s that are successfully moved.
Key Constraints:
Movement Condition: A Node
\(v\) can only move (\(x_v=1\)) if the total net force \(F_{\text{net},v}\) acting on it in the
direction of potential movement is greater than or equal to its total mass \(m_v\) (i.e., \(F_{\text{net},v} \ge m_v\)).
Force Transmission: Force is transmitted along the DAG. A Node
\(v\) can only exert force on its children in the DAG if it itself moves (\(x_v=1\)) and has
sufficient "leftover" force (\(F_{\text{net},v} - m_v\)).
Static Objects: Node
s marked as static
(e.g., containing walls) are constrained such that \(x_v=0\).
Grid Boundaries: Movement is implicitly constrained by grid boundaries and collisions with other static objects, handled by the graph construction.
The ILP solver finds the optimal set of Node
s that can move simultaneously while satisfying all physical constraints.
Position Update: The global positions of the Mesh
es belonging to the Node
s determined to be movable by the ILP solution are updated on the simulation grid.
This physics model, particularly the SCC reduction and ILP-based resolution, allows SwarmBench to simulate complex, emergent physical interactions that require genuine coordination, such as multiple agents cooperatively pushing a heavy object that no single agent could move alone.
Figure 5: Illustration of the core physics principle: an applied force (\(F\)) versus object mass (\(m\)). An agent applies a standard force (typically \(F=2\)). An object moves if the net applied force is greater than or equal to its mass (\(F \ge m\)). Top: A single agent (\(F=2\)) pushes an object of \(m=1\). Since \(2 \ge 1\), the object moves. Middle: A single agent (\(F=2\)) pushes an object of \(m=2\). Since \(2 \ge 2\), the object moves. Bottom: Two agents cooperatively push. Agent 3b pushes Agent 3a, which in turn pushes the object. The effective combined force (e.g., \(F_{\text{total}} \gtrsim 3\) after accounting for Agent 3a’s own mass if it also moves) is applied to the object of \(m=2\). Since \(F_{\text{total}} \ge 2\), the object moves. The ILP solver handles the precise calculation of force transmission and movement thresholds in such multi-body scenarios based on the DAG of interactions.
This appendix provides a detailed description of the SwarmBench environment, agent capabilities, and the evaluation protocol used in our experiments, complementing Section 3 of the main text.
SwarmBench utilizes a simulation environment based on a 2D grid world where multiple agents (\(N\) agents), controlled by LLMs, operate and interact. The adoption of a 2D grid world, while an abstraction, is a deliberate design choice aligned with foundational AI benchmarking practices (e.g., SnakeBench [23] and ARC-AGI [24]). This facilitates a focused investigation of core coordination dynamics while maintaining tractable complexity for initial explorations. This environment itself is designed as a customizable and scalable physical system, where mechanical interactions such as forces and multi-body dynamics (further detailed in Appendix 10) are explicitly modeled.
The simulation proceeds in discrete time steps (rounds, \(t=1, \dots, T\)). In each round, all agents perceive their local environment (including messages from the previous round) simultaneously and decide upon their next action and potential message based on the state at the beginning of the round. Environment updates, including agent movement and object interactions, occur only after all agents have committed to their actions for that round. Interactions between agents and objects, particularly pushing and collision resolution, are governed by this discrete physics simulation that handles complex multi-body dynamics, ensuring that the mechanical properties of the system are consistently applied.
The benchmark includes several core multi-agent coordination tasks designed to probe different facets of emergent swarm behavior (visualized in Figure 1 in the main text and detailed with examples in Appendix 13):
Pursuit: Agents (e.g., ‘0’
-‘9’
) must collaboratively track and corner a faster-moving prey (‘P’
). Tests coordination for containment, potentially aided by communication.
Synchronization: Agents aim to synchronize an internal binary state (‘Number’
vs. ‘$Number’
) across the swarm and collectively alternate this state via a SWITCH
action. Assesses consensus
formation leveraging local cues and communication.
Foraging: Agents navigate an environment with walls (‘W’
) to find a food source (‘F’
), transport it to a nest (‘N’
), changing appearance (‘Number’
to ‘$Number’
) when
carrying. Evaluates exploration, pathfinding, and potential communication-driven task allocation.
Flocking: Agents must move as a cohesive group, maintaining alignment and separation while potentially navigating towards a target or avoiding obstacles. Tests emergent formation control and coordinated movement.
Transport: Multiple agents must cooperate to push a large object (‘B’
) towards a designated goal area. Tests coordinated force application and navigation around obstacles.
The environment framework supports additional tasks (e.g., Obstacle Pushing, Shape Formation) and is extensible. Interactions follow simplified physics rules detailed in Appendix 10. Environment instances,
including initial agent positions, object placements, and potentially other environmental features, are procedurally generated based on a random seed
. To ensure robust evaluation and prevent overfitting to specific scenarios, each simulation
run across different models or trials utilizes a distinct seed. This presents varied initial conditions and environmental layouts, promoting the development of generalizable coordination strategies.
Consistent with the goal of studying emergent behavior from local information, agents operate with significantly restricted perception. The primary input is an egocentric \(k \times k\) grid view (e.g., \(5 \times 5\) in our main experiments) centered on the agent at position \(\mathbf{x}_{i,t} \in \mathbb{R}^2\). This view displays local entities using symbols: the agent itself
(‘Y’
), other agents (by ID, e.g., ‘1’
/‘$1’
), walls (‘W’
), obstacles (‘B’
), empty space (‘.‘
), off-map markers (‘*‘
), and task-specific objects (‘P’
,
‘N’
, ‘F’
). The view includes global coordinate labels.
The full observation packet provided to the LLM includes:
The local \(k \times k\) grid view.
The agent’s global coordinates \(\mathbf{x}_{i,t}\).
Task-specific status (e.g., carrying_food
).
Messages received from other agents in the previous round (\(t-1\)). Messages are received only from agents within the sender’s perception radius at time \(t-1\).
The task description and current progress indicators (e.g., score
).
A limited history of the agent’s own recent observations and actions (e.g., last memory=5
rounds).
The detailed structure and content of the prompt given to the LLM are provided in Appendix 12.
Based on this observation, the agent’s LLM must decide on two outputs for round \(t\):
A primary action \(A_{i,t}\) chosen from a set \(\mathcal{A}\) typically including basic movements (UP
, DOWN
, LEFT
, RIGHT
,
STAY
). Movement actions correspond to an agent attempting to apply a directed force (default magnitude = 2). Agents and objects possess inherent weight (referred to as ‘mass’ in the simulation, default agent mass = 1 calculated from a 1x1
size). Movement or pushing only occurs if the net applied force overcomes the resistance (mass) of the target object(s), considering potentially complex chain reactions resolved by the physics engine (see Appendix 10). Task-specific actions (e.g., SWITCH
, PICKUP
, DROP
) are also included.
A message \(M_{i,t}\) (a string, potentially empty) intended for local broadcast via the MSG
action.
The message \(M_{i,t}\) (if non-empty) is broadcast locally and anonymously to agents within the sender’s perception radius, becoming part of their observation packet in the next round (\(t+1\)). Messages are subject to a character limit (e.g., 120 characters). This setup compels reliance on interpreting local visual cues and utilizing the constrained communication channel for effective coordination.
We define a standardized protocol focusing on zero-shot LLM evaluation. Each agent \(i\) is controlled by an independent LLM instance. In round \(t\), the agent receives its full observation packet (including received messages from \(t-1\)), formulates a prompt containing this information (see Appendix 12), and queries the LLM. Each query is stateless regarding the LLM’s internal conversational context; persistence is managed via the prompt’s explicit inclusion of observation history and received messages.
The LLM response is parsed to extract the intended primary action \(A_{i,t} \in \mathcal{A}\) and the message content \(M_{i,t}\). An episode ends upon task success criteria being met or
reaching a maximum round limit (max_round
).
SwarmBench is model-agnostic. Our experiments (Section 4) utilize several contemporary closed-source (API-based) and open-source LLMs, evaluated without task-specific fine-tuning to assess their inherent zero-shot coordination potential derived from pre-training.
The following tcolorbox shows the exact structure and content of the prompt string generated by the ‘gen_prompt
’ function and provided to each LLM agent in SwarmBench at each decision step. Placeholders within curly braces (e.g.,
{name}
, {task_desc}
, {view_str}
) are dynamically filled with actual simulation data during runtime.
"""You are Agent {name}, operating in a multi-agent environment. Your goal is to complete the task through exploration and collaboration.
Task description:
{task_desc}
Round: {round_num}
Your recent {self.memory}-step vision (not the entire map):
{view_str}
Your current observation:
{level_obs_str}
Message you received:
{messages_str}
Your action history:
{history_str}
Symbol legend:
- Number: An agent whose id is this number (do not mistake column no. and line no. as agent id).
- Y: Yourself. Others see you as your id instead of "Y".
- W: Wall.
- B: Pushable obstacle (requires at least 5 agents pushing in the same direction).
- .: Empty space (you can move to this area).
- *: Area outside the map.
And other symbols given in task description (if any).
Available actions:
1. UP: Move up
2. DOWN: Move down
3. LEFT: Move left
4. RIGHT: Move right
5. STAY: Stay in place
6. MSG: Send a message
And other actions given in task description (if any).
Physics rules:
1. Your own weight is 1, and you can exert a force of up to 2.
2. An object (including yourself) can only be pushed if the total force in one direction is greater than or equal to its weight.
3. Static objects like W (walls) cannot be pushed; only B can be pushed.
4. Force can be transmitted, but only between directly adjacent objects. That means, if an agent is applying force in a direction, you can push that agent from behind to help.
5. Only pushing is allowed - there is no pulling or lateral dragging. In other words, to push an object to the right, you must be on its left side and take the RIGHT action to apply force.
Message rules:
1. A message is a string including things you want to tell other agents.
2. Your message can be received by all agents within your view, and you can receive messages from all agents within your view.
3. Messages are broadcast-based. The source of a message is anonymous.
4. Write only what’s necessary in your message. Avoid any ambiguity in your message.
5. Messages is capped to no more than 120 characters, exceeding part will be replaced by "...".
Other rules:
1. Coordinates are represented as (i, j), where i is the row index and j is the column index. Your 5x5 vision uses global coordinates, so please use global coordinates.
2. The direction of increasing i is downward, and increasing j is to the right.
3. Objects that are completely outside the map (marked with "*") will be removed.
Please think carefully and choose your next action. You will need to collaborate with other agents to successfully complete the task.
Your response should include:
1. Analysis of the current situation
2. Your decision and reasoning
3. The message to be left (if any)
End your response clearly with your chosen action: "ACTION: [YOUR_ACTION]" and/or "MSG: [Your message (no line breaks).]"
"""
You are Agent Agent_5, operating in a multi-agent environment. Your goal is to complete the task through exploration and collaboration.
Task description:
The boundary of the map is surrounded by walls (denoted as W), with a gap leading to the outside of the map (denoted as ’*’). The gap is blocked by an obstacle (denoted as B).
The goal is to first locate the obstacle (B), then have five robots simultaneously push it through the exit, and finally escape to the outside of the map (denoted as ’*’).
Round: 62
Your recent 5-step vision (not the entire map):
Current Step:
BBBB
..Y.
.....
.....
1 Steps Before:
BBBB
.
..Y.
.....
.....
2 Steps Before:
BBBB
.
..Y.
.....
.....
3 Steps Before:
BBBB
.
..Y.
.....
.....
4 Steps Before:
BBBB
.
..Y.
.....
.....
Your current observation:
Message you received:
Message: "At (5,3), moving UP to (4,3) for LEFT push on B at (3,3). Ready for 5-force."
Message: "Moving to (4,5) to join LEFT push on B at (3,5). Ready for 5-force push."
Message: "Moving RIGHT to (4,3) then (4,4) for LEFT push on B. Ready to join force."
Message: "At (4,5) pushing LEFT on B at (3,5). 5-force confirmed. Executing now."
Your action history:
Round 58: Action: UP, Message: "Moving to (4,3) then (3,4) for LEFT push on B. Need 4 more at (3,4)/(4,4)."
Round 59: Action: UP, Message: "Moving to (4,3) then (3,4) for LEFT push on B. Need 4 more at (3,4)/(4,4)."
Round 60: Action: UP, Message: ""Moving to (3,3) for LEFT push on B. Need 1 more at (3,3)/(4,3) to complete 5-force.""
Round 61: Action: UP, Message: ""Moving to (4,3) for LEFT push on B at (3,3). Ready to join 5-force push.""
Round 62: Action: UP, Message: "At (5,3), moving UP to (4,3) for LEFT push on B at (3,3). Ready for 5-force."
Symbol legend:
- Number: An agent whose id is this number (do not mistake column no. and line no. as agent id).
- Y: Yourself. Others see you as your id instead of "Y".
- W: Wall.
- B: Pushable obstacle (requires at least 5 agents pushing in the same direction).
- .: Empty space (you can move to this area).
- *: Area outside the map.
And other symbols given in task description (if any).
Available actions:
1. UP: Move up
2. DOWN: Move down
3. LEFT: Move left
4. RIGHT: Move right
5. STAY: Stay in place
6. MSG: Send a message
And other actions given in task description (if any).
Physics rules:
1. Your own weight is 1, and you can exert a force of up to 2.
2. An object (including yourself) can only be pushed if the total force in one direction is greater than or equal to its weight.
3. Static objects like W (walls) cannot be pushed; only B can be pushed.
4. Force can be transmitted, but only between directly adjacent objects. That means, if an agent is applying force in a direction, you can push that agent from behind to help.
5. Only pushing is allowed - there is no pulling or lateral dragging. In other words, to push an object to the right, you must be on its left side and take the RIGHT action to apply force.
Message rules:
1. A message is a string including things you want to tell other agents.
2. Your message can be received by all agents within your view, and you can receive messages from all agents within your view.
3. Messages are broadcast-based. The source of a message is anonymous.
4. Write only what’s necessary in your message. Avoid any ambiguity in your message.
5. Messages is capped to no more than 120 characters, exceeding part will be replaced by "...".
Other rules:
1. Coordinates are represented as (i, j), where i is the row index and j is the column index. Your 5x5 vision uses global coordinates, so please use global coordinates.
2. The direction of increasing i is downward, and increasing j is to the right.
3. Objects that are completely outside the map (marked with "*") will be removed.
Please think carefully and choose your next action. You will need to collaborate with other agents to successfully complete the task.
Your response should include:
1. Analysis of the current situation
2. Your decision and reasoning
3. The message to be left (if any)
End your response clearly with your chosen action: "ACTION: [YOUR_ACTION]" and/or "MSG: [Your message (no line breaks).]"
{
"response": "### Analysis of the Current Situation:\n1. **Obstacle (B) Location**: The obstacle B is located at positions (3,2), (3,3), (3,4), and (3,5). The exit is likely at one of these positions, but the exact exit (*) is not visible in the current vision. However, based on the messages, the agents are focusing on pushing B at (3,3) and (3,5) LEFT (towards column 1).\n2. **Agent Positions**: \n - You (Y) are at (5,3). \n - Other agents are at (4,3), (4,5), and possibly others not visible in the current vision. \n - Messages indicate that agents are ready for a 5-force push on B at (3,3) and (3,5).\n3. **Messages**: \n - Agents are coordinating to push B LEFT at (3,3) and (3,5). \n - A 5-force push is confirmed for B at (3,5), and agents are moving into position for B at (3,3).\n4. **Your Role**: You are at (5,3) and can move UP to (4,3) to join the push on B at (3,3). Alternatively, you could move to (4,5) to assist with the push on B at (3,5), but the messages suggest the push at (3,5) is already confirmed with enough agents.\n\n### Decision and Reasoning:\n- **Primary Goal**: Assist in the 5-force push on B at (3,3). The messages indicate that agents are ready at (4,3) and (4,4) for this push. You are at (5,3) and can move UP to (4,3) to join the push.\n- **Action**: Move UP to (4,3) to align yourself for the LEFT push on B at (3,3). This will contribute to the 5-force needed to move the obstacle.\n- **Message**: Confirm your readiness to join the push at (3,3) to ensure coordination.\n\n### Message:\n\"At (5,3), moving UP to (4,3) for LEFT push on B at (3,3). Ready for 5-force.\"\n\n### Action:\nACTION: UP \nMSG: \"At (5,3), moving UP to (4,3) for LEFT push on B at (3,3). Ready for 5-force.\"",
"action": "UP",
"message": "\"At (5,3), moving UP to (4,3) for LEFT push on B at (3,3). Ready for 5-force.\""
}
This appendix provides visual examples of the simulation environment for each of the five core SwarmBench tasks. Each figure shows a snapshot from a simulation run, illustrating agent positions, environment features, and the overall task objective.
Figure 6: Example visualization for the Pursuit
task. Agents (0-11
) attempt to surround the prey (P
).
Figure 7: Example visualization for the Synchronization
task. Agents (Number/$Number
) aim to reach a consensus state..
Figure 8: Example visualization for the Foraging
task. Agents (Number/$Number
) collect food (F
) and return it to the nest (N
)..
Figure 9: Example visualization for the Flocking
task. Agents (0-11
) attempt to move cohesively.
Figure 10: Example visualization for the Transport
task. Agents (0-11
) coordinate to push a large obstacle (B
).
To quantitatively analyze emergent collective behaviors, we compute metrics primarily based on agent positions \(\mathbf{x}_{i,t}\) and their primary actions \(A_{i,t}\) (which include
movements like UP
, DOWN
, LEFT
, RIGHT
, inaction STAY
, and any task-specific primary actions defined for a level). While message content analysis is possible (see Appendix 16), the metrics below focus on overt behavioral patterns and spatial configurations.
Characterize the distribution and diversity of primary actions taken by the agents.
Action Proportions: Calculate the overall frequency of specific types of primary actions across all agents and all rounds. For example, the proportion of ‘stay’ actions is computed as: \[\texttt{prop\_stay\_actions} = \frac{1}{N T} \sum_{t=1}^{T} \sum_{i=1}^{N} \mathbb{I}(A_{i,t} = \texttt{STAY}) \label{eq:prop95stay}\tag{1}\] where \(N\) is the number of agents, \(T\) is the total number of rounds, \(A_{i,t}\) is the primary action of agent \(i\) at round \(t\), \(\mathcal{A}\) is the set of all available primary actions, and \(\mathbb{I}(\cdot)\) is the indicator function (1 if the condition is true, 0 otherwise). Proportions for other action subsets
(e.g., movement actions \(\mathcal{A}_{\text{move}} = \{\texttt{UP}, \texttt{DOWN}, \texttt{LEFT}, \texttt{RIGHT}\}\)) are calculated similarly (e.g., prop_move_actions
).
Per-Round Primary Action Entropy (\(H_t\)): Measures the instantaneous diversity (unpredictability) of primary actions within a single round \(t\). It is calculated using the Shannon entropy formula based on the proportion \(p_t(a)\) of agents performing action \(a \in \mathcal{A}\) in round \(t\): \[H_t = -\sum_{a \in \mathcal{A}} p_t(a) \log_2 p_t(a) \quad \text{where} \quad p_t(a) = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(A_{i,t} = a) \label{eq:round95entropy}\tag{2}\] By convention, \(0 \log_2 0 = 0\). We compute the average over all rounds: \(\texttt{avg\_action\_entropy} = \frac{1}{T} \sum_{t=1}^T H_t\), and the standard deviation: \(\texttt{std\_action\_entropy} = \sqrt{\frac{1}{T} \sum_{t=1}^T (H_t - \overline{H})^2}\), where \(\overline{H}\) is the mean entropy over rounds.
Total Primary Action Entropy (\(H_{\text{total}}\)): Measures the overall diversity of primary actions used throughout the entire episode, considering the global frequency \(p(a)\) of each action \(a \in \mathcal{A}\): \[\texttt{action\_entropy\_total} = -\sum_{a \in \mathcal{A}} p(a) \log_2 p(a) \quad \text{where} \quad p(a) = \frac{1}{N T} \sum_{t=1}^T \sum_{i=1}^N \mathbb{I}(A_{i,t} = a) \label{eq:total95entropy}\tag{3}\]
Assess the alignment and uniformity of agent actions related to movement or staying put. Let the set of coordination-relevant actions be \(\mathcal{A}_{\text{coord}} = \mathcal{A}_{\text{move}} \cup \{\texttt{STAY}\} \subseteq \mathcal{A}\).
Per-Round Dominant Action Proportion (\(D_t\)): Measures the proportion of agents performing the single most frequent action within the set \(\mathcal{A}_{\text{coord}}\) at round \(t\). Let \(N_t^{\text{coord}} = \sum_{i=1}^N \mathbb{I}(A_{i,t} \in \mathcal{A}_{\text{coord}})\) be the number of agents performing a coordination-relevant action in round \(t\). If \(N_t^{\text{coord}} > 0\): \[D_t = \max_{a \in \mathcal{A}_{\text{coord}}} \left( \frac{1}{N_t^{\text{coord}}} \sum_{i=1}^N \mathbb{I}(A_{i,t} = a) \right) \label{eq:dominant95prop}\tag{4}\] Otherwise, \(D_t = 0\). We compute the average \(\texttt{avg\_dominant\_action\_prop} = \frac{1}{T} \sum_{t=1}^T D_t\) and the standard deviation \(\texttt{std\_dominant\_action\_prop} = \sqrt{\frac{1}{T} \sum_{t=1}^T (D_t - \overline{D})^2}\) (where \(\overline{D}\) is the mean) to measure the average level and temporal variability of behavioral uniformity, respectively.
Per-Round Polarization Index (\(P_t\)): Measures the degree of alignment of intended movement vectors \(\mathbf{v}(a)\) associated with primary actions in \(\mathcal{A}_{\text{coord}}\). We assign standard unit vectors for movement (e.g., \(\mathbf{v}(\texttt{UP})=(0,-1)\), \(\mathbf{v}(\texttt{RIGHT})=(1,0)\) -
assuming grid coordinates where row index increases downwards) and a zero vector for inaction (\(\mathbf{v}(\texttt{STAY})=(0,0)\)). The index is the magnitude of the average movement vector:
\[P_t = \left\| \frac{1}{N} \sum_{i=1}^{N} \mathbf{v}(A_{i,t}) \right\|_2 \quad \text{(where } A_{i,t} \in \mathcal{A}_{\text{coord}}) \label{eq:polarization}\tag{5}\]
where \(\|\cdot\|_2\) denotes the Euclidean norm (vector length). A value near 1 indicates strong alignment, while a value near 0 indicates disordered movement or widespread inaction. We compute the average
avg_polarization_index
and the standard deviation std_polarization_index
.
In addition to these behavioral metrics, standard spatial metrics such as average pairwise distance (avg_pairwise_distance
) measuring group dispersion and average centroid speed (avg_centroid_speed
) measuring collective
displacement are computed based on agent positions \(\mathbf{x}_{i,t}\) over time using their standard definitions. These combined metrics provide a quantitative basis for analyzing the emergent strategies and collective
dynamics discussed in Section 4 and Appendix 17.
Table 1 provides the detailed numerical results corresponding to the performance overview presented in Figure 3 in the main text (Section 4.1). It shows the mean scores and standard deviations for each evaluated LLM across the five SwarmBench tasks, averaged over 5 simulation runs. Models are ordered by their total score (sum across the five tasks) in descending order.
zation | ||||||
Score | ||||||
gemini-2.0-flash | \(8.80 \pm 1.60\) | \(3.40 \pm 2.94\) | \(5.80 \pm 4.35\) | \(10.40 \pm 4.72\) | \(7.66 \pm 2.43\) | 36.06 |
claude-3.7-sonnet | \(4.40 \pm 1.20\) | \(12.60 \pm 9.62\) | \(1.20 \pm 1.47\) | \(13.60 \pm 3.20\) | \(4.14 \pm 5.07\) | 35.94 |
o4-mini | \(9.60 \pm 0.49\) | \(2.80 \pm 1.17\) | \(4.80 \pm 2.64\) | \(17.80 \pm 1.47\) | \(0.16 \pm 0.31\) | 35.16 |
deepseek-v3 | \(4.20 \pm 2.48\) | \(4.00 \pm 1.41\) | \(2.60 \pm 2.06\) | \(15.00 \pm 1.67\) | \(5.84 \pm 4.81\) | 31.64 |
gpt-4.1 | \(8.40 \pm 1.85\) | \(2.80 \pm 0.75\) | \(3.20 \pm 1.94\) | \(13.40 \pm 2.24\) | \(1.53 \pm 3.07\) | 29.33 |
o3-mini | \(3.60 \pm 2.06\) | \(2.20 \pm 1.17\) | \(2.60 \pm 3.88\) | \(13.40 \pm 1.74\) | \(2.57 \pm 3.52\) | 24.37 |
gpt-4o | \(3.40 \pm 1.50\) | \(1.80 \pm 1.33\) | \(1.60 \pm 1.85\) | \(14.60 \pm 2.24\) | \(1.38 \pm 2.76\) | 22.78 |
gpt-4.1-mini | \(1.40 \pm 0.80\) | \(0.60 \pm 0.49\) | \(1.40 \pm 1.02\) | \(12.20 \pm 2.14\) | \(6.30 \pm 2.11\) | 21.90 |
llama-4-scout | \(1.20 \pm 0.75\) | \(0.20 \pm 0.40\) | \(1.00 \pm 1.55\) | \(14.80 \pm 2.71\) | \(1.15 \pm 2.31\) | 18.35 |
qwq-32b | \(2.20 \pm 1.94\) | \(1.20 \pm 0.98\) | \(0.80 \pm 0.75\) | \(7.00 \pm 0.63\) | \(7.11 \pm 3.67\) | 18.31 |
llama-3.1-70b | \(1.80 \pm 0.40\) | \(1.00 \pm 1.10\) | \(0.00 \pm 0.00\) | \(13.80 \pm 1.17\) | \(0.56 \pm 1.12\) | 17.16 |
deepseek-r1 | \(1.00 \pm 0.63\) | \(1.20 \pm 1.17\) | \(1.00 \pm 1.10\) | \(8.40 \pm 1.96\) | \(0.00 \pm 0.00\) | 11.60 |
claude-3.5-haiku | \(0.60 \pm 0.49\) | \(1.00 \pm 0.00\) | \(0.00 \pm 0.00\) | \(3.60 \pm 3.38\) | \(1.70 \pm 2.77\) | 6.90 |
To investigate the potential role of explicit communication, facilitated by the MSG
action, we performed a supplementary analysis focused on the content and patterns of messages exchanged between agents. This analysis utilized standard
natural language processing techniques applied to the simulation log data.
This communication analysis was conducted on the complete set of simulation logs across all evaluated models and tasks. The primary goals were to quantify basic communication tendencies (frequency, message length), assess the semantic properties of messages, and identify prevalent keywords.
For subsequent correlation with overall task performance (final score, see Appendix 18), several metrics characterizing communication within each individual simulation run were computed:
Non-Empty Message Frequency (non_empty_msg_freq_run
): This metric quantifies the propensity of agents within a run to communicate. It represents the proportion of total agent actions in that run which included the
generation of a non-empty message string.
Average Non-Empty Message Length (avg_non_empty_msg_length_run
): This metric reflects the average verbosity of communication within a run. It is the mean character length calculated over all non-empty messages produced
by any agent during that specific simulation run.
Average Semantic Similarity (avg_similarity_run
): To gauge the overall semantic coherence of communication within a run, message embeddings were generated using the pre-trained Sentence-BERT model
‘all-MiniLM-L6-v2’
[58]. This metric represents the average pairwise cosine similarity computed across all
embeddings of non-empty messages within that run. Higher values suggest that messages conveyed, on average, more similar semantic content.
Semantic Similarity Standard Deviation (std_similarity_run
): This metric captures the variability or stability of semantic content communicated throughout a run. It is the standard deviation of the pairwise cosine
similarities used to compute the average similarity. Lower values indicate greater semantic consistency among messages, while higher values suggest more fluctuation or divergence in the topics or intents expressed.
We also performed keyword extraction on the sampled message data to understand the specific terminology used by different models across tasks. Messages were preprocessed (lowercasing, punctuation removal, English stopword removal using NLTK [59]), and the most frequent terms were identified for each model and task combination.
Figure 11 visually summarizes the results of this analysis, displaying the frequency of the most prominent keywords for each task (grouped by color) across the different LLM models (x-axis). As the figure
illustrates, the dominant keywords strongly reflected the specific objectives and entities of each task. For instance, terms like ‘push’
, ‘left’
, ‘obstacle’
, and coordinates (e.g., ‘(3,3)’
) were prevalent
in the Transport
task, while ‘target’
, ‘surround’
, ‘corner’
, and agent IDs featured heavily in Pursuit
messages. Similarly, ‘food’
, ‘nest’
, ‘carry’
, and
‘search’
were common in Foraging
.
This qualitative analysis, visualized in Figure 11, confirms that the agents’ messages frequently contained task-relevant vocabulary. The figure also reveals variations in keyword usage patterns between different LLM models even when performing the same task, suggesting model-specific communication styles or strategies. While the presence of relevant keywords indicates some level of task understanding channeled into communication, their sheer frequency does not directly translate to coordination effectiveness, which, as discussed in the main text (Section 4.2), appeared more strongly linked to emergent physical dynamics and, to a lesser extent, semantic consistency rather than just keyword usage.
Figure 11: Keyword Frequency Analysis from Agent Messages. Frequency (y-axis) of the top keywords extracted from agent messages, grouped by LLM model (x-axis labels) and task (indicated by color and the vertical plot
structure; see legend at top). Message data was preprocessed before frequency counting. The visualization highlights task-specific terminology (e.g., ‘push’
in Transport
, ‘food’
in Foraging
) and reveals
variations in keyword usage across different models for the same task.
This section provides the detailed quantitative results and visualizations supporting the analysis of emergent group dynamics across all five tasks presented in Section 4.2. The data is aggregated from all 325 simulation runs (across 13 models, 5 tasks, 5 runs each). The metrics used are defined in Appendix 14.
Table [tab:combined95correlations] lists all primary action-based group dynamics features exhibiting statistically significant Pearson
correlations (\(p < 0.05\)) with the final task score (score
) across the combined dataset, sorted by the absolute value of the correlation coefficient (\(r\)).
l>cr>crl Feature & <multicolumn,1>\(r\) & <multicolumn,1>\(p\)-value & Direction
std_action_entropy & \(0.300\) & \(< 0.001\) & Positive
prop_stay_actions & \(0.297\) & \(< 0.001\) & Positive
std_dominant_action_prop & \(0.274\) & \(< 0.001\) & Positive
avg_polarization_index & \(-0.241\) & \(< 0.001\) & Negative
prop_move_actions & \(-0.222\) & \(< 0.001\) & Negative
num_rounds & \(-0.141\) & \(0.011\) & Negative
avg_action_entropy & \(-0.121\) & \(0.029\) & Negative
Table 2 presents the standardized coefficients for the linear regression model built to predict the final task score using ten key primary action-based dynamics features across the combined dataset. The model’s overall performance was \(R^2\approx 0.245\) and Mean Squared Error (MSE) \(\approx 21.519\), using all \(325\) samples. Features are ranked by the absolute value of their standardized coefficient, indicating their relative importance in the linear model.
Feature | Coefficient | Interpretation |
(Standardized) | (Approx. effect of +1 std dev) | |
avg_dominant_action_prop | \(-5.698\) | Score decreases by \(5.70\) |
avg_action_entropy | \(-4.996\) | Score decreases by \(5.00\) |
prop_move_actions | \(-3.260\) | Score decreases by \(3.26\) |
action_entropy_total | \(2.224\) | Score increases by \(2.22\) |
std_dominant_action_prop | \(1.211\) | Score increases by \(1.21\) |
std_polarization_index | \(-0.555\) | Score decreases by \(0.56\) |
num_rounds | \(-0.450\) | Score decreases by \(0.45\) |
avg_polarization_index | \(-0.360\) | Score decreases by \(0.36\) |
std_action_entropy | \(-0.279\) | Score decreases by \(0.28\) |
prop_stay_actions | \(0.009\) | Score increases by \(0.01\) |
Figures 12 and 13 provide visual representations of the correlation analysis performed on the combined dataset across all five tasks.
Figure 12: Correlation Matrix of Top 10 Numeric Features and Score (Combined Tasks). This heatmap visualizes the Pearson correlation coefficients (\(r\)) between the score and the 10 primary action-based dynamics features analyzed in the combined dataset. The color intensity and hue indicate the strength and direction of the correlation (red: positive, blue: negative). Values are shown within each cell. This provides a visual overview of both feature-score relationships and inter-feature correlations.
Figure 13: Score vs. Top 6 Correlated Features (Combined Tasks, Colored by Task). These scatter plots show the relationship between the final score and the six group dynamics features exhibiting the strongest absolute
Pearson correlation (\(r\)) across the combined dataset. Each point represents one simulation run, colored according to its task (experiment_id
). The red line indicates the overall linear regression fit for the
combined data. Plot titles show the feature name and the overall correlation coefficient (\(r\)) with score (* indicates \(p < 0.05\)). These plots help visualize the data distribution,
linearity, and potential task-specific patterns underlying the overall correlations.
This section presents the quantitative correlation results from the supplementary communication analysis, the methodology for which is described in Appendix 16. These findings, should be considered in conjunction with the primary dynamics correlations reported in Appendix 17.
Table 3 summarizes the statistically significant Pearson correlations (\(p < 0.05\)) observed between run-level communication/semantic features and the final task score within the analyzed sample.
Feature | \(r\) | \(p\)-value | Type / Interpretation |
---|---|---|---|
avg_non_empty_msg_length_run | \(0.191\) | \(0.0005\) | Communication (Length +) |
std_similarity_run | \(-0.170\) | \(0.002\) | Semantics (Instability -) |
Figure 14 visualizes the relationships between the task score and the two significantly correlated communication/semantic features identified in the sample.
None
Figure 14: Score vs. Significant Communication/Semantic Features. Scatter plots illustrating the relationship between final task score and (left) average non-empty message length per run
(avg_non_empty_msg_length_run
), and (right) standard deviation of semantic similarity per run (std_similarity_run
). Red lines depict the overall linear regression trend. Plot titles include the feature name and Pearson correlation
coefficient (\(r\)) with score (* denotes \(p < 0.05\)).
As highlighted in Section 5, the magnitude of these correlations associated with communication characteristics is notably lower than those observed for the physical group dynamics metrics (Appendix 17, Table [tab:combined95correlations]). The features non_empty_msg_freq_run
(\(p=0.77\)) and avg_similarity_run
(\(p=0.91\)) did not show statistically significant correlations with the score in this sample analysis.
We examined how agent performance responds to changes in local perception range (\(k\), the size of the square view) and group size (\(N\), the number of agents), revealing key aspects of decentralized coordination challenges (Figure 15).
Expanding the field of view from \(k=3\) to \(k=5\) consistently improved outcomes across diverse tasks like Pursuit
, Synchronization
, Foraging
, and
Flocking
. This suggests that a minimal level of environmental awareness is crucial for agents to effectively coordinate, likely enabling better anticipation and response to neighbors’ actions. However, further increasing the view to \(k=7\) yielded only marginal gains and was sometimes less effective than \(k=5\), particularly in the Transport
task which demands precise collective alignment. This plateau, and in
some cases like the Transport
task a performance dip with \(k=7\) compared to \(k=5\), implies a potential trade-off. While more information can be beneficial, an overly broad
view might lead to information overload, making it harder for the LLM agents to discern critical local cues from a larger, potentially noisier, perceptual field. This could dilute focus on immediately relevant neighbors or
environmental features crucial for tightly coupled maneuvers, such as the precise alignment needed in Transport
. The increased cognitive load of processing a larger input space without a corresponding improvement in strategic depth might thus
be counterproductive in certain scenarios. The effectiveness of \(k=5\) in our main experiments (Section 4) likely reflects a more optimal balance between sufficient environmental
awareness and manageable perceptual complexity for the current LLM architectures in these zero-shot settings.
The influence of group size (\(N\)) presented a more complex picture, strongly modulated by task demands. Predictably, performance in Transport
improved with more agents (\(N=16\) vs \(N=8\)), as the task fundamentally relies on accumulating sufficient physical force. Conversely, Foraging
performance deteriorated as \(N\) increased, suggesting that larger groups introduced detrimental effects like congestion or interference near critical locations (nest ‘N’
, food ‘F’
), outweighing any potential benefits.
Intriguingly, Pursuit
exhibited peak performance at an intermediate size (\(N=12\) compared to \(N=8\) and \(N=16\)), hinting that while more
agents can help initially encircle a target, too many may hinder coordinated containment through increased complexity and potential self-obstruction. Flocking
remained relatively robust to changes in \(N\)
within the tested range.
These varied scaling behaviors highlight a core challenge for LLM-based swarms: managing the increased interaction density and potential for conflicting local decisions in larger groups without centralized control. The sensitivity to both \(k\) and \(N\) underscores that robust swarm intelligence requires strategies adaptable to varying information availability and group dynamics, motivating evaluation across diverse parametric settings as discussed in Sections 5 and 7.
Figure 15: Parameter Sensitivity Analysis. Performance (Score) across selected tasks varies with the number of agents (\(N\)) and field of view size (\(k \times k\)). Mean scores over runs shown; error bars indicate std. dev. Results suggest optimal parameter ranges can be task-dependent.