April 18, 2025
Multi-agent collaboration holds great promise for enhancing the safety, reliability, and mobility of autonomous driving systems by enabling information sharing among multiple connected agents. However, existing multi-agent communication approaches are hindered by limitations of existing communication media, including high bandwidth demands, agent heterogeneity, and information loss. To address these challenges, we introduce LangCoop, a new paradigm for collaborative autonomous driving that leverages natural language as a compact yet expressive medium for inter-agent communication. LangCoop features two key innovations: Mixture Model Modular Chain-of-thought (M\(^3\)CoT) for structured zero-shot vision-language reasoning and Natural Language Information Packaging (LangPack) for efficiently packaging information into concise, language-based messages. Through extensive experiments conducted in the CARLA simulations, we demonstrate that LangCoop achieves a remarkable 96% reduction in communication bandwidth (\(<\) 2KB per message) compared to image-based communication, while maintaining competitive driving performance in the closed-loop evaluation. Our project page and code are at https://xiangbogaobarry.github.io/LangCoop/.
Recent advances in autonomous driving have demonstrated that multi-agent collaboration [1] significantly enhances both safety and efficiency compared to single-vehicle operations, primarily through real-time information sharing and intention communication. This collaborative approach has become increasingly crucial as autonomous vehicles navigate complex environments where interaction with other traffic participants is inevitable and constant. However, the selection of an appropriate collaboration medium remains a critical challenge that has attracted substantial research attention.
A key element of multi-agent collaboration is the medium used for inter-vehicle communication. Researchers have proposed various modalities for exchanging information, including: raw sensor data, neural network features, and downstream task results. Despite their utility, each of these communication media suffers from one or more critical drawbacks. Specifically, they often: (1) Require high bandwidth, placing a heavy load on communication infrastructures and increasing the risk of latency or packet loss. (2) Fail to accommodate the inherent heterogeneities across agents, which may use different sensor configurations, model architectures, or targeting on different downstream tasks. (3) Lose critical contextual information when data are overly compressed, abstracted, or otherwise transformed into a limited representation. (4) Does not support planning-level or control-level collaboration.
To address these issues, we propose that human natural language can serve as an effective communication medium for multi-agent collaborative driving. Unlike conventional sensor-based or feature-based communications, natural language is inherently flexible and capable of conveying a broad range of contextual and semantic cues, therefore offering additional advantages. First, it bridges the gap between machine-readable modalities [2] (e.g., numbers, features, embeddings) and human-spoken language, making the reasoning [3], [4], communication [5], negotiation [6], and decision-making process more transparent. Such transparency benefits research, development, and debugging by enabling human operators to understand and verify the messages being exchanged among autonomous vehicles. Second, ongoing research in leveraging LVLMs within autonomous driving has already demonstrated their utility in understanding [7], reasoning [8], decision-making [9], [10], and even low-level vehicle control [11]. Consequently, natural language collaboration can synergistically exploit the general intelligence of LVLMs to achieve more robust, versatile, and explainable multi-agent collaboration. Third, natural language enables high-level fusion or negotiation at the planning and prediction levels, allowing agents—including automated vehicles, human drivers, and road-side units—to communicate intention and decision rationale rather than just perception data. This capability simplifies the coordination process, allowing agents to reach mutual understanding and agreements rapidly and clearly, ultimately promoting smoother, safer, and more socially acceptable driving behaviors. Lastly, natural language naturally provides scalability and generalization across diverse scenarios and heterogeneous vehicle platforms. Using standardized language-based communication seamlessly integrates autonomous and human-driven vehicles, regardless of sensor suites or underlying technologies. Moreover, natural language communication is inherently model-agnostic, compatible with both open-source (LLAMA [12], DeepSeek [13]) and commercial LLMs (ChatGPT [14], Gemini [15]), enabling easy adoption and interoperability across diverse autonomous vehicle systems.
Another compelling rationale emerges from real-world autonomous driving incidents, such as a case where a Waymo driverless car stopped dead inside a construction zone, causing disruptions and creating hazards [16]. Such incidents highlight the fundamental limitation of conventional sensor-based communication: it fails to transparently communicate the vehicle’s internal decision-making and reasoning processes to nearby human drivers or traffic controllers. In contrast, an interface that uses natural language as a universal information protocol could explicitly communicate an autonomous vehicle’s internal reasoning and intentions in real-time (e.g., “I’ve stopped due to unclear construction signage”), thereby clarifying otherwise confusing behaviors, reducing driver frustration, and facilitating timely human intervention. Furthermore, such a natural language-based approach allows real-time human-in-the-loop interaction, enabling remote operators or nearby traffic managers to quickly communicate or disengage with the vehicle in intuitive terms (e.g., “Please move slowly to the side”) to promptly resolve ambiguous or problematic situations.
Leveraging these insights, we introduce LangCoop, a novel framework for collaborative autonomous driving that uses natural language as the primary medium for inter-vehicle communication. Our framework consists of three key components: (1) a Mixture Model Modular Chain-of-thought (M3CoT) module that structures reasoning into distinct stages for comprehensive scene understanding; (2) a Natural Language Information Packaging (LangPack) system that compresses rich semantic information into compact messages; and (3) multiple driving signal generation approaches that translate natural language reasoning into actionable controls. Our experimental results in closed-loop evaluations using the Carla simulator [17] show that, by using zero-shot LVLMs, LangCoop achieves driving scores of up to 48.8 and route completion rates of up to 90.3%, significantly outperforming non-collaborative baselines while maintaining exceptional communication efficiency (\(<\)2 KB). The framework also operates effectively with heterogeneous agent capabilities, demonstrating the viability of natural language as a medium for autonomous vehicle collaboration.
The integration of Vision-Language Large Models (LVLMs) into autonomous driving has enabled a unified approach to perception, reasoning, and decision-making, offering enhanced interpretability and adaptability [18]–[21]. Early studies have explored LVLMs for closed-loop driving, where multimodal sensor data is processed alongside natural language instructions to generate vehicle control outputs. Shao et al. [22] introduced one of the first LVLM-based end-to-end driving models, while Wang et al. [23] focused on translating language instructions into high-level driving commands. Xu et al. [9] and Sima et al. [10] further emphasized explainability, using question-answering and graph-based reasoning to interpret scene dynamics and decision rationales, making autonomous systems more transparent and human-interpretable. Hwang et al. [24] used LVLMs to directly output the future planning waypoints. Xing et al. [19] proposed a comprehensive benchmark for evaluating the truthfulness, safety, fairness, security, and generalizability of LVLMs in the autonomous driving scenes.
Beyond perception, LVLMs have demonstrated robustness in out-of-distribution (OOD) scenarios, addressing challenges that conventional deep-learning models struggle with in unseen environments. Wang et al. [25] showed that LVLMs could simulate novel situations through latent space editing, improving generalization. Mei et al. [26] introduced a dual-process framework, combining slow but rigorous reasoning from an LVLM with fast real-time execution from a smaller model, mimicking human cognitive processes. Additionally, Dong et al. [27] and Xing et al. [8] explored zero-shot prompting, demonstrating how LLMs can guide autonomous systems without extensive retraining.
LVLMs also play a pivotal role in multi-agent collaboration and human-centric driving by improving vehicular communication [28] and personalized decision-making [18], [28]. Liang et al. [29] and Zhang et al. [30] explored how generative AI models enable semantic-rich, context-aware inter-vehicle communication, surpassing traditional bandwidth-intensive numeric exchanges. In personalized driving, Li et al. [31] highlighted that LVLMs improve context understanding and human-like reasoning, while Lan et al. [32] and Duan et al. [33] demonstrated their ability to simulate human driving behaviors and dynamically adjust trajectories. As LVLMs continue evolving, their integration into autonomous systems paves the way for more interpretable, adaptable, and collaborative driving solutions that better align with human expectations and real-world challenges.
Effective collaboration among autonomous agents in multi-agent driving scenarios hinges on the choice of communication medium. Several approaches have been explored, including the exchange of raw sensor data [34]–[36], neural network features [37]–[47], and perception results [48]–[52]. Specifically, raw sensor data (such as LiDAR point clouds or camera images) offers comprehensive environmental perception but demands high communication bandwidth and latency. Meanwhile, neural network features (intermediate embeddings, BEV feature maps, or feature queries) can reduce bandwidth usage yet introduce incompatibility when agents rely on heterogeneous feature extraction networks. Another approach is sharing perception results, such as predicted depth maps [53], object detection outputs [54], occupancy grids [55], or BEV map segmentations [42]. While enumerating all possible perception outputs can strain communication bandwidth, limiting the shared set risks losing critical semantic details.
Figure 1: Overview of the LangCoop framework.
Given these challenges, natural language has emerged as a promising alternative for communication in multi-agent driving. Unlike numeric-based representations, natural language is compact, human-interpretable, and adaptable to heterogeneous agents. It also supports planning or control interactions. Recent studies in robotics and autonomous driving have begun to explore language-based communication, leveraging its ability to capture rich contextual information with minimal overhead.. For instance, Hu et al. [56], Yao et al. [57], and Fang et al. [58] use Large Language Models (LLMs) for driving-scenario reasoning on highly abstract traffic descriptions but overlook pedestrians, cyclists, unknown obstacles, and environmental conditions that are pivotal in real-world driving. Another approach, V2V-LLM [2], augments an LLM backbone with pre-trained perception features (such as object detections) to incorporate environmental cues. However, it does not exploit the vision-based reasoning capabilities of LVLMs. V2X-VLM [59] is the first work to combine perception and reasoning within a LVLM framework, yet it essentially treats multi-agent collaboration as a multi-sensor fusion problem, neglecting important factors like cross-sensor coordination transformations and collaboration at the planning or control level. Moreover, its evaluation remains limited to open-loop benchmarks, and its model is not open-sourced.
In this work, we advance the field by harnessing both the perception and reasoning capabilities of LVLMs, enabling planning- and control-level collaboration among autonomous vehicular agents. Unlike previous approaches, we conduct closed-loop evaluations to assess real-time performance and provide open-source code for the research community to facilitate further exploration and benchmarking.
In this section, we present LangCoop, a novel framework that natively leverages Large Vision Language Models (LVLMs) for collaborative driving among Connected Autonomous Vehicles (CAVs). As illustrated in Fig. 1, our framework establishes a systematic pipeline for information extraction, processing, exchange, and decision-making in collaborative driving scenarios. Each CAV initially captures front-view images through its onboard cameras, which serve as the primary sensory input. These images are passed through our Mixture Model Modular Chain-of-thought (M\(^3\)CoT ) module (detailed in Section 3.2), which systematically extracts environmental and object-level information as well as process goal-oriented information, and behavioral intentions.
The extracted information is then packaged into a compact, structured natural language format via our Natural Language Information Packaging (LangPack) module. This standardized format facilitates information exchange between connected vehicles while minimizing bandwidth requirements. Concurrently, each vehicle receives packets from other CAVs within the communication range. Upon receiving the packets, each vehicle integrates the messages with its own and feeds them into the LVLMs to generate appropriate driving signals. The driving signals are formulated as discrete trajectories, continuous trajectories, or direct control commands depending on the specific implementation context (detailed in Section 3.4). These signals guide the vehicle’s planning and control systems to execute safe and efficient maneuvers.
The Mixture Model Modular Chain-of-thought (M\(^3\)CoT ) module forms the cognitive foundation of our LangCoop framework, expanding upon the chain-of-thought reasoning process introduced by OpenEmma [8]. M\(^3\)CoT systematically decomposes the complex task of driving scene understanding into four distinct prompting stages, each addressing a specific aspect of the driving context: driving scene description that focuses on holistic environmental understanding, interactive object description that identifies and characterizes specific objects relevant to the driving task, navigation goal prompting that informs the agent about its next navigational goal’s relative location, shifting the agent’s perspective from mere trajectory prediction to goal-oriented planning, and finally future intent description that articulates the vehicle’s intended actions and decision rationale.
A key innovation in our approach is the flexibility to employ different specialized LVLMs for each prompting stage. This design choice offers several significant advantages: First, it acknowledges that different prompting tasks demand distinct capabilities—driving scene and object description rely predominantly on visual understanding capabilities, while navigation goal interpretation and future intent formulation necessitate stronger logical reasoning skills. By selecting models optimized for these specific competencies, our system potentially outperforms monolithic approaches that use a single model for all tasks. Second, this modular design offers practical benefits related to computational efficiency and cost management. Given that zero-shot LVLM inference can be resource-intensive, particularly for high-performance models, our approach allows for strategic resource allocation—deploying more powerful (and potentially more expensive) models only for the stages that critically require their capabilities. This alleviates the need for a single large generalist model, potentially reducing inference time and operational costs without compromising performance.
Our framework introduces Natural Language Information Packaging (LangPack) as an innovative medium for information sharing. LangPack gathers diverse information sources into a standardized, human-readable, and machine-processable format that balances comprehensiveness with transmission efficiency. Upon completing the M\(^3\)CoT processing stages, each vehicle constructs a LangPack packet that integrates prompting results with agent metadata, including location, velocity, acceleration, etc.
Natural Language Information Packaging |
---|
Agent Metadata: location, velocity, acceleration, etc. |
Scene Description: The image shows … |
Objects Description: Vehicle (light-colored car) - Moving forward … |
Navigation Goal: We need to keep moving ahead … |
Intent Description: Slight left adjustment while maintaining safe … |
… |
Total Package Size: \(<\) 2KB |
The LangPack approach offers several distinct advantages for collaborative driving systems. First, the inherent compactness of natural language representation allows for information-dense communication with minimal bandwidth requirements—typical LangPack packages require less than 2KB of data, making them suitable for transmission even in bandwidth-constrained V2X communication environments. Furthermore, natural language provides a flexible and extensible medium that can accommodate diverse information types without requiring rigid structural redesigns. This adaptability is particularly valuable for autonomous driving systems that must process heterogeneous and sometimes unexpected environmental elements.
Upon receiving LangPack packages from other connected vehicles, each CAV performs essential post-processing operations including coordinate transformation and temporal alignment. The processed information is then aggregated with the vehicle’s own perceptions and prompting results to create a comprehensive knowledge ready to be passed into the following decision-making module.
The final component of our LangCoop framework involves translating the aggregated, multi-vehicle understanding into actionable driving signals. We propose three driving signal formulations, each offering specific advantages depending on the implementation context and downstream control requirements:
Discrete Trajectory Generation: The LVLM outputs a sequence of waypoints \((x_i, y_i)\) for the future \(n\) seconds. This high-precision path representation is suitable for complex maneuvers and enables straightforward validation against environmental boundaries.
Continuous Trajectory Generation: Rather than discrete positions, this approach defines vehicle motion through speed and turning curvature parameters over time. It produces smoother motion profiles that better align with vehicle dynamics for natural-feeling behavior.
Direct Control Signal Generation: In this most direct formulation, the LVLM outputs low-level control signals—specifically steering angle, throttle position, and brake pressure—for each time step. A key advantage of this approach is that outputs can be explicitly constrained within physically feasible ranges (e.g., steering angle limits, maximum acceleration rates), ensuring generated commands never exceed the vehicle’s operational capabilities.
In Section 4.2, we present a comparative analysis of all three driving signal formulations across diverse driving scenarios.
In this section, we present comprehensive experimental evaluations of our LangCoop framework through closed-loop simulations in the CARLA environment [17]. We first outline our experimental setup and evaluation metrics (§ 4.1), followed by a systematic assessment of key components within our framework, including driving signal formulations (§ 4.2), prompting methods (§ 4.3), communication strategies (§ 4.4), LVLM selection (§ 4.5), and modular design approaches (§ 4.6). We investigate the framework’s performance under heterogeneous agent configurations [45], [60] (§ 4.7). Finally, we display some visualization results and analysis in § 4.8.
In this work we conduct closed-loop evaluations using the CARLA simulation platform. We use 10 testing scenarios in Town05 with each scenario involves two CAVs controlled by our LangCoop framework while interacting with various dynamic actors including other vehicles, pedestrians, and cyclists controlled by CARLA’s traffic manager. The two CAVs are initialized at different positions within the same general vicinity. We implement V2V communication with a simulated range of 200 meters. For perception, each vehicle receives front-view RGB camera images at 800\(\times\)600 resolution.
We employ three primary evaluation metrics to assess performance comprehensively: Driving Score (DS): Calculated as \(\text{DS} = \text{RC} \times (1 - \text{IP})\), where RC is route completion and IP is infraction penalty. Infractions include collisions, traffic light violations, and lane invasions, each weighted according to severity. Route Completion (RC): The percentage of the predefined route successfully traversed by the vehicle, measured from 0 to 100. Time Consumed (TC): The total time in seconds required to complete the route or until a terminal failure. For communication efficiency assessment, we additionally track: Transmission Bandwidth (TB): The average data size in KB transmitted between vehicles.
Unless otherwise specified, our baseline configuration employs GPT-4o-mini [61] as the LVLM, utilizes a concise version of the M\(^3\)CoT module described in Section 3.2, and exchanges both front-view images (compressed JPEG) and LangPack messages between vehicles.
As described in Section 3.4, our framework supports three distinct driving signal formulations: discrete trajectory, continuous trajectory, and direct control signals. We first compare these formulations to identify the most effective approach for subsequent experiments.
Vehicle 1 | Vehicle 2 | ||||
---|---|---|---|---|---|
2-3(lr)4-5 Driving Signal | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | TC\(_{\text{(s)}}\)\(_{\color{red}\downarrow}\) |
Discrete Traj. | 5.0 | 23.1 | 1.3 | 19.4 | 139.9 |
Continuous Traj. | 33.1 | 74.9 | 48.8 | 90.3 | 124.6 |
Control Signal | 33.7 | 89.0 | 18.1 | 70.2 | 124.8 |
Table 2 reveals that the discrete trajectory approach performs poorly for both vehicles. This underperformance can be attributed to the poor capability of LVLMs towards discrete waypoints understandings—it is hard for zero-shot LVLMs to output discrete waypoints that are smooth and dynamically feasible. In comparison, both continuous trajectory and direct control signal approaches demonstrate better performance. The continuous trajectory formulation achieves better performance for Vehicle 2 (DS: 48.8, RC: 90.3), while the direct control signal approach has better performance for Vehicle 1 (DS: 33.7, RC: 89.0). The continuous trajectory approach also finish the route slightly faster than other methods. We postulate that the strong performance of the continuous trajectory and direct control signal approaches stems from a more natural action space that better aligns with vehicle dynamics and control systems. Based on these results, we adopt the continuous trajectory approach as our default driving signal formulation for subsequent experiments for its balance of performance across both vehicles.
Next, we evaluate three different prompting strategies to assess the impact of reasoning structure on driving performance: Naive Prompting, which directly asks the LVLM to generate driving signals without structured reasoning, Chain-of-thought (CoT), and Concise CoT. The concise CoT variation is inducing LVLMs to output a more concise description by simply adding “Please be very concise" at the end of each prompt.
Vehicle 1 | Vehicle 2 | ||||
---|---|---|---|---|---|
2-3(lr)4-5 Prompting | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | TC\(_{\text{(s)}}\)\(_{\color{red}\downarrow}\) |
Naive | 2.7 | 23.0 | 0.7 | 21.1 | 248.7 |
CoT | 37.0 | 85.2 | 41.1 | 80.3 | 105.2 |
CoT (concise) | 33.1 | 74.9 | 48.8 | 90.3 | 124.6 |
Table 3 demonstrates that the naive prompting approach performs poorly for both vehicles. This underscores the critical importance of structured reasoning for the autonomous driving task. Both CoT approaches substantially outperform the naive method, where there is no prominent performance priority bettween standard and concise CoT. The standard CoT approach achieves the highest performance for Vehicle 1 (DS: 37.0, RC: 85.2) and completes navigation in the shortest time (105.2 seconds). Meanwhile, the concise CoT variation achieves the best performance for Vehicle 2 (DS: 48.8, RC: 90.3). The performance differences between standard and concise CoT prompting highlight an interesting tradeoff. The standard CoT provides more comprehensive reasoning, potentially allowing for more nuanced decision-making, while the concise version reduces computational overhead and may focus the model on the most critical aspects of the driving task. For subsequent experiments, we adopt the concise CoT method as our default prompting strategy, as it provides strong overall performance while maintaining computational efficiency.
A central aspect of our collaborative driving approach is the mechanism and content of inter-vehicle communication. We compare four different communication strategies: no collaboration (baseline), image-only sharing, LangPack-only sharing, and combined image+LangPack sharing.
3.2pt
Vehicle 1 | Vehicle 2 | ||||||
---|---|---|---|---|---|---|---|
2-3(lr)4-5 Message | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | TC\(_{\text{(s)}}\)\(_{\color{red}\downarrow}\) | TB\(_{\text{(KB)}}\)\(_{\color{red}\downarrow}\) | |
Non-collab | 13.5 | 33.1 | 11.35 | 29.44 | 200.1 | 0 | |
Image (JPEG) | 15.3 | 38.9 | 31.3 | 60.7 | 65.8 | 43.1 | |
LangPack | 35.1 | 71.6 | 42.8 | 80.1 | 114.6 | 1.8 | |
Image+LangPack | 33.1 | 74.9 | 48.8 | 90.3 | 124.6 | 44.9 |
As shown in Table 4, the non-collaborative baseline performs poorly with driving scores, which affirms the importance of multi-vehicular collaboration. The image-only strategy shows modest improvements over the non-collaborative baseline but falls significantly short of the LangPack-based methods. This suggests that raw visual data, while information-rich, may not be optimally structured for inter-vehicle understanding without additional processing. The LangPack-only approach achieves remarkable performance (Vehicle 1: DS 35.1, RC 71.6; Vehicle 2: DS 42.8, RC 80.1) while requiring minimal bandwidth (1.8 KB), demonstrating the exceptional efficiency of our natural language packaging approach. This represents a bandwidth reduction of over 96% compared to image sharing while delivering substantially better performance, The combined Image+LangPack approach achieves the highest overall performance, particularly for Vehicle 2 (DS: 48.8, RC: 90.3), but has highest bandwidth consumption (44.9 KB).
These results demonstrate that LangPack offers an exceptional balance between performance and communication efficiency, highlighting the information density and semantic richness of structured natural language representations. For bandwidth-constrained applications, LangPack-only communication provides near-optimal performance with minimal data requirements. When bandwidth constraints are less severe, the combined approach offers incremental performance improvements at the cost of substantially higher data transmission.
Figure 2: Visualization of a natural-language-based collaborative driving scenario. CAV 2 slows down upon receiving the ‘slow down’ intent description from CAV 1. The context is slightly paraphrased for better visualization.
The choice of LVLM significantly impacts collaborative driving performance. We evaluate six popular vision-language models (GPT-4o, Claude-3.7 Sonnet, GPT4o-mini, Gemini Flash Lite 2.0, Qwen-2.5-VL-7B, and Llama 3.2 11B Vision Instruct) to determine their effectiveness within our framework. In the following, we refer these models as GPT-4o, Claude-3.7, GPT4o-mini, Gemini-2.0, Qwen-2.5, and Llama-3.2 respectively.
Vehicle 1 | Vehicle 2 | ||||
---|---|---|---|---|---|
2-3(lr)4-5 Model | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | TC\(_{\text{(s)}}\)\(_{\color{red}\downarrow}\) |
GPT-4o | 41.3 | 70.0 | 47.7 | 91.0 | 79.0 |
Claude-3.7 | 32.0 | 67.0 | 72.1 | 94.1 | 88.5 |
GPT-4o-mini | 33.1 | 74.9 | 48.8 | 90.3 | 124.6 |
Gemini-2.0 | 12.1 | 33.7 | 25.6 | 58.0 | 46.5 |
Qwen-2.5 | 15.5 | 32.2 | 19.4 | 28.8 | 70.7 |
Llama-3.2 | 11.6 | 31.1 | 19.0 | 42.2 | 102.5 |
Table 5 shows that GPT-4o, Claude-3.7, and GPT-4o-mini consistently outperform other options across both vehicles, suggesting these models possess superior capabilities for understanding complex driving scenes and generating appropriate driving actions in collaborative contexts. The remaining models Gemini-2.0, Qwen-2.5, and Llama-3.2 demonstrate lower performance. Interestingly, Gemini-2.0 completes routes in the shortest time (46.5 seconds), suggesting more aggressive driving behavior that may prioritize speed over safety or adherence to traffic rules.
Our M\(^3\)CoT architecture enables the use of different specialized LVLMs for distinct reasoning stages. To evaluate the potential benefits of this modular approach, we implement two experimental configurations with varying model assignments for each prompting stage. In Experiment 6.A, we use Gemini-2.0 for driving scene and interactive objects description, Llama-3.2 for navigation goal and feature intent description, and use GPT4o-mini for driving signal generation. In Experiment 6.B, we use Qwen-2.5 for driving scene and interactive objects description, Llama-3.2 for navigation goal and feature intent description, and use GPT4o-mini for driving signal generation.
Vehicle 1 | Vehicle 2 | ||||
---|---|---|---|---|---|
2-3(lr)4-5 M\(^3\)CoT Setup | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | TC\(_{\text{(s)}}\)\(_{\color{red}\downarrow}\) |
GPT4o-mini | 33.1 | 74.9 | 48.8 | 90.3 | 124.6 |
Exp 6.A | 31.4 | 67.9 | 37.2 | 71.3 | 144.6 |
Exp 6.B | 35.2 | 68.5 | 42.1 | 82.6 | 119.3 |
From Table 6, in experiments 6.A and 6.B, we observe that replacing the reasoning modules with LVLMs other than GPT4o-mini results in slightly lower but still competitive performance compared to the pure GPT4o-mini model. Given that the API costs of Gemini-2.0 and Llama-3.2 are lower than that of GPT4o-mini, these experimental results suggest that in practical scenarios with limited computational budgets, our Mixture Model Modular Chain-of-thought module supports the possibility of replacing reasoning modules with a mixture of models.
In real-world deployments, collaborative driving systems will likely operate in environments where different vehicles utilize AI models with varying capabilities. To assess our framework’s effectiveness in such heterogeneous settings, we conduct two experiments with vehicle pairs using different LVLMs. In experiment 7.A, the vehicles are equipped with GPT-4o-mini and Gemini-2.0, while in experiment 7.B, they are equipped with GPT-4o-mini and Llama-3.2.
Heterogeneous Setup | DS \(_{\color{red}\uparrow}\) | RC \(_{\color{red}\uparrow}\) | TC\(_{\text{(s)}}\)\(_{\color{red}\downarrow}\) | ||
---|---|---|---|---|---|
Exp 7.A | Non-collab | GPT-4o-mini | 18.2 | 56.1 | 167.3 |
Gemini-2.0 | 12.6 | 61.1 | 167.3 | ||
2-6 | Image+LangPack | GPT-4o-mini | 59.1 | 73.2 | 126.8 |
Gemini-2.0 | 45.3 | 70.2 | 126.8 | ||
Exp 7.B | Non-collab | GPT-4o-mini | 16.7 | 70.2 | 142.0 |
Llama-3.2 | 11.5 | 51.0 | 142.0 | ||
2-6 | Image+LangPack | GPT-4o-mini | 51.9 | 96.1 | 144.5 |
Llama-3.2 | 12.6 | 40.1 | 144.5 |
As shown in Table 7, collaboration improves both driving scores and route completion rates across both experiments. In experiment 7.A, pairing GPT-4o-mini with Gemini-2.0, and in experiment 7.B, pairing GPT-4o-mini with Llama-3.2, both vehicles benefit from the collaborative setup. This demonstrates that our framework is adaptable not only to homogeneous settings but also to heterogeneous environments.
Figure 2 displays a scenario where a leading CAV approaches an intersection and decides to slow down. After sharing its intent ‘slow down’ with other CAVs, the following vehicle also decides to slow down despite originally intending to continue forward. This demonstrates effective collaborative decision-making, as the follower vehicle appropriately adjusts its behavior based on the other CAV’s communicated intent. The example illustrates how language-based communication enables real-time adaptive driving behaviors, enhancing overall traffic safety through multi-agent decision-level collaboration. Furthermore, this interaction highlights the practical value of our framework in translating natural language intents into concrete driving decisions across multiple autonomous vehicles. For more visualization results, please refer to our anonymous project page https://xiangbogaobarry.github.io/LangCoop/.
Our experiments with LangCoop reveal several key insights that inform future research directions:
Advantage of Zero-shot LVLMs. Despite benefits of domain-specific training for LVLMs, zero-shot approaches offer clear advantages. They eliminate costly dataset collection and training while maintaining adaptability across diverse driving scenarios. Additionally, proprietary models like GPT and Gemini series cannot be fine-tuned by third parties. A zero-shot pipeline that leverages all LVLMs without domain-specific fine-tuning provides flexibility and accessibility for resource-limited institute.
Computational and Latency Concerns. Regarding computational concerns, we note that LVLM efficiency is rapidly improving, and large models can generate trajectories for training more compact deployment models. Some novel dual-system designs[7], [26] may also alleviate the computational intensity. The conceptual advantages of language-based collaboration outweigh current computational demands, opening new possibilities for interpretable, efficient, and adaptable multi-agent driving systems.
Prompting Strategies for Driving. We observed significant sensitivity to prompt formulation in driving contexts. For example, we observed that explicitly instructing the model to "avoid collisions" (which might seem obvious in driving) substantially improved performance. This suggests that current LVLMs may not fully internalize driving-specific common knowledge. This indicates potential for improvement through specialized prompts or fine-tuning approaches focused on autonomous driving scenarios.
Physical-Informed Control Integration. Our current implementation does not fully incorporate detailed vehicle dynamics into the planning pipeline. Future extensions could address this by integrating physical vehicle models (e.g., bicycle model). Using techniques like quintic polynomial trajectory planning could ensure physically realizable motion while preserving the high-level reasoning capabilities of language models.
Expanding V2X Communication Paradigms. While we currently focus on vehicle-to-vehicle communication, the approach naturally extends to broader V2X ecosystems [62]. The unstructured nature of messages like Emergency Vehicle Alerts, Signal Phase and Timing, and Roadside Safety Alerts[63] aligns well with natural language representations. Future research could incorporate these additional message types, leveraging language models’ ability to process diverse information streams within a unified framework.
This work introduces LangCoop, a novel framework that leverages natural language as a communication medium for multi-agent collaborative driving. We introduce Mixture Model Modular Chain-of-thought (M3CoT) for reasoning and the Natural Language Information Packaging (LangPack) for efficient data sharing. Extensive closed-loop experiments in simulation environments demonstrate that language-based collaboration not only reduces bandwidth requirements but also enhances driving performance and interpretability by including decision-level communication. Looking forward, further optimization of prompting strategies and deeper integration of vehicle dynamics promise to extend the capabilities of language-driven autonomous systems, marking a significant step toward safer and more efficient collaborative driving.
1.5 KB
Agent 1, located at: [2.69048, 69.03092], current speed: 4.518m/s.
It’s scene description:
The driving scenario shows a clear, daytime environment with sunny weather. The road appears to be well-maintained, featuring multiple lanes in both directions. There is moderate traffic, including vehicles such as cars and a motorcycle. Trees line the roadside, indicating a suburban or semi-rural area. Overall, conditions are favorable for driving.
It’s object description:
. Motorcycle
- Location: Center lane, mid-distance
- Status: Riding in the same direction as the vehicle
- Intent: Likely continuing straight ahead on the road.
. Black SUV
- Location: Right lane, close to the front of the vehicle
- Status: Stationary
- Intent: Appears to be waiting or preparing to merge into the lane.
. Red Car
- Location: Left lane, further ahead
- Status: Moving slowly, closer to the center divider
- Intent: Preparing to turn or change lanes.
It’s target description:
The target is 0.15402 meters to my right and 32.39753 meters to my front.
It’s intent description:
. Check Environment: Identify the surrounding vehicles and road conditions.
. Speed Control: Maintain a speed and adhere to the speed limit.
. Adjust Direction: Slightly adjust to the right to align with the target location, ensuring no obstacles are in the way.
. Avoid Collisions: The car ahead is too close, slow down while adjusting to ensure safe distance.
1.4 KB
Agent 0, located at: [2.25215 80.50174], current speed: 5.630m/s.
It’s scene description:
The scenario depicts a clear, sunny day with good visibility. The road is a multi-lane paved highway with light to moderate traffic. Several vehicles are moving in both directions, and there are trees lining the roadside, indicating a suburban or rural environment. The road condition appears smooth and well-maintained.
It’s object description:
. Car (front center): Positioned in the right lane, traveling at a moderate speed. Intent appears to be straightforward driving.
. Motorcycle (left center): Located in the left lane, approaching the vehicle. Likely intent is to overtake the car in the center lane.
. Car (right lane, further ahead): Positioned in the right lane, moving in the same direction as others. Status indicates normal driving behavior.
It’s target description:
The target is 0.86387 meters to by right and 36.0 meters to by front. The target is not an endpoint—continue moving forward after reaching it.
It’s intent description:
. Decelerate: Reduce speed to stay within the speed limit of 20 m/s.
. Slightly Adjust Direction: Steer right towards the target (0.15402 meters to your right).
. Monitor Traffic: Vehicles are ahead. To ensure a safe distance, slow down or change lanes if necessary.
. Continue Forward: Maintain forward motion, adjusting as needed for further navigation.