1 Introduction↩︎

Large Language Models (LLMs) with advanced reasoning capabilities [1]–[3] have driven substantial progress across a wide range of natural language tasks. Building on these advances, LLM-based agents that equipped with planning, tool use, and multi-step reasoning capabilities [4], [5]—have achieved strong performance on complex real-world challenges, including computer operation [6], deep research [7], and information seeking [8], [9].

Figure 2: The detailed comparison among deep search, wide search benchmarks and our proposed DeepWideSearch.

5o far, existing benchmarks for evaluating agents can be systematically categorized along two critical dimensions (Figure 1): search width (measured by the number of information units to be searched) and search depth (measured by average search steps for each unit), revealing four distinct categories: (1) Low width, high depth benchmarks (e.g., GAIA [8], BrowseComp [9]), which focus on intricate deep reasoning over multi-hop retrieval for searching target answers; (2) Low width, low depth benchmarks (e.g., TriviaQA, HotpotQA), which address simple fact-finding tasks; (3) High width, low depth benchmarks (e.g., WideSearch [10] and PaSa [11]), which emphasize broad information collection about specific questions; and critically, (4) High width, high depth tasks, which collect extensive information that required deep reasoning—a critical capability for real-world applications like comprehensive market analysis and business development but entirely unaddressed by current benchmarks. For instance, as shown in Figure 2, the case “identifying the Top-10 EV maker in China by MoM sales growth (Aug 2025) and its Top-3 best-selling new EV cars (price and range)” exemplifies this challenge. It requires agent to gather a large volume of candidates, i.e., EV makers, to fill the result table through wide-scale search, and verify each candidate by performing deep reasoning, a combinatorial complexity that exceeds both the scope of width-focused evaluations and the scale of depth-focused benchmarks.

To address this critical evaluation gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate the capability of agents in deep and wide information seeking. Since it is challenging to construct deep and wide search instances even with human annotation, we develop two methods for conversing established datasets: (1) Deep2Wide Conversion, which extends deep search benchmarks (e.g., GAIA and BrowseComp) by augmenting their information scope through human-annotated table schemas; and (2) Wide2Deep Conversion, which enhances wide search queries by replacing explicit entities with synthesized complex sub-questions that necessitate multi-hop search steps. Both approaches integrate rigorous human validation protocols to ensure data quality while maintaining the combinatorial complexity inherent in real-world information-seeking scenarios. The final benchmark comprises 220 meticulously curated questions spanning 15 diverse domains, featuring both Chinese and English queries with human-verified ground truths, with 85 instances derived from Deep2Wide and 135 from Wide2Deep construction methods.

We conduct comprehensive experiments across state-of-the-art LLMs and agent systems on DeepWideSearch. Our results demonstrate that even the most advanced agent systems achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial difficulty of this kind of information seeking task. Notably, while agent frameworks consistently improve core entity identification (e.g., +15.91 absolute percentage points in Core Entity Accuracy), they exhibit limited efficacy in wide-scale information collection, frequently underperforming their LLMs counterparts using internal knowledge. Through systematic error analysis, we identify four fundamental failure modes: (1) lack of effective reflection mechanisms when encountering problematic search trajectories; (2) overreliance on parametric internal knowledge leading to outdated or inaccurate information; (3) insufficient retrieval despite accessing relevant webpages; and (4) context overflow exceeding current agent architectue limitations. These empirical findings expose key limitations of current agent architecture for the deep and wide information-seeking task. To facilitate further research in this critical domain, we have publicly released the DeepWideSearch benchmark, including datasets and evaluation codebase.

2.1 LLM-based Search Agents↩︎

The emergence of LLM-based agent systems has enabled sophisticated information-seeking capabilities, with frameworks ranging from closed-source implementations (e.g., OpenAI Deep Research) to open-source platforms (e.g., WebAgent [12] and Cognitive Kernel-Pro [13]). These systems have demonstrated proficiency in numerous application domains, including computer-use agents, deep research for complex problem investigation [14], and multi-step information retrieval through tool use [4]. Among these applications, information-seeking agents represent a critical frontier impact real-world utility. Current research in this domain primarily addresses five technical challenges: (1) agentic system architecture design [15]–[18], (2) synthetic data generation for complex scenarios [19]–[21], (3) optimization techniques for retrieval efficiency [22]–[24], (4) knowledge management for multi-hop reasoning [15], [25], and (5) evaluation methodologies for performance assessment [26], [27].

2.2 Benchmarks for LLM-based Agents↩︎

Existing evaluation frameworks for information-seeking agents primarily target two distinct capabilities: (1) Depth in multi-hop reasoning, measured by benchmarks like GAIA [8] and BrowseComp [9] for general complex reasoning, and domain-specific variants in healthcare [28] and E-commerce [29]; (2) Width in information collection, assessed by WideSearch [10] for comprehensive retrieval of atomic information, and PaSa [11] and SPAR [30] for academic literature retrieval. Crucially, no existing benchmark captures the combinatorial complexity inherent in real-world information-seeking tasks that simultaneously demand extensive exploration (width) and intricate multi-step reasoning (depth). This fundamental gap in evaluation methodology has prevented meaningful progress toward agents capable of handling the complex real-world information-seeking. To address this limitation, we propose DeepWideSearch, the first benchmark explicitly designed to evaluate the capability of agents in the deep and wide information-seeking task.

3 Task Formulation↩︎

As shown in Figure 3, DeepWideSearch establishes an evaluation framework that explicitly captures the combinatorial complexity of real-world information-seeking tasks—requiring agents to perform deep reasoning and wide-scale information collection. The evaluation metrics (Column F1, Row F1, Item F1, and Success Rate) illustrated in Figure 3 will be formally described in Section 4.4.

3.0.0.1 Input

Formally, each task in DeepWideSearch is defined as a tuple ($Q, C$): (1) Question $\boldsymbol{Q}$ represents a complex natural language query for deep and wide information seeking; and (2) Columns $\boldsymbol{C=\{c_i\}_{i=1}^N}$ define the table schema as a set of $N$ attributes and constraints need to be collected and verified, such as EV price and MoM scales growth in Figure 3 (right).

3.0.0.2 Output

As shown in Figure 3 (medium), agents are required to generate a structured tabular response $R$ by performing wide search for gathering numerous candidates and deep search for the verification of each candidate.

Figure 3: Task formulation of DeepWideSearch task. The evaluation metrics (highlighted in red) are detailed in Section 4.4.

4 Methodology of Dataset Construction↩︎

Constructing DeepWideSearch instances from scratch presents significant challenges due to the substantial human effort. To address this challenge while maintaining methodological rigor, we propose two methods to converse established datasets into deep and wide search questions: (1) Deep2Wide Conversion and (2) Wide2Deep Conversion. Both methodologies are complemented by human annotation procedures to ensure the quality.

Figure 4: The pipelines of our proposed Deep2Wide and Wide2Deep data construction methods.

4.1 Convert Deep Search Datasets (Deep2Wide)↩︎

Existing deep search benchmarks such as GAIA [8], BrowseComp [9] and BrowseComp-zh [31] require agents to employ multi-hop web browsing and deep reasoning to identify target answers. Building upon these resources, we develop the Deep2Wide conversion methodology by expanding the scope of searched information. As illustrated in Figure 4 (Top), our approach follows a three-stage pipeline inspired by WideSearch [10]: (1) Core Entity Filtering: We sample 80 Chinese questions from BrowseComp-zh and 20 English questions from BrowseComp, filtering out instances where answers are unsuitable as core entities (e.g., dates and numerical values). For example, as shown in Figure 5, Dan Lin is the core entity of the deep search question; (2) Table Schema Definition: Human annotators design structured table schemas by defining relevant information about the core entities; (3) Comprehensive Annotation: Annotators perform exhaustive web searches to populate the tables. Each instance requires approximately 30 minutes of human annotation time, ensuring high-quality and verified data. Following a design similar to that of the WideSearch benchmark [10], we incorporated timestamps into each question to ensure that the answers remain invariant over time.

None

Figure 5: One deep search question in BrowseComp-ZH..

4.2 Convert Wide Search Datasets (Wide2Deep)↩︎

Given that WideSearch [10] represents the publicly available dataset providing human-annotated tabular answers for wide-scale information-seeking, we develop the Wide2Deep conversion methodology to transform these wide search queries by introducing complexity in entity identification. This approach reuse the valuable human-annotated table in WideSearch while enhancing the deep reasoning requirements. Inspired by WebWalker [12], we implement a human-in-the-loop pipeline (Figure 3, bottom) comprising five stages: (1) Entity Extraction: Advanced LLMs identify core entities in 160 English and Chinese WideSearch questions, similar to the core entity in the deep search benchmark (Figure 5); (2) Deep Sub-Question Synthesis: Following prior work [20], [21], a web search agent are implemented to recursively traverse official websites about core entities and collecting their rich entity information. Then, a complex sub-question is generated based on these rich information, adhering to two critical constraints: (a) Uniqueness: The answer to the question must be a single, well-defined entity; (b) Complexity: Direct derivation of the entity from the question must require at least one additional web search step; (3) Question Fusion: Claude-sonnet-4 fuses the deep sub-question with the original wide search query; and (4) Human Annotation: A team of seven master’s-level annotators validates and refines the synthesized questions to ensure uniqueness, complexity, and linguistic naturalness. This process requires approximately 40 minutes of human annotation per instance, maintaining the high-quality standards essential for a rigorous benchmark. The prompts of core entity extraction, deep sub-question synthesis and question fusion are placed at Appendix 11.

4.3 Data Statistics↩︎

Table 1 provides a comprehensive comparison of our DeepWideSearch benchmark against existing datasets across multiple dimensions. Our benchmark demonstrates significantly higher search complexity compared to prior work, with an average table volume of 414.10 information units, substantially exceeding deep search benchmarks like GAIA and BrowseComp. Crucially, DeepWideSearch requires 4.21 average search steps to identify core entities—nearly 4× more complex than WideSearch (1.24). The dataset spans 15 diverse domains, covering both English and Chinese queries, with 220 carefully curated instances (85 from Deep2Wide, 135 from Wide2Deep). These statistics empirically validate the deep and wide attributes of our proposed DeepWideSearch. Cases and more details about the data in Table 1 can be found in Appendix 9.

Table 1: Data statistics comparison across benchmarks. GAIA refers to the text-only split.
Benchmarks	Domains	Data Size	Avg. Sample Per Domain	Table Volume	Avg. Steps Search Entity	Lang.
TriviaQA [32]	-	95K	-	1	$\approx$ 1	EN
HotpotQA [33]	-	113K	-	1	$\approx$ 2	EN
GAIA [8]	-	103	-	1	7.73	EN
BrowseComp [9]	9	1266	126.6	1	-	EN
BrowseComp-zh [31]	11	289	26.27	1	-	ZH
WideSearch [10]	14	200	12.80	450.67	1.24	EN,ZH
Our Proposed DeepWideSearch
Deep2Wide	15	85	7.08	247.74	3.22	EN,ZH
Wide2Deep	13	135	10.38	518.84	4.55	EN,ZH
Overall	15	220	14.67	414.10	4.21	EN,ZH

4.4 Evaluation Metrics of DeepWideSearch↩︎

As shown in Figure 3, we evaluate agent performance on DeepWideSearch along three complementary axes: Depth, Width, and Efficiency.

4.4.0.1 Depth Evaluation

The depth dimension evaluate the capability of agents to correctly identify target entities through deep reasoning over multi-hop retrieval. Following previous works [8], [9], we introduce the Column-F1 metric. As shown in Figure 3, Column-F1 is computed as the F1 score over the unique columns in the table. These unique columns correspond to the core attributes of entities (i.e., rows) that uniquely identify them. Therefore, Column-F1 can be seen as the extension of the accuracy used in established deep search benchmarks, computing the precision of a group of entities (rows in the table). Higher Column-F1 scores indicate more precise entities identification across the entire table structure. Moreover, since our proposed two methods include the core entity of questions, we also introduce the Core Entity Accuracy (CE Acc.), serving as an additional indicator of deep reasoning capability.

4.4.0.2 Width Evaluation

The width dimension measures how comprehensively and accurately the agent retrieves all associated information units for entities (rows in the table). Building upon the evaluation framework of WideSearch [10], we assess performance at three granularities: (1) Success Rate: A binary metric indicating whether the agent’s output table exactly matches the human-annotated ground truth (all rows, columns, and values identical); (2) Row-level F1: Computes precision, recall, and F1 scores at the row level (i.e., for each entity and its associated attributes), capturing whether the agent retrieves complete contextual information per entity; (3) Item-level F1: The finest-grained metric evaluating accuracy at the individual cell level, reflecting fidelity in retrieving atomic information units within the structured table.

4.4.0.3 Efficiency Evaluation

To address the substantial computational costs inherent in web-scale tool usage (including search, browsing APIs), we further evaluate system efficiency through two metrics: (1) Input/Output Token: The total tokens consumed during reasoning and tool calls; (2) Cost: Estimated cost expenditure based on standard model inference API pricing during query resolution. These efficiency metrics are critical for real-world deployment considerations, particularly given the demanding requirements for extensive multi-round search and browsing.

To account for stochasticity in LLM-based agent behavior, we conduct four independent runs per question for each baseline system. For both depth and width metrics, we report three complementary statistics: (1) Avg@4: The mean performance across all four runs; (2) Max@4: The best performance observed across the four runs; and (3) Pass@4: The proportion of questions solved successfully in at least one run (only for Success Rate). This comprehensive evaluation protocol ensures robustness against sampling variance while also highlighting the system’s peak performance potential.

5 Experiments↩︎

5.1 Experimental Setup↩︎

We evaluate three kinds of baselines on our proposed DeepWideBenchmark: (1) Closed-source LLMs (without tool calls): OpenAI o3-mini, GPT-4o, GPT-5, Claude-sonnet 4, Gemini 2.5 Pro and Qwen-Max; (2) Open-source LLMs (without tool calls): DeepSeek-V3/R1 [2], [3], KIMI-K2 [34], Qwen3 series [35]; and (3) Open-source Agent Systems: WebSailor [20], Smolagents [36] and OWL [37] are equipped with advanced GPT-5, Claude-sonnet-4 and Gemini-2.5-Pro backbone models. All agent systems utilized identical tools: (1) Google Search API; and (2) Webpage Visit tool. Since webpages in HTML format are often very lengthy, we use the same LLM in the agents to summarize the HTML into a concise summarization. The cost of this summarization process is also counted into the efficiency metrics. We utilized the official API endpoints of these LLMs with their default decoding parameter settings.

5.2 Main Results↩︎

Table 2: Main results on our proposed DeepWideSearch benchmark.
Model / System	Success Rate (%)		Row F1 Score (%)		Item F1 Score (%)		Column F1 (%)		CE Acc. (%)
2-3 (lr)4-5 (lr)6-7 (lr)8-9 (lr)10-11	Avg@4	Pass@4	Avg@4	Max@4	Avg@4	Max@4	Avg@4	Max@4	Avg@4	Pass@4
Closed-source LLMs
OpenAI o3-mini	0.0	0.0	3.35	4.55	13.59	16.85	27.36	35.68	61.59	69.55
GPT-5	0.30	1.36	9.61	13.42	21.67	28.21	31.71	41.05	58.41	72.72
Claude Sonnet 4	0.9	0.9	7.31	8.97	19.94	23.38	32.63	40.16	57.95	64.09
Gemini 2.5 Pro	0.9	1.82	15.42	18.96	32.06	37.10	45.27	52.86	73.98	81.82
Qwen-Max	0.0	0.0	4.16	6.18	14.32	18.48	28.81	36.19	56.02	63.64
GPT-4o	0.0	0.0	4.18	7.01	11.86	16.41	19.66	27.07	54.20	63.64
Open-source LLMs
DeepSeek-V3	0.23	0.45	6.52	9.99	19.08	24.32	31.26	39.56	60.68	69.09
DeepSeek-R1	0.28	0.45	10.72	14.39	25.01	30.56	38.42	47.77	66.93	75.45
KIMI-K2	0.34	0.91	7.74	11.92	20.44	27.54	31.48	41.83	64.32	73.18
Qwen3-235B-A22B	0.0	0.0	2.94	5.74	12.38	19.53	22.03	34.99	52.39	67.73
Qwen3-235B-A22B-Instruct	0.0	0.0	3.50	5.34	13.28	17.85	24.64	33.03	56.82	64.09
Qwen3-32B	0.0	0.0	2.28	3.67	12.05	16.26	26.37	35.97	54.66	66.36
Open-source Agent Framework with Advanced LLMs
OWL (Gemini 2.5 Pro)	0.0	0.0	11.11	16.93	28.75	41.70	34.84	50.39	66.14	81.82
OWL (Claude sonnet 4)	0.68	1.36	8.29	14.81	20.44	31.65	30.08	45.50	67.39	81.82
Smolagents (Gemini 2.5 Pro)	0.11	0.45	9.01	15.65	18.53	30.91	27.39	45.09	60.00	79.09
Smolagents (Claude sonnet 4)	0.91	0.91	5.06	8.94	14.49	22.68	21.60	33.83	62.95	74.09
Smolagents (GPT-5)	0.45	0.45	8.18	14.27	20.26	30.66	31.83	44.41	66.48	80.00
WebSailor (Gemini 2.5 Pro)	1.25	2.73	12.51	20.49	25.29	39.11	34.41	52.69	70.57	81.36
WebSailor (Claude Sonnet 4)	2.39	3.64	16.88	24.26	32.90	42.35	42.01	54.01	70.91	80.90
WebSailor (GPT-5)	0.34	1.36	10.97	16.17	25.96	35.65	37.18	49.48	74.32	85.00

The complete results are presented in Table 2. It can be found that most baselines demonstrate near-zero success rates, with only WebSailor (Gemini 2.5 Pro) and WebSailor (Claude Sonnet 4) exceeding 1-2% in Success Rate (Avg@4), confirming the inherent complexity of simultaneously handling deep reasoning and wide-scale information collection. Notably, Gemini 2.5 Pro emerges as the top-performing LLM, achieving the highest Column F1 (45.27%, Avg@4), Core Entity Accuracy (73.98%, Avg@4), and Pass@4 Success Rate (1.82%), even outperforming several agent systems. This exceptional performance indicates that Gemini 2.5 Pro possesses advanced reasoning capabilities for entity identification and extensive internal knowledge for filling result tables without external search. Furthermore, we detail the performance of baselines from depth and width metrics as below.

5.2.0.1 Depth Metrics

Our analysis reveals that agent systems generally enhance the deep search capabilities of base LLMs, as evidenced by consistent improvements in Core Entity Accuracy (CE Acc.). For example, the CE Acc. (Avg@4) of GPT-5 increases from 58.41% (base LLM) to 74.32% when integrated into WebSailor, representing a +15.91 percentage point gain. Similarly, Claude Sonnet 4 improves from 57.95% to 70.91% under WebSailor, demonstrating the effectiveness of iterative tool calls and multi-step reasoning in complex information retrieval. However, Gemini 2.5 Pro represents a notable exception to this trend. Upon close inspection of generated outputs, we find that Gemini 2.5 Pro in agent systems frequently fails due to three critical issues: (a) producing invalid markdown-formatted tables; (b) executing incorrect tool call APIs; and (c) incomplete task solving due to inference errors, occurring in 24.24% of cases on average—substantially higher than GPT-5 (16.36%) and Claude Sonnet 4 (17.80%). This suggests that Gemini 2.5 Pro’s output formatting behavior becomes brittle when subjected to multi-step tool orchestration. Critically, while agent systems improve core entity identification, they fail to consistently enhance column-level precision. For instance, the Column F1 (Avg@4) of Claude Sonnet 4 model declines from 32.63% (base LLM) to 30.08% in OWL and 21.60% in Smolagents. This pattern highlights a fundamental limitation: even when agents successfully identify core entities through multi-hop reasoning, current agent architectures cannot reliably collect complete entities, with their effectiveness often falling below the usage of internal knowledge in base LLMs.

5.2.0.2 Width Metrics

When evaluating width metrics that measure comprehensive information collection, we observe that most agent frameworks do not significantly improve the base LLMs’ wide search capabilities. Only three combinations demonstrate consistent improvements across all width metrics: OWL (Claude Sonnet 4), WebSailor (Claude Sonnet 4), and WebSailor (GPT-5). The remaining agents show substantial performance degradation compared to their counterpart base LLMs. Beyond the issues specific to Gemini 2.5 Pro that described above, the Smolagents framework also consistently underperforms across nearly all metrics. Our investigation reveals that Smolagents employs minimal reasoning before tool calls, which restricts the effectiveness of subsequent tool calls. This architectural constraint prevents Smolagents from formulating precise search queries, resulting in inadequate information coverage and poor performance on width metrics.

6 Analysis↩︎

In this section, we conduct several detailed analysis on Efficiency (Section 6.1), Tool Calls (Section 6.2), Differences in Dataset Construction Methods (Section 6.3), Per-topic Performance (Section 6.4), and Error Analysis (Section 6.5).

6.1 Efficiency Analysis↩︎

r0.5

Table 3: Average token usage and cost statistics for some agents on DeepWideSearch questions.
Agents	Input Token	Output Token	Cost ($)
OWL (Gemini 2.5 Pro)	65K	2.5K	$\approx 0.2$
OWL (GPT-5)	1.8M	50K	$\boldsymbol{\approx 2.75}$
Smolagents (Claude Sonnet 4)	224K	2.4K	$\approx 2.14$
Smolagents (GPT-5)	120K	25K	$\approx 0.90$
WebSailor (Gemini 2.5 Pro)	65K	2.5K	$\approx 0.49$
WebSailor (Claude Sonnet 4)	186.2K	3.5K	$\approx 1.40$
WebSailor (GPT-5)	17.7K	6.2K	$\approx 0.36$

Compared to deep search or wide search, DeepWideSearch imposes significantly higher computational and operational overhead. As shown in Table 3, even state-of-the-art agents incur substantial resource costs per query. For instance, OWL (GPT-5) and WebSailor (Claude Sonnet 4) achieve average $2.75 and $1.40 per question — with many queries remaining unresolved despite this high cost. Due to unstable network conditions and tool call errors, agents often require multiple retry attempts to complete tasks such as search, significantly increasing computational overhead—for instance, OWL (GPT-5) incurs an average cost exceeding $6.8 under retry conditions. These results underscore a critical inefficiency in current agent architectures when tackling complex deep and wide information seeking tasks. This suggests that existing systems are not yet scalable for real-world deployment of DeepWideSearch, motivating future work on efficient planning, memory reuse, and adaptive resource allocation.

6.2 Tool Calls Analysis↩︎

r0.4

Table 4: Average tool calls in the WebSailor system.
Agents	Search	Visit
WebSailor (Gemini 2.5 Pro)	4.77	2.68
WebSailor (Claude Sonnet 4)	23.23	4.57
WebSailor (GPT-5)	8.72	5.35

Table 4 shows the average number of tool calls (Search and Visit tools) per sample across different backbone LLMs in WebSailor. Notably, WebSailor (Claude Sonnet 4) exhibits a significantly higher Search tool calls (23.23) compared to Gemini 2.5 Pro (4.77) and GPT-5 (8.72). This aligns with its superior performance (Table 2), suggesting that scaling the search tool calls improves the performance.

Table 5: Performance comparison between Deep2Wide and Wide2Deep methods.
Model / System	Success Rate (%)		Row F1 Score (%)		Item F1 Score (%)		Column F1 (%)		Entity Acc. (%)
2-3 (lr)4-5 (lr)6-7 (lr)8-9 (lr)10-11	Avg@4	Pass@4	Avg@4	Max@4	Avg@4	Max@4	Avg@4	Max@4	Avg@4	Pass@4
Wide Search $\boldsymbol{\rightarrow}$ DeepWideSearch (Wide2Deep)
Avg. LLMs	1.17	2.22	17.23	21.80	38.04	43.95	50.94	59.09	90.12	93.83
Avg. Agents	1.23	2.13	15.55	24.13	33.51	46.98	44.13	60.70	88.36	96.76
Avg. All	1.21	2.15	16.00	23.49	34.75	46.16	45.96	60.26	88.84	95.96
Deep Search $\boldsymbol{\rightarrow}$ DeepWideSearch (Deep2Wide)
Avg. LLMs	0.0	0.0	2.67	3.92	8.52	13.25	13.67	21.81	31.77	46.27
Avg. Agents	0.15	0.44	3.25	5.99	9.21	16.43	13.75	24.92	33.86	54.56
Avg. All	0.11	0.32	3.09	5.42	9.02	15.56	13.73	24.07	33.29	52.30
Overall
Avg. LLMs	0.72	1.36	11.60	14.89	26.64	32.09	36.54	44.69	67.58	75.45
Avg. Agents	0.75	1.25	9.36	14.88	20.76	30.60	27.77	40.74	58.05	69.89
Avg. All	0.74	1.28	9.97	14.88	22.36	31.01	30.16	41.82	60.65	71.40

6.3 Differences in Dataset Construction Methods↩︎

Table 5 demonstrates the average performance of advanced LLMs (GPT-5, Claude Sonnet 4 and Gemini 2.5 Pro) with their counterpart agent systems. It can be found that the Deep2Wide construction method produces substantially more challenging data than Wide2Deep method. For example, LLMs and agents achieves nearly 0.0% success rate on Deep2Wide (Avg. LLMs: 0.0% Avg@4; Avg. Agents: 0.15% Avg@4), compared to the Wide2Deep (Avg. LLMs: 1.17% Avg@4; Avg. Agents: 1.23% Avg@4). Critically, the overall Entity Accuracy on Deep2Wide is only 33.29% (vs. 88.84% on Wide2Deep). This observation indicates that the synthesized deep sub-question in the Wide2Deep method is easier for LLMs to solve. Nevertheless, the column-F1 of Wide2Deep remains below 51%, indicating that comprehensively collecting entities is still challenging.

6.4 Per-topic Performance Analysis↩︎

Figure 6: Per-topic analysis on two depth metrics (Column F1 and CE Acc.) and two width metrics (Item F1 and Row F1).

As shown in Figure 6, we analyze topic-wise performance through bidirectional bar charts evaluating depth metrics (Column-F1, CE Acc.) and width metrics (Item-F1, Row-F1), excluding domains with fewer than 5 samples. Four key patterns emerge: (1) The top-5 most frequent topics (sample count >20) are Film & Movies, Politics, Finance, Technology, and Sports; (2) Politics achieves the highest item- and row-level F1 scores (35% and 19%), indicating wide search are more tractable in this topic, while Politics and Finance attain the highest column F1 and CE accuracy, suggesting deep search are comparatively easier here; (3) Despite strong depth performance in Finance, Travel, and Education topics, the performance of baselines exhibit substantially lower width metrics on these three topics (e.g., Travel 20% item F1 and Finance 8% row F1), revealing that strong deep search capability does not guarantee effective wide search capability; and (4) History and Games consistently underperform across all metrics (e.g., 5% Column-F1 of History), establishing them as the most challenging topics. These findings highlight the heterogeneous nature of search complexity across topics.

6.5 Error Analysis↩︎

As shown in Tables 2, agent systems might underperform backbone LLMs on DeepWideSearch tasks. Our error analysis reveals four key failure patterns: (1) Lack of Reflection: agents often lack effective reflection mechanisms. When encountering wrong trajectories (Figures 13) or tool call errors (Figure 14), they prematurely conclude the task is unsolvable and output empty tables rather than analyzing failure causes and exploring alternative paths; (2) Overreliance on Internal Knowledge: agents frequently overrely on internal knowledge. Even when correctly identifying core entities (Figure 15), they often generate tables solely using their internal parametric knowledge rather than performing proper web queries, resulting in outdated or inaccurate information due to limited training data scope; (3) Insufficient Retrieval: information retrieval is often insufficient. For example, despite identifying relevant pages (Figure 17), agents frequently fail to properly access complete context through visit operations, leading to significant information omissions. Even when visit operations are executed correctly, summarized webpage data may still miss critical details. This limitation motivates the design of a question-aware, customized webpage summarization process in agent systems; and (4) Context Overflow: context overflow presents a fundamental challenge. Deep wide search requires extensive multi-step reasoning and numerous search tool calls, significantly expanding context length (Figure 16). This issue occurred in 24.96% of cases, exceeding the context management capabilities of current agent architectures; In summary, these four error patterns highlight that current agents face substantial limitations when addressing the challenges of depth and width in complex information-seeking tasks. Addressing these limitations requires specialized architecture for deep wide search scenarios.

7 Conclusion↩︎

This paper addresses the critical gap in information-seeking agent evaluation by introducing DeepWideSearch benchmark, the first benchmark designed to simultaneously assess deep reasoning and wide-scale information collection. Our experiments demonstrate that state-of-the-art agents achieve only 2.39% average success rate on this challenging benchmark, revealing fundamental limitations for current agents. These results underscore the combinatorial complexity of deep and wide search as a key frontier to guide future research toward more capable information-seeking agents.

8 Limitations and Future Work↩︎

Despite our established DeepWideSearch benchmark, there are three key limitations remain to be addressed in the future work: (1) As shown in Table 5, the Wide2Deep construction method produces significantly easier questions than Deep2Wide, as evidenced by the substantially higher CE Accuracy. We will iteratively refine sub-questions to increase question complexity while maintaining natural language quality; (2) Our current dataset exhibits slight differences with real-world deep and wide search questions in terms of solution paths (Cases in Appendix 10). In future work, we will iteratively refine the DeepWideSearch dataset to better align with real-world applications; and (3) Our dataset construction relies heavily on human annotation, limiting scalability. Future work should explore automated data generation techniques and develop reference-free evaluation metrics that avoid complex, human-verified tabular answers, enabling efficient dataset expansion and model optimization across diverse domains.

9 Details of Datasets↩︎

The table volume in Table 1 represents the number of the searched information in the DeepWideSearch questions, which is defined as the product of rows and columns of the table. The average steps of the search entities is counted as the number of the reasoning steps and tool calls. Specifically, the average steps of GAIA is counted by the reference trajectories in the dataset, and the average steps of WideSearch is annotated by our three human raters. Besides, Figure 7 and Figure 8 present two cases in our proposed DeepWideSearch dataset.

Figure 7: One case in DeepWideSearch dataset.

Figure 8: One case in DeepWideSearch dataset.

10 Differences between Our Dataset and Real-world Questions↩︎

None

Figure 9: Two cases of the deep and wide search questions..

Figure 9 illustrates two representative deep and wide search questions: the first is an example from our constructed DeepWideSearch dataset, and the second is drawn from a real-world e-commerce scenario. While our dataset captures the essential characteristics of deep and wide search, the primary difference from real-world settings lies in the solution path. In our dataset, the process emphasizes first performing a deep search to gather critical information, followed by a wide search to expand relevant attributes. In contrast, real-world tasks often begin with a wide search to collect a large pool of candidates, followed by a deep search over each candidate for verification. Nevertheless, it is important to emphasize that despite this procedural difference, our dataset still exhibits the traits of deep and wide search. Specifically, during the initial deep search phase, the model also need to list and reason over a set of candidates, systematically applying deep verification to determine which candidates satisfy the problem constraints and thereby identify the correct target entity. Consequently, even this first-stage deep search inherently incorporates the characteristic of the wide search.

11 Prompts for DeepWideSearch Data Construction↩︎

This section presents three prompts for Wide2Deep method: (1) Core Entity Extraction Prompt in Figure 10; (2) Deep Sub-Question Synthesis Prompt in Figure 11; (3) Question Fusion in Figure 12.

None

Figure 10: The prompt of core entitiy extraction in Wide2Deep method..

None

Figure 11: The prompt of deep sub-question synthesis in Wide2Deep method..

None

Figure 12: The prompt of deep and wide question fusion in Wide2Deep method..

12 Error Cases in DeepWideSearch↩︎

This section provides the four kinds of representative errors of agents: (1) Lack of Reflection (Figure 13 and Figure 14); (2) Overreliance on Internal Knowledge (Figure 15); (3) Context Overflow (Figure 16); and (4) Insufficient Retrieval (Figure 17).

None

Figure 13: Lack of Reflection when dive into the wrong trajectory..

None

Figure 14: Lack of reflection when tool calls are wrong..

None

Figure 15: Overreliance on the internal knowledge of LLMs..

None

Figure 16: Multi-turn tool calls and reasoning leads to the context overflow problem, and agents are interrupted to output the table..

None

Figure 17: Complete information in the webpages are not passed to the agents, leading to the insufficient retrieval error..

References↩︎

[1]

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[2]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.

[3]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.

[4]

Y. Xi, J. Lin, Y. Xiao, Z. Zhou, R. Shan, T. Gao, J. Zhu, W. Liu, Y. Yu, and W. Zhang. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges. arXiv preprint arXiv:2508.05668, 2025.

[5]

H.-a. Gao, J. Geng, W. Hua, M. Hu, X. Juan, H. Liu, S. Liu, J. Qiu, X. Qi, Y. Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence. arXiv preprint arXiv:2507.21046, 2025.

[6]

X. Wang, B. Wang, D. Lu, J. Yang, T. Xie, J. Wang, J. Deng, X. Guo, Y. Xu, C. H. Wu, Z. Shen, Z. Li, R. Li, X. Li, J. Chen, B. Zheng, P. Li, F. Lei, R. Cao, Y. Fu, D. Shin, M. Shin, J. Hu, Y. Wang, J. Chen, Y. Ye, D. Zhang, D. Du, H. Hu, H. Chen, Z. Zhou, H. Yao, Z. Chen, Q. Gu, Y. Wang, H. Wang, D. Yang, V. Zhong, F. Sung, Y. Charles, Z. Yang, and T. Yu. Opencua: Open foundations for computer-use agents, 2025. URL https://arxiv.org/abs/2508.09123.

[7]

M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025. URL https://arxiv.org/abs/2506.11763.

[8]

G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants, 2023. URL https://arxiv.org/abs/2311.12983.

[9]

J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URL https://arxiv.org/abs/2504.12516.

[10]

R. Wong, J. Wang, J. Zhao, L. Chen, Y. Gao, L. Zhang, X. Zhou, Z. Wang, K. Xiang, G. Zhang, et al. Widesearch: Benchmarking agentic broad info-seeking. arXiv preprint arXiv:2508.07999, 2025.

[11]

Y. He, G. Huang, P. Feng, Y. Lin, Y. Zhang, H. Li, et al. Pasa: An llm agent for comprehensive academic paper search. arXiv preprint arXiv:2501.10120, 2025.

[12]

J. Wu, W. Yin, Y. Jiang, Z. Wang, Z. Xi, R. Fang, L. Zhang, Y. He, D. Zhou, P. Xie, and F. Huang. Webwalker: Benchmarking llms in web traversal, 2025. URL https://arxiv.org/abs/2501.07572.

[13]

T. Fang, Z. Zhang, X. Wang, R. Wang, C. Qin, Y. Wan, J.-Y. Ma, C. Zhang, J. Chen, X. Li, H. Zhang, H. Mi, and D. Yu. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training, 2025. URL https://arxiv.org/abs/2508.00414.

[14]

R. Han, Y. Chen, Z. CuiZhu, L. Miculicich, G. Sun, Y. Bi, W. Wen, H. Wan, C. Wen, S. Maître, G. Lee, V. Tirumalashetty, E. Xue, Z. Zhang, S. Haykal, B. Gokturk, T. Pfister, and C.-Y. Lee. Deep researcher with test-time diffusion, 2025. URL https://arxiv.org/abs/2507.16075.

[15]

W. Zhang, C. Cui, Y. Zhao, R. Hu, Y. Liu, Y. Zhou, and B. An. Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving. arXiv preprint arXiv:2506.12508, 2025.

[16]

H. Zhou, X. Wan, R. Sun, H. Palangi, S. Iqbal, I. Vulić, A. Korhonen, and S. Arık. Multi-agent design: Optimizing agents with better prompts and topologies, 2025. URL https://arxiv.org/abs/2502.02533.

[17]

Y. Xia, J. Fan, W. Chen, S. Yan, X. Cong, Z. Zhang, Y. Lu, Y. Lin, Z. Liu, and M. Sun. gentRM: Enhancing agent generalization with reward modeling. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19277–19290, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251-0. . URL https://aclanthology.org/2025.acl-long.945/.

[18]

T. Fang, H. Zhang, Z. Zhang, K. Ma, W. Yu, H. Mi, and D. Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025. URL https://arxiv.org/abs/2504.21024.

[19]

J. Wu, B. Li, R. Fang, W. Yin, L. Zhang, Z. Tao, D. Zhang, Z. Xi, G. Fu, Y. Jiang, P. Xie, F. Huang, and J. Zhou. Webdancer: Towards autonomous information seeking agency, 2025. URL https://arxiv.org/abs/2505.22648.

[20]

K. Li, Z. Zhang, H. Yin, L. Zhang, L. Ou, J. Wu, W. Yin, B. Li, Z. Tao, X. Wang, et al. Websailor: Navigating super-human reasoning for web agent. arXiv preprint arXiv:2507.02592, 2025.

[21]

Z. Tao, J. Wu, W. Yin, J. Zhang, B. Li, H. Shen, K. Li, L. Zhang, X. Wang, Y. Jiang, P. Xie, F. Huang, and J. Zhou. Webshaper: Agentically data synthesizing via information-seeking formalization, 2025. URL https://arxiv.org/abs/2507.15061.

[22]

Z. Zhang, Z. Chen, M. Li, Z. Tu, and X. Li. Rlvmr: Reinforcement learning with verifiable meta-reasoning rewards for robust long-horizon agents, 2025. URL https://arxiv.org/abs/2507.22844.

[23]

Y. Fan, K. Zhang, H. Zhou, Y. Zuo, Y. Chen, Y. Fu, X. Long, X. Zhu, C. Jiang, Y. Zhang, L. Kang, G. Chen, C. Huang, Z. He, B. Wang, L. Bai, N. Ding, and B. Zhou. Ssrl: Self-search reinforcement learning, 2025. URL https://arxiv.org/abs/2508.10874.

[24]

H. Sun, Z. Qiao, J. Guo, X. Fan, Y. Hou, Y. Jiang, P. Xie, Y. Zhang, F. Huang, and J. Zhou. Zerosearch: Incentivize the search capability of llms without searching, 2025. URL https://arxiv.org/abs/2505.04588.

[25]

W. Xu, K. Mei, H. Gao, J. Tan, Z. Liang, and Y. Zhang. A-mem: Agentic memory for llm agents, 2025. URL https://arxiv.org/abs/2502.12110.

[26]

M. Zhuge, C. Zhao, D. R. Ashley, W. Wang, D. Khizbullin, Y. Xiong, Z. Liu, E. Chang, R. Krishnamoorthi, Y. Tian, Y. Shi, V. Chandra, and J. Schmidhuber. Agent-as-a-judge: Evaluating agents with agents, 2025. URL https://openreview.net/forum?id=DeVm3YUnpj.

[27]

B. Gou, Z. Huang, Y. Ning, Y. Gu, M. Lin, W. Qi, A. Kopanev, B. Yu, B. J. Gutiérrez, Y. Shu, C. H. Song, J. Wu, S. Chen, H. N. Moussa, T. Zhang, J. Xie, Y. Li, T. Xue, Z. Liao, K. Zhang, B. Zheng, Z. Cai, V. Rozgic, M. Ziyadi, H. Sun, and Y. Su. Mind2web 2: Evaluating agentic search with agent-as-a-judge, 2025. URL https://arxiv.org/abs/2506.21506.

[28]

S. Chen, P. Moreira, Y. Xiao, S. Schmidgall, J. Warner, H. Aerts, T. Hartvigsen, J. Gallifant, and D. S. Bitterman. Medbrowsecomp: Benchmarking medical deep research and computer use, 2025. URL https://arxiv.org/abs/2505.14963.

[29]

Y. Lyu, X. Zhang, L. Yan, M. de Rijke, Z. Ren, and X. Chen. Deepshop: A benchmark for deep research shopping agents, 2025. URL https://arxiv.org/abs/2506.02839.

[30]

X. Shi, Y. Li, Q. Kou, L. Yu, J. Xie, and H. Zhou. Spar: Scholar paper retrieval with llm-based agents for enhanced academic search, 2025. URL https://arxiv.org/abs/2507.15245.

[31]

P. Zhou, B. Leon, X. Ying, C. Zhang, Y. Shao, Q. Ye, D. Chong, Z. Jin, C. Xie, M. Cao, Y. Gu, S. Hong, J. Ren, J. Chen, C. Liu, and Y. Hua. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese, 2025. URL https://arxiv.org/abs/2504.19314.

[32]

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. riviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. . URL https://aclanthology.org/P17-1147/.

[33]

Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018. URL https://arxiv.org/abs/1809.09600.

[34]

K. Team, Y. Bai, Y. Bao, and G. C. et al. Kimi k2: Open agentic intelligence, 2025. URL https://arxiv.org/abs/2507.20534.

[35]

A. Yang, A. Li, and B. Y. et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388.

[36]

A. Roucher, A. V. del Moral, T. Wolf, L. von Werra, and E. Kaunismäki. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingface/smolagents, 2025.

[37]

M. Hu, Y. Zhou, W. Fan, Y. Nie, B. Xia, T. Sun, Z. Ye, Z. Jin, Y. Li, Q. Chen, Z. Zhang, Y. Wang, Q. Ye, B. Ghanem, P. Luo, and G. Li. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation, 2025. URL https://arxiv.org/abs/2505.23885.

DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking

1 Introduction↩︎

2.1 LLM-based Search Agents↩︎

2.2 Benchmarks for LLM-based Agents↩︎