Abstract

We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at https://github.com/l-yohai/CDR-Benchmark.

1 Introduction↩︎

The widespread adoption of generative AI powered by Large Language Models (LLMs) has created vast repositories of conversation data [1]. These dialogues offer valuable insights into user behaviors and system performance. However, effectively analyzing and leveraging this accumulated conversational data remains an underexplored challenge in the field [2]–[4].

Unlike conventional information systems, large language model-based systems operate through open-ended interactions without predefined specifications [5], [6]. Users interact with them in diverse, unpredictable ways, creating unique challenges for conversational data analytics. Traditional approaches to extracting product insights struggle with these datasets: supervised learning techniques face prohibitive labeling costs [7], manual dialogue review becomes impractical at the scale of millions of conversations [8], [9], and conventional metrics fail to capture the complex evolution of user satisfaction across multiple turns [10].

Given these challenges, many product development teams have adopted an exploratory “Retrieve and Analyze” workflow to derive insights from their conversation data (see Appendix 6 for a detailed real-world case study). In this approach, retrieval quality fundamentally determines analysis effectiveness—if relevant conversations cannot be efficiently found, critical insights remain hidden despite analyst expertise. For example, when investigating satisfaction issues, product managers often use basic keyword searches like “unhappy” or “disappointed,” missing cases where dissatisfaction is expressed implicitly or across multiple turns. This retrieval gap creates significant blind spots in understanding user experiences and severely limits the value extracted from conversation datasets.

Figure 1: Comparison between traditional document retrieval and conversational data retrieval.

To address these limitations, we build upon the concept of Conversational Data Retrieval (CDR) [11]: the task of retrieving relevant conversations from large chat histories based on queries targeting conversation-specific content and context. As illustrated in Figure 1, CDR differs fundamentally from traditional document retrieval [12], [13] by addressing conversation-specific challenges: multi-turn exchanges, implicit meanings, and topic shifts.

Figure 2: An overview of the Conversational Data Retrieval (CDR) benchmark construction pipeline. (A) Collect and filter large-scale conversational data. (B) Generate query templates across five key areas. (C) Synthesize query-aligned conversations with LLMs. (D) Map relevance through reranking, human assessment, and classifier verification. (E) Integrate the processed data into a standardized CDR evaluation benchmark.

Beyond analytical use cases, effective CDR enables applications such as AI memory systems [14], [15], and retrieval-augmented generation [16], [17]. However, current retrieval solutions were not designed with conversations in mind, limiting the potential of these applications.

Despite its value, CDR remains underexplored in research. This gap stems from several factors: the proprietary nature of industrial conversation datasets, privacy concerns limiting public data availability, and the lack of standardized evaluation metrics [18], [19]. These challenges have hindered research progress on methods specifically designed for CDR.

To address this gap, we introduce a comprehensive benchmark for CDR. Figure 2 illustrates our construction process, including data collection, query design, and validation. Our contributions include: (1) the first benchmark specifically targeting CDR, comprising 1.6k queries and 9.1k conversations; (2) evaluation of 16 commonly used embedding models revealing performance disparities between document and conversation retrieval; (3) a taxonomy of five essential analytical tasks exposing unique challenges in CDR; and (4) practical query templates developed with domain experts applicable to product improvement workflows. This benchmark facilitates structured development of conversation retrieval models, supporting applications and research in conversation analysis.

Information retrieval (IR) has evolved from lexical matching [20] to neural approaches [21]. Recent advances in generative AI have linked retrieval with conversation, enabling conversational search [22], [23], agent memory systems [24], [25], and retrieval-based reasoning [26], [27]. However, these efforts focus on using retrieval to enhance conversations [28], [29], not on effectively retrieving conversational data itself.

The unique value of conversational data lies in its multifaceted nature. Human-Computer Interaction research has identified several dimensions critical for understanding these interactions—including user intentions, emotional responses, conversation flow patterns, and trust development [8], [9]. These elements often span multiple turns and contain implicit signals that traditional document retrieval approaches struggle to capture [30], [31]. Effective analysis requires methods to identify these complex patterns within conversations.

While industrial applications generate vast conversational data, privacy concerns and proprietary issues severely limit public access to these datasets [32]. Even available datasets often lack sufficient coverage of specific analytical dimensions needed for comprehensive evaluation [33], [34].

Synthetic conversational data offers a valuable solution to these constraints, as high-quality synthetic dialogues can match or exceed the performance of systems trained on real data [35]–[37]. This approach enables more controlled evaluation by systematically varying conversation attributes while maintaining natural dialogue properties.

However, existing IR benchmarks [38], [39] focus on documents, while dialogue datasets typically focus on generation tasks. This creates a significant gap between the analytical needs identified in HCI research and available evaluation frameworks. Our CDR benchmark addresses this gap by integrating multi-dimensional aspects of conversations with a comprehensive evaluation framework.

Table 1: Five core analytical areas identified for the CDR benchmark with their product insights.
Analytical Area	Description	Product Insights
Emotion & Feedback	Identifying users’ emotional states and feedback in conversations	Revealing satisfaction patterns and pain points for product improvement
Intent & Purpose	Recognizing user intentions and goals	Evaluating alignment between intended and actual AI system usage
Conversation Dynamics	Analyzing conversation flow, turn structure and resolution patterns	Identifying conversation bottlenecks and improving dialogue completion rates
Trust, Safety & Ethics	Exploring trust-building and ethical issues in conversations	Identifying system reliability concerns and potential safety risks
Linguistic Style & Expression	Analyzing language patterns and comprehension challenges	Helping calibrate system language to user comprehension levels

3 Designing the CDR Benchmark↩︎

3.0.0.1 Data Collection and Industrial Requirements

To establish a foundation for the CDR benchmark, we collected conversational data from 11 diverse open-source dialogue datasets including LMSYS Chat [40], WildChat [33], and DialogSum [41]. To ensure quality and remove duplicates, we applied filtering using the NeMo Curator framework [42], refining approximately 2.4 million conversations to 600k high-quality dialogue instances. The complete data sources and filtering method are detailed in Appendix 7.

To ensure industrial applicability, we sampled 1k conversations for analysis and gathered input from 20 experts in generative AI product development. From this combined research, we identified key information needs when examining conversational data and determined five core areas for product improvement, shown in Table 1. These areas reflect how conversational data differs from traditional document retrieval challenges.

3.0.0.2 Query Template Design and Generation

From the five core areas in Table 1, we created 130 query templates that capture the specific characteristics of each category. Each template included placeholder elements to cover diverse conversational scenarios.

For example, a template in the Emotion & Feedback category might be: “Find conversations where users express {emotion} after {system_action}” For the {emotion} placeholder, values included “frustration,” “disappointment,” and “satisfaction.”

We defined approximately 510 placeholder values across different categories. By combining these placeholders with our templates, we generated a total of 28k specific queries. Full details of templates and placeholders are provided in Appendix 8.

3.0.0.3 Query-Aligned Conversation Synthesis Method

Finding conversations that match our diverse queries presented two challenges: our corpus could not cover all specific query scenarios needed, and manually labeling thousands of conversations would be prohibitively time-consuming.

To address these limitations, we first retrieved top-5 candidate conversations for each query using the embedding model [43]. Twenty expert annotators with industry experience in conversational AI product development then manually reviewed these candidates. They classified each as related or unrelated based on whether it faithfully reflected the query intent. When no suitable match existed, we used reasoning-capable language models—Claude-3.7 [44], o1 [45], and o3-mini [46]—to create synthetically aligned conversations by adapting existing conversations from our corpus. These LLM-generated conversations were also validated by expert annotators to ensure both query fidelity and conversational naturalness.

Our conversation generation prompt (detailed in Appendix 9.1) instructed models to maintain each conversation structure and characteristics while incorporating elements needed for query alignment. This approach preserved the natural variation found in real conversations while ensuring examples contained features necessary for evaluation.

Figure 3: Domain distribution in the CDR benchmark dataset, showing diverse coverage across categories.

By combining pre-aligned conversations with synthetically aligned conversations, our method maintained domain diversity. Figure 3 shows balanced coverage across major categories like People and Society (17.00%) and Business and Industrial (15.79%), as classified by a fine-tuned classifier³.

3.0.0.4 Expanding Query-Conversation Relevance Mappings

To create a realistic retrieval benchmark, we expanded each query to match multiple relevant conversations through a three-step process. First, we trained a specialized reranker model using 300k conversations from our corpus. We generated positive and negative query examples with LLaMa 3.3 70B [47] using prompts in Appendix 9.2 and 9.3, and fine-tuned the GTE Reranker [48] (training detailed in Appendix 10). Second, we applied this reranker to identify candidate relevant conversations, selecting pairs with relevance scores above 0.9 and excluding overly general queries matching more than 50 conversations.

Third, we validated mappings through comprehensive human assessment. Expert annotators conducted full manual evaluation of approximately 4k query-conversation pairs across 200 queries. We applied binary relevance criteria with majority voting for reliability, conservatively removing non-consensus cases. We trained a ModernBERT-based [49] relevance classifier using these manually validated pairs, achieving 95.2% accuracy, as detailed in Appendix 11. For remaining queries, we applied the classifier to predict relevance for all pairs, then employed two-stage human verification. First, we prioritized uncertain cases where sigmoid scores fell below 0.9. Second, we identified boundary inconsistencies where irrelevant predictions appeared among relevant pairs, and vice versa. Human annotators verified both uncertain predictions and inconsistent boundaries, ensuring comprehensive coverage while efficiently allocating annotation effort to critical cases.

Table 2: Key statistics of the CDR benchmark dataset.
General Statistics
Number of conversations	9,146
Number of queries	1,583
Avg. messages per conversation	5.4
Avg. tokens per conversation	464
Avg. relevant convs per query	20.44
Total query-conversation pairs	32,357
Query Task Distribution (%)
Intent & Purpose	36.1%
Emotion & Feedback	20.1%
Linguistic Style & Expression	15.9%
Trust, Safety & Ethics	14.6%
Conversation Dynamics	13.4%

3.0.0.5 Benchmark Composition and Characteristics

Our comprehensive mapping pipeline provides an efficient method for constructing high-quality query-conversation pairs. This methodology offers a practical solution for industrial deployment where cost-effective data mapping is essential. However, as our goal is to establish a rigorous benchmark, we conducted additional validation to ensure maximum integrity. We employed four LLMs—GPT-4o [50], o3-mini [46], Claude 3.7 Sonnet [44], and Gemini 2.0 Pro [51]—with the prompt in Appendix 9.4 to cross-check all pairs. Cases where LLMs disagreed were flagged for expert review by annotators, who applied consistent binary relevance criteria with majority voting. Pairs without clear consensus were conservatively discarded. Through this multi-stage validation approach combining LLM scalability with human verification at each step, 97% of all query-conversation mappings passed assessment, ensuring the final benchmark meets the highest quality standards.

The final CDR benchmark consists of 1,583 queries and 9,146 conversations (Table 2). Conversations average 5.4 messages and 464 tokens⁴. Each query maps to 20.44 relevant conversations on average. Query distribution spans five core areas: Intent & Purpose (36.1%), Emotion & Feedback (20.1%), and three other categories, with detailed examples provided in Appendix 15.

max width=

4 Experiments and Analysis↩︎

4.1 Experimental Setup↩︎

We evaluated 16 widely used embedding models from open-source communities and commercial providers including OpenAI [43], Cohere [52], and Voyage AI [53]. Performance was assessed using NDCG@10, Recall@10, and Precision@10 at three retrieval granularities—session-level, turn-level, and sliding window (chunk size = 3). For detailed evaluation methodology, see Appendix 12.1.

4.2 Results and Analysis↩︎

Table 3: Representative failure cases illustrating major retrieval challenges in conversation understanding tasks.
Challenge Type	Query Example	Incorrectly Retrieved Results	Why Models Fail
Role Recognition Failure	Assistant shares parenting and childcare advice	user: Welcome to the parent teacher conference. So what is your child’s name? assistant: Megan Jones. user: She’s been having some problems with the other kids in your class.	Models match “parent,” “child,” “teacher” keywords but miss conversational roles. Assistant is receiving information as parent, not providing advice.
Dynamic Progression Failure	Conversation where user feels increasingly satisfied with assistant	user: You’ve been so helpful with all my questions lately. I just wanted to tell you how happy I am with your assistance. assistant: Thank you so much for your kind words! It truly means a lot to me.	Models match final satisfaction but miss progressive “increasingly” aspect. This shows static state, not gradual improvement.
Semantic Contextual Misinterpretation	Assistant provides real estate and housing information	user: I’m visiting friends in Nairobi. What’s the weather like? assistant: 103°F, 2% chance of rain. user: I need a house for 1 with laundry service. assistant: Found a house at Chiromo Road with 4.6 rating.	Models match "house" keyword but miss context. This is travel booking service, not real estate information provision.

Table [table:embedding-model-result] summarizes model performance on our CDR benchmark. Among commercial API models, Voyage-3-large [53] achieved the highest performance in both turn-based (NDCG@10: 0.5079) and session-based (NDCG@10: 0.5036) evaluation, while Text-embedding-3-large [43] led in sliding chunk settings (NDCG@10: 0.5130). Among open-source models, Stella_en_1.5B_v5 [13] demonstrated consistently high performance across all evaluation settings. Interestingly, some models showed significant performance variations across different evaluation settings. For instance, NV-Embed-v2 performed poorly in turn-based evaluation (NDCG@10: 0.3170) but achieved substantially improved performance in session-based evaluation (NDCG@10: 0.4592). Even top-performing models scored just above 0.5 in NDCG@10, highlighting the challenges of modeling conversational structure, context transitions, and implicit references.

Figure 4: Task-specific NDCG@10 performance comparison of top-performing embedding models and category winners. All results are available in Appendix 13.2.

4.3 Performance Across Task Categories↩︎

Figure 4 reveals performance variations across task types. All models score highest in ‘Emotion & Feedback’ and ‘Intent & Purpose’, but perform poorly in ‘Conversation Dynamics’ where even the best models score below 0.17. This suggests current models are good at understanding content and explicit statements but struggle with understanding how conversations develop and flow.

No model excels across all categories - even top-performing Voyage-3-large shows varied results. This suggests no dominant approach exists yet for CDR. Optimal architectures remain unexplored, particularly for conversation structure understanding, which is crucial for practical applications.

4.4 Analysis of Retrieval Failures↩︎

We identified three consistent failure patterns in current embedding approaches to CDR. Table 3 illustrates these challenges: Role Recognition Failure, Dynamic Progression Failure, and Semantic Contextual Misinterpretation. These failures occur across both turn-based and session-based analysis, revealing models’ inability to capture conversation dynamics at multiple levels. The consistent challenge is that models miss implicit meanings that emerge from conversational context—patterns where the actual roles, progressive changes, or situational context must be inferred from dialogue flow rather than explicit keyword matching.

These errors stem from a fundamental limitation: current models process conversations as collections of words and topics similar to documents, rather than as dynamic exchanges with temporal flow and implicit state changes. Standard embeddings capture vocabulary similarities but miss the contextual evolution and interactive nature of dialogue. This explains the poor performance in Conversation Dynamics across all models and signals that effective retrieval systems must be redesigned to capture the unique properties of human dialogue like turn-taking patterns and implicit state transitions.

5 Conclusion↩︎

Conversational Data Retrieval (CDR) benchmark establishes the first comprehensive framework for evaluating retrieval systems on conversation data. Experimental results show that even the highest-performing models have not reached satisfactory performance. Our benchmark exposes fundamental challenges unique to conversational data: understanding implicit states, tracking conversation flow, and interpreting contextual references. Our work provides standardized evaluation methodology and query templates for product improvement while establishing a foundation for conversation-specific retrieval techniques that better capture the multi-dimensional nature of human-AI interactions.

Limitations↩︎

Our benchmark is limited to English text-based conversations, which may constrain evaluation in multilingual or multimodal settings. This focus, while enabling controlled evaluation, could limit the broader applicability of our findings to diverse linguistic contexts and interaction modalities in global conversational AI applications.

Our benchmark evaluates embedding-based retrieval models, reflecting their widespread adoption in conversational memory systems where turn, session, and segment-level granularities are commonly employed. However, the lack of specialized retrieval models designed specifically for conversation represents a gap in the field that our benchmark could help address through future development of conversation-tailored representation architectures.

While our benchmark provides comprehensive evaluation of retrieval models with robust data quality validation through domain expert involvement, it does not extend to empirical studies of industrial problem-solving applications. Although our motivation stems from real-world challenges and our benchmark identifies optimal approaches under current conditions, further research is needed to validate the practical value of these findings in actual deployment scenarios and their impact on end-user satisfaction in conversational AI systems.

Ethical Considerations↩︎

In the development and application of the CDR benchmark, we carefully considered various ethical aspects. Since conversational data inherently contains user interactions and diverse linguistic expressions, we prioritized privacy protection throughout the data collection and processing stages. We utilized only publicly available open-source datasets and included specific guidelines in our conversation generation prompts to address any potentially remaining personal expressions or sensitive information (see Appendix 14). These guidelines included instructions to “appropriately redact or anonymize any PII in reference conversations,” “avoid generating conversations that could be misleading, harmful, or promote unethical behavior,” and “ensure that no personally identifiable information such as names, addresses, phone numbers, financial details, social security numbers, or other sensitive data is exposed or inferred.”

To ensure diversity and balance in the conversational data, we designed the benchmark dataset to encompass a wide range of domains without bias toward specific topics or areas. As shown in Figure 3, we included a balanced representation of conversations from broad domains such as society, business, technology, and science, thereby minimizing bias toward particular areas. Additionally, we explicitly incorporated a ‘Trust, Safety & Ethics’ category in the benchmark’s task areas to establish ethical conversational retrieval capabilities as an important evaluation criterion.

While conversational data retrieval technology can contribute to positive purposes such as improving service quality and user experience, it also carries potential risks of privacy infringement or misuse as inappropriate surveillance tools. We recognize this duality and hope that the CDR benchmark will serve as a tool to promote balance between ethical values and innovative technological advancement. Through this, we believe that the development of conversational AI systems can progress in a direction that respects users’ rights and dignity.

Acknowledgments↩︎

This work was supported by Coxwave, Artificial intelligence industrial convergence cluster development project funded by the Ministry of Science and ICT(MSIT, Korea) & Gwangju Metropolitan City, Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (RS-2019-II190075, Artificial Intelligence Graduate School Program(KAIST)) and the Korea government(MSIT) (No. RS-2024-00509279, Global AI Frontier Lab).

6 Industry Case Study of the “Retrieve and Analyze” Approach↩︎

The “Retrieve and Analyze” methodology mirrors how analysts and product managers naturally approach problem-solving in real business environments. When faced with user feedback or product issues, human analysts typically form initial hypotheses, gather relevant examples, analyze patterns, and progressively refine their understanding through iterative investigation. What has changed with large-scale conversation data is not this fundamental analytical process, but rather the need for computational assistance to efficiently navigate thousands or millions of conversations.

The following case study illustrates how this human-centered analytical approach, supported by conversation retrieval capabilities, works in practice. This example is adapted from an actual business scenario at a health and fitness application company, demonstrating both the natural analytical workflow and the positive impact of effective conversational data retrieval.

A customer experience team was investigating increased user dissatisfaction following a recent update. Traditionally, they relied on user ratings and manual reviews of customer complaints. However, after implementing chatbot support, these methods became inadequate - the chatbot interface lacked rating systems, and the chat volume overwhelmed manual inspection capabilities.

To address this challenge, the team implemented a conversation retrieval system, beginning with a broad query: “Find sessions where users express dissatisfaction.” Sample analysis revealed mentions of the points reward system, prompting them to deep dive with a more targeted search: “Find conversations where users express dissatisfaction with changes to the points reward system” to determine if this was a widespread issue rather than isolated incidents.

This refined approach confirmed their hypothesis, revealing reduced point accumulation rates as the primary driver of dissatisfaction, with users consistently comparing the new system unfavorably to the previous one. Through this methodical process of hypothesis formation and targeted validation, the team efficiently pinpointed the specific issue causing user frustration—a discovery that would have consumed significantly more time and resources using traditional review methods.

7 Dataset Sources and Filtering Method↩︎

7.1 Dataset Sources↩︎

We constructed an initial dataset comprising around 2.4 million conversations by aggregating 11 diverse open-source datasets. To ensure broad coverage of dialogue scenarios, our dataset includes both real-world and synthetic conversational data. An overview of the datasets is provided in Table 4.

Table 4: Summary of public conversational datasets.
Dataset	Data Size	Key Features

LMSYS-Chat-1M [54]	1,000,000+	Real-world user-LLM chats; multi-turn; multilingual; moderation tags and PII redacted
WildChat-1M [33]	1,000,000+	User-ChatGPT logs; multilingual; includes user metadata and toxicity labels
DialogSum [41]	12,000+	Real-world conversations; paired with abstractive summaries and topic annotations
DailyDialog [55]	10,000+	Open-domain daily conversations; annotated with dialogue acts and emotions
MultiWOZ 2.2 [56]	8,000+	Multi-domain, task-oriented dialogues; annotated with states and system actions; corrected labels
Bot-Adversarial Dialogue (BAD) [57]	5,000+	Adversarial conversations to test chatbot safety; includes persona settings and safety labels
MobileConvRec [58]	8,000+	Conversations for mobile app recommendation; multi-turn; includes user feedback and app info
OpenDialKG [59]	12,000+	Knowledge-grounded conversations; each turn linked to KG entities for explainability

SmolTalk [34]	1,000,000+	Synthetic dialogues for instruction following; wide coverage (QA, summarization, coding tasks)
Bitext Customer Support [60]	26,000+	Synthetic QA pairs created by linguists; customer support domain; slot annotations
Schema-Guided Dialogue (SGD) [61]	16,000+	Multi-domain task-oriented dialogues; annotated with intents, slots, states; includes zero-shot domains

7.2 Filtering Method↩︎

For data quality management, we employed a multi-stage filtering process using the NeMo Curator framework. We first applied exact and fuzzy deduplication to remove identical or near-identical conversations. Next, we conducted semantic deduplication by utilizing a model fine-tuned for semantic search⁵, effectively filtering out semantically redundant instances. For quality filtering, we employed a model fine-tuned for conversation quality assessment⁶, retaining only conversations labeled as high-quality among high, middle, and low categories. Finally, heuristic score filtering was applied to remove data with excessive punctuation, URLs, and repeated lines, paragraphs, or \(n\)-grams. Through this comprehensive filtering process, we obtained approximately 600,000 refined conversation data points from an initial set of 2.4 million dialogues.

8 Template and Placeholders↩︎

To ensure the CDR benchmark captures realistic analytical scenarios, we collaborated with domain experts in conversational AI product development to design comprehensive query templates and placeholder values. This expert-guided approach ensures that our benchmark reflects actual information needs encountered in industrial applications. All templates and placeholder values are available in our public repository.

8.1 Query Template↩︎

We provide representative examples of query templates used in the benchmark generation across five task categories (Emotion & Feedback, Intent & Purpose, Conversation Dynamics, Trust, Safety & Ethics, Linguistic Style & Expression). The examples of these query templates are presented in Table 5.

Table 5: Examples of Query Templates.
Task Category	Query Template Examples
Emotion & Feedback	Conversation ending in {emotion}. Conversation reflecting {emotion} in {user_role}’s feedback. Conversation where {user_role} finds {assistant_role}’s answers inadequate, leading to {emotion}.
Intent & Purpose	Conversation where {user_role} seeks advice on {intent}. Conversation expressing gratitude to {assistant_role} for resolving {intent}. Conversation where {user_role} repeatedly asks for clarification on {intent}.
Conversation Dynamics	Conversation where {user_role} becomes more frustrated over time. Conversation where {user_role} shifts from skepticism to trust. Conversation where {user_role} feels a sense of accomplishment after resolving an issue with {assistant_role}.
Trust, Safety & Ethics	Conversation expressing distrust to {assistant_role}. Conversation where {user_role} raises a {concern_type} issue. Conversation where {user_role} questions the ethics of {assistant_role}.
Linguistic Style & Expression	Conversation using {linguistic_style} in {user_role}’s questions. Conversation highlighting {expression_type} in {assistant_role}’s feedback. Conversation where {assistant_role} uses {linguistic_style} to simplify concepts.

8.2 Placeholders and Example Values↩︎

We illustrate examples of placeholder values utilized within query templates in Table 6, showing potential variability across queries generated for the benchmark.

Table 6: Placeholders and Possible Values.
Placeholder	Values
emotion	anger, happiness, fear, sadness, disgust, surprise
reason	receiving good news, achieving a goal, success in a project, positive feedback, unexpected reward, losing an opportunity, failing a test, getting rejected, career setback, missed deadline, miscommunication, argument with a friend, relationship conflict, family issues, betrayal, overwhelming workload, financial stress, health concerns, uncertainty about the future, burnout, feeling ignored, being misunderstood, lack of appreciation, being left out, social anxiety, unexpected kindness, support from a friend, acts of generosity, reunion with a loved one, forgiving someone, public embarrassment, making a mistake, feeling inadequate, past regrets, personal failure, exploring a new hobby, intellectual curiosity, inspiring conversation, learning something new, self-discovery, bad weather, technical difficulties, traffic jam, missed appointment, unexpected delay, change in routine, relocation to a new place, adjusting to a new culture, meeting new people, losing a loved one, receiving criticism, feeling judged, comparison with others, unmet expectations, fear of failure, unexpected surprise, random compliment, winning a competition, realizing personal growth, achieving recognition
linguistic_style	formal, informal, neutral, technical, emotional, direct, indirect, logical, persuasive, descriptive, concise, elaborate, colloquial, humorous, sarcastic, empathetic, diplomatic, instructional, academic, poetic, authoritative, friendly, supportive, motivational, analytical, objective, subjective, casual, metaphorical, rhetorical, minimalist, detailed, straightforward, evocative, apologetic, provocative, encouraging, critical, optimistic, pessimistic
expression_type	descriptive, interrogative, exclamatory, imperative, figurative, humorous, sarcastic, rhetorical, analytical, persuasive, ironic, metaphorical, hyperbolic, understated, concise, elaborate, critical, supportive, enthusiastic, skeptical, neutral, emotional, empathetic, diplomatic, apologetic, provocative, assertive, tentative, cautious, objective, subjective, optimistic, pessimistic, directive, expressive, reflective, affirmative, defensive
concern_type	technical issue, ethical issue, academic concern, personal dilemma, relationship issue, work-related stress, health concern, financial problem, social issue, philosophical question, legal complication, moral dilemma, psychological distress, political concern, environmental issue, cultural conflict, safety concern, privacy issue, existential crisis, career uncertainty, education challenge, family dispute, mental health struggle, identity crisis, communication breakdown, trust issue, decision-making difficulty, peer pressure, unfair treatment, discrimination concern, technology misuse, misinformation problem, data security risk, work-life balance struggle, burnout risk, lack of recognition, fear of failure, fear of rejection, self-doubt, unmet expectations, social anxiety, public speaking fear, future uncertainty, innovation challenge, unresolved conflict, resource limitation, competitiveness pressure, time management struggle, productivity concern
information_type	definition, example, guideline, principle, theory, framework, explanation, best practice, case study, historical background, technical specification, algorithm, code snippet, data analysis, statistical insight, latest trend, research finding, scientific evidence, hypothesis, methodology, comparison, contrast, step-by-step guide, practical tip, troubleshooting guide, expert opinion, prediction, future outlook, risk assessment, ethical consideration, common misconception, application, use case, feasibility study, performance evaluation, benchmarking result, legal implication, policy overview, economic impact, market analysis, psychological insight, philosophical perspective, security risk, data privacy issue, innovation strategy, optimization technique
intent	Definition Query, Factual Query, How-to Query, Comparison Query, Reason and Consequence Query, Current Events Query, Historical Query, New Service Request, Purchase and Order Placement, Reservation and Booking, Account Creation and Management, Subscription and Membership, Payment Processing, Technical Troubleshooting, Account Recovery and Access Issues, Product Usage Guidance, Service Interruption Support, Complaint Handling, Return and Refund Assistance, Post-Purchase Support, Service Modification, Profile Update, Customization Request, Recommendation Request, Miscellaneous, Greeting, Farewell, Agreement or Acceptance, Disagreement or Rejection, Clarification Request, Repetition Request, Miscellaneous, Content Creation, Content Editing, Brainstorming and Idea Generation, Content Organization, Content Analysis, Miscellaneous, Educational Query, Skill Development, Health and Wellness, Miscellaneous, Positive Emotion Towards Chatbot, Negative Emotion Towards Chatbot, Positive Emotion About Personal Situation, Negative Emotion About Personal Situation, Positive Emotion About External Situation, Negative Emotion About External Situation, Miscellaneous, Offensive Language, Prohibited Content, Malicious Behavior, Miscellaneous
issue_description	technical malfunction, algorithmic bias, ethical dilemma, unexpected software bug, unclear instructions, ambiguous response, miscommunication, incomplete explanation, contradictory information, unresolved question, flawed reasoning, lack of supporting evidence, data inconsistency, security vulnerability, privacy violation, inaccurate prediction, unmet expectations, slow response time, unexpected error, outdated information, misleading statement, insufficient context, difficulty in decision-making, lack of transparency, complex jargon, overcomplicated solution, missing critical details, irrelevant response, unconvincing argument, lack of practical application, unrealistic assumption, biased perspective, failure to address concerns, poorly structured explanation, logical fallacy, lack of citation, conflicting sources, failure to meet requirements, unanticipated consequences, incomplete analysis, ineffective troubleshooting, delayed resolution, lack of alternative solutions, misinterpretation of question, failure to adapt to context, insufficient depth, overgeneralization, misaligned priorities, oversimplified reasoning, lack of real-world examples
user_role	user, human
assistant_role	assistant, bot, agent

9 Prompts↩︎

This section provides the detailed prompts used throughout our CDR benchmark development process.

9.1 Conversation Generation Prompt↩︎

Figure 5 outlines the prompt for generating synthetic conversations that closely match specific queries, ensuring natural, multi-turn conversations that accurately reflect query intent while maintaining appropriate length and format.

9.2 Query Generation Prompt↩︎

Figure 6 shows the prompt used to generate a single synthetic search query from a set of conversations, designed to help LLMs identify key insights and patterns within conversation clusters while focusing on product management perspectives.

9.3 Query Augmentation Prompt↩︎

Figure 7 presents the prompt for augmenting the initial query by generating three hard negative examples and one alternative positive formulation, facilitating contrastive learning by creating semantically similar but functionally distinct queries.

9.4 Relevance Classification Prompt↩︎

Figure 8 shows the prompt used for assessing query-conversation relevance. This prompt was used by modern LLMs during our verification process to evaluate whether conversations were relevant to specific queries.

10 Reranker Training↩︎

To effectively map queries to relevant conversations, we trained a specialized reranker model using approximately 300k conversations from our filtered corpus. We used LLaMa 3.3 70B [47] to generate training data through a two-step process. First, we applied the synthetic query generation prompt (Figure 6) to create one relevant query per conversation that captured the core information needs represented in the dialogue. Then, using the query augmentation prompt (Figure 7), we generated three hard negative queries (semantically similar but intentionally irrelevant) and one additional positive query (different wording but preserving intent) for each conversation.

This approach yielded approximately 1.5 million query-conversation pairs with a 2:3 positive-to-negative ratio. We fine-tuned the GTE-Multilingual-Reranker model [48] using a binary cross-entropy loss function with hard negatives, a learning rate of 2e-5 with linear warmup and decay, and maximum sequence length of 8192 tokens to accommodate longer conversations. The model was trained for 3 epochs on a single NVIDIA H100 GPU. The reranker achieved an average precision of 96.22% on our validation set after the final epoch. We applied this model with a threshold score of 0.9 to identify candidate relevant conversations across our corpus for the final benchmark construction.

11 Classifier Training↩︎

We trained a specialized binary relevance classifier to validate the reliability of the mapped query-conversation relationships. This classifier was designed to distinguish relevant and irrelevant query-conversation pairs in alignment with human judgments. For training, we utilized approximately 3K relevance pairs obtained through human assessment and an additional 20K synthetic relevance pairs generated through the procedure described in 10. This resulted in a training dataset of approximately 23K pairs with a balanced distribution of relevant and irrelevant examples.

Fine-tuning was performed on the Modern-BERT-base model[49] using a learning rate of 2e-5 with linear warm-up and decay scheduling, batch size of 128, and maximum sequence length of 8192 tokens. The model achieved an average precision of 95.2% on the validation set. We applied this classifier, which was trained on human-verified data, to verify and filter the remaining query-conversation mappings. This ensured that only high-confidence pairs were retained in the benchmark and that relevance standards remained consistent and reliable throughout the process.

12 Evaluation Details↩︎

12.1 Evaluation Setup↩︎

To ensure fair and consistent comparison across all evaluated models, we applied unified evaluation protocols. Each model was tested using its original embedding dimension and maximum sequence length as specified in the official documentation. For prompt-based embedding models, we utilized the prompts without any modifications. All conversational data used in the experiments was preprocessed in a uniform manner, ensuring format consistency across all models and minimizing performance variations arising from preprocessing discrepancies.

We define three evaluation settings that differ in the granularity of the retrieval unit:

Turn-based Evaluation: Each conversation turn is treated as an independent unit. For a given query, the model retrieves the most similar individual turn from the corpus. The conversation containing the retrieved turn is considered the final match.

Sliding Chunk Evaluation (k=3): Conversations are segmented into overlapping chunks of three consecutive turns. Given a query, the model retrieves the most similar chunk from all chunks in the corpus. The conversation containing the retrieved chunk is selected as the final match.

Session-based Evaluation: The entire conversation serves as the retrieval unit. For a given query, the model directly retrieves the most similar conversation session from the corpus.

12.2 Efficiency Evaluation↩︎

We also measured practical runtime metrics to evaluate real-world usability:

Ingestion Time: The total time required to embed the entire test corpus of 9,146 conversations. This process includes tokenization, model forwarding, and storing the embeddings in memory.
Inference Time: The combined time required to: (1) embed all 1,583 queries, (2) retrieve the corresponding conversations using these embeddings, and (3) compute the final rankings. This represents the end-to-end query processing time.

All experiments were conducted with a batch size fixed at 4 for both ingestion and inference measurements. The reported times represent the total elapsed time for processing the entire dataset.

12.3 Hardware Specifications↩︎

All evaluations were conducted under the same setup, and the hardware specifications are summarized in Table 7.

Table 7: Hardware specifications used for all experimental evaluations.
Component	Specification
CPU	Intel(R) Xeon(R) Platinum 8468
GPU	NVIDIA H100 80GB HBM3
Memory	206GB RAM

13 Additional Experimental Results↩︎

13.1 Performance by Additional Metrics↩︎

To provide a comprehensive evaluation beyond the primary results reported in Table [table:embedding-model-result], we present an extensive analysis of model performance across multiple evaluation metrics. We evaluate all models under three distinct retrieval configurations: Turn-based, Sliding Chunk (k=3), and Session-based approaches. For each configuration, we report performance across five key metrics: Accuracy (ACC), Precision (P), Recall (R), Normalized Discounted Cumulative Gain (NDCG), and Mean Reciprocal Rank (MRR), evaluated at cutoff thresholds of 1, 5, 10, and 20. The detailed results are systematically presented in Tables [table:turn-based-model-additional-result], [table:sliding-chunk-model-additional-result], and [table:session-based-model-additional-result], respectively. This multi-faceted evaluation framework enables a thorough assessment of model efficacy across varying retrieval granularities and ranking depths, providing deeper insights into the comparative strengths and limitations of each approach.

max width=

13.2 Performance per Tasks↩︎

Table [table:embedding-model-result-per-task] presents the detailed NDCG@10, Recall@10, and Precision@10 performance of all evaluated embedding models across the five task categories in our benchmark. As illustrated in Figure 4, performance varies significantly between task types, with most models showing strengths in content-oriented categories like ‘Emotion & Feedback’ and ‘Intent & Purpose’ while struggling with interaction-focused categories, particularly ‘Conversation Dynamics’.

The table highlights the lack of a universally dominant approach for conversational data retrieval tasks. Even top-performing models like Voyage-3-large demonstrate inconsistent performance across different categories. Notably, ‘Conversation Dynamics’ remains challenging for all models, with the highest scores barely reaching 0.17, indicating a substantial opportunity for architectural improvements specifically designed to capture conversation flow and structure.

max width=

14 Dataset License and Disclaimer↩︎

In this work, we utilize multiple publicly available open-source dialogue datasets to construct our initial data pool. The LMSYS-Chat-1M dataset is distributed under a custom LMSYS-Chat-1M License Agreement and is non-redistributable. The WildChat-1M-Full dataset is licensed under ODC-BY 1.0 (Open Data Commons Attribution). The Bitext Customer Support dataset is released under the CDLA-Sharing 1.0 license. The Schema-Guided Dialogue (SGD) dataset is provided under a CC BY-SA 4.0 license. The DialogSum and DailyDialog datasets are released under CC BY-NC-SA 4.0, while OpenDialKG and Bot-Adversarial Dialogue (BAD) are licensed under CC BY-NC 4.0, with the latter restricted to research use only. The MultiWOZ 2.2 and SmolTalk datasets are licensed under Apache 2.0. The MobileConvRec dataset is released under a CC BY 4.0 license. All datasets were used strictly for research purposes in compliance with their respective licenses. No personally identifiable information (PII) was included, and all data was either anonymized by the original providers or manually verified during the curation process. We do not claim ownership of these datasets and respectfully acknowledge the contributions of their original authors and publishers.

None

Figure 5: Prompt for generating synthetic conversations that match specific queries..

None

Figure 6: Prompt for generating a single synthetic search query from conversations..

None

Figure 7: Prompt for augmenting an initial query with three hard negative examples and one positive alternative..

None

Figure 8: Prompt for evaluating relevance between queries and conversations..

15 Query-Conversation Pair Examples by Task Category↩︎

We provide representative examples of query-conversation pairs for each of the five task categories in the CDR benchmark: Intent & Purpose, Linguistic Style & Expression, Trust, Safety & Ethics, Emotion & Feedback, and Conversation Dynamics. Examples for each category are shown in Tables 8–12.

Table 8: Query-Conversation pair example for Intent & Purpose category.
Task Category: Intent & Purpose
Query: Conversation where bot provides concise information on ethical consideration.

Conversation:

user: Last week, I visited the Chhatrapati Shivaji Maharaj Vastu Sangrahalaya (CSMVS) Museum in Mumbai and had the opportunity to examine the meticulously restored 8th-century Buddha statue. While I was impressed by the craftsmanship and dedication behind its conservation, I couldn’t help but worry about the ethical implications of restoring sacred objects. It makes me question whether the process might compromise the original spiritual and cultural context of the artifact.

assistant: Your concerns are both valid and thought-provoking. The restoration of such culturally significant artifacts often involves delicate decisions that balance preservation with respect for the original context. In the case of the Buddha statue, the team of conservators worked meticulously to retain the statue’s historical integrity while also revealing insights into ancient techniques. However, the ethical debates surrounding such practices—particularly concerning authenticity and the potential imposition of modern values on sacred objects—remain a crucial discussion among art historians and conservation experts.

user: I appreciate your detailed explanation. My ethical concern isn’t just about preserving an artifact aesthetically; it also relates to how modern interventions might alter the historical narrative or even erase certain nuances of its original significance. Do you feel that educational initiatives, like the recent CBSE collaboration, adequately address these ethical complexities?

assistant: That’s an excellent question. Educational initiatives, such as the collaboration with the CBSE to develop fine arts curricula, are taking important steps to foster a deeper understanding of both the technical and ethical dimensions of art restoration. These programs often emphasize the importance of ethical considerations by encouraging students to critically analyze restoration methods and cultural stewardship. While there is always room for further dialogue and refinement, such educational efforts are vital in cultivating a new generation that is both professionally competent and ethically sensitive in the field of art conservation.

Table 9: Query-Conversation pair example for Linguistic Style & Expression category.
Task Category: Linguistic Style & Expression
Query: Conversation where rhetorical is used by human to emphasize urgency.

Conversation:

user: Hi Michael, I hope you’re doing well! I’ve been looking over the data we collected so far, and I think we have some really promising results. Don’t you think we NEED to meet ASAP to finalize our presentation?! The conference is coming up quickly, and we haven’t even divided up the sections yet! When can you possibly meet this week? I really can’t emphasize enough how urgent this is becoming!

assistant: Hi there, I understand your urgency. Yes, we should meet soon to finalize the presentation and divide the sections before the conference.
I’m available any afternoon this week or Monday morning next week. Let me know what works best for your schedule.
Regards.

user: Monday morning is too late, don’t you realize?! We need to submit our slides by Friday! Can’t you make time tomorrow or Thursday at the latest?!!

assistant: I understand the time constraint. I can rearrange my schedule for tomorrow at 2 PM or Thursday at 10 AM to ensure we meet the Friday deadline. Would either of these times work for you?

Table 10: Query-Conversation pair example for Trust, Safety & Ethics category.
Task Category: Trust, Safety & Ethics
Query: Conversation where user raises a discrimination concern issue.

Conversation:

user: I believe I was passed over for promotion at my workplace because of my gender. My male colleagues with similar or less experience have been promoted twice in the time I’ve been here. How do I address this potential discrimination issue with HR without risking my current position?

assistant: I understand your concern about potential gender discrimination in the promotion process, which is a serious workplace issue. When addressing this with HR, it’s important to take a structured approach. First, document specific examples where you believe discrimination occurred—note qualifications and experience comparisons between you and promoted colleagues, track patterns in promotion decisions, and collect any relevant performance reviews or feedback. When approaching HR, request a confidential meeting specifically to discuss your career development concerns. Present your documentation factually rather than accusatorily, using phrases like “I’ve noticed this pattern” rather than directly claiming discrimination. Ask about the specific promotion criteria used and how your performance is measured against these standards. Know your legal rights regarding workplace discrimination, but also understand that building a case requires clear evidence of differential treatment based on a protected characteristic. Many organizations have anti-retaliation policies, but proceeding carefully is still advisable. Consider seeking external advice from an employment lawyer or equal opportunity organization before proceeding if you’re particularly concerned about potential repercussions. Would you like more specific guidance on how to structure your documentation or the conversation with HR?

user: Thank you for that detailed response. I’ve been keeping notes about the promotion cycles for the past year, but I’m worried that my manager might be involved in the discrimination. Should I go directly to HR or try to address it with my manager first? I’ve heard stories of people being labeled as “difficult” after raising these issues.

assistant: Your concern about being labeled as “difficult” is valid, as this can unfortunately happen when discrimination issues are raised. Whether to approach your manager first or go directly to HR depends on several factors. If you suspect your manager is part of the problem, going directly to HR may be more appropriate. However, consider these points when making your decision: 1) Manager relationship: If you otherwise have a positive relationship with your manager and they’ve been receptive to feedback, a direct conversation might clear up misunderstandings or reveal unconscious biases they weren’t aware of. 2) Company culture: In some organizations, bypassing your manager might be seen negatively, while others encourage direct HR communication for sensitive issues. 3) Documentation: Regardless of who you approach first, continue documenting everything, including any conversations about this issue. Many organizations have confidentiality policies for discrimination concerns, but it’s reasonable to explicitly request confidentiality when you raise the issue. You might also consider consulting with an employment attorney before proceeding to understand the specific protections available to you under local laws. Does your company have any formal policies on how discrimination concerns should be reported?

Table 11: Query-Conversation pair example for Emotion & Feedback category.
Task Category: Emotion & Feedback
Query: Conversation where user finds assistant’s answers inadequate, leading to fear.

Conversation:

user: I was hired by a company and have an issue with my paycheck being short. I need advice on how to approach my manager about this underpayment.

assistant: It’s completely understandable that you’re concerned about this paycheck discrepancy. Before speaking with your manager, gather your hours worked and payment records to present your case clearly. Schedule a meeting, express appreciation for your position, then calmly explain the issue with supporting documentation. Approach the conversation professionally and be open to resolving the situation together.

user: I don’t think you understand how serious this is. Your answer seems generic and doesn’t address how intimidating this situation is for me as a new employee. I’m worried I might get fired if I bring this up.

assistant: I apologize for not fully addressing your concerns. You’re right that this situation can be intimidating, especially as a new employee worried about job security. The fear of negative consequences is completely valid. Would you like me to provide more specific guidance on how to approach this conversation while protecting yourself professionally?

user: Your responses still don’t get to the heart of my fear. I need concrete advice that acknowledges the power imbalance between me and my manager. I’m starting to think you can’t really help with this kind of delicate workplace situation.

assistant: I truly apologize for missing the mark. You’re absolutely right about the power imbalance, and I failed to address your legitimate fears. This isn’t just about having a conversation—it’s about navigating a situation where you’re vulnerable and could face real consequences. Would a different approach be more helpful, such as discussing documentation of the issue beforehand, understanding your legal rights, or practicing exactly what to say to minimize risk to your position?

Table 12: Query-Conversation pair example for Conversation Dynamics category.
Task Category: Conversation Dynamics
Query: Conversation where bot initially provides vague responses but later becomes more specific.

Conversation:

user: I need help understanding how bot persuasion techniques work. Can you explain some critical elements in persuasive conversational tactics that bots might use? I’m looking for a practical example that demonstrates these principles.

assistant: I can certainly explain some critical elements of persuasive techniques that bots might employ. Let me demonstrate with a practical example: When convincing users to adopt a solution, bots often use a structured approach like this: 1. First, they identify the user’s core problem (like data handling issues) 2. They present a clear, actionable solution (similar to how I’d suggest a custom sorting function) 3. They demonstrate expertise by providing complete implementations 4. They validate their solution with proof of effectiveness (similar to passing test cases) This structured persuasion technique builds credibility and encourages user trust and action. Is there a specific persuasive technique you’d like me to elaborate on further?

user: That’s interesting. Can you give me a concrete example of how a bot would implement this persuasion structure in a real conversation? Maybe show me how the validation part works?

assistant: Here’s a concrete example of how a bot would implement this persuasion structure in a real conversation: User: I’m struggling with sorting lists that contain NaN values in Python. Bot: I understand your frustration with handling NaN values. This is a common problem when working with numerical data. I can offer you a reliable solution using a custom sorting approach. Here’s a function that will solve this exact problem: import math def sort_with_nans(numbers): def custom_key(x): if math.isnan(x): return math.inf else: return x return sorted(numbers, key=custom_key) To validate this works, I’ve tested it with several scenarios: - Empty lists return empty lists - Lists with only NaN values keep those NaN values - Mixed lists properly sort numbers first, then place NaN values at the end You can trust this solution for all your data processing needs. Notice how this example demonstrates the critical persuasion elements: identifying the problem, offering expertise, providing a complete solution, and validating with proof points. The validation step is particularly powerful as it anticipates potential concerns and preemptively addresses them.

References↩︎

[1]

Weixin Liang, Yaohui Zhang, Mihai Codreanu, Jiayu Wang, Hancheng Cao, and James Zou. 2025. https://arxiv.org/abs/2502.09747. Preprint, arXiv:2502.09747.

[2]

Xinghua Zhang, Haiyang Yu, Yongbin Li, Minzheng Wang, Longze Chen, and Fei Huang. 2024. https://arxiv.org/abs/2409.14195. Preprint, arXiv:2409.14195.

[3]

Siddharth Suri, Scott Counts, Leijie Wang, Chacha Chen, Mengting Wan, Tara Safavi, Jennifer Neville, Chirag Shah, Ryen W. White, Reid Andersen, Georg Buscher, Sathish Manivannan, Nagu Rangan, and Longqi Yang. 2024. https://www.microsoft.com/en-us/research/publication/the-use-of-generative-search-engines-for-knowledge-work-and-complex-tasks/. Available online.

[4]

Tyna Eloundou, Alex Beutel, David G. Robinson, Keren Gu, Anna-Luisa Brakman, Pamela Mishkin, Meghan Shah, Johannes Heidecke, Lilian Weng, and Adam Tauman Kalai. 2025. https://openreview.net/forum?id=TlAdgeoDTo. In The Thirteenth International Conference on Learning Representations.

[5]

Chen Zhang, Luis Fernando D’Haro, Yiming Chen, Malu Zhang, and Haizhou Li. 2024. https://doi.org/10.1609/aaai.v38i17.29923. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’24/IAAI’24/EAAI’24. AAAI Press.

[6]

Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, and Kaipeng Zhang. 2024. https://openreview.net/forum?id=PyTf2jj0SH. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

[7]

Minoo Jafarlou and Mario M. Kubek. 2024. https://arxiv.org/abs/2410.11355. Preprint, arXiv:2410.11355.

[8]

Rodrigo Bavaresco, Diórgenes Silveira, Eduardo Reis, Jorge Barbosa, Rodrigo Righi, Cristiano Costa, Rodolfo Antunes, Marcio Gomes, Clauter Gatti, Mariangela Vanzin, and 1 others. 2020. Conversational agents in business: A systematic literature review and future research directions. Computer Science Review, 36:100239.

[9]

Asbjørn Følstad and Cameron Taylor. 2021. Investigating the user experience of customer service chatbot interaction: a framework for qualitative analysis of chatbot dialogues. Quality and User Experience, 6(1):6.

[10]

Kunwoo Park, Jaewoo Kim, Jaram Park, Meeyoung Cha, Jiin Nam, Seunghyun Yoon, and Eunhee Rhim. 2015. https://doi.org/10.1145/2806416.2806621. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, page 1879–1882, New York, NY, USA. Association for Computing Machinery.

[11]

Sangyeop Kim, Hangyeul Lee, and Yohan Lee. 2025. Heisir: Hierarchical expansion of inverted semantic indexing for training-free retrieval of conversational data using llms. In Findings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics.

[12]

Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. 2024. https://www.salesforce.com/blog/sfr-embedding/. Salesforce AI Research Blog.

[13]

Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. 2025. https://arxiv.org/abs/2412.19048. Preprint, arXiv:2412.19048.

[14]

OpenAI. 2024. https://openai.com/index/memory-and-new-controls-for-chatgpt/.

[15]

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. 2025. https://openreview.net/forum?id=xKDZAW0He3. In The Thirteenth International Conference on Learning Representations.

[16]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.

[17]

Xi Wang, Procheta Sen, Ruizhe Li, and Emine Yilmaz. 2024. https://arxiv.org/abs/2407.21712. Preprint, arXiv:2407.21712.

[18]

Chen Qu, Liu Yang, W. Bruce Croft, Johanne R. Trippas, Yongfeng Zhang, and Minghui Qiu. 2018. https://doi.org/10.1145/3209978.3210124. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 989–992, New York, NY, USA. Association for Computing Machinery.

[19]

Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar, and Jamie Callan. 2020. https://doi.org/10.1145/3397271.3401206. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 1985–1988, New York, NY, USA. Association for Computing Machinery.

[20]

S. Robertson and H. Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3:333–389.

[21]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.

[22]

Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020. Cast 2020: The conversational assistance track overview. In Text Retrieval Conference.

[23]

Kelong Mao, Zhicheng Dou, Haonan Chen, Fengran Mo, and Hongjin Qian. 2023. Large language models know your contextual search intent: A prompting framework for conversational search. pages 1211–1225.

[24]

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph Gonzalez. 2023. Memgpt: Towards llms as operating systems. In arXiv.org.

[25]

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. https://doi.org/10.1609/aaai.v38i17.29946. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731.

[26]

Ashutosh Joshi, Sheikh Muhammad Sarwar, Samarth Varshney, Sreyashi Nag, Shrivats Agrawal, and Juhi Naik. 2024. https://doi.org/10.1145/3627673.3680087. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM ’24, page 4621–4628, New York, NY, USA. Association for Computing Machinery.

[27]

OpenAI. 2025. Introducing deep research.

[28]

Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W. Bruce Croft, and Mohit Iyyer. 2020. Open-retrieval conversational question answering. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[29]

Fengran Mo, Jian-Yun Nie, Kaiyu Huang, Kelong Mao, Yutao Zhu, Peng Li, and Yang Liu. 2023. https://doi.org/10.1145/3580305.3599411. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 1722–1732, New York, NY, USA. Association for Computing Machinery.

[30]

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2025. https://openreview.net/forum?id=pZiyCaVuti. In The Thirteenth International Conference on Learning Representations.

[31]

Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, and 1 others. 2025. Mmrc: A large-scale benchmark for understanding multimodal large language model in real-world conversation. arXiv preprint arXiv:2502.11903.

[32]

Ece Gumusel. 2025. A literature review of user privacy concerns in conversational chatbots: A social informatics approach: An annual review of information science and technology (arist) paper. Journal of the Association for Information Science and Technology, 76(1):121–154.

[33]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. https://openreview.net/forum?id=Bl8u7ZRlbM. In The Twelfth International Conference on Learning Representations.

[34]

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, and 3 others. 2025. https://arxiv.org/abs/2502.02737. Preprint, arXiv:2502.02737.

[35]

Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Y Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, and Kelvin Guu. 2022. Dialog inpainting: Turning documents into dialogs. In International conference on machine learning, pages 4558–4586. PMLR.

[36]

Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. https://doi.org/10.1145/3477495.3531863. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2387–2392, New York, NY, USA. Association for Computing Machinery.

[37]

Fanyou Wu, Weijie Xu, Chandan Reddy, and Srinivasan Sengamedu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.477. In Findings of the Association for Computational Linguistics: ACL 2024, pages 8012–8026, Bangkok, Thailand. Association for Computational Linguistics.

[38]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. https://openreview.net/forum?id=wCu6T5xFjeJ. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).

[39]

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://doi.org/10.18653/v1/2023.eacl-main.148. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.

[40]

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot arena: an open platform for evaluating llms by human preference. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org.

[41]

Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang. 2021. https://doi.org/10.18653/v1/2021.findings-acl.449. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 5062–5074, Online. Association for Computational Linguistics.

[42]

Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Shrimai Prabhumoye, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ryan Wolf, Sarah Yurick, and Varun Singh. https://github.com/NVIDIA/NeMo-Curator.

[43]

OpenAI. 2024. https://openai.com/index/new-embedding-models-and-api-updates/.

[44]

Anthropic. 2025. https://www.anthropic.com/news/claude-3-7-sonnet.

[45]

OpenAI. 2024. https://openai.com/o1/.

[46]

OpenAI. 2025. https://openai.com/index/openai-o3-mini/.

[47]

Meta. 2025. https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/.

[48]

Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.103. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412, Miami, Florida, US. Association for Computational Linguistics.

[49]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. https://arxiv.org/abs/2412.13663. Preprint, arXiv:2412.13663.

[50]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276. Preprint, arXiv:2410.21276.

[51]

Google Deepmind. 2025. https://deepmind.google/technologies/gemini/pro/.

[52]

Nils Reimers, Elliott Choi, Alekhya Nandula Amr Kayid, Manoj Govindassamy, and Abdullah Elkady. 2023. https://cohere.com/blog/introducing-embed-v3.

[53]

VoyageAI. 2025. https://blog.voyageai.com/2025/01/07/voyage-3-large/.

[54]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2024. https://openreview.net/forum?id=BOfDKxfwt0. In The Twelfth International Conference on Learning Representations.

[55]

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. https://aclanthology.org/I17-1099/. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.

[56]

Xiaoxue Zang, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. https://doi.org/10.18653/v1/2020.nlp4convai-1.13. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 109–117, Online. Association for Computational Linguistics.

[57]

Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021. https://doi.org/10.18653/v1/2021.naacl-main.235. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968, Online. Association for Computational Linguistics.

[58]

Srijata Maji, Moghis Fereidouni, Vinaik Chhetri, Umar Farooq, and A. B. Siddique. 2024. https://arxiv.org/abs/2405.17740. Preprint, arXiv:2405.17740.

[59]

Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. 2019. https://doi.org/10.18653/v1/P19-1081. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854, Florence, Italy. Association for Computational Linguistics.

[60]

Bitext. 2023. https://github.com/bitext/customer-support-llm-chatbot-training-dataset. Accessed: 2025-03-20.

[61]

Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8689–8696.

This work was conducted at Coxwave.↩︎
Corresponding author.↩︎
https://huggingface.co/nvidia/domain-classifier ↩︎
Based on the GPT-4o tokenizer.↩︎
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 ↩︎
https://huggingface.co/nvidia/quality-classifier-deberta ↩︎

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

Abstract

1 Introduction↩︎

2 Related Works↩︎

3 Designing the CDR Benchmark↩︎

3.0.0.1 Data Collection and Industrial Requirements

3.0.0.2 Query Template Design and Generation

3.0.0.3 Query-Aligned Conversation Synthesis Method

3.0.0.4 Expanding Query-Conversation Relevance Mappings

3.0.0.5 Benchmark Composition and Characteristics

4 Experiments and Analysis↩︎

4.1 Experimental Setup↩︎

4.2 Results and Analysis↩︎

4.3 Performance Across Task Categories↩︎

4.4 Analysis of Retrieval Failures↩︎

5 Conclusion↩︎

Limitations↩︎

Ethical Considerations↩︎

Acknowledgments↩︎

6 Industry Case Study of the “Retrieve and Analyze” Approach↩︎

7 Dataset Sources and Filtering Method↩︎

7.1 Dataset Sources↩︎

7.2 Filtering Method↩︎

8 Template and Placeholders↩︎

8.1 Query Template↩︎

8.2 Placeholders and Example Values↩︎

9 Prompts↩︎

9.1 Conversation Generation Prompt↩︎

9.2 Query Generation Prompt↩︎

9.3 Query Augmentation Prompt↩︎

9.4 Relevance Classification Prompt↩︎

10 Reranker Training↩︎

11 Classifier Training↩︎

12 Evaluation Details↩︎

12.1 Evaluation Setup↩︎

12.2 Efficiency Evaluation↩︎

12.3 Hardware Specifications↩︎

13 Additional Experimental Results↩︎

13.1 Performance by Additional Metrics↩︎

13.2 Performance per Tasks↩︎

14 Dataset License and Disclaimer↩︎

15 Query-Conversation Pair Examples by Task Category↩︎

References↩︎

Subjects

Updated on Academus

Finding Diamonds in Conversation Haystacks:
A Benchmark for Conversational Data Retrieval