Abstract

Retrieval-Augmented Generation (RAG) systems remain vulnerable to hallucinated answers despite incorporating external knowledge sources. We present LettuceDetect, a framework that addresses two critical limitations in existing hallucination detection methods: (1) the context window constraints of traditional encoder-based methods, and (2) the computational inefficiency of LLM-based approaches. Building on ModernBERT’s extended context capabilities (up to 8k tokens) and trained on the RAGTruth benchmark dataset, our approach outperforms all previous encoder-based models and most prompt-based models, while being approximately 30 times smaller than the best models. LettuceDetect is a token-classification model that processes context-question-answer triples, allowing for the identification of unsupported claims at the token level. Evaluations on the RAGTruth corpus demonstrate an F1 score of 79.22% for example-level detection, which is a 14.8% improvement over Luna, the previous state-of-the-art encoder-based architecture. Additionally, the system can process 30 to 60 examples per second on a single GPU, making it more practical for real-world RAG applications.

1 Introduction↩︎

Large Language Models (LLMs) have made significant progress in recent years in terms of their performance [1]–[3]. However, the biggest obstacle to their usage in real-world applications is their tendency to hallucinate [4], [5]. Retrieval-Augmented Generation (RAG) is a method that enhances LLMs by supporting answers with context documents and retrieving knowledge from external sources, prompting the LLMs to ground their responses based on this information [6]. This technique is widely used to minimize hallucinations of LLMs. Despite the incorporation of context documents in RAG, LLMs continue to experience hallucinations [7].

Hallucinations are defined as outputs that are nonsensical, factually incorrect, or inconsistent with the provided evidence [8]. [8] categorizes these errors into two types: Intrinsic hallucinations, which arise from the model’s inherent knowledge, and Extrinsic hallucinations, which occur when responses fail to be grounded in the provided context, such as in the case of RAG hallucinations [7]. While RAG can mitigate intrinsic hallucinations by grounding LLMs in external knowledge, extrinsic hallucinations persist due to imperfect retrieval processes or the model’s tendency to prioritize its intrinsic knowledge over external context [9], leading to factual contradictions. As LLMs remain prone to hallucinations, their utilization in high-risk settings, such as medical or legal fields, may be jeopardized [10], [11].

We present LettuceDetect, a hallucination detection framework that utilizes ModernBERT [12]. Our approach trains a token-classification model to predict whether a token is supported by context documents and a question, determining if it is hallucinated. We frame this task as predicting tokens in the answers generated by large language models (LLMs), based on the provided context documents and the posed question. Our models are trained using the RAGTruth dataset [7]. The architecture we employ is similar to Luna [13], as we train an encoder-based model for this task. A demonstration of our web application is displayed in Figure 1.

All components of our system are released under an MIT license and can be accessed on GitHub¹ and via pip by installing the lettucedetect² package.

The trained models are published on Hugging Face also under MIT licenses. We have made available both a large model ³ and a base model ⁴.

We believe our contribution will be valuable to the community, particularly since many effective hallucination detection methods are either under non-permissive licenses or depend on larger LLM-based models.

The remainder of this paper is structured as follows: Section 2 reviews recent methods for hallucination detection. Section 4 details our method for training an encoder-based hallucination detection model built on ModernBERT. Section 5 presents our findings on the example and span-level tasks using the RAGTruth dataset.

2 Related work↩︎

2.0.0.1 ModernBERT

BERT [14] was one of the first major successes of applying the Transformer architecture [15] to natural language understanding. BERT uses only the Transformer’s encoder blocks in a bidirectional fashion, allowing it to learn context from both directions. As a result, BERT quickly became the backbone of many NLP pipelines for tasks like classification, question answering, named entity recognition, etc.

BERT’s initial design included certain limitations, such as a maximum sequence length of 512 tokens and less efficient attention mechanisms, leaving room for architectural upgrades and larger-scale training. Despite the current rise of popularity of LLM-based architectures in NLP, such as GPT-4 [1], Mistral [16] or Llama-3 [2], encoder-based models are still widely used in many applications, because of their much smaller size and better-suited inference requirements that make them suitable for real-world applications.

fig: — Figure 1: A web demo of our application built in Streamlit⁵. It features three input fields: question, context, and answer. The output shows the highlighted hallucinated spans.

ModernBERT [12] is a state-of-the-art encoder-only transformer architecture that incorporates several modern design improvements over the original BERT model. It utilizes several enhancements, including rotary positional embeddings (RoPE) [17] instead of traditional absolute positional embeddings. Additionally, it features an alternating local-global attention mechanism as described in [3], allowing it to efficiently manage sequences of up to 8,192 tokens. This makes it significantly more effective for long-context tasks, such as modern information retrieval [18], [19]. ModernBERT features a hardware-aware design and an expanded training corpus of 2 trillion tokens, including textual and code data. As a result, it achieves superior performance on various downstream benchmarks, such as GLUE for classification and BEIR for retrieval (while also maintaining faster inference speed) [18], [19]. Based on these findings, the main part of our paper is to use the advancements of ModernBERT in the hallucination detection of LLMs in an RAG setting. In this domain, long-context awareness is an inevitable feature.

2.0.0.2 Hallucination Detection

can vary in granularity, ranging from example-based detection (which assesses if an answer contains hallucinations) to token, span, or sentence-level detection [7]. The methods for detecting hallucinations also differ based on the techniques employed.

2.0.0.3 Prompt-based Techniques

typically utilize zero or few-shot large language models (LLMs) to identify hallucinations in LLM-generated responses. Few-shot or fine-tuned evaluation frameworks, such as RAGAS [20], Trulens⁶, and ARES [21], have emerged to provide hallucination detection at scale using LLM judges. However, real-time prediction remains a challenge for these methods. Other prompt-based approaches, like the zero-shot method SelfCheckGPT [22], employ stochastic sampling to identify inconsistencies across multiple response variants. Rather than relying on a single prompt, Chainpoll [23] implements a series of verification steps to detect hallucinations. [24] presents a method of cross-examination between two LLMs to uncover inconsistencies. [25] utilized LLM-based classifiers trained on synthetic errors to detect both hallucinations and coverage errors in LLM-generated responses.

2.0.0.4 Fine-tuned LLM Judges

approaches involve training LLMs on hallucination detection tasks using specific training data. [7] not only introduced the RagTruth data but also presented a fine-tuned Llama-2-13B LLM, which achieved state-of-the-art performance on their test set, even surpassing larger models like GPT-4. RAG-HAT [26] introduced a novel approach called Hallucination Aware Tuning (HAT), which involves training models to generate detection labels and provide detailed descriptions of identified hallucinations. They created a preference dataset to facilitate Direct Preference Optimization (DPO) training. Fine-tuning through DPO results in SOTA performance on the RAGTruth test set.

2.0.0.5 Encoder-based Solutions

focus on addressing computational efficiency constraints through domain-specific adaptations. RAGHalu [27] employs a two-tiered encoder model that utilizes binary classification at each layer, fine-tuning a Natural Language Inference (NLI) model based on DeBERTa [28]. The approach most similar to our work is Luna [13], which also builds on DeBERTa and NLI to create a lightweight long-context hallucination detection system capable of managing longer contexts effectively. Luna draws connections between detecting entailment in NLI tasks and identifying hallucinations. They fine-tuned on a large, cross-domain corpus of question-answering-based RAG samples, with annotations provided by GPT-4. During the inference phase, Luna conducts sentence- or token-level checks on each model’s response against the retrieved passages, effectively flagging unsupported fragments. FACTOID [29] introduces a Factual Entailment (FE) framework, which represents a new form of textual entailment aimed at locating hallucinations at the token or span level. Other approaches, such as ReDeEp [9], introduce techniques to analyze internal model states for hallucination detection.

3 Data↩︎

We trained and evaluated our models using the RAGTruth dataset [7]. RAGTruth is the first large-scale benchmark for evaluating hallucinations in RAG settings. The dataset contains 18,000 annotated examples at the span level across three tasks: question answering, data-to-text generation, and news summarization.

For the question answering task, data was sampled from the MS MARCO dataset [30], where each question had up to three corresponding contexts. The authors then prompted LLMs to generate answers based on the retrieved passages. In the data-to-text generation task, LLMs were asked to generate reviews for sampled businesses from the Yelp Open Dataset [31]. For the news summarization task, randomly selected documents were taken from the training set of the CNN/Daily Mail dataset [32], and LLMs were prompted to create summaries.

For response generation, various LLMs were employed, including GPT-4-0613 [1], Mistral-7B-Instruct [16], and selections from the Llama models, such as Llama-2-7b-chat and Llama-2-13B-chat [2]. Each sample in the dataset includes one response from each model, resulting in six responses per sample in RAGTruth.

The entire dataset was annotated by human evaluators, who marked annotations in the responses and provided rationales. RAGTruth categorizes hallucinations into types such as Evident Conflict, Subtle Conflict, Evident Introduction of Baseless Information, and Subtle Introduction of Baseless Information. For our training, we focused solely on the binary task of recognizing hallucinations without considering these specific categories. An example from the dataset can be seen in Table 1.

We tokenized RAGTruth and found that the mean token length is 801 tokens, with a median of 741 tokens, a minimum of 194, and a maximum of 2632 tokens. This highlights the need for long-context models, such as ModernBERT, to effectively detect hallucinations.

Table 1: An example of RAGTruth data, including question, references, response, and annotations.
QUESTION	How to explain quartiles?
REFERENCE	Passage 2: Second quartile (Q2) which is more commonly known as median splits the data in half (50%). Median divides the data into a lower half and an upper half. Third quartile (Q3), also known as upper quartile, splits lowest 75% (or highest 25%) of data
	…
RESPONSE	The first quartile (Q1) splits the lowest 25% of the data, while the second quartile (Q2) splits the data into two equal halves, with the median being the middle value of the lower half. Finally, the third quartile (Q3) splits the highest 75% of the data.
	…
ANNOTATION	Span: highest 75%
	Type: Evident Conflict
	Reason: Original: “lowest 75% or highest 25%”
	Generative: “highest 75%”

4 Method↩︎

fig: — Figure 2: The architecture of LettuceDetect. The figure illustrates an example of a Question, Context, and Answer triplet as input to our architecture. After the tokenization step, the tokens are fed into LettuceDetect for token-level classification. Tokens from both the question and the context are masked (indicated by the red line) for loss calculations. In the output of LettuceDetect, we provide probabilities for each answer token. If the output type is span-level, we aggregate subsequent tokens that are hallucinated for the span-level output.

We trained ModernBERT-base and -large variants as token classifiers on the RAGTruth dataset. Input sequences were constructed by concatenating context, question, and answer segments using special tokens ([CLS] for context, [SEP] for separation) and tokenized to a maximum length of 4,096 tokens (in the current version we haven’t utilized ModernBERT’s full 8,192 context length). For handling tokenization, we’ve used the AutoTokenizer [33]. Our models are based solely on the ModernBERT architecture and were not pretrained on the NLI task, unlike previous encoder-based architectures.

The architecture leveraged Hugging Face’s AutoModelForTokenClassification [33] with ModernBERT as the backbone, and a classification head on top. Context/question tokens were masked (label=-100), while answer tokens were labeled as 0 (supported) or 1 (hallucinated). Training used AdamW optimization [34] (learning rate \(1\times10^{-5}\), weight decay 0.01) for 6 epochs on an NVIDIA A100 GPU. For data and batch handling, we’ve used PyTorch DataLoader [35] (batch size=8, shuffling enabled). We evaluated models using token-level F1 score, saving the best-performing checkpoint via safetensors. Dynamic padding was implemented using DataCollatorForTokenClassification to process variable-length sequences efficiently.

The final model predicts hallucination probabilities for each answer token, with span-level outputs generated by aggregating consecutive tokens exceeding a 0.5 confidence threshold. The best models are uploaded to huggingface. Our method can be seen in Figure 2. We discuss the results in Section 5.

Table 2: Performance comparison at the example level across various tasks. We compare our results with models presented in Luna [13] and RAGTruth [7], as well as evaluation frameworks RAGAS and Trulens. The evaluation also includes a fine-tuned LLM from the RAG-HAT [26] paper.
	QUESTION ANSWERING			DATA-TO-TEXT WRITING			SUMMARIZATION			OVERALL
2-13 Method	Prec.	Rec.	F1	Prec.	Rec.	F1	Prec.	Rec.	F1	Prec.	Rec.	F1
Prompt_{gpt-3.5-turbo}	18.8	84.4	30.8	65.1	95.5	77.4	23.4	89.2	37.1	37.1	92.3	52.9
Prompt_gpt-4-turbo	33.2	90.6	45.6	64.3	100.0	78.3	31.5	97.6	47.6	46.9	97.9	63.4
SelCheckGPT_{gpt-3.5-turbo}	35.0	58.0	43.7	68.2	82.8	74.8	31.1	56.5	40.1	49.7	71.9	58.8
LMvLM_gpt-4-turbo	18.7	76.9	30.1	68.0	76.7	72.1	23.2	81.9	36.2	36.2	77.8	49.4
Finetuned Llama-2-13B	61.6	76.3	68.2	85.4	91.0	88.1	64.0	54.9	59.1	76.9	80.7	78.7
RAG-HAT	76.5	73.1	74.8	92.9	90.3	91.6	77.7	59.8	67.6	87.3	80.8	83.9
ChainPoll_{gpt-3.5-turbo}	33.5	51.3	40.5	84.6	35.1	49.6	45.8	48.0	46.9	54.8	40.6	46.7
RAGAS Faithfulness	31.2	41.9	35.7	79.2	50.8	61.9	64.2	29.9	40.8	62.0	44.8	52.0
Trulens Groundedness	22.8	92.5	36.6	66.9	96.5	79.0	40.2	50.0	44.5	46.5	85.8	60.4
Luna	37.8	80.0	51.3	64.9	91.2	75.9	40.0	76.5	52.5	52.7	86.1	65.4
lettucedetect-base-v1	60.64	71.25	65.52	89.30	86.53	87.89	53.89	47.55	50.52	76.64	75.50	76.07
lettucedetect-large-v1	65.93	75.00	70.18	90.45	86.70	88.54	64.04	55.88	59.69	80.44	78.05	79.22

Table 3: Performance comparison at the span level across different tasks. We compare our results with models presented in RAGTruth [7]. We limit this comparison to these papers, as other studies have not evaluated their performance on the span level task.
	QUESTION ANSWERING			DATA-TO-TEXT WRITING			SUMMARIZATION			OVERALL
2-13 Method	Prec.	Rec.	F1	Prec.	Rec.	F1	Prec.	Rec.	F1	Prec.	Rec.	F1
Prompt Baseline_{gpt-3.5-turbo}	7.9	25.1	12.1	8.7	45.1	14.6	6.1	33.7	10.3	7.8	35.3	12.8
Prompt Baseline_gpt-4-turbo	23.7	52.0	32.6	17.9	66.4	28.2	14.7	65.4	24.3	18.4	60.9	28.3
Finetuned Llama-2-13B	55.8	60.8	58.2	56.5	50.7	53.5	52.4	30.8	38.6	55.6	50.2	52.7
lettucedetect-base-v1	62.65	60.40	61.50	58.24	56.57	57.39	52.98	28.08	36.71	59.36	52.01	55.44
lettucedetect-large-v1	66.85	62.14	64.41	64.71	55.99	60.04	60.17	35.47	44.63	64.92	53.96	58.93

5 Evaluation↩︎

We evaluate our models using the RAGTruth test data across all task types, including question answering (QA), data-to-text, and summarization. Following the methodology outlined in [7], we report both example-level and span-level detection performance, reporting precision, recall, and F1 score. Our models are compared against state-of-the-art baselines presented in [7], [13], [26]. This includes comparisons with prompt-based methods, such as gpt-4-turbo and gpt-3.5-turbo, as well as fine-tuned LLMs that have shown state-of-the-art performance on the RAGTruth data, including the previously established state-of-the-art model in [7] (a fine-tuned Llama-2-13B) and the current best result from [26] (a fine-tuned LLM based on Llama-3-8B trained through DPO training). We also compare our models with encoder-based approaches, similar to ours, including the token classifier method presented in [13], which is based on DeBERTa.

Table 2 illustrates our results on the example-level task. Our large model (lettucedetect-large-v1) outperforms all prompt-based methods (gpt-4-turbo achieved an overall F1 score of 63.4% compared to lettucedetect-large-v1’s 79.22%). It also surpasses the previous state-of-the-art encoder-based model, Luna (65.4% vs. 79.22%), and the previously established state-of-the-art fine-tuned LLM presented in [7] (fine-tuned Llama-2-13B with 78.7% vs. 79.22%). The only model that exceeds our large model’s performance is the current state-of-the-art fine-tuned LLM based on Llama-3-8B presented in the RAG-HAT paper [26] (83.9% vs. 79.22%). Our base model (lettucedetect-base-v1) also demonstrates strong performance across tasks while being half the size of the large model. Considering our model’s compact size (150M for the base model and 396M for the large model) and its optimized architecture based on ModernBERT, it is capable of processing approximately 30 to 60 examples per second on a single GPU. Given this optimized inference speed, it only falls short compared to one larger model (8B Llama). Overall, our models are highly efficient while being about 30 times smaller in size.

In Table 3, we present our results on the span-level task. In this task, we evaluate the overlap between the gold spans and the predicted spans. Following the RAGTruth paper, we measured character-level overlap and calculated precision, recall, and F1 score. Our models achieved state-of-the-art performance, with the Llama-2-13B model reaching an overall F1 score of 52.7%, while our large model achieved 58.93% F1 score. Please note that we were unable to compare our results with RAG-HAT on this task because they did not measure at this level. Additionally, RAGTruth did not include this evaluation in their published code, so we relied on our own implementation for this analysis.

6 Conclusion↩︎

We present LettuceDetect, a lightweight and efficient framework for hallucination detection in RAG systems. By leveraging ModernBERT’s long-context capabilities, our baseline models achieve strong performance on the RAGTruth benchmark while remaining highly efficient in inference settings. This work serves as a foundation for our future research, where we plan to expand the framework to include more datasets, additional languages, and enhanced architectures. Even in its current form, LettuceDetect demonstrates that effective hallucination detection can be achieved with lean, purpose-built models.

References↩︎

[1]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, and 262 others. 2024. https://arxiv.org/abs/2303.08774. Preprint, arXiv:2303.08774.

[2]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783. Preprint, arXiv:2407.21783.

[3]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, and 179 others. 2024. https://arxiv.org/abs/2408.00118. Preprint, arXiv:2408.00118.

[4]

Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. https://arxiv.org/abs/2307.10169. Preprint, arXiv:2307.10169.

[5]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. https://doi.org/10.1145/3703155. ACM Transactions on Information Systems, 43(2):1–55.

[6]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. https://arxiv.org/abs/2312.10997. Preprint, arXiv:2312.10997.

[7]

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. 2024. https://doi.org/10.18653/v1/2024.acl-long.585. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878, Bangkok, Thailand. Association for Computational Linguistics.

[8]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. https://doi.org/10.1145/3571730. ACM Comput. Surv., 55(12).

[9]

Zhongxiang Sun, Xiaoxue Zang, Kai Zheng, Yang Song, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, and Han Li. 2025. https://arxiv.org/abs/2410.11414. Preprint, arXiv:2410.11414.

[10]

Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah. 2023. https://arxiv.org/abs/2310.16146. Preprint, arXiv:2310.16146.

[11]

Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, and Daniel E. Ho. 2024. https://arxiv.org/abs/2405.20362. Preprint, arXiv:2405.20362.

[12]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. https://arxiv.org/abs/2412.13663. Preprint, arXiv:2412.13663.

[13]

Masha Belyi, Robert Friel, Shuai Shao, and Atindriyo Sanyal. 2025. https://aclanthology.org/2025.coling-industry.34/. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 398–409, Abu Dhabi, UAE. Association for Computational Linguistics.

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

[15]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

[16]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.06825. Preprint, arXiv:2310.06825.

[17]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. https://doi.org/10.1016/j.neucom.2023.127063. Neurocomputing, 568:127063.

[18]

Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2025. https://arxiv.org/abs/2402.01613. Preprint, arXiv:2402.01613.

[19]

Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.103. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412, Miami, Florida, US. Association for Computational Linguistics.

[20]

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. https://aclanthology.org/2024.eacl-demo.16/. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta. Association for Computational Linguistics.

[21]

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. https://arxiv.org/abs/2311.09476. Preprint, arXiv:2311.09476.

[22]

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.557. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore. Association for Computational Linguistics.

[23]

Robert Friel and Atindriyo Sanyal. 2023. https://arxiv.org/abs/2310.18344. Preprint, arXiv:2310.18344.

[24]

Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.778. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12621–12640, Singapore. Association for Computational Linguistics.

[25]

Tyler A. Chang, Katrin Tomanek, Jessica Hoffmann, Nithum Thain, Erin MacMurray van Liemt, Kathleen Meier-Hellstern, and Lucas Dixon. 2024. https://aclanthology.org/2024.lrec-main.423/. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4729–4743, Torino, Italia. ELRA and ICCL.

[26]

Juntong Song, Xingguang Wang, Juno Zhu, Yuanhao Wu, Xuxin Cheng, Randy Zhong, and Cheng Niu. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.113. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1548–1558, Miami, Florida, US. Association for Computational Linguistics.

[27]

Ilana Zimmerman, Jadin Tredup, Ethan Selfridge, and Joseph Bradley. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.2. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 8–22, Miami, Florida, US. Association for Computational Linguistics.

[28]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. https://arxiv.org/abs/2006.03654. Preprint, arXiv:2006.03654.

[29]

Vipula Rawte, S. M Towhidul Islam Tonmoy, Krishnav Rajbangshi, Shravani Nag, Aman Chadha, Amit P. Sheth, and Amitava Das. 2024. https://arxiv.org/abs/2403.19113. Preprint, arXiv:2403.19113.

[30]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. https://arxiv.org/abs/1611.09268. Preprint, arXiv:1611.09268.

[31]

Yelp. 2021. https://www.yelp.com/dataset. Accessed: 2023-11-03.

[32]

Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. https://doi.org/10.18653/v1/P17-1099. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.

[33]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://arxiv.org/abs/1910.03771. Preprint, arXiv:1910.03771.

[34]

Ilya Loshchilov and Frank Hutter. 2019. https://arxiv.org/abs/1711.05101. Preprint, arXiv:1711.05101.

[35]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, and 2 others. 2019. https://arxiv.org/abs/1912.01703. Preprint, arXiv:1912.01703.

LettuceDetect: A Hallucination Detection Framework for RAG Applications