AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Hamzah Luqman
Information and Computer Science Department, King Fahd University of Petroleum and Minerals
SDAIA-KFUPM Joint Research Center for Artificial Intelligence


ABSTRACT↩︎

Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs’ hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic’s widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs’ outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: Github link.

1 Introduction↩︎

The emergence of large language models (LLMs) has marked a new era in natural language processing (NLP). LLMs demonstrate exceptional competence in generating coherent and contextually relevant text in multiple languages [1]. However, hallucination remains a critical issue for LLMs. Hallucination happens when LLM generates outputs that are factually inaccurate, nonsensical, or misleading [2]. This issue not only undermines their trustworthiness but also limits their practical use in real-world applications.

Figure 1: An example of LLM hallucination errors in the GQA task. Named-entity error denotes incorrect names of people, places, or organizations, value error denotes wrong dates, ages, or time references, factual contradiction represents information not present in the real-world, whereas response conflict represents contradicting information within the response itself.

Hallucination is classified into factual and faithful [3]. Factuality hallucination describes the divergence between produced content and known real-world facts, often appearing as factual inconsistency or fabrication. On the other hand, faithfulness hallucination refers to the divergence from the input or context, misaligning with user instructions or internal consistency. Figure 1 illustrates an example of hallucination in Arabic Generative Question Answering (GQA). In this example, the model introduces named-entity errors (e.g., incorrect names), value errors (e.g., wrong dates), factual contradictions (e.g., claims not supported by real-world facts), and response conflicts (e.g., internal contradictions within the generated response). Extensive research on hallucinations in LLMs has predominantly focused on high-resource languages, such as English and Chinese [1], [3]. Evaluating LLMs’ hallucination in the Arabic context remains relatively underexplored despite the growing number of multilingual and Arabic-specific LLMs [4], [5]. Arabic presents unique linguistic challenges due to its morphological richness, complex syntax, and diversity of dialects [6], [7]. These challenges make hallucination evaluation more complex and necessitate specialized benchmarks and [8], [9]methodologies [10].

To address this limitation, we conduct a comprehensive evaluation of state-of-the-art (SOTA) Arabic and multilingual LLMs on two critical generative tasks: GQA and text summarization. Twelve LLMs have been evaluated in this work. We also evaluated the performance of four reasoning-based models on the TruthfulQA hallucination benchmark. Our evaluation goes beyond conventional metrics by incorporating fine-grained human evaluation to assess hallucinations using a multi-dimensional criterion encompassing both factuality and faithfulness. Twelve fine-grained hallucination types have been identified in this study and used to evaluate LLMs. Through this comparative analysis, we identify strengths and shortcomings of the evaluated LLMs in generating factual outputs. The main contributions of this study can be summarized as follows:

  • Propose a multi-dimensional assessment criterion for LLMs’ hallucination in Arabic.

  • Evaluate hallucination in Arabic, multi-lingual, and reasoning-based LLMs on Arabic GQA and text Summarization tasks.

  • Present a manually annotated dataset for evaluating hallucinations in Arabic LLM outputs across GQA and summarization tasks.

  • Compare four reasoning-based LLMs on the TruthfulQA hallucination benchmark using parallel English and Arabic questions.

2 Related work↩︎

2.0.0.1 Hallucination in LLMs.

Hallucination in LLMs compromises model reliability and poses safety concerns in real-world applications such as healthcare, education, and law. Previous studies have extensively explored hallucination in LLMs within English contexts, focusing primarily on detection and mitigation strategies [1], [3], [11], [12]. To mitigate hallucination in LLMs, prior studies proposed strategies, such as self-verification approaches [13], grounding model outputs in external outputs [14], introducing self-consistency decoding [15], and contrastive decoding [16].

Despite the advancement in LLMs, hallucination remains understudied in low-resource languages like Arabic. While reasoning-focused models such as GPT-4o [17] and DeepSeek-R1 [18] show promise in mitigating hallucinations in English, their effectiveness in Arabic generative tasks is largely unknown. Meanwhile, Arabic-specific LLMs like Jais [5], Fanar [19], and Allam [4] have been developed, but their hallucination behavior has yet to be systematically evaluated. Given Arabic’s morphological complexity and dialectal variation, dedicated benchmarks are essential for evaluating factuality and faithfulness in Arabic LLM outputs [8]. Besides, cross-lingual comparisons between Arabic-focused and multilingual LLMs—such as Gemma3 [20], LLaMA3 [21], and Qwen2.5 [22]—are crucial for understanding how language-specific features affect hallucination. This evaluation is crucial, as language-specific behaviors may lead to significant differences in hallucination tendencies and factual reliability when generating Arabic content.

2.0.0.2 Hallucination Evaluation.

Evaluating hallucination in LLMs is essential to understand their factual reliability and ensure alignment with user intent. Accordingly, another area of research concentrates on assessing the hallucination of models across various NLP tasks. For instance, [2] provided a comprehensive study on hallucinations for abstractive summarization, revealing that SOTA models frequently generate factually and faithfully inconsistent summaries. Their study shows that even summaries with high ROUGE scores can be unfaithful, which highlights the need for better evaluation methods.

A variety of measures have been developed to evaluate the faithfulness of abstractive summarization. The metrics encompass entailment-based measures [23][25], as well as question-generation and question-answering metrics [26][28]. Recently, attention has transitioned to LLM-based metrics [29][31] that utilize LLMs to evaluate the fidelity of a summary. To evaluate hallucination in GQA, prior research has explored multiple approaches, including fine-tuning LLMs to detect factual inconsistencies [32] and analyzing internal model states to identify hallucinated or factually incorrect claims [33], [34].

In parallel, several benchmark datasets have been introduced to facilitate standardized evaluation, including TruthfulQA [35], which targets common misconceptions; FreshQA [36], which focuses on time-sensitive knowledge; HaluEval [37], designed for hallucination categorization. These datasets enable a more comprehensive analysis of hallucination tendencies in GQA. Despite these advancements, hallucination evaluation remains largely unexplored in the Arabic language. Most existing benchmarks and evaluation metrics have been developed for English, leaving a significant gap in assessing the factuality and faithfulness of Arabic generative outputs.

Our work bridges this research gap by providing an extensive comparative evaluation of hallucination phenomena in both Arabic-specific, multilingual, and reasoning LLMs on Arabic GQA and summarization tasks. We aim to systematically measure hallucination in LLMs, identify linguistic features contributing to hallucinations, and benchmark reasoning-enhanced models in an Arabic linguistic context.

3 Results and Discussion↩︎

4pt

Table 1: Hallucination scores on the Arabic GQA task.NE = Named-entity errors, Val = Value errors, Contr. = Factual contradictions, Conflic. = Conflict hallucinations, Gramm. = Grammar errors, Gen. = Generic/Imprecise hallucinations, KSC = Knowledge source conflict, Instr. = Instruction inconsistency, CSw. = Code-switching.
Model Model Lang. Factual Errors Faithfulness Errors Average
Arabic Multi. Rsn. NE Val Contr. Conflic. Gramm. Gen. KSC Total Instr. CSw. Total
Allam 0.083 0.240 0.307 0.000 0.003 0.070 0.023 0.727 0.007 0.030 0.037 0.382
Fanar 0.120 0.227 0.313 0.000 0.003 0.143 0.030 0.837 0.033 0.147 0.180 0.508
Jais-6.7b 0.137 0.103 0.240 0.000 0.000 0.527 0.003 1.010 0.480 0.063 0.543 0.777
Noon 0.197 0.393 0.547 0.003 0.003 0.243 0.020 1.407 0.050 0.070 0.120 0.763
Gemma 0.193 0.297 0.453 0.003 0.000 0.193 0.020 1.160 0.040 0.090 0.130 0.645
Bloom-7b 0.213 0.303 0.510 0.003 0.003 0.287 0.020 1.339 0.037 0.083 0.120 0.730
llama 0.163 0.207 0.313 0.000 0.000 0.257 0.023 0.963 0.030 0.090 0.120 0.542
qwen2.5-7b 0.220 0.267 0.300 0.003 0.003 0.310 0.030 1.133 0.060 0.117 0.177 0.655
DeepSeek-r1 0.070 0.127 0.200 0.000 0.003 0.193 0.010 0.603 0.067 0.083 0.150 0.377
GPT-4o 0.040 0.067 0.120 0.000 0.000 0.127 0.010 0.364 0.033 0.073 0.106 0.235
GPT-o3 0.050 0.083 0.130 0.000 0.003 0.137 0.010 0.413 0.030 0.067 0.097 0.255
QwQ 0.110 0.150 0.280 0.003 0.003 0.223 0.013 0.779 0.070 0.093 0.163 0.471

Several experiments have been conducted to evaluate the hallucination of the selected models on Arabic GQA and summarization tasks. More information about the experiment setup and prompts selection is available in Appendix 10.

3.0.0.1 Models Hallucination.

Tables 1 and 2 show the results of evaluated LLMs on Arabic GQA and text summarization tasks, respectively. The average hallucination score is computed as the mean of the total factual and faithfulness hallucinations for each model.

Both tables show a clear contrast in performance across Arabic and multilingual LLMs. As shown in Table 1, the best-performing model, Allam, achieved the lowest average hallucination score of 0.382, with minimal faithfulness error rate and factuality. The low factual and faithfulness hallucination error rates of Allam indicate strong adherence to real-world knowledge and user instructions. In contrast, models like Noon, Jais, and Bloom exhibit significantly higher hallucination scores, with average scores of 0.777, 0.763, and 0.730, respectively. The high error rates of these models are driven primarily by factual contradictions, named-entity, value, and generic errors, consistent with the general trend that value and named-entity hallucinations dominate in GQA outputs. These errors can be attributed to the models’ difficulty in handling time-sensitive or fact-specific questions, compounded by the absence of grounding in real-world temporal knowledge. Faithfulness errors, including instruction inconsistency and code-switching, are relatively rare across models, with Jais being a notable exception, which indicates that this bilingual model may face challenges in maintaining language consistency and adhering to instructions.

Table 2 shows the hallucination error rates of the evaluated models on the text summarization task. For this task, we used ten indicators to measure the hallucination of each LLM. More details about these indicators are available in Section [sec_hall_indicators]. As shown in the table, hallucination patterns diverge significantly, where fabrication and context inconsistency being the most prevalent error types across all models. This highlights the models’ tendency to introduce fabricated content or deviate from the original document’s context, which is a major issue in summarization, where it is important for the resulting summary to be close to the source.

Similar to the GQA task, Allam obtained the lowest average hallucination score of 0.215 and achieved the best human rating of 5. These results confirm that its outputs are both factual and faithful. In contrast, Fanar and Gemma exhibit high average hallucination scores of 1.215 and 1.000 for factual hallucinations, respectively. Bloom-7b also received the lowest rate by human evaluators, which indicates a big discrepancy between its output and the context of the original text, which could be attributed to the presence of noisy or low-quality data in Bloom’s pretraining corpus.

3.0.0.2 Hallucination Indicators.

Figure 2 presents the distribution of hallucination types of each LLM in the Arabic GQA and text summarization tasks. In the GQA task (Figure 2a), factual contradiction hallucinations are the most frequent, followed by generic, value, and named-entity hallucinations. These factual errors are the most dominant among the other factors, which show challenges in answering time-sensitive and entity-centric questions. Faithfulness errors, such as instruction inconsistency and code-switching, are also observed but to a lesser extent.

In contrast, the summarization task, as shown in (Figure 2b), shows a different pattern. Context inconsistency and fabrication are the most frequent hallucination types generated by LLMs. This highlights summarization’s susceptibility to content invention and divergence from context. Errors such as inference, value, and named-entity remain common but are less dominant. These differences emphasize how hallucination types vary across NLG tasks and reinforce the need for task-specific evaluation criteria.

4pt

Table 2: Hallucination scores on the Arabic summarization task. NE = Named-entity errors, Val = Value errors, Fabric. = Fabrications, Infer. = Inference errors, Gramm. = Grammar errors, Instr. = Instruction inconsistency, and CSw. = Code-switching.
Model Model Lang. Factual Errors Faithfulness Errors Average Human Rating
Arabic Multi. Rsn. NE Val Fabric. Infer. Gramm. Total Density Instr. Context CSw. Total
Allam 0.030 0.060 0.010 0.110 0.000 0.210 0.066 0.000 0.200 0.020 0.220 0.215 5
Fanar 0.270 0.250 0.455 0.230 0.010 1.215 0.486 0.260 0.750 0.120 1.130 1.172 3
Jais 0.150 0.130 0.210 0.130 0.000 0.620 0.344 0.230 0.420 0.010 0.660 0.638 3
Noon 0.192 0.121 0.313 0.172 0.030 0.828 0.277 0.010 0.576 0.071 0.675 0.743 4
Gemma 0.240 0.200 0.430 0.130 0.000 1.000 0.410 0.210 0.610 0.030 0.850 0.925 3
Bloom-7b 0.120 0.140 0.510 0.010 0.000 0.780 0.545 0.420 0.590 0.010 1.020 0.783 1
Llama 0.060 0.090 0.190 0.100 0.040 0.480 0.212 0.110 0.370 0.070 0.550 0.515 3
Qwen2.5 0.070 0.040 0.100 0.180 0.000 0.390 0.128 0.110 0.370 0.083 0.563 0.477 4
DeepSeek-r1 0.030 0.040 0.030 0.080 0.020 0.200 0.075 0.080 0.170 0.040 0.290 0.245 5
GPT-4o 0.010 0.010 0.010 0.070 0.000 0.100 0.021 0.000 0.100 0.010 0.110 0.105 5
GPT-o3 0.000 0.050 0.020 0.080 0.010 0.160 0.032 0.000 0.120 0.010 0.130 0.145 5
QwQ 0.080 0.060 0.080 0.180 0.020 0.420 0.147 0.190 0.390 0.460 1.040 0.730 4

a

b

Figure 2: Frequency of hallucination types (log\(_{10}\)-scaled) generated by evaluated LLMs across (a) GQA and (b) text summarization tasks..

Figure 3: Distribution of hallucination density across Arabic and multilingual LLMs using the summarization task.

3.0.0.3 Arabic vs. Multilingual LLMs.

Figure 3 shows the hallucination density distribution of the evaluated Arabic and multilingual LLMs on the text summarization task. While the difference in hallucination density between Arabic and multilingual models did not reach statistical significance (t = -1.41, p = 0.161), the trend indicates that Arabic models may produce fewer hallucinations on average. This can be attributed to the small size of the dataset and the number of evaluated LLMs. The results of the paired t-test revealed a statistically significant difference at the 5% level (p = 0.0186), indicating that Allam produces significantly fewer hallucinations than Qwen2.5-7b. The negative t-statistic further supports this finding, showing that Allam consistently generates summaries with lower hallucination density. This confirms the superior factual faithfulness of Allam in Arabic summarization.

For GQA, we conducted a Mann-Whitney U test to compare factual hallucination rates between models. When comparing all Arabic models against all multilingual models, the difference was also statistically significant, with a U-statistic of 649,023.5 and a p-value of \(8.19 \times 10^{-6}\) (p < 0.01). These findings indicate that Arabic models are generally more robust in reducing factual hallucinations in the Arabic GQA task compared to their multilingual counterparts. More details about selecting the significance test are present in Appendix 12.

Table 3: Hallucination rates of the reasoning-based models on Arabic and English outputs using the TruthfulQA dataset.
Language Model Hallucination Rate
Arabic Allam 0.666
DeepSeek R1 0.519
GPT-4o 0.448
GPT-o3 0.649
QwQ 0.524
English Allam 0.616
DeepSeek R1 0.482
GPT-4o 0.425
GPT-o3 0.548
QwQ 0.497
t-statistic 3.37
p-value 0.028

3.0.0.4 Reasoning-based models.

Tables 1 and 2 show the performance of four reasoning-based models in Arabic GQA and summarization tasks, respectively. As shown in Table 1, gpt-4o demonstrate the best factuality and faithfulness scores of 0.364 and 0.235, respectively, whereas QwQ exhibits the highest factual and faithfulness errors of 0.779 and 0.471, respectively. Notably, the Arabic-pretrained model, Allam, rivals reasoning-based models, achieving an average hallucination score of 0.382 with competitive performance to QwQ and DeepSeek-r1, which underscores the effectiveness of language-specific pretraining in mitigating hallucinations.

A similar trend is shown in table 2, where gpt-4o attains the best average hallucination score of 0.105, followed by gpt-o3, whereas QwQ exhibits the highest average hallucination score of 0.730. The Arabic pre-trained model, Allam, outperforms DeepSeek-r1 and QwQ with a factual density of 0.066 and a faithfulness score of 0.220, which also underscores the effectiveness of language-specific pretraining.

Table 3 shows the hallucination rates of four reasoning-based LLMs: DeepSeek R1, GPT-4o, GPT-o3, and QwQ and the best-performing Arabic-centric model, Allam, when responses are generated in Arabic and English using the TruthfulQA dataset. We used the coarse-grained definition of the hallucination introduced in this dataset, where the generated responses are compared against the ground-truth. Responses that do not match the ground-truth are considered hallucinations. Using this definition, we computed the hallucination rate reported in Table 3. As shown, the hallucination rate is consistently higher in Arabic outputs relative to English outputs across all reasoning-based models. For instance, the GPT-o3 model demonstrates a hallucination rate of 0.649 in Arabic compared to 0.548 in English. Likewise, DeepSeek-r1 and QwQ exhibit higher hallucination rates in Arabic with 0.519 and 0.524, respectively, compared to 0.482 and 0.497 in English. A two-tailed paired samples t-test indicates a statistically significant difference, with a t-statistic of 3.37 and a p-value of \(2.81 \times 10^{-2}\). These findings suggest that reasoning-based LLMs are more prone to generating hallucinations when responding in Arabic, which underscores the need for further study and targeted enhancements in Arabic.

4 Conclusion↩︎

In this study, we presented the first comprehensive evaluation of hallucination in Arabic across Arabic and multilingual LLMs using two NLG tasks: GQA and summarization. We proposed a multi-dimensional hallucination evaluation framework that incorporates both factuality and faithfulness, tailored specifically to the challenges of Arabic GQA and summarization. Furthermore, we evaluated the performance of reasoning-based LLMs using the TruthfulQA benchmark with parallel Arabic and English questions and gold answers. Our findings reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Arabic models consistently produced fewer hallucinations compared to their multilingual counterparts. Future work will focus on expanding the evaluation to include additional open-source models and a broader range of NLG tasks with larger, more diverse datasets, including culturally grounded questions, to further validate and generalize these findings. Moreover, the provided annotations can serve as a valuable resource for future research, as they may be directly used to fine-tune or train hallucination detection models.

5 Limitations↩︎

Despite presenting the first comprehensive hallucination evaluation across Arabic and multilingual LLMs, our study has some limitations. First, the evaluation was conducted on a relatively small set, which may constrain the statistical power and generalizability of the results. Additionally, the reasoning-based models need to be compared using the same set used with other models. Second, our analysis does not cover the full landscape of NLG tasks and diverse benchmarks. Third, our hallucination annotations rely on manual labeling, which, despite following structured guidelines, remains subject to human interpretation and inconsistency. Finally, our evaluation was limited to computationally feasible models. Moreover, we were limited to model sizes not exceeding 13B parameters, which affects the ability to observe performance trends with models of large sizes.

Acknowledgment↩︎

We would like to sincerely thank Malak Alkhorasani and Reem Aljunaid for their valuable help in the annotation process. Their contributions were essential in ensuring the quality and reliability of our study.

6 Annotation Guidelines↩︎

Three native Arabic-speaking individuals carried out annotations with a background in NLP and linguistic analysis. Before annotation, they underwent a training session to ensure consistent understanding of the categories.

We developed annotation guidelines by listing the hallucination factors with their definitions and examples. The examples were written by GPT-4o and revised by the authors to ensure clarity. Moreover, we provided the annotators with counterexamples to clarify what is considered hallucination and what is not, particularly since certain criteria, such as grammatical errors, may be ambiguous. We ensured that only grammatical errors that cause misunderstanding as hallucination, since our study does not aim to assess fluency.

We also conducted a pilot study to test and refine the guidelines. Based on the annotators’ feedback, definitions were adjusted for clarity, and borderline cases were clarified with additional counterexamples. Moreover, we revised the hallucination factors to better capture the nuanced forms of hallucination in each task. For the GQA task, we added a new criterion (Knowledge Source Conflict) to flag cases where the model’s output could not be confidently verified due to the presence of multiple conflicting sources, even if the answer appeared plausible. For the summarization task, we incorporated two additional indicators: a faithfulness rating scale ranging from 1 (completely unfaithful) to 5 (fully faithful), and a hallucination density score, calculated as the proportion of correct and incorrect facts in each summary. This is to ensure that the evaluation is fair for different summary lengths and details provided. Figure [fig:example] shows the guidelines given to the annotators after refinement. Moreover, the points below explain the 5-point scale used.

  1. Completely Unfaithful: Major hallucinations or contradictions; summary is misleading or factually incorrect.

  2. Mostly Unfaithful: Many incorrect or missing facts; key details are distorted or omitted.

  3. Partially Faithful: Contains some correct information, but with notable omissions or distortions that affect meaning.

  4. Mostly Faithful: All major facts are correct; only minor inaccuracies or stylistic issues.

  5. Fully Faithful: Completely accurate and faithful to the source content; no factual errors or omissions.

7 Annotation Platform↩︎

To facilitate the annotation process, we developed an annotation platform using Gradio. It presents the instance number, model, the text (text, question), and the gold answer (summary, answer). The platform enables annotators to label multiple types of hallucinations using a structured and interactive interface. The annotations are saved in centralized CSV files with predefined column names to ensure consistency. Figure 4 illustrates the annotation platform used for the summarization task.

Figure 4: The annotation platform

8 Annotation Examples↩︎

Table 5 provides examples for each error type based on model-generated answers, whereas the summarization annotation results are provided in the Github link

Figure 5: Examples of Hallucination annotation in GQA

9 TruthfulQA Translation↩︎

The initial translation was generated by GPT-4o. To ensure correctness, the authors went through the whole dataset and manually edited the translated text. The questions that cannot be translated correctly were removed from the dataset. The final version contains 737 instances. Table 6 outlines a subset of the translated questions.

Figure 6: Examples of English questions and their Arabic translations

10 Experiments↩︎

10.1 Experimental Setup↩︎

In our experiments, we used the HuggingFace platform to download non-reasoning-based models. For deployment, we leveraged the AutoModelForCausalLM and AutoTokenizer classes to load each model and generate outputs efficiently. For the reasoning-based models, we utilized two APIs, Together.ai and OpenAI. We utilized the Together.ai API to access the DeepSeek-r1 and QwQ models, whereas we utilized the official OpenAI platform for GPT-4o and GPT-o3. More details about the inference are present in Appendix 11

10.2 Prompt Selection↩︎

Our main focus in this study is to evaluate hallucination rather than applying prompt engineering. Accordingly, we intentionally used simple, straightforward prompts to minimize prompt-induced variability. For summarization, we used a direct instruction that asks the model to summarize the input text into a single sentence. Similarly, for GQA, we asked the model to respond concisely to the given question. Figure 7 illustrates the prompts.

Figure 7: Prompts used for GQA and summarization

11 Inference Details↩︎

To ensure fair comparison across all models, all generated outputs were produced using consistent decoding hyperparameters. We used greedy decoding with a temperature of 0.0, disabling top-k and top-p sampling to produce deterministic outputs. We set the maximum number of tokens to 128 for summarization and 64 for GQA. A repetition penalty of 1.2 was applied, and no beam search or sampling heuristics were used. After generation, the models’ outputs were post-processed to save only the generated response into a txt file for annotation. All models were loaded using transformers with torch_dtype=torch.float16 and device_map="auto" to optimize for GPU (A100) execution in Google Colab Pro. These choices ensured consistent, reproducible, and efficient inference across the full evaluation pipeline. For the reasoning-based models, we followed the approach used by [38]

12 Significance Tests↩︎

To assess whether differences in hallucination rates between models and language groups were statistically meaningful, we conducted a series of significance tests tailored to each task. For the summarization task, we used paired t-tests to compare hallucination density between Arabic and multilingual models. The t-test was chosen because hallucination density is a continuous variable, and preliminary inspection showed an approximately normal distribution within each group. In the GQA task, we assessed the factual hallucination tendencies of Arabic LLMs versus multilingual LLMs. Each model’s answer was annotated with binary labels (“Yes”/“No”) across nine hallucination types, and we computed a hallucination density score by averaging the number of hallucination types marked “Yes” for each response. We then applied the Mann-Whitney U test to compare the hallucination density distributions between the two groups. This non-parametric test was selected due to the binary nature of the annotations and the non-normal distribution of the resulting density scores, allowing us to determine whether the differences in hallucination behavior were statistically significant. For TruthfulQA, we conducted a paired t-test between the hallucination rates of Arabic and English outputs for the same questions. For each question, we computed the average hallucination rate across all models in Arabic and compared it to the corresponding English outputs. This setup allowed us to control for content variability by directly comparing paired outputs for the same input.

13 Ethical Considerations↩︎

This study evaluates hallucination behaviors in LLMs across Arabic and multilingual outputs using publicly available datasets and open-source models. No personal, sensitive, or private data was used. All hallucination annotations were performed manually using clearly defined guidelines. However, we acknowledge the inherent subjectivity. To reduce annotator bias, multiple hallucination types were defined explicitly, and consistency checks were conducted throughout the annotation process.

Models are executed on Google Colab under its Pro tier. Due to hardware limitations, we excluded very large models (e.g., >13B parameters), which may affect the generalizability of our findings to higher-capacity models. It is important to note that our analysis does not assess the harmfulness, bias, or cultural sensitivity of the hallucinated content. Finally, the findings are intended to inform safer model development, not to endorse or certify any specific model as hallucination-free or ethically robust.

References↩︎

[1]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, and 1 others. 2024. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology, 15(3):1–45.
[2]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919.
[3]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and 1 others. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55.
[4]
M Saiful Bari, Yazeed Alnumay, Norah A Alzahrani, Nouf M Alotaibi, Hisham A Alyahya, Sultan AlRashed, Faisal A Mirza, Shaykhah Z Alsubaie, Hassan A Alahmed, Ghadah Alabduljabbar, and 1 others. 2024. Allam: Large language models for arabic and english. arXiv preprint arXiv:2407.15390.
[5]
Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, and 1 others. 2023. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149.
[6]
Ali Farghaly and Khaled Shaalan. 2009. Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4):1–22.
[7]
Nizar Y Habash. 2010. Introduction to Arabic natural language processing. Morgan & Claypool Publishers.
[8]
Hamdy Mubarak, Hend Al-Khalifa, and Khaloud Suliman Alkhalefah. 2024. https://aclanthology.org/2024.lrec-main.705/. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8008–8015, Torino, Italia. ELRA and ICCL.
[9]
Samir Abdaljalil, Hasan Kurban, and Erchin Serpedin. 2025. https://arxiv.org/abs/2503.07833. Preprint, arXiv:2503.07833.
[10]
Serry Taiseer Sibaee, Abdullah I. Alharbi, Samar Ahmed, Omar Nacar, Lahouri Ghouti, and Anis Koubaa. 2024. https://aclanthology.org/2024.osact-1.17/. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, pages 130–134, Torino, Italia. ELRA and ICCL.
[11]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38.
[12]
Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922.
[13]
Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017.
[14]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 9459–9474.
[15]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
[16]
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. 2023. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.
[17]
OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276. Preprint, arXiv:2410.21276.
[18]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
[19]
Fanar Team, Ummar Abbas, Mohammad Shahmeer Ahmad, Firoj Alam, Enes Altinisik, Ehsannedin Asgari, Yazan Boshmaf, Sabri Boughorbel, Sanjay Chawla, Shammur Chowdhury, and 1 others. 2025. Fanar: An arabic-centric multimodal generative ai platform. arXiv preprint arXiv:2501.13944.
[20]
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, and 1 others. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
[21]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
[22]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, and 1 others. 2024. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186.
[23]
Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346.
[24]
Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3592–3603.
[25]
Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. 2022. Summac: Re-visiting nli-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
[26]
Alexander Richard Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. Qafacteval: Improved qa-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601.
[27]
Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 39–53.
[28]
Melanie Subbiah, Faisal Ladhak, Akankshya Mishra, Griffin Adams, Lydia Chilton, and Kathleen Mckeown. 2024. Storysumm: Evaluating faithfulness in story summarization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9988–10005.
[29]
Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. 2023. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554.
[30]
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
[31]
Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, and Saab Mansour. 2024. Finesure: Fine-grained summarization evaluation using llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 906–922.
[32]
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
[33]
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630.
[34]
Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. 2024. Unsupervised real-time hallucination detection based on the internal states of large language models. In Findings of the Association for Computational Linguistics ACL 2024, pages 14379–14391.
[35]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
[36]
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and 1 others. 2024. Freshllms: Refreshing large language models with search engine augmentation. In Findings of the Association for Computational Linguistics ACL 2024, pages 13697–13720.
[37]
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
[38]
Ahmed Hasanaath, Aisha Alansari, Ahmed Ashraf, Chafik Salmane, Hamzah Luqman, and Saad Ezzini. 2025. Arareasoner: Evaluating reasoning-based llms for arabic nlp. arXiv preprint arXiv:2506.08768.