April 02, 2024
With the rising human-like precision of Large Language Models (LLMs) in numerous tasks, their utilization in a variety of real-world applications is becoming more prevalent. Several studies have shown that LLMs excel on many standard NLP benchmarks. However, it is challenging to evaluate LLMs due to test dataset contamination and the limitations of traditional metrics. Since human evaluations are difficult to collect, there is a growing interest in the community to use LLMs themselves as reference-free evaluators for subjective metrics. However, past work has shown that LLM-based evaluators can exhibit bias and have poor alignment with human judgments. In this study, we propose a framework for an end-to-end assessment of LLMs as evaluators in multilingual scenarios. We create a carefully curated dataset, covering 10 languages containing native speaker judgments for the task of summarization. This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (Metal). We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2. Our results indicate that LLM-based evaluators based on GPT-4 perform the best across languages, while GPT-3.5-Turbo performs poorly. Additionally, we perform an analysis of the reasoning provided by LLM-based evaluators and find that it often does not match the reasoning provided by human judges.
Recent Large Language Models (LLMs) like GPT-4 [1], GPT-3.5-Turbo [2], PaLM2 [3], Gemini-1.5 [4], Mistral [5], [6] etc. have shown impressive performance on a variety of standard NLP tasks across languages [7]–[12]. However, there are several challenges in fair and accurate assessment of these models, such as the contamination of existing datasets in LLM pre-training data, lack of multilingual datasets [13], lack of benchmarks that represent real-world usage of these models, lack of frameworks for consistent subjective evaluations, and budget and access issues for native speaker evaluation. Therefore, there is a growing need for frameworks and resources that address the above challenges and allow us to systematically evaluate LLMs across several dimensions and languages.
Further, evaluating the text generation capabilities of these models is even more challenging [14]–[16]. Natural Language Generation (NLG) capabilities are traditionally evaluated using automated metrics such as ROUGE [17] or BLEU [18] scores. These metrics have several known drawbacks, such as reliance on exact matches and over-emphasis on length. Secondly, these metrics do not account for subjective evaluations such as quality, coverage, and coherence [19]–[21]. Thirdly, these metrics are reference-based i.e. they need a comparison baseline, which can be expensive to collect and can sometimes have a low correlation with human judgments. This has led to work on reference-free and subjective evaluation [22]–[25].
Using LLMs as evaluators presents several challenges. Recent works [14], [26], [27] have shown that while LLMs can produce evaluations with human-like accuracy, these evaluations are often inconsistent and can easily be influenced. LLMs also show position bias or scale region bias and are unable to distinguish between candidates that are close to each other [28]. LLMs are sensitive to instructions and their capabilities vary for different metrics [27], [29], [30]. Another significant challenge when using LLMs as evaluators is a limited assessment of their abilities in multilingual settings. Studies have shown that LLMs have inferior performance even on some high-resource languages and cannot be assessed extensively on low-resource languages due to a lack of benchmarks [7]. Therefore, it is still unclear if LLMs can replace human evaluations in multilingual settings.
In this paper, we introduce the Metalframework for a robust assessment of LLMs as evaluators in multilingual scenarios. Figure 1 shows an outline of our framework. The Metalframework is an end-to-end pipeline, that starts with first creating a rich meta-evaluation dataset that contains a variety of samples across the metrics of interest. We do this by systematically prompting GPT-4 to generate a wide range of sample data points, that are then evaluated by native speakers. In the next step, we compare LLM judgments with human judgments. For this, we draw on our previous work [31] to prompt LLMs for evaluations and subsequently compare the scores with human judgments. In particular, for the task of summarization we create a dataset of 1000 summaries covering 10 languages, with human ratings across 5 metrics.1 Next, we obtain LLM evaluations from GPT-3.5-Turbo, GPT-4, and PaLM2 across these 1000 summaries and 5 metrics.
Our findings show that the GPT-3.5-Turbo-based evaluator does not perform well across languages and metrics, while evaluators based on GPT-4 and PaLM2 perform better. We find that the evaluation ability of LLMs significantly across languages, motivating the creation of a meticulously crafted meta-evaluation dataset covering all target languages before using LLM-based evaluators. Lastly, our qualitative analysis shows that while GPT-4 and PaLM2 can achieve accuracy close to humans, the reasoning behind their evaluations is often flawed. While we study the applicability of the Metalframework for the task of summarization, it is extensible to other tasks as well, by creating meta-evaluation datasets for the other tasks.
Studies by [32]–[34] implemented the Likert scale for assessing various dimensions of generated summaries. [35]–[37] perform side-by-side comparisons of summaries produced by different models, using systems such as Elo for ranking the models based on performance.
Human-verified gold-standard datasets are crucial for being able to evaluate LLMs. [29] release riSum, an English-centric dataset of document-instruction-output triplets, with the LLMs generating instructions and outputs, and a human evaluation is adopted to score the triplets, with a focus on “instruction-following”. In our work, we manually curate the instructions and create a dataset covering 10 languages. Seahorse [38] is a multilingual and multifaceted data for summarization with 96K summaries, and metrics related to grammar and output quality. They fine-tune T5 [39], [40] and PaLM [41] models on the train split of the dataset for generating a spectrum of outputs, whereas we work with black-box models and tune our prompts to generate “good” and “bad” summaries.
Several previous studies have analyzed and evaluated LLMs on new tasks and standard benchmarks [7], [8]. [16] prompt ChatGPT for summarization and other higher-level tasks, across various metrics and find a high correlation with human scores, with the caveat that it may be influenced by the way the meta-evaluation datasets are created. We extend this idea in the Metaldataset. Other studies [42]–[45] have also put forth GPT scoring frameworks and prompt-based evaluators, however, they are mostly confined to English or Latin script languages. [46] employ a Multi-Elo Rating System and advice for a similar multi-dimensional assessment of LLM-generated summaries. Previous studies [31], [47], [48] have solely relied on GPT-4 as an LLM judge/evaluator. [49] propose a fine-tuned LLM, comparable to GPT-4 for input-evaluation rubric-output triplets, however, it is only fine-tuned for English. Previously, we performed a multilingual study of LLM meta-evaluation using an internal dataset of human judgments [31]. The dataset used in this work was not curated for the specific purpose of LLM-evaluator calibration and hence suffers from weaknesses such as dataset skew, and we focus this work on building a better dataset for meta-evaluation. We use the prompting strategies from our prior work [31] to evaluate our new curated dataset. To the best of our knowledge, no other study has proposed an end-to-end pipeline from generating a rich evaluation set to assessing the performance of LLM as evaluators in the multilingual scenario.
The Metaldataset contains a total of 1000 summaries across 10 languages. The dataset is specially curated to investigate the capabilities of LLMs as evaluators in different languages along 5 dimensions. In this section, we describe how the dataset was created and annotated.
The dataset consists of 100 summaries each, for 10 languages: English (En), French (Fr), Chinese Simplified (Zh), Hindi (Hi), Arabic (Ar), Bengali (Bn), Russian (Ru), Turkish (Tr), Japanese (Ja), and Swahili (Sw). We selected the languages to cover a diverse range of scripts and regions. The main text for each summary in our dataset was chosen from XL-Sum [25], and the corresponding summary was generated by prompting GPT-4. A brief overview of our methodology is shown in Figure 1.
For each of the 10 languages we create a histogram of 20 bins of the number of tokens in the main text for all the datapoints in the test set of the XL-Sum dataset [25]. We chose 100 random samples from the bin with the highest frequency.
To investigate the capabilities of LLMs as evaluators, our objective was to create an evaluation set of summaries with varying quality. To this end, for each of the chosen 1000 samples from the above step, we generate two summaries by prompting GPT-4 as follows.
To generate good-quality summaries we provide the main text to GPT-4 and prompt it to return a summary of the main text such that it captures the essence of the main passage. We specifically ask for a summary that is highly rated on the 5 metrics of interest, described in the next section. We keep the temperature at 0 for the generation of good-quality summaries.
To generate bad-quality summaries we provide the main text to GPT-4 and prompt it to act as an adversarial NLP assistant, and badly summarize the main passage. We specifically ask for a summary that is rated low on the 5 metrics of interest. To generate bad quality summaries we keep the temperature as 1. In our initial experiments with lower temperatures, we observed that GPT-4 does not produce bad summaries even when specifically prompted to do so.
To further ensure the quality of summaries, in both styles of prompting we ask GPT-4 to also justify why the generated summary is good or bad. Once we have 2 summaries per data point, we choose a good-quality summary or a bad-quality summary at random. The verbatim of our prompts are provided in §10.1
For the 1000 summaries selected from the above process, we have each sample annotated by 3 annotators for 5 different metrics. We use the metrics described by [31] in their work:
This metric assesses whether the summary is acceptable to a native speaker. Specifically, the annotators are asked to determine whether the text exhibits signs of being translated, misuses words, or includes expressions that are not idiomatic in their language.
This metric assesses whether the general quality of the output text is good. The annotators are asked to consider flaws such as significant repetition, non-native language elements, or indications that the text has been web-scraped.
This metric assesses the effectiveness of the summarization. It focuses on assessing the degree to which the summary aligns with key information in the main passage.
This metric assesses the summary for the presence of any content that may be deemed offensive, inappropriate, or harmful. It serves as a filter against outputs that might perpetuate harmful stereotypes or misinformation.
This metric assesses whether the summary remains anchored to, and consistent with, the main passage. It serves as a check against unwarranted deviations from the ground truth provided in the input.
For LA, OCQ, and TQ, annotators were asked to assign one of the three possible classes: Bad (0), Medium (1), Good (2). For PC and H, annotators were asked to assign one of the two possible classes: Present (1) and Absent (0).2
Each datapoint was annotated by three annotators for the five metrics. Annotators were native speakers of the respective language and trained professionals contracted through an external annotator services company. The pay was adjusted based on the annotator’s region and experience. Since we wanted to ensure we had a strong evaluation set to study the capabilities of LLMs as evaluators, special attention was given to the quality of annotations. The annotators were specifically trained to perform annotations for this task and a sample of annotations was reviewed for all annotators. Annotations were reviewed for accuracy and guideline consistency. Based on the review, feedback was provided to the annotators, and ambiguous cases were re-annotated.
Table 6 in Appendix §10.3 shows the Fleiss’ Kappa (\(\kappa\)) and pairwise agreement (computed as F1) values among the annotators for the various languages and metrics. All our \(\kappa\) values are \(> 0.6\) (except for H in En, \(\kappa = 0.54\)), and all F1 values are \(> 0.75\), indicating substantial agreement. Some \(\kappa\) values are 0 due to class skew, but high F1 in these cases indicates high reliability. For our experiments, we take the majority vote from the three human annotations per sample as the aggregate class for that sample. In the case of 3 distinct annotations, we take the average value. Figure 2 shows the distribution of the aggregate annotations over various languages and metrics.
As discussed in §3.1, we sample the datapoints from XL-Sum based on the number of tokens of the passage. Specifically, the tiktoken
3 was utilized for the tokenization process, and the length (token) distribution of the passages and summaries are presented in Table 1
along with the number of good and bad instances per language. Table 2 presents the frequency/distribution of the classes (0, 1, and 2) in the
good and bad summaries. Notably, the first row of the table depicts higher counts of low scores (class 0) for the bad summaries, relative to the good ones. Further, the medium scores (class 1) also contain a higher frequency in the bad summaries, however,
the difference between the frequencies is lower than that of class 0. Surprisingly, the bad summaries have more of class 2 scores than the good ones (third row) in Linguistic Acceptability. This goes on to show that LLMs are incapable of generating
incoherent text despite adversarial prompts.
Lang | Passage | Summary | Good | Bad | ||
---|---|---|---|---|---|---|
AR | 877.39 \(\pm\) 53.00 | 160.70 \(\pm\) 87.29 | 50 | 50 | ||
BN | 4161.58 \(\pm\) 534.91 | 339.83 \(\pm\) 160.55 | 53 | 47 | ||
EN | 358.29 \(\pm\) 21.09 | 67.71 \(\pm\) 29.57 | 46 | 54 | ||
FR | 341.96 \(\pm\) 26.89 | 84.79 \(\pm\) 39.27 | 51 | 49 | ||
HI | 1234.82 \(\pm\) 70.28 | 219.08 \(\pm\) 92.38 | 48 | 52 | ||
JA | 1327.44 \(\pm\) 61.50 | 136.44 \(\pm\) 81.11 | 52 | 48 | ||
RU | 748.26 \(\pm\) 47.52 | 139.09 \(\pm\) 72.28 | 43 | 57 | ||
SW | 518.70 \(\pm\) 35.90 | 127.98 \(\pm\) 73.79 | 47 | 53 | ||
TR | 625.77 \(\pm\) 40.96 | 136.44 \(\pm\) 68.76 | 42 | 58 | ||
ZH | 666.03 \(\pm\) 47.78 | 124.16 \(\pm\) 67.80 | 48 | 52 |
Class | LA | OCQ | TQ | H | PC |
---|---|---|---|---|---|
0 | 54 / 80 | 78 / 112 | 124 / 202 | 352 / 362 | 457 / 493 |
1 | 113 / 116 | 104 / 121 | 91 / 93 | 128 / 158 | 23 / 27 |
2 | 313 / 324 | 298 / 287 | 265 / 225 | - / - | - / - |
GPT-4-32K [1], GPT-3.5-Turbo [2], and PaLM2 Text-Bison [3] models were used as the evaluators to score the LLM-generated summary according to the given metrics. 4
The models were prompted using the LangChain5 framework and a structured JSON output format was maintained to parse the generations efficiently. The prompts for evaluation follow the same verbatim as [31].
Based on our previous work [31], we use the simple and detailed prompting strategies for all models, and all the metrics are evaluated independently in a single call to the API. All prompts were provided in English, as [7] have shown that multilingual instructions lead to worse performance. Further, the temperature is set to \(0\) for reproducibility.
A rudimentary description of the metric and scoring schema is provided, as shown in Figure 7 in appendix §10.4.
An informative and thorough description of the metric and a case-by-case breakdown of the scoring schema is provided, as shown in Figure 8in the appendix §10.4.
As described in §3.2 we use the aggregate of the three annotations for our experiments.
We measure the pairwise agreement between the LLM evaluators and human aggregate scores per language and metric. To account for any class imbalance, we report the weighted F1 score instead of accuracy.
We analyze the class distribution of the human aggregate scores and the various model predictions for three possible classes: When all three annotators agree, when two of three annotators agree, and when no annotators agree. We do this analysis only for metrics with 3 possible classes: LA, OCQ, and TQ.
Seahorse [38] is a dataset akin to Metal, as described in
§2. It contains summaries generated using several models for passages from popular summarization datasets such as XL-Sum [25], XSum [50], MLSum [51] and WikiLingua [52]. We use the XL-Sum
subset of Seahorse and find out common datapoints between Seahorse and Metal. There are a total of 27 overlapping data points: 1 in English, 10 in Russian, and 16 in Turkish. These datapoints
can have one or more summaries in Seahorse generated by mt5_small
: The 300M version of mT5 [40], mt5_small_250
: The same mt5_small
model but using the checkpoint after training 250 steps, mt5_xxl
: The 13B mT5 model, palm_1shot
: 540B PaLM model [41] prompted with one in-domain example, palm_finetuned
: 540B PaLM model finetuned on training data for the respective dataset.
We use our detailed prompting strategy to evaluate the summaries generated by various models in Seahorse for our metrics and compare them with the evaluation of the summaries generated by GPT-4 for the same main passages in
Metal.We use PaLM2 and GPT-4 as evaluators. 6
Metric | Prompting Strategy | Model | AR | BN | EN | FR | HI | JA | RU | SW | TR | ZH |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LA | ** | human | 0.89 | 0.81 | 0.86 | 0.99 | 0.87 | 0.95 | 0.98 | 0.82 | 0.97 | 0.94 |
2-13 | Simple | GPT-3.5-Turbo | 0.54 | 0.43 | 0.61 | 0.61 | 0.44 | 0.45 | 0.67 | 0.78 | 0.55 | 0.59 |
GPT-4 | 0.74 | 0.15 | 0.72 | 0.88 | 0.59 | 0.48 | 0.74 | 0.61 | 0.72 | 0.85 | ||
PaLM2 | 0.74 | 0.11 | 0.54 | 0.73 | 0.64 | 0.38 | 0.77 | 0.82 | 0.69 | 0.84 | ||
2-13 | Detailed | GPT-3.5-Turbo | 0.19 | 0.44 | 0.59 | 0.40 | 0.18 | 0.19 | 0.53 | 0.57 | 0.15 | 0.19 |
GPT-4 | 0.71 | 0.22 | 0.82 | 0.81 | 0.61 | 0.47 | 0.80 | 0.76 | 0.72 | 0.85 | ||
PaLM2 | 0.71 | 0.21 | 0.54 | 0.75 | 0.59 | 0.34 | 0.78 | 0.88 | 0.64 | 0.84 | ||
OCQ | ** | human | 0.85 | 0.82 | 0.82 | 0.97 | 0.83 | 0.93 | 0.93 | 0.84 | 0.84 | 0.91 |
2-13 | Simple | GPT-3.5-Turbo | 0.11 | 0.39 | 0.65 | 0.47 | 0.17 | 0.21 | 0.64 | 0.61 | 0.52 | 0.33 |
GPT-4 | 0.71 | 0.27 | 0.69 | 0.70 | 0.65 | 0.47 | 0.94 | 0.85 | 0.69 | 0.88 | ||
PaLM2 | 0.69 | 0.23 | 0.63 | 0.68 | 0.58 | 0.43 | 0.92 | 0.91 | 0.67 | 0.79 | ||
2-13 | Detailed | GPT-3.5-Turbo | 0.23 | 0.54 | 0.59 | 0.50 | 0.31 | 0.33 | 0.64 | 0.58 | 0.50 | 0.44 |
GPT-4 | 0.69 | 0.26 | 0.68 | 0.72 | 0.65 | 0.51 | 0.92 | 0.88 | 0.68 | 0.84 | ||
PaLM2 | 0.68 | 0.29 | 0.57 | 0.65 | 0.66 | 0.41 | 0.92 | 0.91 | 0.69 | 0.86 | ||
TQ | ** | human | 0.77 | 0.78 | 0.77 | 0.90 | 0.78 | 0.99 | 0.94 | 0.84 | 0.87 | 0.82 |
2-13 | Simple | GPT-3.5-Turbo | 0.63 | 0.53 | 0.52 | 0.84 | 0.58 | 0.81 | 0.83 | 0.82 | 0.65 | 0.77 |
GPT-4 | 0.60 | 0.64 | 0.53 | 0.81 | 0.56 | 0.87 | 0.95 | 0.87 | 0.60 | 0.78 | ||
PaLM2 | 0.56 | 0.67 | 0.41 | 0.83 | 0.56 | 0.85 | 0.90 | 0.88 | 0.59 | 0.79 | ||
2-13 | Detailed | GPT-3.5-Turbo | 0.26 | 0.49 | 0.54 | 0.76 | 0.22 | 0.44 | 0.63 | 0.63 | 0.58 | 0.31 |
GPT-4 | 0.71 | 0.64 | 0.59 | 0.86 | 0.66 | 0.86 | 0.96 | 0.87 | 0.63 | 0.76 | ||
PaLM2 | 0.58 | 0.66 | 0.38 | 0.83 | 0.51 | 0.84 | 0.94 | 0.90 | 0.65 | 0.73 | ||
H | ** | human | 0.89 | 0.97 | 0.85 | 0.97 | 0.90 | 0.99 | 0.99 | 0.93 | 0.84 | 1.00 |
2-13 | Simple | GPT-3.5-Turbo | 0.54 | 0.27 | 0.81 | 0.75 | 0.36 | 0.63 | 0.72 | 0.66 | 0.57 | 0.59 |
GPT-4 | 0.93 | 0.74 | 0.85 | 0.91 | 0.89 | 0.94 | 0.93 | 0.90 | 0.87 | 0.90 | ||
PaLM2 | 0.94 | 0.77 | 0.78 | 0.92 | 0.90 | 0.82 | 0.72 | 0.80 | 0.76 | 0.87 | ||
2-13 | Detailed | GPT-3.5-Turbo | 0.06 | 0.01 | 0.42 | 0.58 | 0.09 | 0.36 | 0.50 | 0.37 | 0.22 | 0.19 |
GPT-4 | 0.95 | 0.72 | 0.85 | 0.90 | 0.88 | 0.96 | 0.94 | 0.89 | 0.86 | 0.88 | ||
PaLM2 | 0.91 | 0.73 | 0.76 | 0.90 | 0.86 | 0.94 | 0.87 | 0.87 | 0.86 | 0.91 | ||
PC | ** | human | 0.93 | 1.00 | 1.00 | 1.00 | 0.94 | 0.99 | 0.99 | 0.86 | 1.00 | 1.00 |
2-13 | Simple | GPT-3.5-Turbo | 0.52 | 0.23 | 0.83 | 0.56 | 0.32 | 0.31 | 0.33 | 0.51 | 0.45 | 0.63 |
GPT-4 | 0.90 | 0.99 | 1.00 | 0.95 | 0.85 | 1.00 | 0.97 | 0.73 | 1.00 | 0.97 | ||
PaLM2 | 0.89 | 1.00 | 0.97 | 0.85 | 0.86 | 0.95 | 0.92 | 0.71 | 0.99 | 0.96 | ||
2-13 | Detailed | GPT-3.5-Turbo | 0.28 | 0.06 | 0.68 | 0.45 | 0.23 | 0.28 | 0.20 | 0.43 | 0.28 | 0.36 |
GPT-4 | 0.87 | 0.99 | 0.99 | 0.87 | 0.85 | 0.95 | 0.91 | 0.71 | 0.91 | 0.96 | ||
PaLM2 | 0.89 | 0.84 | 0.97 | 0.88 | 0.86 | 0.79 | 0.88 | 0.80 | 0.92 | 0.95 |
Table 3 and Figures 9 and 10 in Appendix §10.5 present the distribution of F1 scores of various models with the two prompting strategies on the 10 languages. For “human scores”, we average the pairwise F1 scores of all the annotators, i.e., A1-A2, A2-A3, and A3-A1. For the “model scores” in the plot, the F1 score between the annotator aggregate and model evaluation is computed.
For all the metrics, humans have the best agreement. In the case of LA, for most of the languages, GPT-4 with detailed instructions performs the closest to humans, followed by GPT-4 with simple instructions. GPT-3.5-Turbo performs the worst with detailed instructions and a significant difference is observed by making the instructions simple, however, no difference is found for English. For most of the languages, especially Zh, Hi, Ru, and Ar, GPT-4 and PaLM2 perform similarly.
For OCQ, GPT-3.5-Turbo performs the worst, and detailed instruction improves the performance marginally over simple instructions. GPT-4 and PaLM2 perform very closely to humans for Russian, however, there is a gap between the human and LLM scores on the rest of the languages. For TQ, both prompting strategies for GPT-4 and PaLM2 do equally well on most languages except Ar, Hi, Zh, and En. In these cases, GPT-4 with detailed instructions does the best.
For PC and H, all models show very similar scores as compared to humans, except GPT-3.5-Turbo. Simple instructions for GPT-3.5-Turbo improve performance for both metrics with a higher gain on H. Interestingly, for Bn and metrics LA and OCQ, both prompting strategies for GPT-3.5-Turbo do better than GPT-4 and PaLM2. For Sw and metrics LA, OCQ, and TQ, the agreement between humans and GPT-4 or PaLM2 is as good as the agreement between humans. GPT-3.5-Turbo with detailed instructions does worse than GPT-3.5-Turbo for all metrics except OCQ.
Overall, we find that the performance of GPT-4 and especially PaLM2 are largely independent of simple or detailed instructions, in all languages. The same holds for GPT-3.5-Turbo only on English, suggesting that it is less sensitive to prompting in English. GPT-4 with detailed instructions comes closest to human evaluation, with marginal improvements over simple instructions, in most cases. GPT-4 and PaLM2 are very effective in identifying hallucinations and problematic content for all languages.
Figure 3 shows the distribution of human aggregate score and various models for the three cases. In the case where all annotators agree, as shown in Figure 3 (a) we can see that the class distribution for GPT-4 and PaLM2 with both prompting variations is very close to the class distribution of human aggregate scores. This indicates that when humans have full agreement (perhaps due to easier samples), LLM-based evaluators also perform well.
In the case where two of three annotators agree, we can see in Figure 3 (b) that for both prompting variations GPT-4 and PaLM2 often over-predict class 2, under-predict class 1 and are similar to humans for class 0. Overall, detailed-GPT-4 comes closest to the distribution of human aggregate scores. For both cases, GPT-3.5-Turbo often prefers the middle class, which can be indicative of scale region bias.
In our third case, owing to our high quality of annotations, we have only 13 out of 3000 samples where no annotators agree. Figure 3 (c) shows the class distribution for this case. We can observe that GPT-3.5-Turbo with simple instructions assigns different classes with almost equal frequency. GPT-4 with detailed instructions often outputs the middle class, which is the annotator aggregate as well. PaLM2 with detailed instructions outputs the highest or the lowest score, and interestingly very few times opts for the middle class.
From this analysis, we conclude that while simple or detailed instructions for both GPT-4 and PaLM2 perform equally well when all human annotators agree, detailed instructions for GPT-4 do best when there is disagreement amongst annotators.
Lang | Metric | Seahorse | XL-Sum | Metal | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\(mt5\_small\_250\) | \(mt5\_xxl\) | \(mt5\_small\) | \(palm\_1shot\) | \(palm\_finetuned\) | reference | \(GPT-4_{good}\) | \(GPT-4_{bad}\) | ||||||||||
PaLM2 | GPT-4 | PaLM2 | GPT-4 | PaLM2 | GPT-4 | PaLM2 | GPT-4 | PaLM2 | GPT-4 | PaLM2 | GPT-4 | PaLM2 | GPT-4 | PaLM2 | GPT-4 | ||
RU | H | 0.60 | 0.40 | 0.00 | 0.20 | 0.43 | 0.43 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.75 | 0.75 |
LA | 0.60 | 1.00 | 1.40 | 2.00 | 0.57 | 2.00 | 1.60 | 2.00 | 1.25 | 2.00 | 1.33 | 2.00 | 0.00 | 1.00 | 0.62 | 0.87 | |
OCQ | 0.60 | 0.60 | 1.40 | 1.20 | 0.57 | 0.86 | 1.40 | 1.60 | 1.00 | 1.25 | 1.00 | 1.67 | 0.0 | 0.00 | 0.50 | 0.50 | |
TQ | 0.60 | 0.40 | 1.00 | 1.20 | 0.71 | 0.71 | 1.60 | 1.20 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.50 | 0.50 | |
PC | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.25 | 0.25 | |
TR | H | 0.55 | 0.75 | 0.10 | 0.10 | 0.50 | 0.50 | 0.00 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | 0.28 | 0.55 | 0.55 |
LA | 0.11 | 1.12 | 1.40 | 2.00 | 0.75 | 1.62 | 1.50 | 2.00 | 1.20 | 2.00 | 1.33 | 2.00 | 1.57 | 1.71 | 0.78 | 1.55 | |
OCQ | 0.11 | 0.62 | 1.20 | 1.50 | 0.62 | 1.00 | 1.37 | 1.62 | 1.00 | 1.60 | 1.33 | 1.67 | 1.57 | 1.57 | 0.78 | 1.00 | |
TQ | 0.11 | 0.50 | 1.20 | 1.30 | 0.50 | 0.75 | 1.37 | 1.25 | 0.80 | 1.20 | 1.17 | 1.33 | 1.57 | 1.57 | 0.78 | 1.11 | |
PC | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.22 |
Table 4 shows the values for evaluation of overlapping datapoints between Seahorse and Metalfor
RU and TR. The values are averaged over all datapoints. For Russian, we can observe that palm_1shot
does the best as it is rated highly across all metrics by both the models. Interestingly, palm_1shot
is rated better than
palm_finetuned
. GPT-4\(_{good}\) summaries received very bad evaluations across all metrics by both models. On further investigation, we found that GPT-4\(_{good}\) category had
only 2 datapoints out of the overlapping 10 datapoints, and co-incidentally these 2 generated summaries were of bad quality and have been rated poorly by human annotators as well. This indicates that GPT-4 might not always be biased towards its own
generations as compared to generations from other models. We can also observe that in almost all cases GPT-4 provides a higher rating as compared to PaLM2. For Turkish, we can observe that GPT-4\(_{good}\) receives the
highest ratings by both models. Similar to Russian, palm_1shot
receives better ratings than palm_finetuned. As expected, mt5_small_250
receives the lowest ratings since it is an under-trained model. We can also notice a clear
difference in ratings for GPT-4\(_{good}\) vs GPT-4\(_{bad}\) generations. Overall, from this experiment we can conclude that using our metrics and prompting methods we can compare
generations from different models.
No. | Language | Metric | Human Score and Comments | GPT-4 Score and Justification | PaLM2 Score and Justification |
---|---|---|---|---|---|
1 | English | TQ | ‘A1’: 1, ‘Some grammatical errors present. Some misaligned elements and omission of important information.’, ‘A2’: 2, ‘High quality summary, except missing verbs so medium linguistic acceptability’, ‘A3’: 0, ‘Summary contains grammatical errors, has poor discourse and is incoherent.’ | 2, The summary adheres to the instructions and captures the main points of the passage. Therefore, the summary demonstrates a precise understanding and adherence to the prompt’s instructions. | 2, The summary captures the main points of the passage. It correctly identifies the key players in the dispute, the reasons for the dispute, and the potential consequences of the dispute. The summary is concise and well-written. |
2 | Arabic | LA | ‘A1’: 1, ‘Text seems to be copied. Repetition noticed. Summary is too short more like a headline.’, ‘A2’: 2, ‘it is missing a lot of content from the main passage’, ‘A3’: 1, ‘The summary is very brief and does not meet all the points mentioned in the main passage.’ | 2, The vocabulary used is contextually relevant and the sentences reflect the natural rhythm, emphasis, and intonation of spoken Arabic. it does not violate any linguistic norms or rules, and thus scores high on linguistic acceptability. | 2, The summary is grammatically correct and uses contextually relevant vocabulary. However, it is overly verbose and lacks natural rhythm and intonation. |
3 | French | OCQ | ‘A1’: 2, ‘Le résumé, bien que mal conçu, affirme qu’une femme accuse Bill Cosby d’agression sexuelle à cause de l’argent, ce qui n’est pas vrai selon le passage original. Il y a des hallucinations dans le résumé et la qualité de la tâche est mauvaise.", ’A2’: 2, ‘Le résumé fourni contient des informations précises sur l’affaire Bill Cosby, telles que les allégations contre lui et le règlement civil avec l’un des accusateurs.’, ‘A3’: 1, ‘Insted of ’veut’ it would have been better to use “ne voudrait” or “est accusée de ne vouloir que de l’argent". To me, it is a fake news.’ | 0, The summary provided does not accurately reflect the content of the passage. It also inaccurately portrays the accuser as someone who just wants money, without mentioning her testimony or the fact that Cosby settled the case. | 0, The summary does not capture the essence of the passage. It is not clear, coherent, or engaging. It does not convey the main points of the passage and it is not original. |
4 | Hindi | TQ | ‘A1’: 2, ‘The summary is counterfactual in relation to the input. The president condemned the incident and in summary it states that he praised the incident. This is a problematic content because it involves student violence, gun violence, teen violence, etc…. less or no grammatical errors makes the task accuracy good.’, ‘A2’: 2, ‘The summary does not justifies the actual sentence. The quality of the summary is very bad.’, ‘A3’: ‘2, no comments’ | 0, The summary provided does not adhere to the passage at all. It misrepresents the events described in the passage. | 0, The summary does not adhere to the instructions given in the prompt. The prompt asks for a summary of the passage, but the summary is not a summary of the passage. The summary is a made-up story that is not related to the passage. |
5 | Chinese Simplified | OCQ | ‘A1’: 2, ‘Good summary without mistakes’, ‘A2’: 2, ‘The summary fails to mention the important message of the original, which is, many civilians were killed. The summary only indicates that some people were killed, ’A3’: 2, ‘Key information is missing in the summarization. The summarization is too general.’ | 1, The summary is grammatically correct and uses appropriate language. However, it lacks clarity and relevance to the passage. | 1, The summary is concise and captures the main points of the passage. It is also well-written and easy to understand. However, it does not provide any new information or insights beyond what is already in the passage. |
While results in section §5 show that detailed instructions to GPT-4 and PaLM2 give evaluations very close to humans, it is unclear how humans and LLMs reason about their scores. We qualitatively analyse the comments from the annotators for their ratings and justifications produced by the LLMs at the time of scoring the summaries. An analysis of some interesting examples is discussed in this section. As discussed in section §4.4, the annotations from humans can be divided into three categories: when all annotators agree, when two annotators agree, and when no annotators agree. Table 5 shows examples from each of these categories for different languages and metrics. We specifically analyze cases where LLMs’ scores differ from the annotator aggregate score.
The first example is where no annotators agree on TQ for an English sample. Both GPT-4 and PaLM2 assign a 2 in this case. While all three annotators point out a few problems with the summary, both GPT-4 and PaLM2 ignore some key elements for TQ such as “omission of important information”, and “poor discourse” and say that the summary “captures main points of the passage”.
The next two cases in the table are when two annotators agree. In the second case, two annotators give the sample a score of 1 for LA, however, no annotators point towards any grammatical issues with the summary. Their comments are more relevant for TQ and OCQ. This indicates that for humans their judgment of one metric might affect their judgment of other metrics. Both LLMs give a high score of 2 to the sample, even though the reason from PaLM2 says “lacks natural rhythm and intonation”. This shows that LLMs’ reasons might not always be aligned with their scores, in line with findings from [53]. In the third case, the annotator aggregate for OCQ is 2, however, both LLMs assign a score of 0. Annotators mention problems such as “hallucinations” in their comments, while GPT-4 says the summary is an inaccurate representation of the main text, and PaLM2 complains of incoherence.
The last two examples in the table are where all three annotators agree but the LLM scores are different. In the fourth case, all annotators assign a score of 2 for TQ, however, both LLMs assign a score of 0. Even though A2 complains about the quality of the summary, they assign a score of 2, indicating some error in judgment. Both LLMs assign a score of 0 and reason that that the summary consists of hallucinations. It is interesting that humans still assign the summary a score of 2 indicating that there can be subtle differences in how humans interpret these metrics.
In the last case, all annotators assign the sample a score of 2 for OCQ and do not mention any issues with content quality in their comments. Interestingly, PaLM2 assigns a score of 1 and the justification states “it does not provide any new information or insights beyond what is already in the passage”. Since this was a summarization task no new information is expected in the summary. This again indicates that the judgment and justification might not always be aligned. Table 7 in Appendix §10.6 shows some samples of cases where either of the LLM scores agree with human aggregate scores, but there are some discrepancies in their justification.
Overall, our analysis indicates that there are several challenges in the alignment of human evaluations with LLM evaluations. While the scoring by LLMs on several metrics and languages might come close to humans, it is difficult to understand how they come up with these scores, necessitating further research.
We presented the first framework for end-to-end evaluation of LLMs as evaluators in multilingual scenarios. We created a dataset of 1000 summaries across 10 languages rated by native speakers on 5 different metrics. Our dataset covers a range of summaries in terms of linguistic acceptability, output quality, task quality, and others. We do this by systematically prompting GPT-4 to generate summaries of varying quality. The human ratings obtained for these summaries are of high quality with \(\kappa > 0.6\) and \(F1 > 0.75\). We plan to make the Metaldataset available to the research community. Using our dataset, we investigate the capabilities of three LLMs as evaluators: GPT-3.5-Turbo, GPT-4, and PaLM2 using two prompting strategies and compare their evaluation with the Metalhuman evaluations. Our results show that GPT-4 with detailed instructions performs closest to humans, while GPT-3.5-Turbo is not a suitable multilingual evaluator but surprisingly does better than GPT-4 and PaLM2 in some metrics for Bengali. We also show that GPT-4 with detailed instructions does best when there is disagreement amongst human annotators. We compare the overlapping summaries between Seahorse and Metaland show how our metrics and prompting methods can be used to compare generations from different models. Finally, we analyze human and LLM reasoning and observe that LLMs often provide incorrect justifications for their scores, thus showing that more research is needed to be able to use LLM-based evaluators with confidence in the multilingual setting.
We prompt GPT-4 to generate good and bad-quality summaries. As noted in §3.1, for lower temperature values we observed that GPT-4 did not generate bad summaries. We use a temperature of 1 and observe some variation of quality across all our metrics except problematic content. This could be due to the content filter applied to these models. Therefore, it is difficult to study the capability of such models on this metric. We evaluate the generations from GPT-4 using GPT-3.5-Turbo, GPT-4, and PaLM2. Recent work has shown that LLMs prefer their own outputs. Although this might have affected our evaluations, exploring this is beyond the scope of our work. In our work, we mainly focused on investigating how well LLM ratings align with human ratings across various metrics and languages. All summaries generated and evaluated in our study are by the same model, we do not compare them against human-written summaries or summaries generated by other models. Lastly, LLMs are also shown to have scale region bias and we do not calibrate for this in our study, expecting it to be standardized across all their ratings. In the future, it would interesting to explore their impact on our evaluation.
We use the framework by [54] to discuss the ethical considerations for our work.
Our dataset was annotated by an external company that has long-standing contracts with the organization and is employed by the organization regularly to do this work. Therefore, the annotation company only accepts work that is covered under the purview of their contract.
To generate the summaries in our dataset we use the main text from the publicly available test set of XL-Sum [25]. Our summaries are generated in 10 languages: En, Fr, Hi, Zh, Ar, Bn, Tr, Ja, Ru, and Sw. We do this by prompting GPT-4. We release the dataset publicly for future research. Our dataset was created such that it covers a range of quality for summaries. Therefore, some summaries in our dataset are deliberately incoherent. Our ratings on problematic content show that \(<5\)% of our data had problematic text in them.
Annotators were recruited through an external annotator services company. All annotators were native speakers of the language of the data points they annotated. The pay was adjusted after discussion with the company, based on the annotator’s region and experience. No demographic information is available about the annotators. The annotators are governed by their company’s and our organization’s privacy policy.
We draw inspiration from the community standards set for similar tasks. These guidelines were created following best practices after careful research. Annotators were asked to rate the summaries across 5 metrics. A detailed explanation was given for each of the metrics. For 3 metrics annotators had to choose from 3 classes, and for 2 metrics they had to choose from 2 classes. Annotators were allowed to give feedback for any data point via an optional comments text box. Annotators received training for this task. Annotator identity was hidden from the task reviewers to limit any bias.
In this study, we explore methods to generate summaries by prompting GPT-4. We deliberately prompt GPT-4 to generate some bad summaries. All summaries generated were evaluated by 3 LLMs: GPT-3.5-Turbo, GPT-4, and PaLM2. We explore several ways to calibrate LLM judgment with human judgments for various metrics and languages. While these methods can be easily misused, our intent with this study is to highlight the gap between the two and urge the community to proceed with caution.
Figures 4 and 5 show the general prompting schema for summary generation. Notably, we use the guidance
7 framework for these generations.
Figure 6 shows detailed instructions provided to the annotators. The metrics are explained in §3.2.
Table 6 shows the Fleiss’ Kappa \(\kappa\) and pairwise agreement (F1) values for various metrics and languages.
Lang | H | LA | OCQ | PC | TQ |
---|---|---|---|---|---|
AR | 0.65 / 0.89 | 0.66 / 0.89 | 0.61 / 0.85 | 0.65 / 0.93 | 0.61 / 0.77 |
BN | 0.83 / 0.97 | 0.64 / 0.81 | 0.62 / 0.82 | 0.0 / 1.0 | 0.64 / 0.78 |
EN | 0.54 / 0.85 | 0.73 / 0.86 | 0.63 / 0.82 | 1.0 / 1.0 | 0.61 / 0.77 |
FR | 0.94 / 0.97 | 0.93 / 0.99 | 0.91 / 0.97 | 1.0 / 1.0 | 0.84 / 0.9 |
HI | 0.68 / 0.9 | 0.69 / 0.87 | 0.62 / 0.83 | 0.78 / 0.94 | 0.6 / 0.78 |
JA | 0.97 / 0.99 | 0.92 / 0.95 | 0.89 / 0.93 | 0.0 / 0.99 | 0.98 / 0.99 |
RU | 0.99 / 0.99 | 0.97 / 0.98 | 0.88 / 0.93 | 0.9 / 0.99 | 0.89 / 0.94 |
SW | 0.85 / 0.93 | 0.71 / 0.82 | 0.73 / 0.84 | 0.62 / 0.86 | 0.72 / 0.84 |
TR | 0.66 / 0.84 | 0.95 / 0.97 | 0.76 / 0.84 | 0.0 / 1.0 | 0.8 / 0.87 |
ZH | 1.0 / 1.0 | 0.68 / 0.94 | 0.65 / 0.91 | 1.0 / 1.0 | 0.65 / 0.82 |
Figures 7 and 8 show examples of simple and detailed instructions for Linguistic Acceptability respectively. Rest of the prompts can be found in [31].
Figures 9 and 10 are a pictorial representation of Table 3.
Table 7 shows some samples of cases where either of the LLM scores agree with human aggregate scores, but there are some discrepancies in their justification.
No. | Lang. | Metric | Human Scores and Comments | GPT-4 Scores and Justification | PaLM2 Scores and Justification |
---|---|---|---|---|---|
1 | Bengali | TQ | ‘A1’: 0, ‘The summary part is very short.’, ‘A2’: 1, ‘This summary section was not completed. Language construction was good and some flaws but otherwise okay and summary section more improvement was needed.’, ‘A3’: 2, ‘This content has good co-relation between different thoughts and ideas. The summary of the content is not complete.’ | 1, However, it misses out on finer details and nuances such as the reason behind the boycott, the reactions of different countries, and the political implications of the boycott. Therefore, the summary partially aligns with the passage.. | 1, The summary is a good summary of the passage. It captures the main points of the passage and provides a brief overview of the topic. However, the summary does not go into as much detail as the passage and does not provide as much information about the topic. |
2 | Bengali | OCQ | ‘A1’: 0, ‘Difficult to understand from beginning to end.’, ‘A2’: 1, ’ summary section can be more improvement is needed.’, ‘A3’: 1, "In summary part as well, the sentences are irrelevant with other sentences of the paragraph. " | 1, the summary is somewhat accurate in terms of content relevance but lacks clarity and cohere | 1, The summary is not in the same language as the passage |
3 | Swahili | LA | ‘A1’: 1, ‘The summary is inaccurate and incomplete and thus not a true reflection of the original passage. Some crucial information is missing while other information is not accurate.’, ‘A2’: 0, NA, ‘A3’: 0, NA | 1, The summary provided in Swahili is grammatically correct and uses appropriate words. However, it is overly simplified and does not capture the essence of the passage | 0, The summary is not fluent and does not reflect natural rhythm, emphasis, and intonation of spoken language. It contains several grammatical errors and uses words that are not appropriate for the context |
4 | Russian | LA | ‘A1’: 0, ‘Summary with no syntax no punctuation, just a set of words. Not even key words. Hallucinations: (Not present in the text)’, ‘A2’: 0, ‘Summary is not what the text is about.’, ‘A3’: 0, NA | 0, The summary provided does not accurately reflect the content of the passage. It is not clear, coherent, or contextually relevant | 0, The summary is not in Russian. It is in English |
5 | Turkish | LA | ‘A1’: 1, NA, ‘A2’: 1, ‘Irrelevant summary context.’, ‘A3’: 1, ‘Totally meaningless utterances.’ | 1, The summary uses correct grammar and sentence structure, and there are no non-existent or incorrect words. However, the summary is overly verbose and uses stilted phrasing. | 1, The summary is not linguistically acceptable. It contains several grammatical errors, such as the use of the word ‘gok tasl’ (meteorite) to refer to a handbag, and the use of the word ‘muze’ (museum) to refer to a planet |
Metaldataset and code available at https://aka.ms/METAL↩︎
We include the detailed annotation instructions in Appendix §10.2↩︎
Both GPT models were accessed through Azure and PaLM2 via VertexAI.↩︎
We do not consider the 1 overlapping datapoint in English for our experiment.↩︎
https://github.com/guidance-ai/guidance (Version 0.0.64)↩︎