TravelBench : Exploring LLM Performance in Low-Resource Domains

Srinivas Billa
Expedia Group
The Angel Building, 407 St John St
London, EC1V 4EX, UK
nbilla@expediagroup.com
Xiaonan Jing
Expedia Group
1111 Expedia Group Wy W
Seattle, WA 98119, USA
xijing@expediagroup.com


Abstract

Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.

1 Introduction↩︎

The rapid advancement of large language models (LLMs) in recent years has significantly facilitated the prototyping of downstream natural language processing (NLP) tasks. However, this has also introduced new challenges in selecting the most suitable LLM from an ever-growing pool of state-of-the-art (SOTA) models. To address this issue, a number of benchmarking datasets are proposed to evaluate the general capabilities of LLMs, including but not limited to: MMLU [1], ARC-C [2], GSM8K [3], HumanEval [4], MGSM [5]. However, these benchmarks provide limited insight into the performance of the model in low-resource domains such as the travel industry. In addition, prior study [6] has shown that different types of generation can affect the performance of an LLM’s decoding strategy. Without in-depth domain evaluation, streamlining LLM development becomes more difficult.

Traditionally, opinion mining has been the most prominent task in the travel domain, as it was often studied over customer reviews. For instance, hotel reviews have been widely used in sentiment classification [7], rating prediction [8], and opinion spam detection [9]. Similarly, restaurant reviews have been long leveraged for benchmarking the overall and aspect-based sentiment analysis, particularly in SemEval-2014 Task 4 [10]. Most recently, with the emergence of LLM agents, trip planning has gained attention and benchmark such as TravelPlanner [11] has been introduced, which evaluates LLMs on tool usage and complex planning. To further understand LLM performance in the travel domain beyond opinion mining and broaden the scope of real-world usage scenario benchmarking, we curated a comprehensive set of travel datasets including 14 tasks across 7 categories for model evaluation. An overview of the datasets is presented in Table 1.

By applying LLMs to generate predictions, we treat each task as solving an autoregressive modelling problem. The prediction then reflects the next token generation capability given specific contexts. Thus, the output is characterized by a combination of the model’s internal knowledge and instruction-following ability.

Table 1: Benchmark datasets categorised by NLP task. Each task is treated as a generation problem using an auto-regressive LLM. Avg Input Tokens denotes the number of tokens in the prompt which consists both instruction tokens and data tokens. For classification problems, number of labels are reported.

Task

:================ Aspect Based Sentiment Analysis

# Samples

:=============== \(\sim\)​500

Avg Input Tokens :================ \(\sim\)​700

# Labels

:================ 6

Overall Review Sentiment Analysis \(\sim\)​300 3
Aspect Based Review Segmentation \(\sim\)​900 Open Ended Generation
Review Topic Classification \(\sim\)​500 17
Faithfulness \(\sim\)​1500 5
Relevance \(\sim\)​700 5
Inclusiveness \(\sim\)​1000 2
Compliance \(\sim\)​900 2
Review Moderation \(\sim\)​2000 2
Manager Response Moderation \(\sim\)​1400 2
Customer Service Intent Prediction \(\sim\)​800 9
Review Summarisation \(\sim\)​2500 Open Ended Generation
Review Translation EN ->XX \(\sim\)​1000 \(\sim\)​100 Open Ended Generation
Review Translation XX ->EN Open Ended Generation

In the rest of this paper, we present in-depth evaluations of LLMs on our benchmark, aiming to bridge the gap between low-resource, under-explored tasks in the travel domain and the demand of real-world applications. It should be noted that all of our datasets were collected in an anonymised form from real-world usage scenarios, making the evaluation results more representative of actual model performance in practice. Our contributions are as follows:

  • We introduce a wide range of curated datasets in the travel domain for LLM evaluation, expanding the scope of current resources beyond opinion mining.

  • We present an in-depth analysis of various LLMs, verifying that out-of-the-box LLMs have performance limitations when adapted to travel domain.

  • We share insights on LLM’s scaling and reasoning behaviours to shed light on the effects of model’s training FLOPs and size. 1

2 Dataset↩︎

In this section, we describe the datasets, data collection process, and finally the evaluation metrics. The selection of tasks was derived from real-world downstream tasks in the travel domain. We sourced data from real-world scenarios, i.e. publicly deployed systems. The data was anonymised to remove Personally Identifiable Information (PII). All of the data was human annotated or verified without assistance from LLMs, minimizing implicit bias in the dataset towards models. For instance, we omitted sample responses to summary generation in the annotation guideline to reduce the impact of writing styles from pre-defined summaries. Dataset curation follows three steps:

  1. Stratified Random Sampling. From a large collection of real-world usage, we randomly sample approximately 500 to 1000 rows per task following the source distribution.

  2. Rubric Creation. We developed annotation guidelines in collaboration with human experts to provide precise instructions and ensure label consistency.

  3. Annotation. We employed human coders for annotation. It should be noted that although the number of coders could vary across datasets, the guidelines were designed to minimize variance introduced by these differences.

2.1 Aspect-based Sentiment Analysis (ABSA)↩︎

The purpose of this task is to identify granular sentiment toward specific topics within a text. Given a pre-defined set of topics (e.g., WiFi, pool, parking), we randomly sampled an equal number of hotel reviews per topic. Annotators labelled the sentiment of each aspect to positive, negative, mixed, neutral, not mentioned, wished for. Besides the widely used sentiment labels, we introduced two custom labels, namely "not mentioned: a topic not being present in the text" and "wished for: an utterance with the intent of wishing for it". For example, "I wish the breakfast was free". We compute the F1-score [12] between prediction and ground truth for evaluation.

2.2 Overall Sentiment Classification↩︎

The purpose of this task is to identify overall sentiment within a text. The data mainly consists of hotel reviews and customer support conversations. The labels are positive, negative, neutral. We use F1-score for evaluation.

2.3 Aspect-based Text Segmentation↩︎

The purpose of this task is to extract relevant segments within a text with respect to a given aspect of interest (e.g. "cleanliness", "service"). Using the same data as the ABSA task, human annotators were asked to label the start and the end of text span indices. We used BLEU-score [13] to measure exact matches between the prediction and the ground truth.

2.4 Topic Classification↩︎

The purpose of this task is to identify the topics that a text references. We use the same hotel review dataset as the ABSA task. Given a review the model is expected to identify all relevant topics are mentioned from the following list: amenities, bar, beach, breakfast, cleanliness, comfort, location, noise, pool, price, restaurant, room, service, spa or gym, view, wifi. We use F1-score for evaluation.

2.5 HELM↩︎

Holistic Evaluation of Language Models (HELM) [14] defined an LLM-as-a-judge [15] evaluation paradigm for natural language generation (NLG) such as faithfulness, relevance, compliance, and inclusiveness. We employed two types of evaluation setups: 1) scoring in the range of 1 to 5 scale, with 5 being the most-desired output [16], [17]; and 2) binary classification with True or False labels [18], [19].

2.5.1 HELM Faithfulness↩︎

The faithfulness metric measures the absence of hallucination or untrue facts derived from the LLM’s own knowledge. A score of 5 means "highly-faithful" and a score of 1 means "high-unfaithful". Root Mean Square Error (RMSE)[20] is used for evaluation.

2.5.2 HELM Compliance↩︎

Compliance provides a binary measure on whether the presence of content is compliant with legal policy. Example policies include toxic language, defamation, and infringements of intellectual property. F1-score is used for evaluation

2.5.3 HELM Inclusiveness↩︎

Inclusiveness is a binary metric which measures whether there exists the presence of hate speech, non-respectful language, harassment, and threats in the NLG. F1-score is used for evaluation.

2.5.4 HELM Relevance↩︎

Relevance measures how relevant the information is to the purpose of the task, for example : how relevant is the generated text given the task of review summarisation, on a 1-5 scale. Each use case has its individual measure of relevance. The rubrics change per use case. RMSE is used for evaluation.

2.6 Review Moderation↩︎

This task aims to identify whether a review is appropriate for publication. We evenly sampled hotel reviews and asked human annotators to label them as APPROVED or REJECTED. We use F1-score for evaluation.

2.7 Manager Response Moderation↩︎

Similarly to review moderation, this task aims to classify property manager responses to customer reviews as APPROVED or REJECTED.. Some common reasons for rejection include exposing sensitive personal information, being irrelevant to the review, or being overly promotional in order to bypass platform policies. We use F1-score for evaluation.

2.8 Intent Prediction↩︎

This task aims to determine the intent behind a customer’s utterance with a customer service agent. We instructed human annotators to label each utterance as one of the following intents: book, cancel, change, contest_payment, feedback, help, retrieve, small_talk, unknown_scenario. We use F1-score for evaluation.

2.9 Review Summarisation↩︎

Using the same hotel review data as the ABSA task, we grouped the reviews by property and topic. Human annotators were instructed to summarise reviews within the same group into a short summary, which is less than 100 characters. We used METEOR-score [21] for evaluation, which is robust to paraphrasing and synonyms.

Figure 1: Performance P_m against training compute FLOPs. While performance generally improves with scale, there are significant diminishing returns past 0.5*10^{16}.

2.10 Translation↩︎

We split translation into two sub-tasks: 1) translate from English to a target language; and 2) translate from a target language to English. To obtain the annotations, we randomly sampled hotel reviews from non-English languages and instructed human annotators to translate target texts into English. METEOR-score was used for evaluation.

3 Benchmark Setup↩︎

We leveraged prompt engineering by treating each task as a chat completion problem. We used the OpenAI chat completion template via Python SDK to ensure response accuracy. For each task, we designed a few-shot prompt template and used the same template consistently across all LLMs. Each template consists of 1) the description of the task, 2) step-by-step instructions to solve the specific problem, and 3) Zero-shot or few-shot question/answer pairs. For better comparison, we limited our scope to instruct-tuned models only, and that a temperature of 0 and top-p of 1 were applied to obtain deterministic outputs. For models with reasoning capabilities, we adopted the default settings from their model cards. Additionally, to serve open-source LLMs, we utilized vLLM framework and for proprietary LLMs such as GPT-4o, we leveraged official APIs through a secure proxy. We focused on two evaluations: 1) overall performance comparison with \(67\) models across open and closed source LLMs, and 2) the effect of reasoning of 8 hybrid-reasoning models from the Qwen 3 family.

3.1 Scaled performance across metrics↩︎

In order to ensure fair comparison of the models across various tasks, each with a different evaluation metric, we devised an overall performance metric \(P_m\). Firstly, we calculated \(P_{t,m}\) which denotes the scaled performance with respect to each task as the following, \[{!}{ P_{t,m} = \begin{cases} 1 - \dfrac{\text{clamp}(x_{t,m}, a_t, b_t) - a_t}{b_t - a_t}, & \text{if } \text{invert}_t = \text{True} \\ \\ \dfrac{\text{clamp}(x_{t,m}, a_t, b_t) - a_t}{b_t - a_t}, & \text{if } \text{invert}_t = \text{False} \end{cases} }\] where \(x_{t,m}\) is the original score for model \(m\) on task \(t\), \(a_t\) is the theoretical minimum bound for task \(t\), \(b_t\) is the theoretical maximum bound for task \(t\), \(\text{invert}_t\) is an indicator if lower is better for task \(t\), and \(\text{clamp}(x, a, b) = \min(\max(x, a), b)\) is the clamping function to bound the score within \([a, b]\).

Then, we can calculate the overall performance \(P_m\) by averaging over all tasks as, \[P_m = \frac{1}{N_m} \sum_{t} P_{t,m}\] where \(P_m\) represents the overall performance of model \(m\), and \(N_m\) represents the total number of valid (included) tasks for model \(m\).

Figure 2: The effect of enabling reasoning across the Qwen3 model family. While smaller models show some improvements, the same does not apply to large models - performance degradation can be seen for the 235B model.

4 Results and Analysis↩︎

4.1 Scaling Laws↩︎

We are interested in learning the correlation between model performance and its scale. Using the performance \(P_m\) calculated from above and the compute approximation \(FLOPs\) following [22] ([22]), a model’s scale can be defined as the training compute in floating-point operations per second (FLOPs), \[FLOPs \approx 6NT\] where \(T\) is the number of training tokens and N is the number of model parameters

As illustrated in Figure 1, we can observe a positive correlation between model performance and scale across all tasks. However, the scaling progression rapidly diminishes as FLOPs increases. Models trained with compute budgets larger than \(\approx 0.5*10^{16}\) FLOPs exhibits slow non-linear improvements, suggesting that unseen domain adaption remains a challenge even for larger models with higher generalizability. This result suggests that for domain adoption of out-of-the-box LLMs, lightweight models can sometimes serve as better options given the trade off between performance, compute costs, and inference latencies.

4.2 Effects of Reasoning↩︎

Recent work on unified training frameworks with toggle reasoning provided an opportunity to study its impact on the same model with in a model family. We tested the Qwen3 [23] family models to measure the performance difference with and without enabling internal thinking.

Figure 2 illustrates the performance differences for the model of the same size. As the model parameter increases, judging with reasoning does not always outperform that without reasoning. In addition, reasoning provides little to zero performance improvements on bigger models, and in some cases slightly degrades performance, which is counter-intuitive.

4.3 Performance Variance↩︎

Figure 3: Variance of model performance across tasks across the Qwen 3 family. While the performance overall increases with model size, the spread of the performance does not follow the same pattern. This indicates that performance is very task dependent and no one model is the best at every task.

Figure 3 demonstrates the variance in performance across the Qwen3 family models. We observe that larger models are on average better than their smaller counterparts when pre-trained on the same dataset. However, improvements in performance consistency are insignificant. We also notice that, while enabling reasoning improves the average performance of the smaller models, it can introduce more fluctuations in the model scores across tasks. However, this behaviour is less prevalent in larger models. Our assumption is that larger models store abundant internal knowledge with greater context retrieval ability. When a task is complex enough to overwhelm the model’s natural retrieval capability, adding reasoning helps improve the recall. Yet reasoning does not provide the model with the ability to gain new knowledge, and thus the performance caps at a certain level.

5 Related Works↩︎

Scaling laws. [22] ([22]) and [24] ([24]) presented scaling laws for LLMs of varying sizes to understand the relationship between scaling model size and training data. They compared test loss and some general benchmark performance such as MMLU [1] and Big-Bench [25].

Domain Specific Benchmarking. Within the travel sector, TravelPlanner [11] introduced a benchmark for evaluating LLMs on tool use and planning with synthetically generated queries. They compared both open source and closed source LLMs and showed that even the best performing model GPT4 has a success rate of only 0.6%. ChinaTravel [26] extends this work by using open-ended human written queries for the Chinese travel market, also showing similar results with Deepseek V3 being the best model at 5% pass rate, the authors note this may be because Deepseek has more context into chinese data. SemEval-2014 Task 4 [10] introduced an aspect-based sentiment analysis dataset on restaurant and laptop reviews, which has been widely used for benchmarking ABSA.

Effects of reasoning. Research such as [27] ([27]) and [28] ([28]) which train models with reinforcement learning to enable native reasoning in LLMs show prominent improvements in model performance through test-time compute scaling. We extend this research and compare new hybrid-thinking models proposed by [23] ([23]) and show the effect of enabling reasoning traces. All prior research shows notable performance increases by introducing native reasoning.

6 Conclusion↩︎

We presented an in-depth evaluation of LLM-as-a-judge on a series of annotated datasets in the low-resource travel domain. We expanded the task variation from the previously prominent opinion mining to 7 common NLP tasks. Our results indicate that out-of-the-box LLMs, despite the number of parameters and the amount of training tokens, reach performance bottlenecks over domain-specific tasks. Furthermore, internal reasoning tends to have a more significant performance boost on smaller LLMs over larger models. Lastly, the scoring fluctuation over the Qwen-family models indicates that although reasoning helps with retrieval, there remains a gap between human annotation and model judgement on unseen knowledge.

7 Limitations↩︎

  • Due to resource limitations, some of the models (>200b) tested were only tested in their quantised FP8/INT4 versions. This may have caused a slight degredation in performance for these models at the higher end of the FLOPs budget in 1 however not enough to change the analysis drawn from it.

  • We use a single prompt across all models, while this may not be optimal since different models might perform better with modified prompts we chose to keep the prompt identical as a measure to test ease of use. Future work could go into measuring how much the prompt matters in downstream performance.

  • The internal company proxy service we used for closed-source models has inbuilt guardrails that protect against accidental misuse. However this caused some issues with dealing with some tasks such as review moderation as the harmful content filter was being triggered and caused the service to fail in providing a response. While this is testing whether the model can successfully filter out harmful content the proxy service would not allow it. So we had to remove the failing rows in our testing for those models.

  • For closed-source models, we are not able to find details on training data size, quantisation and parameter size which meant we could not include those in the FLOPs scaling figures.

References↩︎

[1]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
[2]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
[3]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
[5]
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. 2022. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057.
[6]
Gian Wiher, Clara Meister, and Ryan Cotterell. 2022. On decoding strategies for neural text generators. Transactions of the Association for Computational Linguistics, 10:997–1012.
[7]
Md Hijbul Alam, Woo-Jong Ryu, and SangKeun Lee. 2016. Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Information Sciences, 339:206–223.
[8]
Hongning Wang, Yue Lu, and Chengxiang Zhai. 2010. Latent aspect rating analysis on review text data: a rating regression approach. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 783–792.
[9]
Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. arXiv preprint arXiv:1107.4557.
[10]
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27–35. 8th International Workshop on Semantic Evaluation August 23-24, 2014. ; Conference date: 23-08-2014 Through 24-08-2014.
[11]
Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. 2024. Travelplanner: a benchmark for real-world planning with language agents. In Proceedings of the 41st International Conference on Machine Learning, pages 54590–54613.
[12]
Cornelis J Van Rijsbergen. 1979. Information retrieval, 2nd edn. newton, ma.
[13]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
[14]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
[15]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623.
[16]
Cheng-Han Chiang and Hung-yi Lee. 2023. A closer look into using large language models for automatic evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8928–8942.
[17]
Xiaonan Jing, Srinivas Billa, and Danny Godbout. 2025. On a scale from 1 to 5: Quantifying hallucination in faithfulness evaluation. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 7765–7780, Albuquerque, New Mexico. Association for Computational Linguistics.
[18]
Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023. Is chatgpt a good nlg evaluator? a preliminary study. In Proceedings of EMNLP Workshop, page 1.
[19]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522.
[20]
Tianfeng Chai, Roland R Draxler, et al. 2014. Root mean square error (rmse) or mean absolute error (mae). Geoscientific model development discussions, 7(1):1525–1534.
[21]
Satanjeev Banerjee and Alon Lavie. 2005. https://aclanthology.org/W05-0909/. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
[22]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://arxiv.org/abs/2001.08361. Preprint, arXiv:2001.08361.
[23]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. 2025. https://arxiv.org/abs/2505.09388. Preprint, arXiv:2505.09388.
[24]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. 2022. https://arxiv.org/abs/2203.15556. Preprint, arXiv:2203.15556.
[25]
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. 2023. https://arxiv.org/abs/2206.04615. Preprint, arXiv:2206.04615.
[26]
Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Si-Yu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, and Yu feng Li. 2025. https://arxiv.org/abs/2412.13682. Preprint, arXiv:2412.13682.
[27]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. https://arxiv.org/abs/2501.12948. Preprint, arXiv:2501.12948.
[28]
OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit, Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian Osband, Ignasi Clavera Gilaberte, Ilge Akkaya, Ilya Kostrikov, Ilya Sutskever, Irina Kofman, Jakub Pachocki, James Lennon, Jason Wei, Jean Harb, Jerry Twore, Jiacheng Feng, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joaquin Quiñonero Candela, Joe Palermo, Joel Parish, Johannes Heidecke, John Hallman, John Rizzo, Jonathan Gordon, Jonathan Uesato, Jonathan Ward, Joost Huizinga, Julie Wang, Kai Chen, Kai Xiao, Karan Singhal, Karina Nguyen, Karl Cobbe, Katy Shi, Kayla Wood, Kendra Rimbach, Keren Gu-Lemberg, Kevin Liu, Kevin Lu, Kevin Stone, Kevin Yu, Lama Ahmad, Lauren Yang, Leo Liu, Leon Maksin, Leyton Ho, Liam Fedus, Lilian Weng, Linden Li, Lindsay McCallum, Lindsey Held, Lorenz Kuhn, Lukas Kondraciuk, Lukasz Kaiser, Luke Metz, Madelaine Boyd, Maja Trebacz, Manas Joglekar, Mark Chen, Marko Tintor, Mason Meyer, Matt Jones, Matt Kaufer, Max Schwarzer, Meghan Shah, Mehmet Yatbaz, Melody Y. Guan, Mengyuan Xu, Mengyuan Yan, Mia Glaese, Mianna Chen, Michael Lampe, Michael Malek, Michele Wang, Michelle Fradin, Mike McClay, Mikhail Pavlov, Miles Wang, Mingxuan Wang, Mira Murati, Mo Bavarian, Mostafa Rohaninejad, Nat McAleese, Neil Chowdhury, Neil Chowdhury, Nick Ryder, Nikolas Tezak, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Patrick Chao, Paul Ashbourne, Pavel Izmailov, Peter Zhokhov, Rachel Dias, Rahul Arora, Randall Lin, Rapha Gontijo Lopes, Raz Gaon, Reah Miyara, Reimar Leike, Renny Hwang, Rhythm Garg, Robin Brown, Roshan James, Rui Shu, Ryan Cheu, Ryan Greene, Saachi Jain, Sam Altman, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Santiago Hernandez, Sasha Baker, Scott McKinney, Scottie Yan, Shengjia Zhao, Shengli Hu, Shibani Santurkar, Shraman Ray Chaudhuri, Shuyuan Zhang, Siyuan Fu, Spencer Papay, Steph Lin, Suchir Balaji, Suvansh Sanjeev, Szymon Sidor, Tal Broda, Aidan Clark, Tao Wang, Taylor Gordon, Ted Sanders, Tejal Patwardhan, Thibault Sottiaux, Thomas Degry, Thomas Dimson, Tianhao Zheng, Timur Garipov, Tom Stasi, Trapit Bansal, Trevor Creech, Troy Peterson, Tyna Eloundou, Valerie Qi, Vineet Kosaraju, Vinnie Monaco, Vitchyr Pong, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wes McCabe, Wojciech Zaremba, Yann Dubois, Yinghai Lu, Yining Chen, Young Cha, Yu Bai, Yuchen He, Yuchen Zhang, Yunyun Wang, Zheng Shao, and Zhuohan Li. 2024. https://arxiv.org/abs/2412.16720. Preprint, arXiv:2412.16720.

  1. We are in the process of approval to release the datasets for further research.↩︎