Are large language models superhuman chemists?


Abstract

Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. This is relevant for the chemical sciences, which face the problem of small and diverse datasets that are frequently in the form of text. LLMs have shown promise in addressing these issues and are increasingly being harnessed to predict chemical properties, optimize reactions, and even design and conduct experiments autonomously.

However, we still have only a very limited systematic understanding of the chemical reasoning capabilities of LLMs, which would be required to improve models and mitigate potential harms. Here, we introduce “ChemBench,” an automated framework designed to rigorously evaluate the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of human chemists.

We curated more than 7,000 question-answer pairs for a wide array of subfields of the chemical sciences, evaluated leading open and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. The models, however, struggle with some chemical reasoning tasks that are easy for human experts and provide overconfident, misleading predictions, such as about chemicals’ safety profiles.

These findings underscore the dual reality that, although LLMs demonstrate remarkable proficiency in chemical tasks, further research is critical to enhancing their safety and utility in chemical sciences. Our findings also indicate a need for adaptations to chemistry curricula and highlight the importance of continuing to develop evaluation frameworks to improve safe and useful LLMs.

1 Introduction↩︎

llms are ml models trained on massive amounts of text to complete sentences. Aggressive scaling of these models has led to a rapid increase in their capabilities,[1], [2] with the leading models now being able to pass in some evaluations the United States Medical Licensing Examination.[3] They also have been shown to design and autonomously perform chemical reactions when augmented with external tools such as web search and synthesis planners.[4][6] While some see “sparks of agi” in them,[7] others consider them as “stochastic parrots”—i.e., systems that only regurgitate what they have been trained on.[8] Nevertheless, the promise of these models is that they have shown the ability to solve a wide variety of tasks they have not been explicitly trained on.[9], [10] This has led to tremendous economic interest and investment in such generative models, with an expected market of more than $1.3 trillion (almost $30 billion for drug discovery applications) by 2032.[11]

Chemists and materials scientists have quickly caught on to the mounting attention given to llms, with some voices even suggesting that “the future of chemistry is language”.[12] This statement is motivated by a growing number of reports that use llms to predict properties of molecules or materials,[2], [13][17] optimize reactions[18], [19] generate materials,[20][22] extract information,[23][28] or to even prototype systems that can autonomously perform reactions in the physical world based on commands provided in natural language.[5], [6], [29]

In addition, since a lot—if not most—of the information about chemistry is currently stored and communicated in text, there is a strong reason to believe that there is still a lot of untapped potential in llms for chemistry and materials science.[30] For instance, most insights in chemical research do not directly originate from data stored in databases but rather from the scientists and their ability to interpret data. Many of these insights are in the form of text in scientific publications. Thus, operating on such texts might be our best way of unlocking these insights and learning from them. This might ultimately lead to general copilot systems for chemists that can provide answers to questions or even suggest new experiments based on vastly more information than a human could ever read. Such a usage mode is, in particular, interesting in the face of recent advances in autonomous laboratories.[5], [6], [29], [31][35]

However, the rapid increase in capabilities of chemical ml models led (even before the recent interest in llms) to concerns about the potential for dual use of these technologies, e.g., for the design of chemical weapons.[36][41] To some extent, this is not surprising as any technology that, for instance, is used to design non-toxic molecules can also be used inversely to predict toxic ones (even though the synthesis would still require access to controlled physical resources and facilities). Still, it is essential to realize that such models’ user base is wider than that of chemistry and materials science experts who critically reflect on every output such models produce. For example, many students frequently consult these tools—perhaps even to prepare chemical experiments.[42] This also applies to users from the general public, who might consider using llms to answer questions about the safety of chemicals. Thus, for some users, misleading information—especially about safety-related aspects—might lead to harmful outcomes. However, even for experts, chemical understanding and reasoning capabilities are essential as they will determine the capabilities and limitations of their models in their work, e.g., in copilot systems for chemists. Unfortunately, apart from anecdotal reports, there is little evidence on how llms perform compared to expert (human) chemists.

Thus, to better understand what llms can do for the chemical sciences and where they might be improved with further developments, evaluation frameworks are needed to allow us to measure progress and mitigate potential harms systematically. For the development of llms, evaluation is currently primarily performed via standardized benchmark suites such as BigBench[43] or the LM Eval Harness.[44] Among 204 tasks (such as linguistic puzzles), the former contains only two tasks classified as “chemistry related”, whereas the latter contains no specific chemistry tasks. Due to the lack of widely accepted standard benchmarks, the developers of chemical language models[14], [45][48] frequently utilize language-interfaced[49] tabular datasets such as the ones reported in MoleculeNet,[50] Therapeutic Data Commons[51] or MatBench.[52] In these cases, the models are evaluated on predicting very specific properties of molecules (e.g., solubility, toxicity, melting temperature or reactivity) or on predicting the outcome of specific chemical reactions. This, however, only gives a very limited view of the general chemical capabilities of the models.

While some benchmarks based on university entrance exams[53], [54] or automatic text mining[55][57] have been proposed, none of them have been widely accepted. This is likely because they cannot automatically be used with black box (or tool-augmented) systems, do not cover a wide range of topics, or are not carefully validated by experts. On top of that, the existing benchmarks are not designed to be used with models that support special treatment of molecules or equations and do not provide insights on how the models compare relative to experts.

In this work, we report a novel benchmarking framework (), which we call ChemBench, and use it to reveal limitations of current frontier models for use in the chemical sciences. Our benchmark consists of question-answer pairs manually () or semi-automatically () compiled from diverse sources. Our corpus covers a large fraction of the topics taught in undergraduate and graduate chemistry curricula. It can be used to evaluate any system that can return text (i.e., including tool-augmented systems).

To contextualize the scores, we also surveyed experts in chemistry on a subset of the benchmark corpus to be able to compare the performance of current frontier models with (human) chemists of different specializations. Our results indicate that current frontier models perform “superhuman” on some aspects of chemistry but, in many cases, including safety-related ones, might be very misleading. We find that there are still numerous limitations for current models to overcome, such that they can be directly applied in autonomous systems for chemists. Moreover, we conclude that carefully created broad benchmarks represent an important stepping stone for progress in this field.

2 Results and Discussion↩︎

Figure 1: Overview of the ChemBench framework. The figure shows the different components of the ChemBench framework. The framework’s foundation is the benchmark corpus that consists of many questions and answers that we manually or semi-automatically compiled from various sources. We then used this corpus to evaluate the performance of various models and tool-augmented systems using a custom framework. To provide a baseline, we built a web application that we used to survey experts in chemistry. The results of the evaluations are then compiled in publicly accessible leaderboards, which we propose as a foundation for evaluating future models.

2.1 Benchmark corpus↩︎

To compile our benchmark corpus, we utilized a broad list of sources (see ), ranging from university exams to semi-automatically generated questions based on curated subsets of data in chemical databases. For quality assurance, all questions have been reviewed by at least one scientist in addition to the original curator and automated checks.

Importantly, our large pool of questions encompasses a wide range of topics. This can be seen, for example, in in which we compare the number of questions in different subfields of the chemical sciences (see for details on how we assigned topics). The distribution of topics is also evident from in which we visualize the questions in a two-dimensional space using a pca on the embeddings of the questions. In this representation, semantically similar questions are close to each other, and we color the points based on classification into topics. It is clear that a focus of ChemBench(by design) lies on safety-related aspects, which in appear as a large distinct clusters across the embedding space.

Figure 2: Number of questions for different topics. The topics have been assigned using a combination of a rule-based system (mostly based on the source the question has been sampled from) and a classifier operating on word embeddings of the questions. The figure shows that not all aspects of chemistry are equally represented in our corpus. The ChemBench corpus, by design, currently focuses on safety-related aspects, which is also evident in . This figure represents the combined count of and open-ended questions.

Figure 3: Principal component projection of embeddings of questions in the ChemBench corpus. To obtain this figure, we embedded questions and answers using the BART model[58] (using other embeddings, such as of OpenAI’s ada model, leads to qualitatively similar results). We then project the embeddings into a two-dimensional space using pca. We color the points based on a classification into topics. Safety-related aspects cover a large part of the figure that is not covered by questions from other topics.

While many existing benchmarks are designed around mcq, this does not reflect the reality of chemistry education and research. For this reason, ChemBench samples both mcq and open-ended questions ( mcq questions and open-ended questions).

2.1.0.1 “Tiny” subset

It is important to note that a smaller subset of the corpus might be more practical for routine evaluations.[59] For instance, [60] report costs of more than $10,000 for api calls for a single evaluation on the widely used helm benchmark. To address this, we also provide a subset ( questions) of the corpus that was curated to be a diverse and representative subset of the full corpus in which topics are more balanced than in the complete corpus (see for details on the curation process). We also used this subset to seed the web application for the human baseline study.

2.2 Model evaluation↩︎

2.2.0.1 Benchmark suite design

Because the text used in scientific settings differs from typical natural language, many models have been developed that deal with such text in a particular way. For instance, the Galactica model[61] uses special tokenization or encoding procedures for molecules and equations. Current benchmarking suites, however, do not account for such special treatment of scientific information. To address this, ChemBench encodes the semantic meaning of various parts of the question or answer. For instance, molecules represented in smiles are enclosed in [START_SMILES][\END_SMILES] tags. This allows the model to treat the smiles string differently from other text. ChemBench can seamlessly handle such special treatment in an easily extensible way because the questions are stored in an annotated format.

Since many widely utilized systems only provide access to text completions (and not the raw model outputs), ChemBench is designed to operate on text completions. This is also important given the growing number of tool-augmented systems that are deemed essential for building chemical copilot systems. Such systems can augment the capabilities of llms through the use of external tools such as search apis or code executors.[62][64] In those cases, the llm that returns the probabilities for various tokens (that are often used for model evaluations[65]) is only a part of the whole system, and it is not clear how to interpret the probabilities in the context of the whole system. The text completions, however, are the system’s final outputs, which would also be used in a real-world application. Hence, we use them for our evaluations.

Figure 4: Performance of models and humans on the “tiny” subset of the ChemBench corpus. The figure shows the percentage of questions that the models answered completely correctly. We use horizontal bars to indicate the performance of various models and highlight statistics of the human performance. Since the humans did not answer all the questions, this plot is based on the subset of questions that most humans answered. The evaluation we use here is very strict as it only considers a question answered completely correctly or completely incorrectly, partially correct answers are also considered incorrect. provides an overview of the performance of various models on the entire corpus. Systems with “ReAct” in the name are tool augmented, i.e., they can call external tools such as web search or Python code executors to better answer the questions. However, we limit those systems to a maximum of ten calls to the llm. This constraint led the systems to often not find the correct answer within the specified number of calls. In this case, we consider the answer as incorrect.

2.2.0.2 System performance

To understand the current capabilities of llms in the chemical sciences, we evaluated a wide range of leading models[66] on the ChemBench corpus, including systems augmented with external tools. An overview of the results of this evaluation is shown in . In this figure, we show the percentage of questions that the models answered correctly. Moreover, we show the worst, best, and average performance of the human experts in our study, which we obtained via a custom web application (chembench.org) that we used to survey the experts. Remarkably, the figure shows that the leading llm, Claude 3, outperforms the best human in our study in this overall metric and vastly, by more than a factor of two, exceeds the average performance of the experts in our study. Many other models also outperform the average human performance. Interestingly, the Galactica model, which was trained specifically for scientific applications, underperformed compared to many advanced commercial and some open-source models and barely exceeds the random baseline.

Given the considerable interest in tool-augmented systems, the mediocre performance of these systems (GPT-3.5 and Claude 2 augmented with tools) in our benchmark is striking. Their lack of performance, however, is partially because we limited the system to a maximum of ten llm calls. With the default tool augmented setup (using the so-called ReAct method[64]), however, this often did not allow the system to identify the correct answer (e.g., because it repeatedly tried to search for the answer on the web and did not find a solution within ten calls to the llm). This observation highlights the importance of communicating and measuring not only predictive performance but also computational cost (e.g., in terms of api calls) for tool-augmented systems.

2.2.0.3 Performance per topic

To obtain a more detailed understanding of the performance of the models, we also analyzed the performance of the models in different subfields of the chemical sciences. For this analysis, we defined a set of topics (see ). We classified all questions in the ChemBench corpus into these topics based on hand-crafted rules and classifier models operating on the question texts. We then computed the percentage of questions the models or humans answered correctly for each topic.

Figure 5: Performance of the models and humans on the different topics of the “tiny” subset of the ChemBench corpus. The radar plot shows the performance of the models and humans on the different topics of “tiny” subset of the ChemBench corpus. The performance is measured as the fraction of questions that were answered completely correctly by the models. The best score for every dimension is one (all questions answered correctly), and the worst is zero (no question answered completely correctly). A larger colored area indicates a better performance. This figure shows the performance on the subset of questions that were answered by humans. The performance of models on the entire corpus is shown in .

In this spider chart, the worst score for every dimension is zero (no question answered correctly), and the best score is one (all questions answered correctly). Thus, a larger colored area indicates a better performance. One can observe that this performance varies widely across models and topics. While macromolecular chemistry and biochemistry receive relatively high scores for many models, this is not the case for topics such as chemical safety or analytical chemistry. In the subfield of analytical chemistry the prediction of the number of signals observable in a nmr spectrum proved difficult for the models (e.g., percent correct answers for GPT-4) while this question appeared easier ( percent correct) for trained humans. Importantly, the human experts are given a drawing of the compounds, whereas models are only shown the smiles string of a compound and have to use this to reason about the symmetry of the compound (i.e., to identify the number of diasterotopically distinct protons, which requires reasoning about the topology and structure of a molecule). These findings also shine an interesting light on the value of textbook-inspired questions. A subset of the questions in the ChemBench are based on textbooks targeted at undergraduate students. On those questions, the models tend to perform better than on some of our semi-automatically constructed tasks (see ). For instance, while the overall performance in the chemical safety topic is low, the models would pass the certification exam according to the German Chemical Prohibition Ordinance based on a subset of questions we sampled from the corresponding question bank (e.g., % correct answers for GPT-4, % for Claude 3, and % for the human experts). While those findings are impacted by the subset of questions we sampled, the results still highlight that good performance on such question bank or textbook questions does not necessarily translate to good performance on other questions that require more reasoning.

We also gain insight into the models’ struggles with chemical reasoning tasks when we examine their performance as a function of molecular descriptors. If the model would answer questions after reasoning about the structures, one would expect the performance to depend on the complexity of the molecules. However, we find that the models’ performance does not correlate with complexity indicators but rather trivially with the size of the compounds (see ). This indicates that the models may not be able to reason about the structures of the molecules (in the way one might expect) but instead rely on retrieval of fragments of the training data. Those observations mirror recent findings that llms struggle to “reverse” facts they have seen in training (i.e., generalize from “A has feature B” to “B is a feature of A”).[67][70]

It is important to note that the model performance for some topics, however, is underestimated in the current evaluation. This is because models provided via apis typically have safety mechanisms that prevent them from providing answers that the provider deems unsafe. For instance, models might refuse to provide answers about cyanides. To overcome this, direct access to the model weights would be required, and we strive to collaborate with the developers of frontier models to overcome this limitation in the future. This is facilitated with the tooling ChemBench provides, thanks to which contributors can automatically add new models in an open science fashion.

2.2.0.4 Confidence estimates

One might wonder whether the models can estimate if they can answer a question correctly. If they could do so, incorrect answers would be less problematic as one could detect when an answer is incorrect. To investigate this, we prompted[71] some of the top-performing models to estimate, on an ordinal scale, their confidence in their ability to answer the question correctly.

Figure 6: Relationship between confidence of the model in the answer and correctness. For this analysis, we used verbalized confidence estimates from the model. We prompted the models to return a confidence score on an ordinal scale to obtain those estimates. We then plot the correctness of the answers (which is calculated as the mean value of all answers being either completely correct (1) or completely incorrect (0)) against the confidence score. The stripplots show the individual data points, a strong density indicating a higher number of points. The lines show the mean, and the error bars show the standard error of the mean. The figure shows that the reliability of the confidence estimates varies widely across models. While the ones for GPT-4 and Claude 2 are not particularly reliable, the ones for Claude 3 follow the expected trend better.

In , we show that for some models, there is no significant correlation between the estimated difficulty and whether the models answered the question correctly or not. For applications in which humans might rely on the models to provide answers, this is a concerning observation highlighting the need for critical reasoning in the interpretation of the model’s outputs.[30], [72] For example, for the questions about the safety profile of compounds, GPT-4 reported an average confidence of (on a scale of 1–5) for the questions it answered correctly and for the questions it answered incorrectly. While, on average, the verbalized confidence estimates from Claude 3 seem better calibrated (), they are still misleading in some cases. For example, for the questions about ghs pictograms Claude 3 returns an average score of for correct answers and for incorrect answers.

3 Conclusions↩︎

On the one hand, our findings underline the impressive capabilities of llms in the chemical sciences: Leading models outperform domain experts in specific chemistry questions on many topics. On the other hand, there are still striking limitations. For very relevant topics the answers models provide are wrong. On top of that, many models are not able to reliably estimate their own limitations. Yet, the success of the models in our evaluations perhaps also reveals more about the limitations of the exams we use to evaluate models—and chemists—than about the models themselves. For instance, while models perform well on many textbook questions, they struggle with questions that require some more reasoning. Given that the models outperformed the average human in our study, we need to rethink how we teach and examine chemistry. Critical reasoning is increasingly essential, and rote solving of problems or memorization of facts is a domain in which llms will continue to outperform humans.

Our findings also highlight the nuanced trade-off between breadth and depth of evaluation frameworks. The analysis of model performance on different topics shows that models’ performance varies widely across the subfields they are tested on. However, even within a topic, the performance of models can vary widely depending on the type of question and the reasoning required to answer it.

The current evaluation frameworks for chemical llms are primarily designed to measure the performance of the models on specific property prediction tasks. They cannot be used to evaluate reasoning or systems built for scientific applications. Thus, we had little understanding of the capabilities of llms in the chemical sciences. Our work shows that carefully curated benchmarks can provide a more nuanced understanding of the capabilities of llms in the chemical sciences. Importantly, our findings also illustrate that more focus is required in developing better human-model interaction frameworks, given that models cannot estimate their limitations.

While our findings indicate many areas for further improvement of llm-based systems, it is also important to realize that clearly defined metrics have been the key to the progress of many fields of ml, such as computer vision. Although current systems are far from reasoning like a chemist, our ChemBench framework will be a stepping stone for developing systems that might come closer to this goal.

4 Methods↩︎

4.1 Curation workflow↩︎

For our dataset, we curated questions from existing exams or exercise sheets but also programmatically created new questions. Questions were added via Pull Requests on our GitHub repository and only merged into the corpus after passing manual review () as well as automated checks (e.g., for compliance with a standardized schema).

To ensure that the questions do not enter a training dataset, we use the same canary string as the BigBench project. This requires that llm developers filter their training dataset for this canary string.[4], [43]

Figure 7: Overview of the workflow for the assembly of the ChemBench corpus. To assemble the ChemBench corpus, we first collected questions from various sources. Some tasks were manually curated, others semi-programmatically. We added semantic annotations for all questions to make them compatible with systems that use special processing for modalities that are not conventional natural text. We reviewed the questions using manual and automatic methods before adding them to the corpus.

4.1.0.1 Manually curated questions

Manually curated questions were sourced from various sources, including university exams, exercises, and question banks. provides an overview of the sources of the manually curated questions.

p3.5 cmp6.5 cmp.5cmX


source & description & question count
NMR spectroscopy & Questions are based on images shared via the Twitter account NMRspectroscopy &&
Analytical chemistry & Questions are based on exercises in master student examination at the FSU Jena, Germany &&
& Questions originate from examinations for grade 10 at the Marist Comprenhensive Academy Uturu, Nigeria &&
& Questions originate from the Bachelor of Science (for chemistry) courses at the TUM, Germany && +
Functional materials and nanomaterials & Questions are based on exercises in a seminar conducted at the FSU Jena, Germany &&
Combustion engineering & Based on a Master of Science examination paper at the University of Magdeburg, Germany &&
Materials’ synthesis & Questions are based on seminars on material synthesis from FSU Jena, Germany &&
Chemistry olympiad & National and International Olympiads held in US, UK and Moldova &&
Organic reactivity & Based on exams in the Bachelor of Science in Chemistry at the TUM, Germany &&
Periodic table & Manually created &&
Polymer chemistry & Based on examinations at the University of Hamburg, Germany &&
Textbook questions & Various textbooks covering biomolecular science, drug synthesis, molecular structure, organic chemistry, and X-ray crystallography &&
Safety & Based on the question bank published by the Federal/State Working Group on Chemical Safety (BLAC), which is used for the expertise examination (“Sachkundeprüfung”) &&
& Based on toxicology exams at the LMU Munich, Germany &&
& Based on toxicology exams at the University of Vienna, Austria &&
& Based on toxicology exams at WWU Munster, Germany &&
& Based on exercises for lectures in high-energy density materials at LMU Munich, Germany &&
& Based on pharmacology exams at the University of Vienna, Austria &&
& Lab safety quizzes based on various sources && + + + +

4.1.0.2 Semi-programmatically generated questions

In addition to the manually curated questions, we also generated questions programmatically. An overview of the sources of the semi-programmatically generated questions is provided in .

p3.5 cmp6.5 cmp.5cmX


source & description & question count
Number of isomers & MAYGEN[73] was used to compute the number of isomers for a set of smiles extracted from the ZINC dataset[74] & &
Total electron count of molecules & Electron counts based on the data from https://www.cheminfo.org/ & &
Oxidation states & Oxidation states questions based on the data from https://www.cheminfo.org/ &&
Chemical reactivity & Questions are framed based on the information from the Cameo Chemicals website &&
Number of nmr signals & Molecules are sampled from the ZINC database[74], OpenChemLib[75] is used to compute the number of diasterotopically distinct hydrogen atoms &&
Point group of molecules & Our ChemCaption tool is used to assign the point group using spglib,[76] and then each case was manually checked to select well-defined cases &&
IUPAC-SMILES pairs & Sampled from the PubChem [77] database && +
& Daily allowable intakes according to the World Health Organization &&
& Definitions of hazard statements &&
& GHS classification of chemicals mined through the API& &
& Materials’ compatibility &&
& Chemical compatibility &&

4.2 Model evaluation workflow↩︎

4.2.0.1 Prompting

We employ distinct prompt templates tailored for completion and instruction-tuned models to maintain consistency with the training. As explained below, we impose constraints on the models within these templates to receive responses in a specific format so that robust, fair, and consistent parsing can be performed. Certain models are trained with special annotations and LaTeX syntax for scientific notations, chemical reactions, or symbols embedded within the text. For example, all the SMILES representations are encapsulated within [START_SMILES][\END_SMILES] in Galactica[61]. Our prompting strategy consistently adheres to these details in a model-specific manner by post-processing LaTeX syntax, chemical symbols, chemical equations, and physical units (by either adding or removing wrappers). This step can be easily customized in our codebase.

4.2.0.2 Parsing

Our parsing workflow is multistep and primarily based on regular expressions. In the case of instruction-tuned models, we first identify the [ANSWER]``[\ANSWER] environment we prompt the model to report the answer in. In the case of completion models, this step is skipped. From there, we attempt to extract the relevant enumeration letters (for multiple-choice questions) or numbers. In the case of numbers, our regular expression was engineered to deal with various forms of scientific notation. As initial tests indicated that models sometimes return integers in the form of words, e.g., “one” instead of “1”, we also implemented a word-to-number conversion using regular expressions. If these hard-coded parsing steps fail, we use a llm, e.g., Claude 2, to parse the completion.

Manual validation of parsing performance

As stated, we tried to account for the variation in outputs with custom regular expressions. We selected a large, diverse subset of questions (10 per topic for all model reports) and manually investigated where the parsed output does not match the actual answer intended by the model. We found that for questions, the parsing was accurate in 99.76% of the cases, while for floating point questions, the parsing was accurate in 99.17% of the cases. The models most frequently generating errors are pplx-7b-chat and Mixtral-8x7b.

4.2.0.3 Models

4.2.0.3.1 Completion models

We used Galactica (120b)[61] with the default settings.

4.2.0.3.2 Instruction-tuned models

In addition, we used Claude 2, Claude3 (Opus),[78] GPT-4,[4] GPT-3.5-turbo,[1] Gemini Pro,[79] Mixtral-8x7b[80] LLaMA2 (7B, 13B, 70B),[81] as well as the 7B chat model from Perplexity.AI.

4.2.0.3.3 Tool augmented models

In addition to directly prompting llms, we also investigated the performance of tool-augmented systems. For this, we investigated the 7B parameter “online” model of Perplexity.AI. In addition, we utilized gpt-3.5-turbo as well as Claude 2, with ReAct-style tool augmentation.[64] The latter two models had access to WolframAlpha, the ArXiv api, a Python interpreter, and web search (using DuckDuckGo). We implemented the systems using Langchain with the default prompts and constrained the system to a maximum of ten llm calls.

4.3 Confidence estimate↩︎

To estimate the models’ confidence, we prompted them with the question (and answer options for mcq) and the task to rate their confidence to produce the correct answer on a scale from 1 to 5. We decided to use verbalized confidence estimates[71] since we found those closer to current practical use cases than other prompting strategies, which might be more suitable when implemented in systems.

4.4 Human baseline↩︎

4.4.0.1 Question selection

Since we anticipated that we would not be able to collect enough responses for every question to allow for a meaningful statistical analysis, we decided to show a relevant subset of all questions to the human scorers. For selecting the subset, we decided to address two questions:

  • Are the questions for which the models scored poorly just too difficult or unanswerable?

  • Are there areas in which the performance of humans is very different from the ones of the models?

To answer the first question, we selected up to 13 questions per source that llmsdid not answer correctly in an initial scoring round. In addition, we picked 100 diverse ones using greedy MaxMin sampling on the embeddings on the questions computed using BART. We added five random questions about the number of nmr signals and the point group of molecules.

4.4.0.2 Study design

For our initial study, we wanted to maximize the response rate given our available resources. For this reason, we did not opt for a highly controlled study setting. That is, while users were prompted not to use external tools other than a calculator and not consult with other humans, we do not have any way to verify that the participants complied with those rules. Note that users were also allowed to skip questions.

Another aspect of requiring unsupervised question answering is that in real life, humans have tools and can use them to answer any question of interest. In our current study, we prompted users not to use those tools.

4.4.0.3 Participants

Users were open to reporting about their experience in chemistry. Overall, did so. Out of those, reported to have been awarded a Ph.D. are beyond a first postdoc, have a master’s degree, and have a bachelor’s degree.

4.4.0.4 Comparison with models

To compare the performance of humans (who might have answered only some questions) with the performance of models (which answered all questions), we focussed on questions that at least four humans answered and limited the pool of human scorers to those who answered at least 100 questions (i.e., humans). The latter threshold was chosen to limit it to humans who seriously attempted to answer a part of the questions systematically. This analysis might lead to potential biases, most likely in favor of humans, as they were allowed to skip questions. humans answered more than 200 questions. For the analysis, we treated each human as a model. We computed the topic aggregated averages per human for analyses grouped by topic and then averaged over all humans.

4.5 Classification of questions into topics↩︎

When curating our dataset, we systematically recorded keywords and sources. To allow for analysis of the model performance as a function of the topic, we leverage this information together with the output of sequence classification models. We use this information to make the assignment for questions that can easily be assigned to a topic based on the source (e.g., number of nmr signals, chemical compatibility, toxicology exam questions). For the remaining ones, e.g., from chemistry olympiad questions, we use zero-short sequence classification[82] using the BART model[58], [83], which our preliminary analysis found to be more robust than topic modeling based on embeddings from OpenAI’s ada model or Cohere’s Cohembed-english-v3.0 model.

Data and code availability↩︎

The code and data for ChemBench is available at https://github.com/lamalab-org/chem-bench. The code for the app for our human baseline study is available at https://github.com/lamalab-org/chem-bench-app. To ensure reproducibility, this manuscript was generated using the framework.[84] The code to rebuild the paper (including code for all figures and numbers next to which there is a GitHub icon) can be found at \GitHubURL. To facilitate reproduction, some intermediate analysis results are cached at http://dx.doi.org/10.5072/zenodo.34706.

Acknowledgements↩︎

This work was supported by the Carl Zeiss Foundation, a “Talent Fund” of the “Life” profile line of the Friedrich Schiller University Jena and the FAIRmat consortium. We also thank Stability.AI for the access to its HPC cluster and donations.

M.A.expresses gratitude to the European Research Council (ERC) for evaluating the project with the reference number 101106377 titled “CLARIFIER” and accepting it for funding under the HORIZON TMA MSCA Postdoctoral Fellowships - European Fellowships. Furthermore, M.A.acknowledges the funding provided by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (Grant Reference: EP/Y023447/1; Organization Reference: 101106377).

M.R.and U.S.S.thanks the “Deutsche Forschungsgemeinschaft” for funding under the regime of the priority programme SPP 2363 “Utilization and Development of Machine Learning for Molecular Applications – Molecular Machine Learning” (SCHU 1229/63-1; project number 497115849).

In addition, we thank the OpenBioML.org community and their ChemNLP project team for valuable discussions. Moreover, we thank Pepe Márquez for discussions and support and Julian Kimmig for feedback on the web app. In addition, we acknowledge support from Sandeep Kumar with an initial prototype of the web app. We thank Bastian Rieck for developing the LaTeX-credit package (https://github.com/Pseudomanifold/latex-credits).

Conflicts of interest↩︎

K.M.J.is a paid consultant for OpenAI (as part of the red teaming network). M.P.is an employee of Stability.AI, and A.M.and N.A.are paid contractors of Stability.AI.

Author contributions↩︎

5 Appendix↩︎

5.1 Desired properties of a chemistry benchmark↩︎

  • End-to-end automation. For model development, the evaluations must be run many times (e.g., on regular intervals of a training run). Approaches that rely on humans scoring the answers of a system[85][87] can thus not be used.

  • Careful validation by experts. Manual curation is needed to minimize the number of incorrect or unanswerable questions.[88] This is motivated by the observation that many widely used benchmarks are plagued by noisiness.[89], [90]

  • Usable with models that support special treatment of molecules. Some models, such as Galactica[61], use special tokenization or encoding procedures for molecules or equations. The benchmark system must encode the semantic meaning of various parts of the question or answer to support this.

  • Usable with black box systems. Many relevant systems do not provide access to model weights or raw logits. This might be the case because the systems are proprietary or because they involve not only llms but also external tools such as search apis or code executors.[62][64] Thus, a benchmark should not assume access to the raw model outputs but be able to operate on text completions.

  • Probing capabilities beyond answering of mcqs. In real-world chemistry, as well as higher-level university education, multiple-choice questions are seldom utilized. Yet, most benchmarking frameworks focus on the mcq setting because of the ease of evaluation. Realistic evaluations must measure capabilities beyond answering mcq.

  • Cover a diverse set of topics. Chemistry, as the “central science”, bridges multiple disciplines.[91] To even just approximate “chemistry capabilities” the topics covered by a chemistry benchmark must be very diverse.

5.2 Related work↩︎

Existing benchmarks such as those from [45], [92], [85], [47] fail to comply with most of the requirements stipulated above. While these benchmarks could provide valuable insights in the short term, they cannot follow the rapid additions to the llm space. ChemBench aims to correct this through a set of developments: compatibility with BigBench, end-to-end automation, a particular focus on chemical safety, employment of diverse prompting strategies, and specialized notation for molecules and mathematical symbols. Moreover, our robust framework, including the platform chembench.org, will engage the community in open-source contributions.

5.3 Benchmark corpus↩︎

To ensure maximal interoperability with existing benchmarks or tools, we curated the data in an extended form of the widely used BigBench format.[43] This also implies that future baselines can be built on top of our infrastructure if saved in the same format.

shows the distribution of the Flesch-Kincaid reading ease scores of the questions. We see that the questions are generally complex to read.

Figure 8: Distribution of Flesch-Kincaid reading ease scores of the questions. The Flesch-Kincaid reading ease score[93] measures how easy a text is to read. It is calculated based on the average number of syllables per word and words per sentence. The higher the score, the easier the text is to read. The distribution of the questions’ scores is shown in the histogram.

shows that most questions in our corpus are mcq. A substantial fraction, in contrast to other benchmarks, is open-ended.

Figure 9: Number of multiple choice questions vs.open-ended questions per topic. The bar plot shows the number of mcq and general questions per topic.

5.4 Model performance↩︎

We also evaluated the model performance on the entire ChemBench corpus. shows the fraction of questions that were answered completely correctly by the models. Note that this ranking differs from the one on the “tiny” subset.

Figure 10: Overall performance of the models on the ChemBench corpus. The bar plot shows the fraction of questions that were answered completely correctly by the models. Scores computed on the entire ChemBench corpus.

shows the performance of the models on the different topics of the ChemBench corpus. The general pattern of performance varies significantly between the different topics and is also observed when the models are evaluated on the entire corpus. However, since some subjects are composed of questions from different sources, the ranking of the models is, in some instances, different from the one on the “tiny” subset.

Figure 11: Performance of the models on the different topics of the ChemBench corpus. The radar plot shows the performance of the models on the different topics of the ChemBench corpus. The performance is measured as the fraction of questions answered completely correctly by the models. A score of 1 indicates that all questions were answered completely correctly, while a score of 0 indicates that none were answered completely correctly.

shows this data as a parallel coordinates plot. This visualization highlights the critical observation that the ranking of the models highly depends on the questions they are evaluated on. Only very broad benchmarks have the potential to provide a comprehensive view of a model’s capabilities. However, even in those cases, the weighting of the different topics is crucial. Hence, we believe that fine-grained analysis of model performance is vital for the development of future benchmarks.

Figure 12: Performance of the models on the different topics of the ChemBench corpus. The parallel coordinates plot shows the performance of the models on the different topics of the ChemBench corpus. The performance is measured as the fraction of questions answered completely correctly by the models.

Figure 13: Performance of the models on the different topics of the “tiny” subset. The parallel coordinates plot shows the performance of the models on the different topics of the “tiny” subset. The performance is measured as the fraction of questions answered completely correctly by the models.

To further investigate the performance of the models, we also compared the performance on different data sources. Compared to topics, this is a more fine-grained analysis, as topics can be composed of questions from different sources. In , we see that the performance of the models varies significantly between the different data sources. Interestingly, the performance of the models on questions sourced based on textbooks seems to be better for our models than some semi-programmatically created tasks, such as questions about the number of signals in an nmr spectrum.

Figure 14: Fraction of completely correctly answered questions per data source. The heatmap shows, in color, the fraction of questions answered completely correctly by different systems for some of our data sources. The performance is measured as the fraction of questions answered completely correctly by the models. A score of one (red) indicates that all questions were answered completely correctly, while a score of zero (blue) indicates that none of the questions were answered completely correctly. We see that the performance of the models varies significantly between the different data sources. For instance, it is interesting to observe that questions sourced based on textbooks seem easier for our leading models than for humans. However, this performance does not correlate with performance on other sources, e.g., semi-programmatically created tasks such as questions about the number of signals in an nmr spectrum.

shows the same analysis on the “tiny” subset.

Figure 15: Fraction of completely correctly answered questions per data source on the “tiny” subset. The heatmap shows, in color, the fraction of questions answered completely correctly by different systems for some of our data sources. The performance is measured as the fraction of questions answered completely correctly by the models. A score of one (red) indicates that all questions were answered completely correctly, while a score of zero (blue) indicates that none were answered completely correctly. We see that the performance of the models varies significantly between the different data sources. For instance, it is interesting to observe that questions sourced based on textbooks seem easier for the leading models than for humans. However, this performance does not correlate with performance on other sources, e.g., semi-programmatically created tasks such as questions about the number of signals in an nmr spectrum.

One might wonder if questions that are more difficult to parse lead to worse performance of the models. shows no clear correlation between the reading ease of the questions and the performance of the models.

Figure 16: Model performance as a function of reading ease. The violin plots show the distribution of reading ease scores for questions answered completely correctly and those not. We do not observe a clear correlation between the reading ease of the questions and the performance of the models.

In addition, we analyzed the performance on questions which requires calculation. For this, we manually labeled questions that require multiple calculation steps. We find that the ranking of models is different for questions with our without calculations ().

Figure 17: Overall model performance for questions with and without calculation steps. We find that the ranking of models changes if we evaluate them on questions that require and do not require calculations, respectively.

5.5 Performance as a function of molecular features↩︎

To better understand if the performance of the models is correlated with specific features of the molecules, we analyzed the performance of the models as a function of the number of atoms and the complexity of the molecules. shows that the performance of the models is not correlated with the complexity of the molecules but rather with the number of atoms (, similar trivial correlation for ). The corresponding Spearman correlation coefficients are listed in .

Figure 18: Dependence of the mean absolute error in predicting the number of NMR signals on the Böttcher complexity of the molecules. The complexity measure proposed by [94] is an information-theoretic additive measure of compound complexity that follows chemical intuitions. The plot shows that for the llms, the predictive performance (measured as the mean absolute error in the prediction of the number of NMR signals) is not correlated with the complexity of the molecules. For inference based on reasoning, one would expect that the complexity of the molecule is a good predictor of the difficulty of the question.

Figure 19: Dependence of the mean absolute error in predicting the number of NMR signals on the number of atoms. The plot shows that for the llms, the predictive performance (measured as the mean absolute error in the prediction of the number of NMR signals) is correlated with the number of atoms in the molecule. For reasoning-based inference, one would expect that the number of atoms in the molecules is not necessarily a good predictor, and certainly worse than complexity measures, of the difficulty of the question.

Figure 20: Dependence of the mean absolute error in predicting total electron counts on the number of atoms. The plot shows that for the llms, the predictive performance (measured as the mean absolute error in the prediction of the total electron counts) is correlated with the number of atoms in the molecule.

Table 1: Spearman correlation coefficients for the correlation of model performance with molecular features. The table shows the Spearman correlation coefficient \(\rho\) for the correlation of the performance of the models with the number of atoms and the complexity of the molecules.
topic molecular descriptor \(\rho\) GPT-4 \(\rho\) Claude 3 \(\rho\) Galactica \(\rho\) humans
number of nmr signals number of atoms
complexity
total electron counts number of atoms
smiles iupac name matching number of atoms
complexity

5.6 Influence of model scale↩︎

To obtain first insights in how the performance llms depends on scale, we tested the llms of the LLaMA series. Interestingly, we find that the 7B and 70B models perform comparably, with the 7B showing lower performance (fraction of correct answers for the 7B, 13B, and 70B models are , , ).

Note that such analyses are difficult as models are typically not directly comparable in terms of dataset and training protocol.[95]

5.7 Human baseline↩︎

5.7.0.1 App

To facilitate the collection of responses, we developed a responsive web application in Typescript using the Next.js[96] app router framework. This application handles serving the user interface and exposes various rest apis for relevant operations. We utilize a MySQL[97] database and Prisma orm[98] for efficient database management. The web application is styled with Tailwind CSS[99] using the shadcn/ui component library and uses NextAuth[100] for easy and secure user authentication and postMark for sending Emails. The application is hosted on the Vercel web hosting platform.

5.7.0.2 Statistics

shows the distribution of scores our human scorers achieved.

Figure 21: Distribution of human scores. The histogram and kernel density estimates show the fraction of questions answered completely correctly. Since the best possible score for each question is one and the worst possible score is zero, the values on this plot are between zero and one.

We also recorded the time humans took to answer the questions. This time is the time from the question being displayed to the human to the human submitting the answer. Interestingly, we found no significant correlation between the experience of the human scorers and the performance on the questions (, Spearman’s \(\rho \approx \variable{output/spearman_experience_score.txt}\) \(\quad\), and \(p \approx \variable{output/spearman_experience_score_p.txt}\)).

Figure 22: Time taken by human scorers to answer questions vs.correctness of their answers. From the plot, it is clear that there is no clear dependence of the correctness of the answers on the time taken by the human scorers to answer the questions.

Additionally, we prompted users to provide additional information about their experience in chemistry. While we recorded fine-grained information, e.g., their specialization, we focused on the number of years since the first university-level chemistry course. shows that the experience of the human scorers was not significantly correlated with the correctness of their answers (, Spearman’s \(\rho \approx \variable{output/spearman_experience_score.txt}\) \(\quad\), and \(p \approx \variable{output/spearman_experience_score_p.txt}\)).

Figure 23: Experience of human scorers vs.correctness of their answers. The experience (in the number of years since the first university-level chemistry course) of the human scorers was not significantly correlated with the correctness of their answers.

5.8 Confidence estimates↩︎

Since it is important to understand if models can provide an indication of whether their answer might likely be incorrect, we prompted some of our top performing llms to return the confidence in providing a correct answer on an ordinal scale. This is similar to the verbalized confidence scores reported by [71]. plots the distribution of those scores. We find that the models show different distributions of confidence scores, which, for some, are skewed to the extremes.

Figure 24: Distribution of confidence scores reported by llms. llms show different distributions of confidence scores. The confidence scores are reported on an ordinal scale from 1 to 5, with 1 indicating low confidence and 5 indicating high confidence. The bar plots show how many questions were answered with each confidence score.

5.9 Leaderboard↩︎

Our leaderboard is based on the tool chain developed for Matbench.[52] Briefly, the ChemBench pipeline produces standardized files in json format that contributors can add via pull requests to the ChemBench repository. The Markdown tables and interactive plots are automatically generated and updated on the ChemBench website.

References↩︎

[1]
T. B. Brown et al., “Language models are few-shot learners,” May 2020, [Online]. Available: https://arxiv.org/abs/2005.14165.
[2]
Z. Zhong, K. Zhou, and D. Mottin, “Benchmarking large language models for molecule prediction tasks,” 2024, [Online]. Available: https://arxiv.org/abs/2403.05075.
[3]
T. H. Kung et al., “Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models,” PLoS digit. health, vol. 2, no. 2, p. e0000198, 2023.
[4]
OpenAI et al., “GPT-4 technical report,” Mar. 2024, [Online]. Available: https://arxiv.org/abs/2303.08774.
[5]
D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes, “Autonomous chemical research with large language models,” Nature, vol. 624, no. 7992, pp. 570–578, Dec. 2023, doi: 10.1038/s41586-023-06792-0.
[6]
A. M. Bran, S. Cox, A. D. White, and P. Schwaller, “ChemCrow: Augmenting large-language models with chemistry tools,” 2023, [Online]. Available: https://arxiv.org/abs/2304.05376.
[7]
S. Bubeck et al., “Sparks of artificial general intelligence: Early experiments with GPT-4,” 2023, [Online]. Available: https://arxiv.org/abs/2303.12712.
[8]
E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?” in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 2021, pp. 610–623.
[9]
R. Bommasani et al., “On the opportunities and risks of foundation models,” Aug. 2021.
[10]
M. Anderljung et al., “Frontier AI regulation: Managing emerging risks to public safety,” Jul. 2023, [Online]. Available: https://arxiv.org/abs/2307.03718.
[11]
B. Schmitt, “Transforming qualitative research in phygital settings: The role of generative AI,” Qualitative Market Research: An International Journal, Dec. 2023, doi: 10.1108/qmr-08-2023-0107.
[12]
A. D. White, “The future of chemistry is language,” Nat. Rev. Chem., vol. 7, no. 7, pp. 457–458, May 2023, doi: 10.1038/s41570-023-00502-0.
[13]
K. M. Jablonka et al., “14 examples of how LLMs can transform materials science and chemistry: A reflection on a large language model hackathon,” Digit. Discov., vol. 2, no. 5, pp. 1233–1250, 2023.
[14]
K. M. Jablonka, P. Schwaller, A. Ortega-Guerrero, and B. Smit, “Leveraging large language models for predictive chemistry,” Nat. Mach. Intell., pp. 1–9, 2024.
[15]
Z. Xie, X. Evangelopoulos, Ö. H. Omar, A. Troisi, A. I. Cooper, and L. Chen, “Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules,” Chem. Sci., vol. 15, no. 2, pp. 500–510, 2024.
[16]
C. Liao, Y. Yu, Y. Mei, and Y. Wei, “From words to molecules: A survey of large language models in chemistry,” 2024, [Online]. Available: https://arxiv.org/abs/2402.01439.
[17]
D. Zhang et al., “ChemLLM: A chemical large language model,” arXiv preprint arXiv:2402.06852, 2024, [Online]. Available: https://arxiv.org/abs/2402.06852.
[18]
M. C. Ramos, S. S. Michtavy, M. D. Porosoff, and A. D. White, “Bayesian optimization of catalysts with in-context learning,” 2023, [Online]. Available: https://arxiv.org/abs/2304.05341.
[19]
A. Kristiadi, F. Strieth-Kalthoff, M. Skreta, P. Poupart, A. Aspuru-Guzik, and G. Pleiss, “A sober look at LLMs for material discovery: Are they actually good for bayesian optimization over molecules?” 2024, [Online]. Available: https://arxiv.org/abs/2402.05015.
[20]
A. N. Rubungo, C. Arnold, B. P. Rand, and A. B. Dieng, “Llm-prop: Predicting physical and electronic properties of crystalline solids from their text descriptions,” arXiv preprint arXiv:2310.14029, 2023.
[21]
D. Flam-Shepherd and A. Aspuru-Guzik, “Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files,” 2023, [Online]. Available: https://arxiv.org/abs/2305.05708.
[22]
N. Gruver, A. Sriram, A. Madotto, A. G. Wilson, C. L. Zitnick, and Z. Ulissi, “Fine-tuned language models generate stable inorganic materials as text,” 2024, [Online]. Available: https://arxiv.org/abs/2402.04379.
[23]
L. Patiny and G. Godin, “Automatic extraction of FAIR data from publications using LLM,” ChemRxiv preprint doi:10.26434/chemrxiv-2023-05v1b-v2, Dec. 2023, doi: 10.26434/chemrxiv-2023-05v1b-v2.
[24]
J. Dagdelen et al., “Structured information extraction from scientific text with large language models,” Nat. Commun., vol. 15, no. 1, Feb. 2024, doi: 10.1038/s41467-024-45563-x.
[25]
Z. Zheng et al., “Image and data mining in reticular chemistry powered by GPT-4V,” Digit. Discov., 2024, doi: 10.1039/d3dd00239j.
[26]
J. Lála, O. O’Donoghue, A. Shtedritski, S. Cox, S. G. Rodriques, and A. D. White, “PaperQA: Retrieval-augmented generative agent for scientific research,” 2023, [Online]. Available: https://arxiv.org/abs/2312.07559.
[27]
J. H. Caufield et al., “Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning,” 2023, [Online]. Available: https://arxiv.org/abs/2304.02711.
[28]
T. Gupta, M. Zaki, N. Krishnan, et al., “DiSCoMaT: Distantly supervised composition extraction from tables in materials science articles,” arXiv preprint arXiv:2207.01079, 2022, [Online]. Available: https://arxiv.org/abs/2207.01079.
[29]
K. Darvish et al., “ORGANA: A robotic assistant for automated chemistry experimentation and characterization,” 2024, [Online]. Available: https://arxiv.org/abs/2401.06949.
[30]
S. Miret and N. Krishnan, “Are LLMs ready for real-world materials discovery?” 2024, [Online]. Available: https://arxiv.org/abs/2402.05200.
[31]
J. M. Granda, L. Donina, V. Dragone, D.-L. Long, and L. Cronin, “Controlling an organic synthesis robot with machine learning to search for new reactivity,” Nature, vol. 559, no. 7714, pp. 377–381, Jul. 2018, doi: 10.1038/s41586-018-0307-8.
[32]
N. H. Angello et al., “Closed-loop optimization of general reaction conditions for heteroaryl suzuki-miyaura coupling,” Science, vol. 378, no. 6618, pp. 399–405, Oct. 2022, doi: 10.1126/science.adc8743.
[33]
C. W. Coley et al., “A robotic platform for flow synthesis of organic compounds informed by AI planning,” Science, vol. 365, no. 6453, p. eaax1566, Aug. 2019, doi: 10.1126/science.aax1566.
[34]
B. Burger et al., “A mobile robotic chemist,” Nature, vol. 583, no. 7815, pp. 237–241, Jul. 2020, doi: 10.1038/s41586-020-2442-2.
[35]
M. Seifrid et al., “Autonomous chemical experiments: Challenges and perspectives on establishing a self-driving lab,” Acc. Chem. Res., vol. 55, no. 17, pp. 2454–2466, 2022.
[36]
A. Gopal et al., “Will releasing the weights of future large language models grant widespread access to pandemic agents?” Oct. 2023, [Online]. Available: https://arxiv.org/abs/2310.18233.
[37]
D. Ganguli et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,” Aug. 2022, [Online]. Available: https://arxiv.org/abs/2209.07858.
[38]
F. Urbina, F. Lentzos, C. Invernizzi, and S. Ekins, “Dual use of artificial-intelligence-powered drug discovery,” Nat. Mach. Intell., vol. 4, no. 3, pp. 189–191, Mar. 2022, doi: 10.1038/s42256-022-00465-9.
[39]
Q. L. Campbell, J. Herington, and A. D. White, “Censoring chemical data to mitigate dual use risk,” Apr. 2023.
[40]
R. Moulange, M. Langenkamp, T. Alexanian, S. Curtis, and M. Livingston, “Towards responsible governance of biological design tools,” Nov. 2023.
[41]
F. Urbina, F. Lentzos, C. Invernizzi, and S. Ekins, “A teachable moment for dual-use,” Nat. Mach. Intell., vol. 4, no. 7, pp. 607–607, Jul. 2022, doi: 10.1038/s42256-022-00511-6.
[42]
Intelligent.com, “One-third of college students used CHATGPT for schoolwork during the 2022-23 academic date,” Intelligent. https://www.intelligent.com/one-third-of-college-students-used-chatgpt-for-schoolwork-during-the-2022-23-academic-date/, Oct. 2023.
[43]
A. Srivastava et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” Jun. 2022.
[44]
L. Gao et al., “A framework for few-shot language model evaluation.” Zenodo, Dec. 2023, doi: 10.5281/zenodo.10256836.
[45]
T. Guo et al., “What can large language models do in chemistry? A comprehensive benchmark on eight tasks,” May 2023, [Online]. Available: https://arxiv.org/abs/2305.18365.
[46]
W. Ahmad, E. Simon, S. Chithrananda, G. Grand, and B. Ramsundar, “ChemBERTa-2: Towards chemical foundation models,” Sep. 2022, [Online]. Available: https://arxiv.org/abs/2209.01712.
[47]
X. Cai et al., “Comprehensive evaluation of molecule property prediction with ChatGPT,” Methods, vol. 222, pp. 133–141, Feb. 2024, doi: 10.1016/j.ymeth.2024.01.004.
[48]
N. C. Frey et al., “Neural scaling of deep chemical models,” Nat. Mach. Intell., vol. 5, no. 11, pp. 1297–1305, Oct. 2023, doi: 10.1038/s42256-023-00740-3.
[49]
T. Dinh et al., “Lift: Language-interfaced fine-tuning for non-language machine learning tasks,” Adv. Neur. In., vol. 35, pp. 11763–11784, 2022.
[50]
Z. Wu et al., “MoleculeNet: A benchmark for molecular machine learning,” Chem. Sci., vol. 9, no. 2, pp. 513–530, 2018, doi: 10.1039/c7sc02664a.
[51]
K. Huang et al., “Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development,” Feb. 2021, [Online]. Available: https://arxiv.org/abs/2102.09548v2.
[52]
A. Dunn, Q. Wang, A. Ganose, D. Dopp, and A. Jain, “Benchmarking materials property prediction methods: The matbench test set and automatminer reference algorithm,” npj Comp. Mater., vol. 6, no. 1, Sep. 2020, doi: 10.1038/s41524-020-00406-3.
[53]
M. Zaki, Jayadeva, Mausam, and N. M. A. Krishnan, “MaScQA: Investigating materials science knowledge of large language models,” Digit. Discov., vol. 3, no. 2, pp. 313–327, 2024, doi: 10.1039/d3dd00188a.
[54]
D. Arora, H. G. Singh, and Mausam, “Have LLMs advanced enough? A challenging problem solving benchmark for large language models,” May 2023, [Online]. Available: https://arxiv.org/abs/2305.15074.
[55]
Y. Song, S. Miret, H. Zhang, and B. Liu, “HoneyBee: Progressive instruction finetuning of large language models for materials science,” Oct. 2023, [Online]. Available: https://arxiv.org/abs/2310.08511.
[56]
Z. Wei et al., “ChemistryQA: A complex question answering dataset from chemistry.” 2021, [Online]. Available: https://openreview.net/forum?id=oeHTRAehiFF.
[57]
Y. Song, S. Miret, and B. Liu, “MatSci-NLP: Evaluating scientific language models on materials science language tasks using text-to-schema modeling,” in Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), Jul. 2023, pp. 3621–3639, doi: 10.18653/v1/2023.acl-long.201.
[58]
M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” Oct. 2019, doi: 10.48550/ARXIV.1910.13461.
[59]
F. M. Polo, L. Weber, L. Choshen, Y. Sun, G. Xu, and M. Yurochkin, “tinyBenchmarks: Evaluating LLMs with fewer examples,” Feb. 2024, [Online]. Available: https://arxiv.org/abs/2402.14992.
[60]
P. Liang et al., “Holistic evaluation of language models,” Nov. 2022, [Online]. Available: https://arxiv.org/abs/2211.09110.
[61]
R. Taylor et al., “Galactica: A large language model for science,” Nov. 2022, [Online]. Available: https://arxiv.org/abs/2211.09085.
[62]
T. Schick et al., “Toolformer: Language models can teach themselves to use tools,” Adv. Neur. In., vol. 36, 2024.
[63]
E. Karpas et al., “MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning,” 2022, [Online]. Available: https://arxiv.org/abs/2205.00445.
[64]
S. Yao et al., “React: Synergizing reasoning and acting in language models,” 2022, [Online]. Available: https://arxiv.org/abs/2210.03629.
[65]
C. Fourrier, N. Habib, J. Launay, and T. Wolf, “What’s going on with the open llm leaderboard?” Hugging Face – The AI community building the future. [Online]. Available: https://huggingface.co/blog/open-llm-leaderboard-mmlu.
[66]
E. Beeching et al., “Open LLM leaderboard.” https://huggingface.co/spaces/HuggingFaceH4/open_llm _leaderboard; Hugging Face, 2023.
[67]
L. Berglund et al., “The reversal curse: LLMs trained on" a is b" fail to learn" b is a",” 2023, [Online]. Available: https://arxiv.org/abs/2309.12288.
[68]
Z. A. Zhu and Y. Li, “Physics of language models: Part 3.1, knowledge storage and extraction,” 2023, [Online]. Available: https://arxiv.org/abs/2309.14316.
[69]
Z. Allen-Zhu and Y. Li, “Physics of language models: Part 3.2, knowledge manipulation,” 2023, [Online]. Available: https://arxiv.org/abs/2309.14402.
[70]
O. Golovneva, Z. Allen-Zhu, J. Weston, and S. Sukhbaatar, “Reverse training to nurse the reversal curse,” arXiv preprint arXiv:2403.13799, 2024.
[71]
M. Xiong et al., “Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs,” 2023, [Online]. Available: https://arxiv.org/abs/2306.13063.
[72]
B. Li et al., “Trustworthy AI: From principles to practices,” ACM Comput. Surv., vol. 55, no. 9, pp. 1–46, Jan. 2023, doi: 10.1145/3555803.
[73]
M. A. Yirik, M. Sorokina, and C. Steinbeck, “MAYGEN: An open-source chemical structure generator for constitutional isomers based on the orderly generation principle,” J. Cheminf., vol. 13, no. 1, Jul. 2021, doi: 10.1186/s13321-021-00529-9.
[74]
J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman, “ZINC: A free tool to discover chemistry for biology,” J. Chem. Inf. Model., vol. 52, no. 7, pp. 1757–1768, Jun. 2012, doi: 10.1021/ci3001277.
[75]
Actelion, “OpenChemLib.” [Online]. Available: https://github.com/actelion/openchemlib.
[76]
A. Togo, K. Shinohara, and I. Tanaka, spglib: A software library for crystal symmetry search,” 2018, doi: 10.48550/ARXIV.1808.01590.
[77]
S. Kim et al., “PubChem 2023 update,” Nucleic Acids Res., vol. 51, no. D1, pp. D1373–D1380, Oct. 2022, doi: 10.1093/nar/gkac956.
[78]
Anthropic, “The claude 3 model family: Opus, sonnet, haiku.” 2024, [Online]. Available: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model \_Card\_Claude\_3.pdf.
[79]
J. Martin, C. Hanner, N. Bolatto, and D. L. Akin, “TRAVELS: A multimodal mobility concept for highly capable planetary traverses,” in ASCEND 2022, Oct. 2022, doi: 10.2514/6.2022-4397.
[80]
A. Q. Jiang et al., “Mixtral of experts,” Jan. 2024.
[81]
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023, [Online]. Available: https://arxiv.org/abs/2307.09288.
[82]
W. Yin, J. Hay, and D. Roth, “Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach,” Aug. 2019, doi: 10.48550/ARXIV.1909.00161.
[83]
Facebook, “Facebook/bart-large-mnli · hugging face.” [Online]. Available: https://huggingface.co/facebook/bart-large-mnli/.
[84]
R. Luger, M. Bedell, D. Foreman-Mackey, I. J. M. Crossfield, L. L. Zhao, and D. W. Hogg, “Mapping stellar surfaces III: An efficient, scalable, and open-source doppler imaging model,” p. arXiv:2110.06271, Oct. 2021, [Online]. Available: https://arxiv.org/abs/2110.06271.
[85]
L. Schulze Balhorn, J. M. Weber, S. Buijsman, J. R. Hildebrandt, M. Ziefle, and A. M. Schweidtmann, “Empirical assessment of ChatGPT’s answering capabilities in natural science and engineering,” Sci. Rep., vol. 14, no. 1, Feb. 2024, doi: 10.1038/s41598-024-54936-7.
[86]
M. R. AI4Science and M. A. Quantum, “The impact of large language models on scientific discovery: A preliminary study using GPT-4,” 2023, [Online]. Available: https://arxiv.org/abs/2311.07361.
[87]
C. M. Castro Nascimento and A. S. Pimentel, “Do large language models understand chemistry? A conversation with ChatGPT,” J. Chem. Inf. Model., vol. 63, no. 6, pp. 1649–1655, Mar. 2023, doi: 10.1021/acs.jcim.3c00285.
[88]
C. G. Northcutt, A. Athalye, and J. Mueller, “Pervasive label errors in test sets destabilize machine learning benchmarks,” Mar. 2021, [Online]. Available: https://arxiv.org/abs/2103.14749.
[89]
C. Frye, “PubMedQA noisy.” Dec. 2023, [Online]. Available: https://twitter.com/charles\_irl/status/1731854677711507650.
[90]
Awg, “Broken benchmark: MMLU,” LessWrong. [Online]. Available: https://www.lesswrong.com/posts/rQBaftqKMfG2uMiWb/broken-benchmark-mmlu .
[91]
A. Aspuru-Guzik, R. Lindh, and M. Reiher, “The matter simulation (r)evolution,” ACS Cent. Sci., vol. 4, no. 2, pp. 144–152, Feb. 2018, doi: 10.1021/acscentsci.7b00550.
[92]
L. Sun et al., “SciEval: A multi-level large language model evaluation benchmark for scientific research,” 2023, [Online]. Available: https://arxiv.org/abs/2308.13149.
[93]
R. Flesch, “A new readability yardstick.” J. Appl. Psychol., vol. 32, no. 3, p. 221, 1948.
[94]
T. Böttcher, “An additive definition of molecular complexity,” J. Chem. Inf. Model., vol. 56, no. 3, pp. 462–470, Feb. 2016, doi: 10.1021/acs.jcim.5b00723.
[95]
S. Biderman et al., “Pythia: A suite for analyzing large language models across training and scaling,” in International conference on machine learning, 2023, pp. 2397–2430.
[96]
Vercel, “Nextjs,” The React Framework for the Web. [Online]. Available: https://nextjs.org/.
[97]
Oracle, “MySQL,” The world’s most popular open source database. [Online]. Available: https://www.mysql.com/.
[98]
I. Prisma Data, “Prisma,” Prisma | Simplify working and interacting with databases. [Online]. Available: https://www.prisma.io/.
[99]
tailwindcss, “Tailwind CSS,” Tailwind CSS - Rapidly build modern websites without ever leaving your HTML. [Online]. Available: https://tailwindcss.com/.
[100]
NextAuth.js, “NextAuth.js.” [Online]. Available: https://next-auth.js.org/.