Evaluating Large Language Models Using Contrast Sets: An Experimental Approach

Manish Sanwal
University of Texas at Austin


In the field of Natural Language Inference (NLI), particularly for multi-input text classification tasks, Cross-Entropy Loss is commonly used as a general error metric. Although effective as a training benchmark, this metric does not adequately assess a model’s understanding of language entailments. In this work, we propose a novel approach for creating a contrast set for the Stanford Natural Language Inference (SNLI) dataset [1]. Our method involves automatically replacing verbs, adverbs, and adjectives with their synonyms, maintaining the original sentence meaning. This approach helps determine whether a model truly comprehends the language or merely identifies recurring patterns for making predictions. We utilized the ELECTRA-small model [2] for our investigation. While the model exhibits an 89.9% accuracy on the standard SNLI dataset, its performance drops to 72.5% on our contrast set—a significant 17% decrease. This finding prompted an in-depth analysis to understand the underlying learning patterns of the model. Subsequently, we enhanced the model’s robustness by fine-tuning it with a contrast training dataset tailored for SNLI, resulting in an improved accuracy of 85.5% on contrast sets. These experiments underscore the necessity for more balanced datasets in NLI tasks that account for varied linguistic expressions. We anticipate that our findings will inspire further development of comprehensive datasets, fostering the advancement of more nuanced and effective NLI models.

1 Introduction↩︎

Language entailment tasks have long stood as pivotal challenges in the field of natural language processing (NLP). With the advent of deep learning techniques, particularly the development of transformer models [3], the landscape of these tasks has undergone a significant transformation. State-of-the-art models are now achieving scores that rival, and in some cases surpass, human evaluators. This remarkable progress, however, predominantly hinges on conventional evaluation metrics, chiefly accuracy against established validation datasets.

Despite these advances, a growing body of literature [4] [5] etc. suggests that these models, while effective, might succeed by capitalizing on superficial patterns prevalent in test examples rather than attaining a profound understanding of language nuances. This phenomenon raises critical questions about the depth and authenticity of the language understanding these models claim to achieve.

In our research, we introduce a novel method for assessing Natural Language Inference (NLI) models. This approach uniquely incorporates the use of a contrast set alongside traditional validation datasets. It is specifically applied to the Stanford Natural Language Inference (SNLI) dataset [1], which challenges models to analyze a given premise and hypothesis, determining if their relationship is one of entailment, contradiction, or neutrality. Our innovative approach, which integrates contrast sets into the evaluation process, adds a new dimension to the scrutiny of NLI models.

Our methodology aims to rigorously test models not just on their ability to predict these relationships in standard settings but also on their robustness and adaptability to nuanced, contextually varied scenarios presented in the contrast set. This dual-layered evaluation framework is designed to probe the models’ deeper language understanding capabilities, moving beyond mere pattern recognition to a more authentic and comprehensive comprehension of linguistic constructs.

Building on the premise of our initial investigation, the second phase of our experiment pivoted towards enhancing the model’s performance on the contrast set. Recognizing the potential limitations in the current training regimes, we hypothesized that a more nuanced and diverse training dataset could significantly bolster the model’s ability to handle the complexity inherent in our contrast set.

To test this hypothesis, we employed a technique known as "Data Augmentation." This approach involved creating an expanded training dataset, which we refer to as the "contrast training set." This new set was designed to mimic the diversity and complexity of the contrast set, thereby exposing the model to a broader spectrum of linguistic scenarios during its training phase. Our objective was to not just train the model on standard examples but to also fine-tune it on examples that closely align with the challenging scenarios presented in the contrast set.

The results of this data augmentation strategy were promising. By training the model on this enriched dataset, we observed a notable improvement in its performance. Specifically, the model’s accuracy on the contrast set surged to an impressive 87.5%. This marked increase underscored the effectiveness of our data augmentation approach, highlighting its potential as a powerful tool for enhancing the sophistication and adaptability of NLI models.

This experiment demonstrates the significance of training data diversity in developing models that not only perform well on standard benchmarks but also possess a robust understanding of nuanced language contexts. By integrating these more complex and varied training examples, we have taken a step towards creating models that more closely mirror the depth and flexibility of human language comprehension.

2 Model and the Dataset↩︎

2.1 The Model↩︎

For our investigation, we utilized the ELECTRA-small model [2]. Originally, the Electra model was conceived as a discriminator capable of differentiating between "real" input tokens and "fake" ones generated by another neural network, akin to the discriminator in a Generative Adversarial Network (GAN). Remarkably, at a small scale, ELECTRA demonstrates robust performance even on a single GPU. When scaled up, it achieves state-of-the-art results on the SQuAD 2.0 dataset [6]. Additionally, this model can be fine-tuned for various downstream tasks, including classification, sequence tagging, and language entailment.

2.1.1 Standard Evaluation Method↩︎

For a classification problem with \(C\) classes, the cross-entropy loss for a single instance can be expressed as:

\[\text{CrossEntropy} = - \sum_{c=1}^{C} y_c \log(p_c)\]

In this context, \(y_c\) denotes a binary indicator, assuming values of either 0 or 1, to establish whether the class label \(c\) accurately represents the classification of a given observation. Concurrently, \(p_c\) represents the predicted probability that the observation is correctly classified under class \(c\). Considering a dataset comprising \(N\) instances, the aggregate cross-entropy loss is typically expressed as the mean loss across all instances.

\[\text{Error} = \frac{1}{N} \sum_{n=1}^{N} \text{CrossEntropy}(y_{nc}, p_{nc})\] In this equation, \(y_{nc}\) and \(p_{nc}\) refer to the true label and the predicted probability for class \(c\) in instance \(n\), respectively.

2.2 The DataSet↩︎

The SNLI corpus, in its 1.0 version [1], comprises a dataset of 570,000 pairs of sentences, all written in English and manually annotated for a balanced classification task. The sentences are categorized into three labels: entailment, contradiction, and neutral, which are fundamental to the study of Natural Language Inference (NLI), also referred to as Recognizing Textual Entailment (RTE). The dataset primarily sources its language from English-speaking Flickr users and crowdworkers on Amazon Mechanical Turk. Each data entry in the corpus includes a premise and a hypothesis, both represented as strings, along with an integer label that indicates the category of the sentence pair.

2.2.1 Data Fields↩︎

premise: This is a string that serves as a basis for evaluating the validity of the hypothesis.
hypothesis: A string that is to be assessed for its truthfulness, falsehood, or indeterminacy in relation to the premise.
label: An integer used to categorize the relationship between the premise and the hypothesis. A value of 0 signifies that the hypothesis is entailed by the premise, 1 indicates neither entailment nor contradiction between the premise and hypothesis, and 2 denotes that the hypothesis contradicts the premise.

  "premise": "A person on a horse jumps over a broken down airplane.",
  "hypothesis": "A person is training his horse for a competition.",
  "label": 1

3 Contrast set Evaluation↩︎

3.1 General Framework↩︎

A model that depends on basic cues but lacks a deep understanding of language can perform effectively in terms of average accuracy, particularly when these cues are often predictive. [7] suggest that numerous SQuAD questions can be resolved through elementary heuristics, specifically those focusing on type and keyword-matching. To assess if current models have acquired knowledge beyond these rudimentary patterns, we propose a contrast set designed to challenge inadequate models by modifying test examples. Consider following example.

Figure 1: image.

In the above case a human evaluator can easily understand the updated hypothesis and can predict that the hypothesis neither entail nor contradicts the premise but the model predicts that the hypothesis entails the premise. Lets define a contrast set \(A\) to be a function that takes in a premise \(p\), and hypothesis \(h\) and optionally with model \(f\) to generates a contrasting example (\(p\), \(h^{\\'}\)).

\[(p, h^{\\'}) = A(p, h, f)\]

As we have a model \(f\) to predict the probability for each class using the contrast example

\[p_{nc} = f(p, h^{\\'})\]

and with the predict probability \(p_{nc}\) we can define contrast cross entropy error as

\[\text{Contrast Error} = \frac{1}{N} \sum_{n=1}^{N} \text{CrossEntropy}(y_{nc}, p_{nc})\]

3.2 Contrast set creation↩︎

Generating contrast sets in the field of natural language processing presents a significant challenge, a stark contrast to the relatively simpler process in image classification. In image classification, one can create a contrasting example by introducing a minimal, almost imperceptible amount of noise to the original input, as described by [8]. These minute adjustments, while not altering the fundamental semantics or meaning of the image, are capable of influencing the predictions of models, particularly those that are excessively sensitive to changes that do not affect the underlying semantic content.

In the realm of natural language processing, the comparable strategy would involve paraphrasing the original input text, a concept explored by [9]. Paraphrasing aims to rewrite the text while preserving its original meaning, effectively creating a semantic mirror of the original input that differs only in its phrasing or structure. However, this task is fraught with complexities. High-precision paraphrase generation is a sophisticated endeavor, as it requires maintaining the delicate balance of meaning in the face of linguistic modifications. Unlike in image processing, where the alterations are subtle and non-semantic, changes in language, even minor ones, often lead to a shift in meaning. Thus, most attempts to edit or modify a sentence in the pursuit of paraphrase generation inadvertently result in a change in its original meaning, presenting a substantial hurdle in the development of robust natural language processing models that can reliably handle such semantically-preserving transformations.

In our experiment, we employed a straightforward approach, focusing on the automated substitution of select keywords in the hypothesis with their synonyms. This method, while not always yielding grammatically correct sentences, has shown promising results. During our trials, we observed that despite some grammatical inconsistencies, human evaluators were still able to accurately determine the correct label for these sentences.

To create the contrast set, we used the validation set from the Stanford Natural Language Inference (SNLI) database. The SNLI database, with its comprehensive collection of sentence pairs, provided a robust foundation for applying our synonym substitution technique. This choice ensured that our experiment was grounded in a dataset reflective of real-world language usage. We utilized the Natural Language Toolkit (nltk) library to facilitate this process. The initial step involved applying a part-of-speech tagger to the hypothesis from the SNLI validation set. This tagging helped us identify the specific grammatical roles of words within the sentences. Recognizing that altering nouns could significantly change the sentence’s meaning, our strategy specifically targeted Verbs, Adjectives, and Adverbs for synonym replacement. Upon identifying these target words, we leveraged nltk’s integration with the WordNet database to find potential synonyms. For each identified Verb, Adjective, or Adverb, we compiled a list of synonyms and then randomly selected one to replace the original word in the sentence.

After generating the initial contrast set, we undertook a manual review process. This involved carefully parsing each sentence in the contrast set, making necessary adjustments to ensure grammatical correctness and overall coherence. This manual refinement was essential to ensure that the sentences not only maintained their core meaning but also sounded natural and correct. This method of synonym replacement, augmented by manual review, proved effective for generating variations of the hypothesis. It allowed us to test the robustness of natural language processing models against semantic-preserving linguistic variations, a vital aspect in evaluating these models’ adaptability and accuracy in processing human language.

3.3 Evaluation on contrast set↩︎

In evaluating our model on the contrast set, we observed some crucial insights into its language processing capabilities. Our approach involved creating a contrast set by automatically replacing verbs, adverbs, and adjectives in the Stanford Natural Language Inference (SNLI) dataset with their synonyms, maintaining the original sentence meaning. This methodology was aimed at determining whether the model genuinely comprehends language or simply identifies recurring patterns for predictions.


Our initial findings using the ELECTRA-small model revealed a significant discrepancy between its performance on the standard SNLI dataset (89.9% accuracy) and the contrast set (72.5% accuracy). This 17% decrease highlighted the model’s limitations in handling nuanced language variations.

To address this, we fine-tuned the model with a contrast training dataset tailored for SNLI. This enhancement led to an improved accuracy of 85.5% on the contrast sets, emphasizing the need for more balanced datasets in Natural Language Inference (NLI) tasks that account for varied linguistic expressions. These findings suggest that current NLP datasets should increasingly be created with contrast sets in mind to encapsulate the variability and nuance in natural human languages.

4 Error Analysis↩︎

To illustrate how our model can be confounded by simple word replacements, consider the following examples:

4.1 Example 1↩︎

Figure 2: image.

In this instance, we replaced the word "talks" with "babbles" in the hypothesis. While "babble" does not exactly convey the same meaning as "talk," a human evaluator could still discern the entailment between the premise and the updated hypothesis. However, the model incorrectly predicted a contradiction, assigning a label of 1 instead of 0.

This misinterpretation by the model suggests that it relies heavily on pattern matching between the premise and the hypothesis. In the original hypothesis, the phrase "talking on a cellphone" in the premise directly aligns with "talks on a cellphone." However, when "talks" was replaced with "babbles," this direct pattern was disrupted. Although "babbles" is a synonym that preserves the core action of speaking on a cellphone, the model failed to recognize this subtlety.

This example highlights a critical shortcoming in the model’s understanding of natural language. It points to an over-reliance on direct word matching, rather than a deeper comprehension of linguistic nuances and synonyms. Such limitations can lead to incorrect interpretations, especially in cases where synonyms or paraphrased language are used. It underscores the importance of enhancing the model’s ability to recognize and process semantic-preserving variations in language, a key aspect for achieving more accurate and reliable natural language processing.

4.2 Example 2↩︎

Figure 3: image.

In this scenario, the original premise describes a woman engaged in painting, specifically adding stars to an American flag painting. The original hypothesis incorrectly suggests that she is sewing, leading to a label of 2, indicating a contradiction. This is because sewing is a distinctly different activity from painting.

However, when the hypothesis is updated to "The woman is a tailor," the model’s prediction changes to a label of 1, indicating a neutral relationship between the premise and hypothesis. This shift in prediction is intriguing. "Tailor" is a profession typically associated with sewing, but the statement "The woman is a tailor" does not directly contradict the premise of her painting. It’s possible that a woman can be a tailor and also engage in painting, hence the model’s classification of the relationship as neutral.

This example demonstrates the model’s challenge in processing contextual information and understanding the subtleties of language. While the updated hypothesis does not directly describe the action depicted in the premise, it also doesn’t outright contradict it, leading the model to a middle ground of neutrality. Such examples underscore the importance of enhancing the model’s capability to discern and interpret nuanced and contextual differences in language.

4.3 Example 3↩︎

Figure 4: image.

Here is one more example that shows how subtle changes in language can impact the model’s interpretation and classification. The original premise describes an old woman in a specific type of shirt (plaid), positioned on a crowded sidewalk and smiling. The original hypothesis closely mirrors this description, leading to an entailment label (0). This is because the hypothesis accurately captures the essence of the premise, with both statements aligning in terms of the woman’s attire and action.

However, the updated hypothesis introduces two subtle changes: it replaces "plaid" with "tartan" and changes "smiling" to "grins." Tartan is essentially a type of plaid, so this change doesn’t significantly alter the meaning. Similarly, "grins" is a synonym of "smiling" and conveys the same general expression of happiness. Despite these changes being semantically similar to the original words, the model’s prediction shifts to a neutral label (1).

This shift suggests that the model is sensitive to even minor lexical changes, impacting its ability to recognize semantic equivalence. Although "tartan" and "plaid" are closely related, and "grins" is a form of smiling, the model does not seem to equate these synonyms completely with their counterparts in the original hypothesis. This could indicate a limitation in the model’s understanding of synonyms and contextual language use.

Moreover, these examples underscores the challenge of ensuring that natural language processing models can accurately interpret and respond to the rich variability of human language. It highlights the necessity for models to not only recognize direct word-to-word matches but also to understand the broader context and synonymous relationships within language. Enhancing this capability is crucial for developing more reliable and robust models capable of processing language with the same level of nuance and understanding as humans.

5 Improving the model’s predictions↩︎

The examples highlighted above demonstrate a significant challenge in natural language processing: the robustness of machine learning models in the face of the inherent variability of human language. This issue is illustrated through traditional evaluation methods, as shown in the provided schematic. These methods often involve training models on well-known datasets, such as the Stanford Natural Language Inference (SNLI) corpus, and subsequently testing them solely on the SNLI dataset. This approach aims to assess the model’s ability to generalize based on the patterns it has learned. However, this method often leads to a superficial assessment of model performance, as it doesn’t consider the model’s capability to handle linguistic variations that are distinct from those in the training set. This limitation highlights the models’ tendency to rely on pattern recognition rather than achieving a deeper, more nuanced understanding of language.

The proposed methodology advocates for an iterative fine-tuning process using the contrast set post initial training. This approach, grounded in the hypothesis that exposure to a broader linguistic spectrum during fine-tuning will enhance the model’s grasp of complex language features such as synonyms and word mappings, aims to reform the model’s predictive capabilities on both the SNLI and contrast sets. The anticipated outcomes suggest a more balanced performance, indicative of a model that has transcended mere pattern recognition to encapsulate a more profound linguistic comprehension.


This fine-tuning strategy is not without precedent. It resonates with the findings of previous literature, notably the work documented by [10] , which elucidates the benefits of employing contrast sets for model evaluation and enhancement. Such contrasting fine-tuning not only aligns with the continual quest for models that truly comprehend language but also serves as a testament to the iterative nature of machine learning—a field where model refinement is as crucial as model development. The integration of contrast sets thus represents a methodological evolution, promising strides toward models that genuinely mirror the complexity and unpredictability of human language.

5.1 Model Improvement results↩︎

To enhance the performance of our model on the contrast dataset, we devised a strategy to construct a targeted contrast training set. This initiative was underpinned by the utilization of the SNLI training corpus, which we subjected to a contrast set generation procedure analogous to the one delineated previously. Our objective was to fine-tune the model using a strategically minimal dataset. We embarked on this process by generating a contrast set derived from a randomly selected subset of 1,000 examples from the SNLI training data. Subsequent to this, we assessed the model’s validation accuracy on both the newly formed contrast set and the original validation set [10].

This methodical approach allowed us to incrementally amplify the size of the contrast training set. By doing so, we could meticulously monitor the relationship between the expanding training dataset and the corresponding shifts in validation and contrast set accuracies. Our intention was to discern the threshold at which additional contrast examples ceased to yield proportional enhancements in the model’s performance, thus identifying the optimal size of the contrast training set. This incremental expansion was not merely a quantitative exercise but a strategic endeavor to refine the model’s competence in generalization, thereby aligning its performance more closely with the nuanced complexities of natural language as encountered in the real world. Through this meticulous process, we sought to strike a balance between the breadth of the contrast training set and the depth of understanding that the model exhibits, ensuring that each additional example contributes meaningfully to the model’s linguistic acuity.

Following is the result of our expariment:


The observation that the model’s accuracy on the contrast set improved steadily, while the validation accuracy remained constant as the number of contrast training samples increased, is a significant finding. This trend indicates that the model is effectively learning the nuances and variability inherent in natural language. Typically, contrast sets are designed to test a model’s understanding of subtle language differences, often challenging it with examples that are syntactically similar but semantically different from the training data.

In our case, the constant validation accuracy suggests that the model’s general understanding of the language, as represented by the validation set, did not change. This stability is a positive sign, as it implies that the model did not lose its general language understanding while adapting to the nuances captured in the contrast set.

Reaching an 85.5% accuracy on the contrast set is noteworthy. It implies that the model has a strong ability to discern and correctly interpret the more subtle and complex aspects of language that are often presented in these sets. This level of accuracy suggests that the model is not merely memorizing training examples but is instead developing a more nuanced understanding of language.

In conclusion, the increase in contrast set accuracy, coupled with stable validation accuracy, indicates a successful adaptation of the model to more complex language tasks without compromising its general language performance. This balance is crucial for developing robust NLP models capable of handling a wide range of real-world language scenarios.

6 Discussion and related work↩︎

The performance of current machine learning systems for reading comprehension, though seemingly effective when measured by standard evaluation metrics, is notably deficient under adversarial evaluation. This discrepancy arises because standard evaluations tend to be overly forgiving towards models that depend on surface-level cues, failing to challenge their deeper understanding of the text. In comparison, evaluation using contrast sets demonstrates that these models exhibit excessive stability to changes that significantly alter the text’s meaning, despite appearing minor.

Our evaluation framework was designed to rigorously test models not just on their ability to predict relationships in standard settings but also on their robustness and adaptability to nuanced, contextually varied scenarios presented in the contrast set. This dual-layered evaluation approach probes the models’ deeper language understanding capabilities, moving beyond mere pattern recognition to a more authentic and comprehensive comprehension of linguistic constructs.

This observation strongly suggests that when constructing NLP datasets, there should be a deliberate focus on incorporating contrast sets. These sets are crucial as they encompass the diversity and subtlety inherent in natural human language. By integrating such sets, the datasets will not only challenge but also train models to recognize and understand the finer nuances and variations of language, moving beyond superficial pattern recognition. This approach is essential for developing more sophisticated and genuinely competent NLP systems that can handle the complexity and richness of human language.


Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. CoRR, abs/1508.05326.
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR, abs/1706.03762.
Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. CoRR, abs/1707.07328.
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating NLP models via contrast sets. CoRR, abs/2004.02709.
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. CoRR, abs/1806.03822.
Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Fastqa: A simple and efficient neural architecture for question answering. CoRR, abs/1703.04816.
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. http://arxiv.org/abs/1412.6572.
Nitin Madnani and Bonnie J. Dorr. 2010. https://doi.org/10.1162/coli_a_00002. Computational Linguistics, 36(3):341–387.
Nelson F. Liu, Roy Schwartz, and Noah A. Smith. 2019. Inoculation by fine-tuning: A method for analyzing challenge datasets. CoRR, abs/1904.02668.