Decoders Laugh as Loud as Encoders


Abstract

From the dawn of the computer, Allen Turing dreamed of a robot that could communicate using language as a human being. The recent advances in the field of Large Language Models (LLMs) shocked the scientific community when a single model can apply for various natural language processing (NLP) tasks, while the output results are sometimes even better than most human communication skills. Models such as GPT, Claude, Grok, etc. have left their mark on the scientific community. However, it is unclear how much these models understand what they produce, especially in a nuanced theme such as humor. The question of whether computers understand humor is still open (among the decoders, the latest to be checked was GPT-2). We addressed this issue in this paper; we have showed that a fine-tuned decoder (GPT-4o) performed (Mean F1-macro score of 0.85) as well as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86).

1 Introduction↩︎

1.0.0.1 There are many types of jokes, and therefore humor can be different. Also, it diverges from language to language since there is a cultural difference between nations. Therefore, we have decided to stick to English humor only. We had divided English Humor into five main categories: absurdity, dark, irony, wordplay (including puns and any other form of words made in a special way to make us laugh), and social commentary (such as in [1]). We added another category of negative examples such as regular sentences that aren’t supposed to be funny, and we tagged them as no-joke.

1.0.0.2 We decided to use LLMs to classify each joke to each category, assuming the classification will be good means the LLM understand the jokes. We used three types of LLMs: Encoders, Encoders2Decoders (zero-shot and few-shots learning only), and Decoders (zero-shot, few-shots, and fine-tuned learning). Since the humor dataset categories examples were biased, we decided to weigh scarce categories as abundant categories, therefore we used f1-macro score as the metric for evaluation.

1.0.0.3 Encoders were trained on a training dataset and evaluated on validation dataset. Since we were lacking data, we used 20 epochs to Fine-Tune the Encoders, and we picked the iteration with the best f1-macro score. Encoders2Decoders were tested only with zero-shot learning and few-shot learning. Decoder (GPT-4o) was fine-tuned using OpenAI API.

1.0.0.4 The results were surprising, Fine-tuned Decoder model is on par with a Fine-tuned Encoder. While it’s important to note that decoders mission is to generate text, they weren’t "born" to classify. And yet they perform pretty much as the best model found so far: RoBERTa [2].

2 Literature Review↩︎

Humor serves the human society in order to mitigate the pain from mundane problems, a better way to communicate and convey a message, to make someone like you, or even in the end of the day to provide us with a better feeling (as Charlie Chaplin once said: "A day without laughter is a day wasted."). However, what makes us laugh is studied for a few decades, and the ability to "grasp" the joke is considered to be one of the top enigmas in human’s cognition. Considering this, it’s clear that this "human" ability is considered hard to "grasp" for a robot. according to a review [3] There are two stages in humorous classification: feature selection and algorithm selection. These two options are going to be covered thoroughly in the next two sub-sections: 2.1 and 2.2

2.1 Features Extraction↩︎

During time there were many approaches for how to get good features for humorous classification. These approaches were divided into three main categories:

  • Feature Engineering - Old school methods which were highly explored during the time of classical Machine Learning (ML).

    • Ambiguity Detection - a word that may have more than one meaning can help to identify a joke [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]

    • Incongruity Detection - it’s when you have one line of thought interrupted by a different line of thought in the same sentence [4], [9], [12], [14][24]

    • Emotion-Based Detection - certain words can imply about the narrator’s attitudes, feelings, modes etc. [21], [22], [25][27]

    • Unexpectedness Detection - a joke can touch a taboo issue or be absolutely about absurd situations - which will make us laugh [4], [17], [25]

    • Subjectivity - different people find different jokes funny [21]. This happens when the joke contains language which implies different beliefs, speculations, criticisms, opinions, and evaluations [28]. This kind of feature can predict if it contains humor or not [12], [16], [19][23]

    • Negation - Funny texts usually include negative word such as "not", "no", "isn’t", "doesn’t", "don’t", denials, or any other type of other word with negative meaning [9], [22], [27], [29][33].

  • Automated Features - Features that are extracted without manual work, however too many features may end up with curse of dimensionality, therefore researchers use feature selection.

    • Word Embeddings - words are being represented using dense-vectors which contain the semantic meaning of the word, Words with close meaning have a higher cosine score result [34]. Choosing the right features can improve jokes recognition [20], [23], [26], [31], [35][42].

    • Bag-of-Words - This method counts words in the text, without giving any importance for the order of the word or its grammar [34]. This end ups with long and sparse vectors. This may be helpful since certain words can be good for classifying the text as humorous or not-humorous [43]. Semantics of the text are an important feature which can be seen using this feature extraction and can help to classify the text [12], [39], [43][45].

    • Part of Speech (POS) - the frequencies such as nouns, pronouns, verbs, and modifiers for example can predict whether the text is funny or not. It was also interesting to see that the number of appearances or length of the triplet “noun-verb-adjective” also had been a good predictor. It was found also useful since for example, there is a correlation between positions of pun words and location of the grammatical tagging

    • Acoustic-prosodic - When you have voice recording you can detect humorous parts according to the way they are said. However, in this research paper we research only textual information, therefore this type of feature extraction will be ignored. [5], [8][11], [15], [23][27], [31], [42], [43], [46].

  • Lexical Features - a lexical resource consists from language dictionaries, and special addition of data that can contribute to the field of NLP. It contains words, sub-words, and collocations. Each of the fields may have phonological and semantic relations, such as synonyms, spelling, etc. The most commonly used lexical resource is WordNet, although some research findings pointed to the fact that this dictionary is still lacking [47].

    • Alliteration - Is the case when you have two or more adjacent words, which has the same vocal beginning, which causes the sentence to be unexpected and funny. [4], [8], [9], [12], [18][24], [27], [33], [47].

    • Antonymy - When two or more opposite semantic-words are contained in the same sentence, that can make the sentence humorous. This type of feature is being extracted using a dictionary (usually WordNet Antonyms) [9], [22], [26], [27], [30], [36], [47].

    • Frequency - Frequency of certain words, or frequency of certain words that comes together, can suggest whether the sentence is humorous or not, since these words or words combination are common at jokes, while they are rare for regular texts [10], [11], [14], [17], [23], [40], [43], [46], [46], [48][51].

    • Polarity - A negative word in a positive sentiment sentence can make us laugh, as well as the other way around [9], [12], [19][25], [25], [26], [29], [32], [46], [49].

    • Sentiment-based - Sentiment is a mental state often shaped by feelings and emotional response, and is commonly used to create humorous elements [6], [9], [12], [17], [20], [24], [27], [31], [32], [50].

    • Adult slang - Taboo themes or sexual slang can be a good predictive for humor [9], [12], [27], [30], [33], [47], [49].

    • Written-spoken - sometimes colloquial words are used are used in a written text. That might be a strong characteristic of a humorous text [14], [15], [17].

    • Rhyme - When two words have the same phonetic ending, but they mean different things, the sentence can be perceived as funny [4], [8], [9], [12], [18][22], [24], [33], [44], [47]

    • Stylistic - data of Stylistic characteristics encompass various textual and formatting elements, including word and text length metrics, average word size, connective terms, emojis, hash symbols, web links, referential expressions, abbreviations, question-forming pronouns, capitalized text, bracketed content, highlighted terms, interrogative elements, and punctuation frequency including exclamation points and quotation marks. These features can predict whether the text is humorous or not [14], [15], [17], [19], [24], [26], [27], [30], [41], [43], [45], [48][52].

    • Human Centeredness - knowledge such as persons, socials groups, social relations, and personal pronouns can be a good predicative measure for humorous texts [27], [29], [31][33], [49].

    • Similarity - Two types of similarity might occur in funny text: syntactic similarity which is focused on the POS in-text similarity, or semantic similarity. The score of similarity is usually being calculated using the distance between concepts and terms [18], [21], [23], [26], [31], [33], [36], [39], [44].

2.2 Algorithms↩︎

There are four types of algorithms in order to classify the text into humorous or not humorous:

  • Supervised Learning Approaches - Algorithms such as: Support Vector Machine (SVM), Random Forests (RF), Naive Bayes (NB), Decision Trees (DT), Logistic Regression (LR), and K-Nearest Neighbor (K-NN) were the building blocks for the rudimentary classical NLP [15], [30], in the [53], [54] papers they had found RF was the best among these algorithms.

  • Deep Learning Approaches - Deep Learning (DL) is any type of neural network with more than 3 layers. The big advantage of this algorithm over the previous group of algorithms: it extracts features automatically without need of human intervention [55]. DL contains various types of algorithms, such as: Convolutional Neural Network (CNN), Graph Convolutional Network (GCN), Recurrent Neural Network (RNN), Bidirectional RNN (BiRNN), Long Short-Term Memory Networks (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), Bidirectional GRU (BiGRU), Contextual Memory Fusion Network (C-MFN), Multi-Layer Perceptron (MLP). DL Algorithms found to be very good in comprehensive tasks according to [56]. The Best algorithm according to [3] that was found to be is LSTM. There were researchers that tried to combine two of algorithms above in order to get improvement [10], [11], [38], [42], [57], [58].

  • Transfer Learning Approaches - The complexity of humor as a human emotional response necessitates extensive background knowledge and profound contextual awareness. As a result, transfer learning strategies that utilize pre-trained language models (PLMs) have become increasingly prominent in contemporary research, driven by their substantial progress in neural architecture development. This approach represents a machine learning framework that adapts existing PLMs for new applications, drawing upon their comprehensive knowledge base derived from common data across various fields [59]. the review of [3] checked the following PLMs: Google’s BERT (Bidirectional Encoder Representations from Transformers) [60][65], Google’s ALBERT (A Lite BERT) [66], Google’s XLNet (Generalized Autoregressive Pretraining for Language Understanding) [62], Google’s Transformer-XL (Attentive Language Models Beyond a Fixed-Length Context) [62], Facebook’s RoBERTa (Robustly Optimized BERT Pretraining Approach) [2], [61], [64], [66], Facebook’s XLM (Enhancing BERT for Cross-lingual Language Model) [62], [64], Microsoft’s CodeBERT, and OpenAI’s GPT-2 [62]. BERT was the most common model, but also the State-of-the-Art (SOTA) at the time. [63] showed that BERT was best in classifying humorous or humorless in text that was generated by GPT-2. Also, in SemEval-2020 the Hitachi team showed RoBERTa and BERT-large were the best models [2].

  • Rule-Based Approaches - Rule-based methodologies identify humorous elements within textual content through predefined criteria that utilize dictionaries and word databases. These criteria articulate the discovery of patterns and meaningful connections among data sample instances. The most common Algorithm being explored is the Lesk Algorithms, because it works pretty well and is pretty amicable towards more complex approaches based on it [67]. Additional computational models employed within the reviewed research include: Pointwise Mutual Information (PMI), dependency networks, Markov modeling systems, Word Sense Disambiguation (WSD) frameworks, Weighted Finite-State Transducers (WFSTs), Bayesian classification, Latent Semantic Analysis (LSA), Gloss Vector representations, and Inverse Document Frequency (IDF) techniques. However, this technique was compared to classic ML Algorithms and was found to be inferior [18], [68]

2.3 The Novelty of Our Study↩︎

GPT-3 is very limited in generating jokes. According to [69] it probably repeats the joke it memorized from human data (25 of his jokes are 90 percent of the generated jokes). It can explain a joke, but it is lacking since he mostly understands Puns/Wordplay but no other types of humor. Also, when it asks to explain a regular sentence (non-joke) it classifies it as a joke and hallucinates on the explanation.

While GPT-4 is supposed to be better in generating jokes, it still struggles to generate a whole stand-up or comedy show according to [70], this is because the model is instructed to avoid certain words which can be a punchline. Also, it lacks the context of who the speaker is and who the audience is. Another issue we should bear in mind, that in order to create humor we need to make the opposite process of Chain-of-Thought (CoT). Instead of writing down each step at a time, such as in reasoning tasks, we need to skip steps and to leave to human a gap of information he or she should understand by themselves. This process called Leap-of-Thoguht (LoT), and it can’t be done using prompting (unlike CoT). Creative Leap-of-Thought (CLoT) is an LLM LoRA tuned on a game of humor and association (Oogiri-GO dataset - in the language of English, Chinese, and Japanese), with text and images. and is found to outperform each model to the time of the publishment of this paper. [71]

In the last three years, LLMs bloomed, especially Decoders, but also Encoders. In this paper we will check which models are better using classification to six classes - five types of jokes, and one not-funny sentence. The same as in [1], a paper that used generated jokes by GPT-4o and showed that these jokes can be classified pretty good using Encoders. We will show that today, it can be done also for human ("real") jokes. And by a Fine-Tuned Decoder which is equal to making the comparisons to the best Encoder so far (RoBERTa). Nevertheless, certain Encoders such as RoBERTa-base and RoBERTa-large are a bit better than the Fine-Tuned Decoder (GPT-4o), but not significantly.

3 Methodology↩︎

3.1 Data Collection↩︎

We collected English humor data categorized into five types, each defined below along with the number of examples and their respective sources.

Figure 1: Data Collection and Processing Pipeline

Absurdity Jokes—humor based on nonsensical or illogical scenarios that defy common sense—were sourced from [72] (58 jokes), [73] (13), [74] (21), and [75] (24).

Dark Jokes—humor involving taboo, morbid, or tragic subjects presented in a humorous way—were scraped from [76] (115).

Irony Jokes—humor that relies on stating the opposite of what’s expected or highlighting contradictions between expectations and reality—were collected from [77] (75) and [78] (73).

Wordplay Jokes—humor derived from clever manipulation, puns, or play on words—were taken from [79] (105) and [80] (273).

Social Commentary Jokes—humor highlighting societal issues, trends, or behaviors in a satirical or critical manner—were gathered from [81] (11), [82] (24), [83] (29), and [84] (2).

3.2 Data Preprocessing↩︎

3.2.1 Clean Humor Ambiguity Types↩︎

All jokes have been checked one after another manually, and any joke from a non-wordplay category that also contained a wordplay element, was excluded from the dataset. This step was taken to ensure the classification task remains a multi-class problem rather than a multi-label one. However, to illustrate the inherent ambiguity in humor categorization, we present in Table 1 several representative jokes that feature multiple humor types, especially where wordplay overlaps with others.

Table 1: Examples of different humor types mixed with wordplay humor - that were removed from our dataset
Joke Humor Types Explanation
A high school senior visits a psychic... "I’ve applied to 10 different colleges," the student said. "Which ones will accept me? Which one will I attend?" "That is hard to say," said the psychic. "But you will spend an absurd sum of money." "How do you know this?" the student asked. The psychic replied, "It’s mostly intuition." Absurdity, Wordplay Absurd: The student consults a psychic for college decisions, an illogical scenario. Wordplay: The punchline hinges on the pun between "intuition" (gut feeling) and "in tuition" (financial cost).
It turns out a major new study recently found that humans eat more bananas than monkeys. It’s true. I can’t remember the last time I ate a monkey. Dark, Wordplay Wordplay: The joke plays on the ambiguous sentence structure, twisting the meaning from a comparison of diet to a claim about eating monkeys. Dark: It introduces a taboo subject—eating primates—in a casual and humorous tone.
It is funny and sad thing how a group of squid is not called a squad. Irony, Wordplay Wordplay: Plays on the phonetic similarity between "squid" and "squad," setting up a pun. Irony: Highlights the mismatch between what would be a fitting name and actual naming conventions, provoking humor through unexpected linguistic rules.
If con is the opposite of pro, then is Congress the opposite of progress? Social Commentary, Wordplay Wordplay: The joke plays on the prefixes "con" and "pro," twisting the meaning of "Congress" into an opposite of "progress." Social Commentary: Critiques political inefficiency or dysfunction in a satirical way.

3.2.2 Replace repeating words↩︎

Repetition of the same word or a collocation in the same category were replaced by random words. For example, the collocation "the ironic person" repeated itself in the data of the irony jokes, therefore I replaced it with a random first and last name. e.g.:

Before: “Why did the ironic person wear sunscreen? Because they wanted to get a sunburn!”

After: “Why did Kelly Jones wear sunscreen? Because she wanted to get a sunburn!”

3.2.3 Replace the category type words with other words↩︎

Words that suggest a clue to the category were changed. For example, half of the jokes in the irony category included the word: "ironic" or "irony", we changed these words with funny/surprising/ or any other suiting word for the context. e.g.:

Before: "What’s ironic about the Bible? In one of the most interesting irony examples, the most shoplifted book in America is The Bible."

After: "What’s funny about the Bible? In one of the most interesting absurd examples, the most shoplifted book in America is The Bible."

3.3 Negative Examples↩︎

Regular sentences (not funny / non-joke) were randomly sampled from a Kaggle dataset [85], and we made sure that the amount of regular sentences is equal to the amount of all other jokes.

Table 2: Number of examples for each Category after cleaning
Type of Sentence Number of Examples
Absurdity 75
Dark 77
Irony 105
Wordplay 378
Social Commentary 62
No-Joke (regular sentence) 697
Total 1392

3.4 Train, Test, Validation Data Split↩︎

The data were split into: Train (80%), Validation (10%), Test (10%). Stratify was used to maintain the same ratio between labels within category groups.

3.5 Training↩︎

Figure 2: Training loss of models over time (steps)

\(^{\ast}\)Trained with batch size 8 instead of 16.
\(^{\ast\ast}\)Trained with batch size 4 instead of 16.
\(^{\circ}\)Model Quantization was set to be bf16 precision instead of fp32

3.5.0.1 Fine-Tuning was done to all Encoder Models, an example for a fine-tuning with seed 42 is available atFigure 2, using HuggingFace and PyTorch (transformers). The encoders were fine-tuned \(3\) times for \(20\) epochs, batch size was 16 (except XLNet-large-cased and NeoBERT which had batch size of 8, DeBERTa-v3-large with batch size of 4), all models were full precision fp32 (except DeBERTa-v3-large and NeoBERT which used bf16). The runs were done with 3 constant seeds (\(42\), \(1337\), \(2025\)), so the models are reproducible. In each time when training loss was close enough to zero (epsilon of 0.0001) then training was stopped. In each time the selected encoder was chosen according to its epoch with maximum f1-macro score (this metric was chosen since the data categories are biased). Eventually all 3 models from each run, were averaged, paving the way for a non-normal distribution with mean and standard-deviation. Also we had average for each run the recorded (best epoch according to f1-macro) metrics such as: Recall-Macro, Precision-Macro, and Accuracy.

3.5.0.2 While BART (Encoder2Decoder model) was checked only with zero shot learning (few-shots learning isn’t available for this model). Flan-T5 was checked both for zero-shot learning and few-shots.

Figure 3: Training and validation loss progression for GPT-4o fine-tuning

GPT-4o (the main/big decoder) was also Fine-Tuned, using OPENAI API, it was trained for 3 epochs, batch size of 4 examples each time, and a Learning Rate (LR) Multiplier of 2. we had found that when GPT-4o decoder was fine-tuned it got very fast to zero training loss, which means the model is over-fitting and "hungry" for data Figure 3.

4 Results↩︎

Table 3: Recall-Macro, Precision-Macro, F1-Macro, and Accuracy test scores Among Many Different Models
Model Type & Name Recall-Macro Precision-Macro F1-Macro Accuracy
Encoders
BERT-base-uncased \(0.5635 \pm 0.0601\) \(0.6135 \pm 0.0606\) \(0.5808 \pm 0.0588\) \(0.8024 \pm 0.0251\)
BERT-base-multilingual \(0.7994 \pm 0.045\) \(0.7492 \pm 0.0476\) \(0.7674 \pm 0.0481\) \(0.881 \pm 0.027\)
XLNet-base-cased \(0.7455 \pm 0.0471\) \(0.7185 \pm 0.0362\) \(0.7212 \pm 0.0402\) \(0.8642 \pm 0.0214\)
distilBERT-base-uncased \(0.7682 \pm 0.0159\) \(0.7274 \pm 0.03679\) \(0.7358 \pm 0.0333\) \(0.8619 \pm 0.027\)
ALBERT-base \(0.7579 \pm 0.0789\) \(0.7402 \pm 0.0409\) \(0.7201 \pm 0.053\) \(0.8476 \pm 0.023\)
RoBERTa-base \(\boldsymbol{0.8745 \pm 0.0081}^{\dagger}\) \(\boldsymbol{0.8652 \pm 0.0304}^{\dagger}\) \(\boldsymbol{0.8566 \pm 0.0164}^{\dagger}\) \(\boldsymbol{0.9286 \pm 0.0124}^{\dagger}\)
XLM-roBERTa-base \(0.8061 \pm 0.042\) \(0.7735 \pm 0.057\) \(0.7462 \pm 0.0569\) \(0.8691 \pm 0.0393\)
DeBERTa-v3-base \(0.7855 \pm 0.0691\) \(0.784 \pm 0.1056\) \(0.7733 \pm 0.0882\) \(0.8905 \pm 0.0406\)
ModernBERT-base \(0.687 \pm 0.0131\) \(0.722 \pm 0.0484\) \(0.6871 \pm 0.0182\) \(0.8642 \pm 0.0124\)
BERT-large-uncased \(0.7187 \pm 0.0549\) \(0.688 \pm 0.0393\) \(0.6959 \pm 0.0425\) \(0.8381 \pm 0.0165\)
XLNet-large-cased\(^{\ast}\) \(0.7936 \pm 0.0628\) \(0.7451 \pm 0.048\) \(0.7606 \pm 0.0541\) \(0.8714 \pm 0.0247\)
ALBERT-large-v2 \(0.7119 \pm 0.0133\) \(0.7101 \pm 0.0177\) \(0.7057 \pm 0.0129\) \(0.8524 \pm 0.0109\)
RoBERTa-large \(0.8594 \pm 0.0473^{\dagger}\) \(0.8501 \pm 0.0318\)\(^{\dagger}\) \(0.8503 \pm 0.0355^{\dagger}\) \(\boldsymbol{0.9286 \pm 0.0071}^{\dagger}\)
XLM-roBERTa-large \(0.8514 \pm 0.0348\) \(0.8228 \pm 0.0519\) \(0.8308 \pm 0.0441\) \(0.9214 \pm 0.0189\)
DeBERTa-v3-large\(^{\circ}\) \(^{\ast\ast}\) \(0.8441 \pm 0.0271\) \(0.8034 \pm 0.0378\) \(0.8145 \pm 0.0304\) \(0.9048 \pm 0.0.0109\)
ModernBERT-large \(0.7806 \pm 0.0195\) \(0.7864 \pm 0.0539\) \(0.7611 \pm 0.0151\) \(0.8929 \pm 0.0071\)
NeoBERT\(^{\circ}\) \(^{\ast}\) \(0.7719 \pm 0.0861\) \(0.8191 \pm 0.0274\) \(0.7765 \pm 0.0786\) \(0.9049 \pm 0.0251\)
Encoder-Decoder
BART-large-mnli-zero shot \(0.2488\) \(0.2541\) \(0.1841\) \(0.2071\)
Flan-T5-base-zero shot \(0.2292\) \(0.2843\) \(0.1575\) \(0.3696\)
Flan-T5-base-few shots \(0.1674\) \(0.1619\) \(0.1079\) \(0.3254\)
Decoders
Llama-3.2-3B-Instruct-zero shot \(0.3600\) \(0.2934\) \(0.1729\) \(0.1929\)
Gemma-2-2b-it-zero shot \(0.3697\) \(0.4944\) \(0.3665\) \(0.6571\)
Qwen2-7B-Instruct-zero shot \(0.3746\) \(0.4221\) \(0.2759\) \(0.2595\)
Mistral-7B-Instruct-v0.2-zero shot \(0.4055\) \(0.5071\) \(0.3184\) \(0.3000\)
GPT-4-zero shot \(0.5919\) \(0.5808\) \(0.5047\) \(0.5214\)
Llama-3.2-3B-Instruct-few shots \(0.3895\) \(0.4328\) \(0.2086\) \(0.2143\)
Gemma-2-2b-it-few shots \(0.4234\) \(0.4945\) \(0.3759\) \(0.4786\)
Qwen2-7B-Instruct-few shots \(0.4791\) \(0.4777\) \(0.3452\) \(0.2950\)
Mistral-7B-Instruct-v0.2-few shots \(0.4801\) \(0.6030\) \(0.3865\) \(0.3857\)
GPT-4-few shots \(0.6381\) \(0.6878\) \(0.5955\) \(0.6429\)
GPT-4o-fine-tuned \(0.8618 \pm 0.0025^{\dagger}\) \(0.8464 \pm 0.0089^{\dagger}\) \(0.8522 \pm 0.0056^{\dagger}\) \(0.9238 \pm 0.0041^{\dagger}\)

Bold scores represent the best score in column.
Underlined scores are the second-best score in the column.

\(\pm\) add or subtract standard deviation from mean
\(^{\ast}\)Trained with batch size 8 instead of 16.
\(^{\ast\ast}\)Trained with batch size 4 instead of 16.
\(^{\circ}\)Model Quantization was set to be bf16 precision instead of fp32
\(^{\dagger}\) - no significant difference (p.value greater than 0.05) according to Welch’s t-test

4.0.0.1 Table 3 shows that larger encoders perform better than their "base" version (except ALBERT and RoBERTa), with a mean gap of 0.05 f1-macro score points (ALBERT and RoBERTa were also included in the average). However, it’s unclear why RoBERTa and ALBERT had the same results between the different model sizes. We have also found RoBERTa-base to be the best model.

4.0.0.2 In the Encoders-to-Decoders and Decoders, we have checked zero-shot and few-shots learning. Flan-T5 had a surprising outcome: few-shots learning performed worse than zero-shot learning, this is probably because of the fact that digesting long prompt might be a problem for small models. All the models which were checked for zero-shot and few-shots learning had lagged behind the RoBERTa Encoder result.

4.0.0.3 However, in the crux of this study, we had run the fine-tuned GPT-4o 3 times in a greedy mode (temperature = 0), with a different seed for each run (42, 1337, 2025) and we got f1-macro score of 0.8522 which is the second best result and it stands in par with the best Encoder, RoBERTa (the Welch’s t-test showed there isn’t a statistical difference between them).

5 Conclusions↩︎

This study aimed to probe whether various LLMs can understand Humor. The test was done by classifying each sentence to one of six categories (five types of humor or no-joke regular/negative sentence). The surprising result is that both fine-tuned Encoders and fine-tuned Decoders perform equally well. This contradicts the previous approach where Encoder (RoBERTa) was considered to be better.

6 Limitations↩︎

It’s important to mention that issues such as heterogeneous data (which may imply different jargon for different categories), also the data was scarce with only 1394 examples (when half of them 697 are regular sentences, "Non-Joke"), the length of sentences in different categories, as well as semantic meaning that can help classification (e.g. Absurdity was mainly taken from the world of politics jokes) were not taken into consideration under this paper. Also, GPT-5 was only introduced in the recent couple of days, and GPT-5 is not publicly available as a distinct model for fine-tuning in the time of publication of this paper.

References↩︎

[1]
S. K. R. Kasu, S. Biradar, and S. Saumya, “Deceptive humor: A synthetic multilingual benchmark dataset for bridging fabricated claims with humorous content,” arXiv preprint arXiv:2503.16031, 2025.
[2]
D. Faraj and M. Abdullah, “SarcasmDet at SemEval-2021 task 7: Detect humor and offensive based on demographic factors using RoBERTa pre-trained model,” in Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021), 2021, pp. 527–533.
[3]
A. Kalloniatis and P. Adamidis, “Computational humor recognition: A systematic literature review,” Artificial Intelligence Review, vol. 58, no. 2, p. 43, 2024.
[4]
A. Morales and C. Zhai, “Identifying humor in reviews using background text sources,” in Proceedings of the 2017 conference on empirical methods in natural language processing, 2017, pp. 492–501.
[5]
A. Reyes, D. Buscaldi, and P. Rosso, “An analysis of the impact of ambiguity on automatic humour recognition,” in International conference on text, speech and dialogue, 2009, pp. 162–169.
[6]
M. K. Hasan et al., “Humor knowledge enriched transformer for understanding multimodal humor,” in Proceedings of the AAAI conference on artificial intelligence, 2021, vol. 35, pp. 12972–12980.
[7]
S. Attardo, D. H. Attardo, P. Baltes, and M. J. Petray, “The linear organization of jokes: Analysis of two thousand texts,” 1994.
[8]
C. Bucaria, “Lexical and syntactic ambiguity as a source of humor: The case of newspaper headlines.” HUMOR: International Journal of Humor Research, vol. 17, no. 3, 2004.
[9]
S. van den Beukel and L. Aroyo, “Homonym detection for humor recognition in short text,” in Proceedings of the 9th workshop on computational approaches to subjectivity, sentiment and social media analysis, 2018, pp. 286–291.
[10]
Y. Diao, H. Lin, L. Yang, X. Fan, D. Wu, and K. Xu, “CRGA: Homographic pun detection with a contextualized-representation: Gated attention network,” Knowledge-Based Systems, vol. 195, p. 105056, 2020.
[11]
Y. Diao, H. Lin, L. Yang, X. Fan, D. Wu, and K. Xu, “Homographic pun location using multi-dimensional semantic relationships,” Soft Computing, vol. 24, no. 16, pp. 12163–12173, 2020.
[12]
D. Yang, A. Lavie, C. Dyer, and E. Hovy, “Humor recognition and humor anchor extraction,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2367–2376.
[13]
J. T. Kao, R. Levy, and N. D. Goodman, “A computational model of linguistic humor in puns,” Cognitive science, vol. 40, no. 5, pp. 1270–1285, 2016.
[14]
R. Mahajan and M. Zaveri, “Svnit@ semeval 2017 task-6: Learning a sense of humor using supervised approach,” in Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), 2017, pp. 411–415.
[15]
R. Mahajan and M. Zaveri, “Humor identification using affect based content in target text,” Journal of Intelligent & Fuzzy Systems, vol. 39, no. 1, pp. 697–708, 2020.
[16]
Y. Ziser, E. Kravi, and D. Carmel, “Humor detection in product question answering systems,” in Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, 2020, pp. 519–528.
[17]
F. Barbieri and H. Saggion, “Automatic detection of irony and humour in twitter.” in ICCC, 2014, pp. 155–162.
[18]
R. Mihalcea, C. Strapparava, and S. Pulman, “Computational models for incongruity detection in humour,” in International conference on intelligent text processing and computational linguistics, 2010, pp. 364–374.
[19]
L. Liu, D. Zhang, and W. Song, “Exploiting syntactic structures for humor recognition,” in Proceedings of the 27th international conference on computational linguistics, 2018, pp. 1875–1883.
[20]
L. Liu, D. Zhang, and W. Song, “Modeling sentiment association in discourse for humor recognition,” in Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers), 2018, pp. 586–591.
[21]
D. Zhang, W. Song, L. Liu, C. Du, and X. Zhao, “Investigations in automatic humor recognition,” in 2017 10th international symposium on computational intelligence and design (ISCID), 2017, vol. 1, pp. 272–275.
[22]
D. Zhang, W. Song, X. Liu, L. Liu, and X. Zhao, “Research on humor recognition,” in 2018 IEEE 9th international conference on software engineering and service science (ICSESS), 2018, pp. 152–155.
[23]
A. Kamal and M. Abulaish, “Self-deprecating humor detection: A machine learning approach,” in International conference of the pacific association for computational linguistics, 2019, pp. 483–494.
[24]
R. Zhang and N. Liu, “Recognizing humor on twitter,” in Proceedings of the 23rd ACM international conference on conference on information and knowledge management, 2014, pp. 889–898.
[25]
A. Reyes, P. Rosso, and D. Buscaldi, “From humor recognition to irony detection: The figurative language of social media,” Data & Knowledge Engineering, vol. 74, pp. 1–12, 2012.
[26]
Y. Diao et al., “Homographic puns recognition based on latent semantic structures,” in National CCF conference on natural language processing and chinese computing, 2017, pp. 565–576.
[27]
R. Ortega-Bueno, C. E. Muniz-Cuza, J. E. M. Pagola, and P. Rosso, “UO UPV: Deep linguistic humor detection in spanish social media,” in Proceedings of the third workshop on evaluation of human language technologies for iberian languages (IberEval 2018) co-located with 34th conference of the spanish society for natural language processing (SEPLN 2018), 2018, pp. 204–213.
[28]
J. Wiebe and R. Mihalcea, “Word sense and subjectivity,” in Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, 2006, pp. 1065–1072.
[29]
R. Mihalcea and S. Pulman, “Characterizing humour: An exploration of features in humorous texts,” in International conference on intelligent text processing and computational linguistics, 2007, pp. 337–347.
[30]
S. Castro, M. Cubero, D. Garat, and G. Moncecchi, “Is this a joke? Detecting humor in spanish tweets,” in Ibero-american conference on artificial intelligence, 2016, pp. 139–150.
[31]
D. Shahaf, E. Horvitz, and R. Mankoff, “Inside jokes: Identifying humorous cartoon captions,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1065–1074.
[32]
Z. Yang, B. Hu, and J. Hirschberg, “Predicting humor by learning from time-aligned comments.” in INTERSPEECH, 2019, pp. 496–500.
[33]
J. Sjöbergh and K. Araki, “Recognizing humor without recognizing meaning,” in International workshop on fuzzy logic and applications, 2007, pp. 469–476.
[34]
Y. Goldberg, Neural network methods in natural language processing. Morgan & Claypool Publishers, 2017.
[35]
M. K. Hasan et al., “UR-FUNNY: A multimodal language dataset for understanding humor,” arXiv preprint arXiv:1904.06618, 2019.
[36]
A. Vadehra, “Uwav at semeval-2017 task 7: Automated feature-based system for locating puns,” in Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), 2017, pp. 449–452.
[37]
H.-Y. Chen, Y.-S. Lin, and C.-C. Lee, “Through the words of viewers: Using comment-content entangled network for humor impression recognition,” in 2021 IEEE spoken language technology workshop (SLT), 2021, pp. 1058–1064.
[38]
D. Bertero and P. Fung, “Multimodal deep neural nets for detecting humor in TV sitcoms,” in 2016 IEEE spoken language technology workshop (SLT), 2016, pp. 383–390.
[39]
V. Indurthi and S. R. Oota, “Fermi at semeval-2017 task 7: Detection and interpretation of homographic puns in english language,” in Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), 2017, pp. 457–460.
[40]
Y.-C. Gu, Y.-H. Tseng, W.-L. Hsu, W.-S. Wu, and H.-C. Chen, “Development and classification of a chinese humor corpus,” Advances in Natural Language Processing, p. 117, 2019.
[41]
K. N. Jensen, N. F. Rasmussen, T. Wang, M. Placenti, and B. Plank, “Buhscitu at SemEval-2020 task 7: Assessing humour in edited news headlines using hand-crafted features and online knowledge bases,” in SemEval, Association for Computational Linguistics, 2020.
[42]
L. Ren, B. Xu, H. Lin, and L. Yang, “ABML: Attention-based multi-task learning for jointly humor recognition and pun detection,” Soft Computing, vol. 25, no. 22, pp. 14109–14118, 2021.
[43]
A. Ermilov, N. Murashkina, V. Goryacheva, and P. Braslavski, “Stierlitz meets SVM: Humor detection in russian,” in Conference on artificial intelligence and natural language, 2018, pp. 178–184.
[44]
M. Yatsu and K. Araki, “Comparison of pun detection methods using japanese pun corpus,” in Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), 2018.
[45]
A. Khandelwal, S. Swami, S. S. Akhtar, and M. Shrivastava, “Gender prediction in english-hindi code-mixed social media content: Corpus and baseline system,” Computación y Sistemas, vol. 22, no. 4, pp. 1241–1247, 2018.
[46]
S. Skalicky, C. M. Berger, S. A. Crossley, and D. S. McNamara, “Linguistic features of humor in academic writing.” Advances in Language and Literary Studies, vol. 7, no. 3, pp. 248–259, 2016.
[47]
R. Mihalcea and C. Strapparava, “Making computers laugh: Investigations in automatic humor recognition,” in Proceedings of human language technology conference and conference on empirical methods in natural language processing, 2005, pp. 531–538.
[48]
A. C. Adams, “On the identification of humor markers in computer-mediated communication.” in AAAI fall symposium: Artificial intelligence of humor, 2012, pp. 2–6.
[49]
A. Reyes, P. Rosso, and D. Buscaldi, “Humor in the blogosphere: First clues for a verbal humor taxonomy,” Journal of Intelligent Systems, vol. 18, no. 4, pp. 311–332, 2009.
[50]
C. Westbury and G. Hollis, “Wriggly, squiffy, lummox, and boobs: What makes some words funny?” Journal of Experimental Psychology: General, vol. 148, no. 1, p. 97, 2019.
[51]
Y. Raz, “Automatic humor classification on twitter,” in Proceedings of the NAACL HLT 2012 student research workshop, 2012, pp. 66–70.
[52]
A. Purandare and D. Litman, “Humor: Prosody analysis and automatic recognition for f* r* i* e* n* d* s,” in Proceedings of the 2006 conference on empirical methods in natural language processing, 2006, pp. 208–215.
[53]
A. Jaiswal et al., “Pun detection using soft computing techniques,” in 2019 international conference on machine learning, big data, cloud and parallel computing (COMITCon), 2019, pp. 5–9.
[54]
N. Hossain, J. Krumm, and M. Gamon, “"President vows to cut< taxes> hair": Dataset and analysis of creative text editing for humorous headlines,” arXiv preprint arXiv:1906.00274, 2019.
[55]
V. Shukla, M. Sinha, and T. Dasgupta, “Automatic humor detection from code-mixed tweets,” in Proceedings of the 11th annual meeting of the forum for information retrieval evaluation, 2019, pp. 56–59.
[56]
L. Alzubaidi et al., “Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions,” Journal of big Data, vol. 8, no. 1, p. 53, 2021.
[57]
X. Fan et al., “Phonetics and ambiguity comprehension gated attention network for humor recognition,” Complexity, vol. 2020, no. 1, p. 2509018, 2020.
[58]
X. Fan et al., “Humor detection via an internal and external neural network,” Neurocomputing, vol. 394, pp. 105–111, 2020.
[59]
N. Agarwal, A. Sondhi, K. Chopra, and G. Singh, “Transfer learning: Survey and classification,” Smart Innovations in Communication and Computational Sciences: Proceedings of ICSICCS 2020, pp. 145–155, 2020.
[60]
O. Weller and K. Seppi, “Humor detection: A transformer gets the last laugh,” arXiv preprint arXiv:1909.00252, 2019.
[61]
D. Cao, “Self-attention on sentence snippets incongruity for humor assessment,” in Journal of physics: Conference series, 2021, vol. 1827, p. 012072.
[62]
T. Morishita, G. Morio, H. Ozaki, and T. Miyoshi, “Hitachi at SemEval-2020 task 7: Stacking at scale with heterogeneous language models for humor recognition,” in Proceedings of the fourteenth workshop on semantic evaluation, 2020, pp. 791–803.
[63]
N. A. Akbar, I. Darmayanti, S. M. Fati, and A. Muneer, “Deep learning of a pre-trained language model’s joke classifier using GPT-2,” Journal of Hunan University Natural Sciences, vol. 48, no. 8, 2021.
[64]
A. Mittal, P. Jeevan, P. Gandhi, D. Kanojia, and P. Bhattacharyya, “" so you think you’re funny?": Rating the humour quotient in standup comedy,” arXiv preprint arXiv:2110.12765, 2021.
[65]
B. N. Patro, M. Lunayach, D. Srivastava, H. Singh, V. P. Namboodiri, et al., “Multimodal humor dataset: Predicting laughter tracks for sitcoms,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 576–585.
[66]
B. Song, C. Pan, S. Wang, and Z. Luo, “Deepblueai at semeval-2021 task 7: Detecting and rating humor and offense with stacking diverse language model-based methods,” in Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021), 2021, pp. 1130–1134.
[67]
T. Miller and I. Gurevych, “Automatic disambiguation of english puns,” in Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), 2015, pp. 719–729.
[68]
E. Mikhalkova and Y. Karyakin, “PunFields at SemEval-2017 task 7: Employing roget’s thesaurus in automatic pun recognition and interpretation,” arXiv preprint arXiv:1707.05479, 2017.
[69]
S. Jentzsch and K. Kersting, “ChatGPT is fun, but it is not funny! Humor is still challenging large language models,” arXiv preprint arXiv:2306.04563, 2023.
[70]
P. Mirowski, J. Love, K. Mathewson, and S. Mohamed, “A robot walks into a bar: Can language models serve as creativity supporttools for comedy? An evaluation of llms’ humour alignment with comedians,” in Proceedings of the 2024 ACM conference on fairness, accountability, and transparency, 2024, pp. 1622–1636.
[71]
S. Zhong et al., “Let’s think outside the box: Exploring leap-of-thought in large language models with creative humor generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 13246–13257.
[72]
UpJoke, Accessed: 2025“Absurd jokes.” [Online]. Available: https://upjoke.com/absurd-jokes.
[73]
Reader’s Digest, Accessed: 2025“Bad jokes you can’t help but laugh at.” [Online]. Available: https://www.rd.com/list/bad-jokes-cant-help-laugh-at/.
[74]
Reddit, Accessed: 2025“What’s a joke that’s so stupid it’s funny?” [Online]. Available: https://www.reddit.com/r/AskReddit/comments/a26y06/whats_a_joke_thats_so_stupid_its_funny/.
[75]
Today, Accessed: 2025“Bad jokes that will make you laugh.” [Online]. Available: https://www.today.com/life/inspiration/bad-jokes-rcna58390.
[76]
TheCoolist, Accessed: 2025“Dark jokes.” [Online]. Available: https://www.thecoolist.com/humor/dark-jokes/.
[77]
DiscoverJokes, Accessed: 2025“Jokes about irony.” [Online]. Available: https://discoverjokes.com/jokes-about-irony/.
[78]
BoredPanda, Accessed: 2025“Ironic jokes.” [Online]. Available: https://www.boredpanda.com/ironic-jokes/.
[79]
iNews, Accessed: 2025“Best pun-based jokes.” [Online]. Available: https://inews.co.uk/light-relief/jokes/best-pun-based-jokes-170096.
[80]
W. Styler, Accessed: 2025“A collection of puns.” [Online]. Available: https://wstyler.ucsd.edu/puns/.
[81]
UpJoke, Accessed: 2025“Commentary jokes.” [Online]. Available: https://upjoke.com/commentary-jokes.
[82]
LaughFactory, Accessed: 2025“Political jokes.” [Online]. Available: https://www.laughfactory.com/jokes/political-jokes.
[83]
Reader’s Digest, Accessed: 2025“Political jokes.” [Online]. Available: https://www.rd.com/list/political-jokes/.
[84]
Reader’s Digest, Accessed: 2025“Presidential jokes.” [Online]. Available: https://www.rd.com/list/presidential-jokes/.
[85]
DeepContractor, Accessed: 2025“200K short texts for humor detection.” Kaggle, [Online]. Available: https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection.