September 05, 2025
From the dawn of the computer, Allen Turing dreamed of a robot that could communicate using language as a human being. The recent advances in the field of Large Language Models (LLMs) shocked the scientific community when a single model can apply for various natural language processing (NLP) tasks, while the output results are sometimes even better than most human communication skills. Models such as GPT, Claude, Grok, etc. have left their mark on the scientific community. However, it is unclear how much these models understand what they produce, especially in a nuanced theme such as humor. The question of whether computers understand humor is still open (among the decoders, the latest to be checked was GPT-2). We addressed this issue in this paper; we have showed that a fine-tuned decoder (GPT-4o) performed (Mean F1-macro score of 0.85) as well as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86).
Humor serves the human society in order to mitigate the pain from mundane problems, a better way to communicate and convey a message, to make someone like you, or even in the end of the day to provide us with a better feeling (as Charlie Chaplin once said: "A day without laughter is a day wasted."). However, what makes us laugh is studied for a few decades, and the ability to "grasp" the joke is considered to be one of the top enigmas in human’s cognition. Considering this, it’s clear that this "human" ability is considered hard to "grasp" for a robot. according to a review [3] There are two stages in humorous classification: feature selection and algorithm selection. These two options are going to be covered thoroughly in the next two sub-sections: 2.1 and 2.2
During time there were many approaches for how to get good features for humorous classification. These approaches were divided into three main categories:
Feature Engineering - Old school methods which were highly explored during the time of classical Machine Learning (ML).
Ambiguity Detection - a word that may have more than one meaning can help to identify a joke [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]
Incongruity Detection - it’s when you have one line of thought interrupted by a different line of thought in the same sentence [4], [9], [12], [14]–[24]
Emotion-Based Detection - certain words can imply about the narrator’s attitudes, feelings, modes etc. [21], [22], [25]–[27]
Unexpectedness Detection - a joke can touch a taboo issue or be absolutely about absurd situations - which will make us laugh [4], [17], [25]
Subjectivity - different people find different jokes funny [21]. This happens when the joke contains language which implies different beliefs, speculations, criticisms, opinions, and evaluations [28]. This kind of feature can predict if it contains humor or not [12], [16], [19]–[23]
Negation - Funny texts usually include negative word such as "not", "no", "isn’t", "doesn’t", "don’t", denials, or any other type of other word with negative meaning [9], [22], [27], [29]–[33].
Automated Features - Features that are extracted without manual work, however too many features may end up with curse of dimensionality, therefore researchers use feature selection.
Word Embeddings - words are being represented using dense-vectors which contain the semantic meaning of the word, Words with close meaning have a higher cosine score result [34]. Choosing the right features can improve jokes recognition [20], [23], [26], [31], [35]–[42].
Bag-of-Words - This method counts words in the text, without giving any importance for the order of the word or its grammar [34]. This end ups with long and sparse vectors. This may be helpful since certain words can be good for classifying the text as humorous or not-humorous [43]. Semantics of the text are an important feature which can be seen using this feature extraction and can help to classify the text [12], [39], [43]–[45].
Part of Speech (POS) - the frequencies such as nouns, pronouns, verbs, and modifiers for example can predict whether the text is funny or not. It was also interesting to see that the number of appearances or length of the triplet “noun-verb-adjective” also had been a good predictor. It was found also useful since for example, there is a correlation between positions of pun words and location of the grammatical tagging
Acoustic-prosodic - When you have voice recording you can detect humorous parts according to the way they are said. However, in this research paper we research only textual information, therefore this type of feature extraction will be ignored. [5], [8]–[11], [15], [23]–[27], [31], [42], [43], [46].
Lexical Features - a lexical resource consists from language dictionaries, and special addition of data that can contribute to the field of NLP. It contains words, sub-words, and collocations. Each of the fields may have phonological and semantic relations, such as synonyms, spelling, etc. The most commonly used lexical resource is WordNet, although some research findings pointed to the fact that this dictionary is still lacking [47].
Alliteration - Is the case when you have two or more adjacent words, which has the same vocal beginning, which causes the sentence to be unexpected and funny. [4], [8], [9], [12], [18]–[24], [27], [33], [47].
Antonymy - When two or more opposite semantic-words are contained in the same sentence, that can make the sentence humorous. This type of feature is being extracted using a dictionary (usually WordNet Antonyms) [9], [22], [26], [27], [30], [36], [47].
Frequency - Frequency of certain words, or frequency of certain words that comes together, can suggest whether the sentence is humorous or not, since these words or words combination are common at jokes, while they are rare for regular texts [10], [11], [14], [17], [23], [40], [43], [46], [46], [48]–[51].
Polarity - A negative word in a positive sentiment sentence can make us laugh, as well as the other way around [9], [12], [19]–[25], [25], [26], [29], [32], [46], [49].
Sentiment-based - Sentiment is a mental state often shaped by feelings and emotional response, and is commonly used to create humorous elements [6], [9], [12], [17], [20], [24], [27], [31], [32], [50].
Adult slang - Taboo themes or sexual slang can be a good predictive for humor [9], [12], [27], [30], [33], [47], [49].
Written-spoken - sometimes colloquial words are used are used in a written text. That might be a strong characteristic of a humorous text [14], [15], [17].
Rhyme - When two words have the same phonetic ending, but they mean different things, the sentence can be perceived as funny [4], [8], [9], [12], [18]–[22], [24], [33], [44], [47]
Stylistic - data of Stylistic characteristics encompass various textual and formatting elements, including word and text length metrics, average word size, connective terms, emojis, hash symbols, web links, referential expressions, abbreviations, question-forming pronouns, capitalized text, bracketed content, highlighted terms, interrogative elements, and punctuation frequency including exclamation points and quotation marks. These features can predict whether the text is humorous or not [14], [15], [17], [19], [24], [26], [27], [30], [41], [43], [45], [48]–[52].
Human Centeredness - knowledge such as persons, socials groups, social relations, and personal pronouns can be a good predicative measure for humorous texts [27], [29], [31]–[33], [49].
Similarity - Two types of similarity might occur in funny text: syntactic similarity which is focused on the POS in-text similarity, or semantic similarity. The score of similarity is usually being calculated using the distance between concepts and terms [18], [21], [23], [26], [31], [33], [36], [39], [44].
There are four types of algorithms in order to classify the text into humorous or not humorous:
Supervised Learning Approaches - Algorithms such as: Support Vector Machine (SVM), Random Forests (RF), Naive Bayes (NB), Decision Trees (DT), Logistic Regression (LR), and K-Nearest Neighbor (K-NN) were the building blocks for the rudimentary classical NLP [15], [30], in the [53], [54] papers they had found RF was the best among these algorithms.
Deep Learning Approaches - Deep Learning (DL) is any type of neural network with more than 3 layers. The big advantage of this algorithm over the previous group of algorithms: it extracts features automatically without need of human intervention [55]. DL contains various types of algorithms, such as: Convolutional Neural Network (CNN), Graph Convolutional Network (GCN), Recurrent Neural Network (RNN), Bidirectional RNN (BiRNN), Long Short-Term Memory Networks (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), Bidirectional GRU (BiGRU), Contextual Memory Fusion Network (C-MFN), Multi-Layer Perceptron (MLP). DL Algorithms found to be very good in comprehensive tasks according to [56]. The Best algorithm according to [3] that was found to be is LSTM. There were researchers that tried to combine two of algorithms above in order to get improvement [10], [11], [38], [42], [57], [58].
Transfer Learning Approaches - The complexity of humor as a human emotional response necessitates extensive background knowledge and profound contextual awareness. As a result, transfer learning strategies that utilize pre-trained language models (PLMs) have become increasingly prominent in contemporary research, driven by their substantial progress in neural architecture development. This approach represents a machine learning framework that adapts existing PLMs for new applications, drawing upon their comprehensive knowledge base derived from common data across various fields [59]. the review of [3] checked the following PLMs: Google’s BERT (Bidirectional Encoder Representations from Transformers) [60]–[65], Google’s ALBERT (A Lite BERT) [66], Google’s XLNet (Generalized Autoregressive Pretraining for Language Understanding) [62], Google’s Transformer-XL (Attentive Language Models Beyond a Fixed-Length Context) [62], Facebook’s RoBERTa (Robustly Optimized BERT Pretraining Approach) [2], [61], [64], [66], Facebook’s XLM (Enhancing BERT for Cross-lingual Language Model) [62], [64], Microsoft’s CodeBERT, and OpenAI’s GPT-2 [62]. BERT was the most common model, but also the State-of-the-Art (SOTA) at the time. [63] showed that BERT was best in classifying humorous or humorless in text that was generated by GPT-2. Also, in SemEval-2020 the Hitachi team showed RoBERTa and BERT-large were the best models [2].
Rule-Based Approaches - Rule-based methodologies identify humorous elements within textual content through predefined criteria that utilize dictionaries and word databases. These criteria articulate the discovery of patterns and meaningful connections among data sample instances. The most common Algorithm being explored is the Lesk Algorithms, because it works pretty well and is pretty amicable towards more complex approaches based on it [67]. Additional computational models employed within the reviewed research include: Pointwise Mutual Information (PMI), dependency networks, Markov modeling systems, Word Sense Disambiguation (WSD) frameworks, Weighted Finite-State Transducers (WFSTs), Bayesian classification, Latent Semantic Analysis (LSA), Gloss Vector representations, and Inverse Document Frequency (IDF) techniques. However, this technique was compared to classic ML Algorithms and was found to be inferior [18], [68]
GPT-3 is very limited in generating jokes. According to [69] it probably repeats the joke it memorized from human data (25 of his jokes are 90 percent of the generated jokes). It can explain a joke, but it is lacking since he mostly understands Puns/Wordplay but no other types of humor. Also, when it asks to explain a regular sentence (non-joke) it classifies it as a joke and hallucinates on the explanation.
While GPT-4 is supposed to be better in generating jokes, it still struggles to generate a whole stand-up or comedy show according to [70], this is because the model is instructed to avoid certain words which can be a punchline. Also, it lacks the context of who the speaker is and who the audience is. Another issue we should bear in mind, that in order to create humor we need to make the opposite process of Chain-of-Thought (CoT). Instead of writing down each step at a time, such as in reasoning tasks, we need to skip steps and to leave to human a gap of information he or she should understand by themselves. This process called Leap-of-Thoguht (LoT), and it can’t be done using prompting (unlike CoT). Creative Leap-of-Thought (CLoT) is an LLM LoRA tuned on a game of humor and association (Oogiri-GO dataset - in the language of English, Chinese, and Japanese), with text and images. and is found to outperform each model to the time of the publishment of this paper. [71]
In the last three years, LLMs bloomed, especially Decoders, but also Encoders. In this paper we will check which models are better using classification to six classes - five types of jokes, and one not-funny sentence. The same as in [1], a paper that used generated jokes by GPT-4o and showed that these jokes can be classified pretty good using Encoders. We will show that today, it can be done also for human ("real") jokes. And by a Fine-Tuned Decoder which is equal to making the comparisons to the best Encoder so far (RoBERTa). Nevertheless, certain Encoders such as RoBERTa-base and RoBERTa-large are a bit better than the Fine-Tuned Decoder (GPT-4o), but not significantly.
We collected English humor data categorized into five types, each defined below along with the number of examples and their respective sources.
Absurdity Jokes—humor based on nonsensical or illogical scenarios that defy common sense—were sourced from [72] (58 jokes), [73] (13), [74] (21), and [75] (24).
Dark Jokes—humor involving taboo, morbid, or tragic subjects presented in a humorous way—were scraped from [76] (115).
Irony Jokes—humor that relies on stating the opposite of what’s expected or highlighting contradictions between expectations and reality—were collected from [77] (75) and [78] (73).
Wordplay Jokes—humor derived from clever manipulation, puns, or play on words—were taken from [79] (105) and [80] (273).
Social Commentary Jokes—humor highlighting societal issues, trends, or behaviors in a satirical or critical manner—were gathered from [81] (11), [82] (24), [83] (29), and [84] (2).
All jokes have been checked one after another manually, and any joke from a non-wordplay category that also contained a wordplay element, was excluded from the dataset. This step was taken to ensure the classification task remains a multi-class problem rather than a multi-label one. However, to illustrate the inherent ambiguity in humor categorization, we present in Table 1 several representative jokes that feature multiple humor types, especially where wordplay overlaps with others.
Joke | Humor Types | Explanation |
---|---|---|
A high school senior visits a psychic... "I’ve applied to 10 different colleges," the student said. "Which ones will accept me? Which one will I attend?" "That is hard to say," said the psychic. "But you will spend an absurd sum of money." "How do you know this?" the student asked. The psychic replied, "It’s mostly intuition." | Absurdity, Wordplay | Absurd: The student consults a psychic for college decisions, an illogical scenario. Wordplay: The punchline hinges on the pun between "intuition" (gut feeling) and "in tuition" (financial cost). |
It turns out a major new study recently found that humans eat more bananas than monkeys. It’s true. I can’t remember the last time I ate a monkey. | Dark, Wordplay | Wordplay: The joke plays on the ambiguous sentence structure, twisting the meaning from a comparison of diet to a claim about eating monkeys. Dark: It introduces a taboo subject—eating primates—in a casual and humorous tone. |
It is funny and sad thing how a group of squid is not called a squad. | Irony, Wordplay | Wordplay: Plays on the phonetic similarity between "squid" and "squad," setting up a pun. Irony: Highlights the mismatch between what would be a fitting name and actual naming conventions, provoking humor through unexpected linguistic rules. |
If con is the opposite of pro, then is Congress the opposite of progress? | Social Commentary, Wordplay | Wordplay: The joke plays on the prefixes "con" and "pro," twisting the meaning of "Congress" into an opposite of "progress." Social Commentary: Critiques political inefficiency or dysfunction in a satirical way. |
Repetition of the same word or a collocation in the same category were replaced by random words. For example, the collocation "the ironic person" repeated itself in the data of the irony jokes, therefore I replaced it with a random first and last name. e.g.:
Before: “Why did the ironic person wear sunscreen? Because they wanted to get a sunburn!”
After: “Why did Kelly Jones wear sunscreen? Because she wanted to get a sunburn!”
Words that suggest a clue to the category were changed. For example, half of the jokes in the irony category included the word: "ironic" or "irony", we changed these words with funny/surprising/ or any other suiting word for the context. e.g.:
Before: "What’s ironic about the Bible? In one of the most interesting irony examples, the most shoplifted book in America is The Bible."
After: "What’s funny about the Bible? In one of the most interesting absurd examples, the most shoplifted book in America is The Bible."
Regular sentences (not funny / non-joke) were randomly sampled from a Kaggle dataset [85], and we made sure that the amount of regular sentences is equal to the amount of all other jokes.
Type of Sentence | Number of Examples |
---|---|
Absurdity | 75 |
Dark | 77 |
Irony | 105 |
Wordplay | 378 |
Social Commentary | 62 |
No-Joke (regular sentence) | 697 |
Total | 1392 |
The data were split into: Train (80%), Validation (10%), Test (10%). Stratify was used to maintain the same ratio between labels within category groups.
\(^{\ast}\)Trained with batch size 8 instead of 16.
\(^{\ast\ast}\)Trained with batch size 4 instead of 16.
\(^{\circ}\)Model Quantization was set to be bf16 precision instead of fp32
GPT-4o (the main/big decoder) was also Fine-Tuned, using OPENAI API, it was trained for 3 epochs, batch size of 4 examples each time, and a Learning Rate (LR) Multiplier of 2. we had found that when GPT-4o decoder was fine-tuned it got very fast to zero training loss, which means the model is over-fitting and "hungry" for data Figure 3.
Model Type & Name | Recall-Macro | Precision-Macro | F1-Macro | Accuracy |
---|---|---|---|---|
Encoders | ||||
BERT-base-uncased | \(0.5635 \pm 0.0601\) | \(0.6135 \pm 0.0606\) | \(0.5808 \pm 0.0588\) | \(0.8024 \pm 0.0251\) |
BERT-base-multilingual | \(0.7994 \pm 0.045\) | \(0.7492 \pm 0.0476\) | \(0.7674 \pm 0.0481\) | \(0.881 \pm 0.027\) |
XLNet-base-cased | \(0.7455 \pm 0.0471\) | \(0.7185 \pm 0.0362\) | \(0.7212 \pm 0.0402\) | \(0.8642 \pm 0.0214\) |
distilBERT-base-uncased | \(0.7682 \pm 0.0159\) | \(0.7274 \pm 0.03679\) | \(0.7358 \pm 0.0333\) | \(0.8619 \pm 0.027\) |
ALBERT-base | \(0.7579 \pm 0.0789\) | \(0.7402 \pm 0.0409\) | \(0.7201 \pm 0.053\) | \(0.8476 \pm 0.023\) |
RoBERTa-base | \(\boldsymbol{0.8745 \pm 0.0081}^{\dagger}\) | \(\boldsymbol{0.8652 \pm 0.0304}^{\dagger}\) | \(\boldsymbol{0.8566 \pm 0.0164}^{\dagger}\) | \(\boldsymbol{0.9286 \pm 0.0124}^{\dagger}\) |
XLM-roBERTa-base | \(0.8061 \pm 0.042\) | \(0.7735 \pm 0.057\) | \(0.7462 \pm 0.0569\) | \(0.8691 \pm 0.0393\) |
DeBERTa-v3-base | \(0.7855 \pm 0.0691\) | \(0.784 \pm 0.1056\) | \(0.7733 \pm 0.0882\) | \(0.8905 \pm 0.0406\) |
ModernBERT-base | \(0.687 \pm 0.0131\) | \(0.722 \pm 0.0484\) | \(0.6871 \pm 0.0182\) | \(0.8642 \pm 0.0124\) |
BERT-large-uncased | \(0.7187 \pm 0.0549\) | \(0.688 \pm 0.0393\) | \(0.6959 \pm 0.0425\) | \(0.8381 \pm 0.0165\) |
XLNet-large-cased\(^{\ast}\) | \(0.7936 \pm 0.0628\) | \(0.7451 \pm 0.048\) | \(0.7606 \pm 0.0541\) | \(0.8714 \pm 0.0247\) |
ALBERT-large-v2 | \(0.7119 \pm 0.0133\) | \(0.7101 \pm 0.0177\) | \(0.7057 \pm 0.0129\) | \(0.8524 \pm 0.0109\) |
RoBERTa-large | \(0.8594 \pm 0.0473^{\dagger}\) | \(0.8501 \pm 0.0318\)\(^{\dagger}\) | \(0.8503 \pm 0.0355^{\dagger}\) | \(\boldsymbol{0.9286 \pm 0.0071}^{\dagger}\) |
XLM-roBERTa-large | \(0.8514 \pm 0.0348\) | \(0.8228 \pm 0.0519\) | \(0.8308 \pm 0.0441\) | \(0.9214 \pm 0.0189\) |
DeBERTa-v3-large\(^{\circ}\) \(^{\ast\ast}\) | \(0.8441 \pm 0.0271\) | \(0.8034 \pm 0.0378\) | \(0.8145 \pm 0.0304\) | \(0.9048 \pm 0.0.0109\) |
ModernBERT-large | \(0.7806 \pm 0.0195\) | \(0.7864 \pm 0.0539\) | \(0.7611 \pm 0.0151\) | \(0.8929 \pm 0.0071\) |
NeoBERT\(^{\circ}\) \(^{\ast}\) | \(0.7719 \pm 0.0861\) | \(0.8191 \pm 0.0274\) | \(0.7765 \pm 0.0786\) | \(0.9049 \pm 0.0251\) |
Encoder-Decoder | ||||
BART-large-mnli-zero shot | \(0.2488\) | \(0.2541\) | \(0.1841\) | \(0.2071\) |
Flan-T5-base-zero shot | \(0.2292\) | \(0.2843\) | \(0.1575\) | \(0.3696\) |
Flan-T5-base-few shots | \(0.1674\) | \(0.1619\) | \(0.1079\) | \(0.3254\) |
Decoders | ||||
Llama-3.2-3B-Instruct-zero shot | \(0.3600\) | \(0.2934\) | \(0.1729\) | \(0.1929\) |
Gemma-2-2b-it-zero shot | \(0.3697\) | \(0.4944\) | \(0.3665\) | \(0.6571\) |
Qwen2-7B-Instruct-zero shot | \(0.3746\) | \(0.4221\) | \(0.2759\) | \(0.2595\) |
Mistral-7B-Instruct-v0.2-zero shot | \(0.4055\) | \(0.5071\) | \(0.3184\) | \(0.3000\) |
GPT-4-zero shot | \(0.5919\) | \(0.5808\) | \(0.5047\) | \(0.5214\) |
Llama-3.2-3B-Instruct-few shots | \(0.3895\) | \(0.4328\) | \(0.2086\) | \(0.2143\) |
Gemma-2-2b-it-few shots | \(0.4234\) | \(0.4945\) | \(0.3759\) | \(0.4786\) |
Qwen2-7B-Instruct-few shots | \(0.4791\) | \(0.4777\) | \(0.3452\) | \(0.2950\) |
Mistral-7B-Instruct-v0.2-few shots | \(0.4801\) | \(0.6030\) | \(0.3865\) | \(0.3857\) |
GPT-4-few shots | \(0.6381\) | \(0.6878\) | \(0.5955\) | \(0.6429\) |
GPT-4o-fine-tuned | \(0.8618 \pm 0.0025^{\dagger}\) | \(0.8464 \pm 0.0089^{\dagger}\) | \(0.8522 \pm 0.0056^{\dagger}\) | \(0.9238 \pm 0.0041^{\dagger}\) |
Bold scores represent the best score in column.
Underlined scores are the second-best score in the column.
\(\pm\) add or subtract standard deviation from mean
\(^{\ast}\)Trained with batch size 8 instead of 16.
\(^{\ast\ast}\)Trained with batch size 4 instead of 16.
\(^{\circ}\)Model Quantization was set to be bf16 precision instead of fp32
\(^{\dagger}\) - no significant difference (p.value greater than 0.05) according to Welch’s t-test
This study aimed to probe whether various LLMs can understand Humor. The test was done by classifying each sentence to one of six categories (five types of humor or no-joke regular/negative sentence). The surprising result is that both fine-tuned Encoders and fine-tuned Decoders perform equally well. This contradicts the previous approach where Encoder (RoBERTa) was considered to be better.
It’s important to mention that issues such as heterogeneous data (which may imply different jargon for different categories), also the data was scarce with only 1394 examples (when half of them 697 are regular sentences, "Non-Joke"), the length of sentences in different categories, as well as semantic meaning that can help classification (e.g. Absurdity was mainly taken from the world of politics jokes) were not taken into consideration under this paper. Also, GPT-5 was only introduced in the recent couple of days, and GPT-5 is not publicly available as a distinct model for fine-tuning in the time of publication of this paper.