July 23, 2025
Phishing emails continue to pose a significant threat to cybersecurity by exploiting human vulnerabilities through deceptive content and malicious payloads. While Machine Learning (ML) models are effective at detecting phishing threats, their performance largely relies on the quality and diversity of the training data. This paper presents MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus, a novel, multi-source phishing email dataset designed to overcome critical limitations in existing resources. It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails, with a wide spectrum of engineered features. We evaluated the dataset’s utility for phishing detection research through systematic experiments with four classification models (RF, XGB, MLP, and CNN) across multiple feature configurations. Results highlight the dataset’s effectiveness, achieving 98.34% F1 with XGB. By integrating broad features from multiple categories, our dataset provides a reusable and consistent resource, while addressing common challenges like class imbalance, generalisability and reproducibility.
phishing, dataset, email, artificial intelligence, machine learning, deep learning
Phishing remains one of the most prevalent and damaging forms of cybercrime, counting with over 3.76 million cyberattacks in 2024, causing wide financial losses globally and contributing to hundreds of thousands of compromised accounts (mostly email) in 2024 [1]–[4]. It is also responsible for 15% of attack vectors in data breaches, costing an average of $4.88 million per breach [5]. Email is one of the many media through which these attacks are performed, as it remains a central channel for personal and professional communication, and with the ever-evolving attack landscape, automated detection and mitigation strategies have become essential to safeguarding users and infrastructures. Machine Learning (ML) approaches promise scalable and adaptive defence mechanisms against evolving phishing campaigns, however, their effectiveness is highly dependent on the quality and diversity of the training data.
Despite the availability of several open-source phishing datasets, recent literature highlights some limitations in their usability for robust model development [6]. These issues include extreme class imbalance, limited feature engineering, narrow coverage of phishing tactics, outdated samples, and inconsistent preprocessing practices [6]–[8]. Furthermore, many studies rely on private or small-scale datasets, which limits generalisability and reproducibility [9]. These challenges restrain the development of efficient and realistic ML models, as they may lead to overfitting or inflated performance metrics on small or biased test sets.
In this paper, we address these limitations by analysing and comparing several existing open-source email phishing datasets to assess their strengths and limitations, as well as several related works (Section 2). Based on this analysis, we propose combining multiple datasets to increase the diversity of phishing examples and expand the overall sample size, thereby enhancing the robustness and generalisability of ML models. We apply consistent preprocessing and feature engineering across various feature categories, resulting in a comprehensive and ready-to-use dataset tailored for ML applications, named MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus (Section 3).
To verify the utility of this dataset, we conduct different experiments using four distinct ML and Deep Learning (DL) models, each trained on different feature subsets (Section 4). As is discussed in Section 5, the results consistently demonstrate the effectiveness of the proposed dataset in supporting accurate and reliable phishing detection. In conclusion, by offering a standardised, feature-rich, and scalable resource, this dataset serves as the main contribution of our study and a valuable foundation for future research and development in phishing detection, as further detailed in Section 6.
Datasets are sets of data relevant to a specific topic. Data can be artificially generated or collected from different sources. Datasets are labelled if they contain a classifying feature. On the datasets that will be described, labels identify each email sample as benign or phishing but the labels can have other meanings depending on the dataset.
There are several publicly available email spam and phishing datasets, some of which have been used for ML-based email phishing detection [10]. Spam datasets are commonly used in phishing detection studies [11], [12] because phishing is a type of spam and these emails share many characteristics, such as being unsolicited, unwanted, deceptive, and attempting to solicit an action [13]. Table 1 provides an overview of these datasets, including their content type, label availability, and sample sizes.
Dataset | Content | Label | Sample Size |
---|---|---|---|
Enron Corpus | No | 517,401 | |
Corpus | Email phishing | Yes | 11,527 |
SpamAssassin | Email spam | Yes | 6,047 |
Nigerian Fraud | Email phishing | Yes | 3,975 |
TREC-05 | Email spam | Yes | 92,189 |
TREC-06 | Email spam | Yes | 37,822 |
TREC-07 | Email spam | Yes | 75,419 |
CEAS-08 | Email spam | Yes | 39,154 |
Ling-Spam | Email spam | Yes | 2,893 |
Email4S | Email phishing | Yes | 18,650 |
The corpora available for phishing detection vary significantly in origin and content. They range from large, unlabelled corporate archives like the Enron Corpus [14], often used as a source of benign emails, to specialised collections containing only malicious samples, such as the Nazario Phishing Corpus [15] and Nigerian Fraud letters [16]. Many were curated specifically for academic research and benchmarking, including the seminal Ling-Spam [17], the SpamAssassin corpus [18], and collections from standardised competitive events like the TREC datasets [19]–[21] and the CEAS-08 [22]. More recent efforts focus on developing labelled, phishing-specific datasets, for example by extracting and curating samples from larger collections, as demonstrated by initiatives like Email4S [23].
Existing phishing email datasets face significant challenges that restrict the development of robust ML-based detection systems. Recent reviews highlight that many corpora suffer from data imbalance, outdated structures, and limited feature engineering. There is often a trade-off between scope and diversity; classic spam corpora like the SpamAssassin dataset offer high source diversity, whereas dedicated phishing collections, such as the Nazario Phishing Corpus, tend to be smaller and narrower in scope, lacking coverage of varied phishing tactics. These differences can heavily impact study results [7].
Furthermore, much of the current research relies on small public or inaccessible private datasets, which affects generalisability. Ethical concerns, such as the disclosure of sensitive or personally identifiable information, also make more difficult to acquire new datasets. Although some studies attempt to mitigate these issues by merging multiple corpora, these efforts often remain limited in scale, combining only two or three sources [24]. Addressing these quality issues is crucial for improving the reliability and generalisability of phishing detection research [6], [8].
In the phishing email domain, researchers commonly extract features from six principal sources: email body text, embedded URLs, attachments, message headers, HTML structure, and external domain reputation, to expose both technical anomalies (e.g. mismatched “From” headers or IP-based links) and social-engineering cues (e.g. urgency words or deceptive phrasing) [25]–[28].
Email Body Text Features contain free-form writing that can reveal many social-engineering tactics and tricks used by attackers to exploit the human factor of a system [29], [30]. These cues are evident not only in the direct content of the email but also in its lexicality, syntax, and semantics. Some approaches used to extract these features from text include vectorisation methods such as Bag of Words or TF-IDF [31], counts of trigger words like “verify,” “password,” and “urgent” [32], readability scores such as Flesch–Kincaid and Gunning Fog [33], and sentiment analysis to detect emotional manipulation [34]. Together, these linguistic characteristics help identify deceptive emails crafted to manipulate recipients.
Embedded URL Features are crucial in identifying phishing emails, as attackers often use deceptive links to lure victims. These features typically include various metrics related to the URL’s lexical structure, such as the total length of the URL, the number of subdomains it contains, the presence of IP-literal hostnames (e.g. https://192.168.1.1/home), the directory depth (measured by counting the segments separated by “/”), token counts derived by splitting the URL on characters like “.”, “-”, and “_”, digit‑to‑letter ratios and the overall character entropy. Beyond these lexical characteristics, other suspicious indicators include the use of dubious top-level domains, the presence of UTF-encoded characters, and the employment of excessive URL-shortening services [35]–[37]. These combined URL features help in detecting potentially harmful links embedded in phishing emails.
Attachment Features play a vital role in detecting malicious content within emails, as attachments can carry malicious payloads or impersonate legitimate forms. These features include the number and types of attached files (e.g. executables, Office documents containing macros, and compressed archives), file‐size anomalies, discrepancies between the file extension and the MIME type, embedded macros or scripts, and the count of embedded images. Additionally, statistical flags for unusually large or zero-byte attachments and signature-based checks on macro patterns are effective tools for identifying concealed malware or credential-harvesting forms [38], [39]. Together, these features help uncover malicious attachments that aim to compromise the recipient’s security.
Message Header Features carry email’s transport and routing metadata, which attackers often spoof or manipulate to conceal their identity or mislead recipients. Notable examples include the number and sequence of the
Received
header fields, which records each server the email passed through and creates a chain with timestamps (abnormal counts or irregular sequences in these hops may indicate suspicious routing or email relaying through unexpected servers);
discrepancies between the From
header field, which shows the sender claimed email; the Return-Path
field, which specifies the destination for bounce messages; and the envelope sender (the actual sender address used during SMTP
transmission), which can reveal inconsistencies or attempts to mask the sender identity. Authentication results from protocols like SPF (Sender Policy Framework), DKIM (DomainKeys Identified Mail), and DMARC (Domain-based Message Authentication, Reporting
& Conformance) also provide strong spoofing indicators by verifying if emails are actually from who they claim to be from. Additional suspicious indicators include anomalous domains in the Reply-To
field (which specifies the address that
replies will be directed to) or List-Unsubscribe
field (used for newsletter opt-outs), as attackers might use these fields to redirect victims or avoid detection; and unusual timestamp discrepancies, such as impossible or inconsistent sending
times across headers. By quantifying these signals, detection models can effectively learn to identify forged or illicit senders and flag potentially malicious emails [40].
HTML Structure Features are essential for analysing emails that include HTML content, as attackers often embed hidden forms, scripts, or obfuscated elements to deceive recipients. These features involve counting and examining HTML tags
commonly exploited in phishing attempts, such as <form>
(which can collect user credentials), <iframe>
(often used to load external content invisibly), <script>
(which may execute malicious code),
<img>
(which can track user interaction or load deceptive visuals), and inline CSS (frequently used to hide elements from the recipient). Key structural indicators include the depth of tag nesting (which can suggest attempts to hide
content deep within complex layouts), the ratio of hidden elements (e.g. those using display:none
) to visible ones, mismatches between anchor text and the actual href
targets (which can disguise malicious links), and the presence
of JavaScript event handlers like onload
or onclick
(which can trigger actions when the user interacts with the email). By extracting and analysing these elements, it becomes possible to uncover the client-side techniques used by
phishing kits to bypass basic text-based detection methods [41], [42].
Reputation Features extend beyond the immediate content of an email to evaluate the trustworthiness of the attacker’s domains. Important reputation indicators include the domain’s registration age, time remaining until domain expiration, domain name registrar and domain nameserver metadata (which can reveal connections to known malicious actors or unreliable service providers), DNS record time-to-live values (which indicate the refresh rate of domain information), and the geographic location of the hosting IP address (adds further context, as certain regions may be associated with phishing or cybercrime). Additionally, it is crucial to check whether the host IP or Autonomous System Number (ASN) is on known Blacklists. Domains with short lifespans (newly registered or rapidly changing), use obscure top-level domains, or rely on hosting providers with lax abuse policies tend to have a strong association with phishing campaigns. These reputation features provide valuable signals for identifying potentially malicious domains used in attacks [43], [44].
There are many techniques used for phishing email detection. Some of the most traditional ones include reputation-based methods, which involve assessing the trustworthiness of a sender or website based on historical data [45]; heuristic-based methods, which rely on flexible, experience-informed criteria [46]; and rule-based methods, which classify content according to predefined rules [12], [47]. Despite having consistent detection rates for known threats, these methods lack zero-day attack detection and are hard to maintain, which led researchers to search for novel approaches, often based on Natural Language Processing (NLP), ML, and DL.
Omotehinwa et al. [48] evaluated ensemble ML models for spam email detection, emphasising hyperparameter optimisation. They trained RF and Extreme Gradient Boosting (XGB) models on textual features extracted from the Enron-Spam corpus. XGB performed best with an F1 of 98.16%, demonstrating the effectiveness of ensemble learning in identifying email threats.
Naswir et al. [49] investigated how effectively lexical URL features can detect phishing emails. They extracted features like IP addresses in URLs, special characters (e.g. ‘@’, ‘-’, and ‘%’), URL entropy, and length, alongside basic email metadata, from the Nazario Phishing Corpus and the legitimate Enron Corpus. By optimising a Support Vector Machine (SVM) with the Cuckoo Search algorithm, their method achieved 91% accuracy, demonstrating the strong predictive power of URL-based features for phishing detection.
Khalid et al. [11] introduced LogiTriBlend, a stacked ensemble model for phishing email detection, combining SVM, Logistic Regression, RF, and XGB. They used a small, imbalanced "Phishing Email Dataset" (4591 emails) and addressed this imbalance with SMOTE data augmentation. It comprises Enron Corpus, SpamAssassin, Nazario Phishing Corpus and other sources. The models were trained on textual features, vectorised using TF-IDF, Word2Vec, and Doc2Vec. The Doc2Vec-based features with LogiTriBlend achieved the best F1 (99.41%). However, the study’s generalisability might be limited due to the small size of the original dataset and its reliance solely on text features. Chanis et al. [50] also used ensemble methods for phishing detection, uniquely combining stylometric features (such as formatting and writing style) with content-based features. They trained separate classifiers for each feature type, then integrated them using a stacked ensemble. This approach captured both the email’s meaning and writing style, achieving a 98.43% F1 on a private dataset, outperforming content-only models by around 2.2%.
Atawneh and Aljehani [51] compared various Deep Learning (DL) architectures (CNNs, RNNs, LSTMs) for phishing detection, using textual features from email bodies from the Nazario and SpamAssassin corpora. Their proposed BERT+LSTM hybrid model achieved a 99.55% F1, outperforming other standalone models. Kaushik et al. [52] proposed another hybrid DL model, combining LSTM and CNN layers, for phishing detection. Their model processes URL-based and textual features (transformed into 2D image-like matrices) from a balanced dataset, constructed by combining phishing URL data from PhishTank [53] and legitimate URL data from Alexa [54] with additional email datasets from various public repositories. It achieved an F1 of around 99%, demonstrating the efficacy of combining sequential and spatial feature learning.
Despite most previously mentioned studies using text-based features, other information can also be extracted and used as input for the models. For example, Zhang et al. [55] developed a phishing detection framework that uses diverse features beyond just text. Their approach extracts text from embedded images via Optical Character Recognition (OCR), combines it with URL-based, web-based, and rule-based features. Using these varied inputs, they trained a two-stage Extreme Learning Machine (ELM) model on a dataset of over 22000 webpages collected from PhishTank, APWG, and legitimate sources such as Alexa’s top sites, achieving 99.04% accuracy.
Zhang et al. [56] used 79 static email features (e.g., sender IP, links, attachments) and textual embeddings to train a RF classifier for phishing detection. Their model achieved 99.97% accuracy on a large, custom dataset (660985 emails), combining benign emails from Enron Corpus (517401 samples), Trec-07 (25220 samples), and SpamAssassin dataset (6952 samples); and malicious phishing samples from Nazario Phishing Corpus (9510 samples), Wooyun XSS dataset [57] (168 samples), Trec-07 (50199 samples), and additional 49136 samples generated by summarising and simulating novel types of malicious email attacks that exploit email protocol vulnerabilities. While this dataset is extensive, its heavy reliance on the Enron Corpus for benign emails might introduce bias.
Altwaijry et al. [12] compared DL models for phishing email detection, using only email body and subject content from Nazario Phishing Corpus and SpamAssassin. They proposed a lightweight 1D-CNN architecture (1D-CNNPD), enhanced with recurrent layers like Bi-GRU. Their best model (1D-CNNPD + Bi-GRU) achieved a 99.66% F1, outperforming traditional ML models and comparable DL approaches such as THEMIS [58] and DeepAnti-PhishNet [59] and demonstrating the effectiveness of CNNs combined with recurrent mechanisms for phishing detection from text. Another recent approach by Koide et al. [60] introduced ChatSpamDetector, which uses Large Language Models (LLMs), specifically GPT-4, for phishing detection. They prompt GPT-4 directly with raw email text, allowing it to infer phishing characteristics contextually, rather than relying on engineered features. This zero-shot/few-shot approach achieved 99.70% accuracy on a private balanced dataset, showing LLMs’ ability to capture subtle phishing signals.
All these studies consistently show RF and XGB as top performers in phishing detection, highlighting their reliability and effectiveness for real-world applications involving structured email features.
From the previously described datasets, we selected the Nazario Phishing Corpus, Nigerian Fraud, TREC-05, TREC-06, and TREC-07 to be combined into a new dataset, aiming to enhance the diversity of phishing examples and increase the overall sample size. These datasets were chosen because they provide both the original raw email messages, which allow greater flexibility in feature extraction, and the corresponding classification labels, making them suitable for phishing detection tasks. Together, these sources resulted in a raw dataset comprising 220932 email samples.
Feature engineering is the process of extracting and transforming raw email data into a set of informative features that ML models can understand and learn from. These features can be numerical (e.g. the number of links in an email) or categorical (e.g. the type of attachments an email has). Based on our analysis of commonly extracted features in recent research and their respective mapping to key components of phishing emails (such as body content, embedded links, attachments, headers, HTML structure, and domain reputation), we selected a representative subset of features from each category. Table 2 provides a summary of this selection.
Feature Name | Feature Description |
---|---|
source | Original dataset from where the content came from |
sender | Hashed email address of the sender |
sender_domain | Email domain of the sender |
receiver | Hashed email address of the receiver |
receiver_domain | Email domain of the receiver |
date | Date and Time of when the email was sent (RFC 2822) |
subject | The subject of the email |
content_types | The content type of the body |
body | The email body |
urls | URLs extracted from the email body |
url_count | Number of URLs present on the email |
url_length_max | Maximum URL length |
url_length_avg | Average URL length |
url_subdom_max | Maximum subdomain count |
url_subdom_avg | Average subdomain count |
attachment_count | Number of attachments present on the email |
has_attachments | 0 if it does not have any attachments or 1 if it does |
attachment_types | The types of the attachments present on the email |
language | The language (e.g. "en") of the email text |
label | 0 for benign or 1 for phishing |
With the features properly defined and the unified dataset comprising 220932 email samples, the subsequent and crucial step to ensure the quality and reliability of our analysis is data cleaning. This process is fundamental, especially when combining distinct phishing datasets from various sources, as the heterogeneous nature of emails can introduce noise, which would negatively impact the development of robust models. Email messages come in diverse formats, character encodings, and content types. So, to preserve the original content, our pipeline is designed to transform each raw email message into a consistent form while maintaining privacy safeguards.
We begin by interpreting the header fields (sender, receiver, date, and subject) by decoding various character encodings into legible text. This ensures that international characters (e.g. Chinese, Arabic, Greek, and Cyrillic) and special symbols are accurately rendered, preserving the nuances of the original message metadata.
Emails often consist of multiple parts, each with its own content type and body, as already noted. These parts may be encoded using formats such as Base64
or quoted-printable
, requiring individual handling. During
preprocessing, any content transfer encoding is automatically detected and decoded. HTML markup is then stripped to isolate the underlying textual content, resulting in a clean and continuous stream of words suitable for further analysis.
Many email threads include quoted replies and forwarded content. Our pipeline locates common reply separators and removes duplicate segments, isolating unique elements. By focusing on individual content, we reduce noise and highlight the elements most relevant to phishing behaviour. Beyond text, emails often carry irrelevant decorations such as horizontal lines, separators, and special characters introduced by various clients. We systematically remove these artefacts. Additionally, we normalise look‐alike characters, such as Unicode homoglyphs, to their standard ASCII counterparts. This normalisation streamlines future text processing and removes elements that could be used to trick detection systems.
The preprocessing of the email body also includes the removal or masking of sensitive information. This includes email addresses, URLs, file paths, and phone numbers, which are replaced with standardised placeholder tags. This approach conceals actual user data while preserving the structural signals (e.g. the presence of a link), essential for phishing detection models. PGP signatures and emojis were also replaced by tags only for preserving structural signals. In addition to the text, we also gather metadata on attachments and hyperlinks. Each attachment’s type is recorded, and all links are extracted and analysed for structural complexity. Finally, recognising that email content may span multiple languages or include code snippets, we segment the cleaned text and apply language detection to each part. These language labels can guide language-specific processing and enrich the feature set used by analytical models.
After applying the processing pipeline to the combined dataset, some duplicate and empty samples were identified and removed. The resulting dataset, MeAJOR Corpus, comprised a total of 135894 emails.
To analyse the performance of the resulting dataset, we used ML models to assess how the selected features impact classification accuracy. The models RF, XGB, Multilayer Perceptron (MLP), and Convolutional Neural Network (CNN) were selected based on their frequent use and good performance in the reviewed literature. Each model was trained and tested using four different feature combinations to investigate the contribution of different types of information to the overall performance.
The first feature set, referred to as Text Only Set, included only the textual data extracted from the subject
and body
features of the final dataset. This baseline set provided a foundation to evaluate model
performance relying solely on raw text, without incorporating any additional contextual information. Given that phishing emails often exhibit distinctive linguistic patterns, focusing on textual content allowed us to establish an essential reference point
for detection effectiveness. The second set, referred to as the Text + URL Set, combined the Text Only Set with URL-based features, namely the total number of URLs, the maximum and average URL length, and the number of subdomains. The
third set, called the Text + Attachment Set, appended attachment-related features to the Text Only Set, namely the total number of attachments, a binary flag indicating their presence, and the top-level MIME types observed. Finally, the
Text + URL + Attachment Set merge all previous features, offering a holistic representation of each email. This last set was designed to reflect a real-world detection scenario in which multiple phishing indicators may coexist and interact,
potentially improving the model’s ability to flag deceptive messages.
All the textual features present on all the sets underwent NLP vectorisation. Specifically, we chose the word embedding technique FastText because it incorporates subword (character n-gram) information, allowing rare or unseen words to automatically get embeddings and making morphologically related words closer in vector space. Before applying this technique, the text was lowercased, tokenised, and trimmed of words with little semantic meaning (stopwords and punctuation marks). Once the email content was standardised and tokenised, a numerical vector was generated for each token using a FastText model pre-trained using large amounts of textual data. Since each email contains several tokens, and each token has a corresponding generated vector, we then average all vectors of an email, resulting in a single vector per email content and subject.
Model performance was evaluated using Accuracy, Precision, Recall, and F1-score (F1). To ensure fair and optimised comparisons across all models, hyperparameter tuning was conducted for each one using a 5-fold cross-validation procedure. After tuning, the best hyperparameters were selected and the final models were trained and evaluated using an 80/20 holdout validation split, ensuring consistent evaluation across all feature sets. The performance across all models is summarised in Table 3.
Model | Feature Set | Acc | Pre | Recall | F1 |
---|---|---|---|---|---|
RF | Text | 96.02 | 97.19 | 94.64 | 95.90 |
RF | Text+URL | 96.73 | 97.76 | 95.53 | 96.63 |
RF | Text+Att | 95.86 | 96.99 | 94.52 | 95.74 |
RF | Text+URL+Att | 96.70 | 97.70 | 95.55 | 96.61 |
XGB | Text | 97.93 | 97.96 | 97.82 | 97.89 |
XGB | Text+URL | 98.37 | 98.51 | 98.17 | 98.34 |
XGB | Text+Att | 97.85 | 97.87 | 97.75 | 97.81 |
XGB | Text+URL+Att | 98.35 | 98.36 | 98.28 | 98.32 |
MLP | Text | 97.61 | 97.43 | 97.72 | 97.57 |
MLP | Text+URL | 96.70 | 97.98 | 95.25 | 96.59 |
MLP | Text+Att | 97.73 | 97.77 | 97.61 | 97.69 |
MLP | Text+URL+Att | 97.22 | 97.32 | 97.01 | 97.17 |
CNN | Text | 97.49 | 98.02 | 96.85 | 97.43 |
CNN | Text+URL | 97.69 | 97.42 | 97.90 | 97.66 |
CNN | Text+Att | 97.59 | 97.28 | 97.83 | 97.55 |
CNN | Text+URL+Att | 97.65 | 97.28 | 97.96 | 97.62 |
Note: Feature set abbreviations: Text = Text Only Set, Text+URL = Text + URL Set, Text+Att = Text + Attachment Set, Text+URL+Att = Text + URL + Attachment Set
Analysing the results, it can be noticed that the RF model showed significant improvement when URL features were included alongside textual data, achieving an F1 of 96.63%. This corroborates findings by Naswir et al. [49] regarding the value of URL features, though our implementation uses a more limited feature set than their hybrid approaches. In contrast, attachment features showed a neutral or slightly negative impact on the overall performance, consistent with Zhang et al. [56] who observed minimal contribution from attachments in their large-scale study. Comparing the Text + Attachment and Text + URL + Attachment feature sets with their counterparts without the attachment features, confirms that URL information is particularly valuable for phishing detection when using RF, while attachment information has little to no impact.
XGB consistently delivered strong results across all feature combinations, achieving 98.34% F1 with the Text + URL Set. This exceeds Omotehinwa et al. [48] (98.16% F1) and approaches ensemble methods like Chanis and Arampatzis [50] (98.43% F1), despite our simpler feature engineering. The robustness of the Text Only Set (97.89% F1) supports Khalid et al. [11] on text-centric efficacy, though our URL augmentation shows greater gains than their stylometric features. Attachment features again demonstrated neutral-to-negative impacts.
The MLP also exhibited strong results across all feature sets, reaching its best performance at 97.69% F1 with the Text + Attachment Set, validating Altwaijry et al. [12] on neural networks’ text proficiency. Notably, URLs caused a 1.1% F1 drop, contrasting sharply with ensemble models and highlighting MLP’s ineffectiveness with structured features (a limitation not observed in hybrid architectures like the one from Atawneh and Aljehani [51]).
Finally, the CNN model, evaluated on the same feature sets, achieved its best result when combining textual and URL data, reaching 97.66% F1 with the Text + URL Set. This demonstrated superior structural feature handling versus MLP, though falling short of Kaushik and Rathore’s work [52] (99% F1), where a specialised URL-to-image transformation is used to capture intricate visual patterns within URLs. The CNN model also maintained high precision and recall across configurations, showing its ability to generalise well even on diverse feature inputs. Notably, adding attachment features showed slightly more positive impacts when compared to the ensemble models, suggesting CNN’s adaptability to heterogeneous inputs aligns with Zhang et al. [55] on multi-modal processing.
The results obtained from the baseline experiments are crucial for validating the effectiveness, representativeness, and utility of the newly generated dataset presented in this study. Our results robustly confirm the established trend, highlighted by Alhuzali et al. [7] and Chanis et al. [50], that ensemble models such as XGB and RF consistently excel with structured email features. XGB emerged as the top performer in our evaluation, achieving a peak F1 of 98.34% using the combined Text + URL Set. This performance is highly competitive with recent state-of-the-art results, as it surpasses the 98.16% F1 reported by Omotehinwa et al. [48] using XGB on the Enron-Spam corpus and approaches the 98.43% F1 achieved by Chanis et al. [50] using a sophisticated stacked ensemble incorporating stylometric features. Importantly, our result was obtained with comparatively simpler feature engineering, underscoring the quality and richness of the underlying dataset. The strong performance of the Text Only Set (97.89% F1) aligns with findings by Khalid et al. [11] on the efficacy of text-centric approaches. However, our integration of URL features presented more substantial gains than the stylometric augmentations reported in their LogiTriBlend model.
The DL models, CNN and MLP, also demonstrated robust performance, particularly with enriched feature sets, supporting findings by Atawneh et al. [51] and Kaushik et al. [52] on the applicability of DL to textual features. However, our results reveal the interesting nuance that classical MLP architectures matched or even slightly surpassed CNN performance in some configurations, especially when incorporating structured features like attachments. This contrasts with some state-of-the-art hybrid DL approaches [51], [52], but highlights a potential advantage of simpler neural architectures when robust feature engineering provides discriminative inputs rather than relying solely on end-to-end representation learning. Our CNN model’s peak F1 of 97.66% with the Text + URL Set falls short of Kaushik et al. 99% [52], which leveraged specialised URL-to-image transformations. This gap suggests potential for future work incorporating similar advanced feature representations using our dataset.
The significant performance boost observed when adding URL features aligns strongly with the current state of the art, reinforcing findings by Khalid et al. [11], Naswir et al. [49], and Zhang et al. [55] on the critical importance of lexical and structural URL characteristics for identifying phishing attempts. While attachment features showed a more neutral or context-dependent impact in our experiments (consistent with the minimal contribution noted by Zhang et al. [56]), their inclusion reflects the multi-modal learning trend emphasised in cutting-edge research [41], [56]. Our dataset’s design inherently supports this multi-modal approach by providing diverse feature categories.
This paper presents MeAJOR Corpus, a novel multi-source phishing email dataset designed to overcome critical limitations in existing resources and advance the development of robust ML-based detection systems. By strategically integrating and curating five open-source corpora, we constructed a dataset of 135894 labelled emails. This integration significantly enhances sample diversity, volume, and representativeness, capturing a broader spectrum of phishing tactics and legitimate email structures than isolated or smaller datasets allow.
Our extensive feature engineering constitutes a core strength of the dataset, setting it apart from existing resources. Firstly, by providing both broad, foundational features (like raw body text and URL lists) for downstream custom feature extraction and specific, ready-to-use engineered features (like URL counts and attachment flags), it offers great flexibility. Secondly, it integrates signals across technical, structural, and linguistic dimensions, enabling holistic analysis that surpass the narrow focus of many existing corpora. Thirdly, the consistent preprocessing and anonymisation pipeline ensures feature quality and privacy compliance, addressing inconsistencies found in prior combined datasets.
The MeAJOR Corpus and its feature engineering’s utility were validated through experiments with four classification models (RF, XGB, MLP, CNN) across multiple feature configurations. The results were compelling since XGB achieved a peak F1 of 98.34% using the Text + URL Set, closely approaching state-of-the-art benchmarks while utilising significantly simpler feature engineering than comparable studies. This high performance, replicated across models, robustly confirms the dataset’s capacity to support accurate and reliable phishing detection. The substantial performance boost observed when incorporating URL features underscores the critical value of the structural URL information captured in our feature set. Furthermore, the strong results obtained with the Text Only Set (XGB: 97.89% F1) affirm the quality of the linguistic data preserved through our preprocessing. While attachment features showed a more context-dependent impact, their inclusion reflects the dataset’s inherent support for multi-modal learning, a key trend in advanced detection research.
Future work can leverage the MeAJOR Corpus’ diverse features to explore advanced techniques like LLMs, transformer architectures, multi-modal fusion, or synthetic data generation, driving further innovation in safeguarding users against evolving email threats. Exploration of additional data modalities and feature engineering techniques also presents promising avenues.
This work has been supported by the PC2phish project, which has received funding from FCT with Refª: 2024.07648.IACDC. Furthermore, this work also received funding from the project UIDB/00760/2020.