June 03, 2025
This study critically examines the methodological rigor in credit card fraud detection research, revealing how fundamental evaluation flaws can overshadow algorithmic sophistication. Through deliberate experimentation with improper evaluation protocols, we demonstrate that even simple models can achieve deceptively impressive results when basic methodological principles are violated. Our analysis identifies four critical issues plaguing current approaches: (1) pervasive data leakage from improper preprocessing sequences, (2) intentional vagueness in methodological reporting, (3) inadequate temporal validation for transaction data, and (4) metric manipulation through recall optimization at precision’s expense. We present a case study showing how a minimal neural network architecture with data leakage outperforms many sophisticated methods reported in literature, achieving 99.9% recall despite fundamental evaluation flaws. These findings underscore that proper evaluation methodology matters more than model complexity in fraud detection research. The study serves as a cautionary example of how methodological rigor must precede architectural sophistication, with implications for improving research practices across machine learning applications.
The rise of deep learning has profoundly transformed how we approach complex classification and regression problems. Given its data-hungry nature, the prevailing emphasis has been on acquiring large volumes of training data to boost model performance. Meanwhile, the implementation of deep learning models has become increasingly trivialized, thanks to the proliferation of high-level APIs and libraries—especially in Python—that abstract much of the underlying complexity. Today, even large language mode...
However, this convenience comes at a cost. The focus has shifted so heavily toward automation and performance metrics that critical aspects of the modeling pipeline are often overlooked or poorly documented. Researchers frequently delegate key steps to automated tools, sometimes without fully understanding or reporting them. In doing so, we risk putting the cart before the horse—prioritizing model tuning over foundational concerns such as data quality and preparation.
The initial steps that precede model training are often loosely grouped under the term “preprocessing,” yet they receive scant attention in many studies. It is not uncommon for papers to devote pages to re-explaining ubiquitous evaluation metrics like accuracy, precision, recall, or confusion matrices, while providing vague or incomplete details on more foundational questions such as:
How were categorical, ordinal, or fuzzy variables handled?
What strategy was used for data splitting (e.g., random, stratified, time-based)?
When and how was normalization or standardization applied; and to which subsets?
Were oversampling or undersampling techniques limited to the training set, or did they inadvertently affect the test data?
Was feature selection or dimensionality reduction performed before or after data splitting?
This vagueness often extends to the methodological core. Sleek pipeline diagrams are commonly included but tend to omit essential details. For instance, one might encounter the use of 2D Convolutional Neural Networks (CNNs) applied to tabular data without any justification. Was there a compelling reason for employing a spatial model on non-spatial input? If so, how was the data reshaped to accommodate this architecture? Such decisions are non-trivial and require clear, transparent reporting.
These questions are far from minor. They directly affect the reproducibility, interpretability, and trustworthiness of machine learning models. When unaddressed, they may introduce silent data leakage, bias, or misleading performance metrics; ultimately undermining the credibility of the research. Thorough documentation and transparency in preprocessing are not optional; they are essential to rigorous, responsible data science.
In this work, we undertake a critical review of existing literature in light of the concerns previously discussed, using credit card fraud detection as a case study-specifically focusing on a widely referenced benchmark dataset [1]. This domain poses distinctive challenges, with extreme class imbalance being a primary issue. In such contexts, seemingly high accuracy scores are not uncommon, yet they can be misleading. Methodological oversights or missteps, whether inadvertent or otherwise, can easily lead to inflated performance metrics and a skewed perception of model efficacy.
One recurring issue we have observed is the mishandling of resampling techniques - particularly when oversampling or undersampling is performed before the train-test split - leading to data leakage. To underscore this point, we intentionally apply a simple yet flawed MLP-based approach. Despite its simplicity, the model yields impressively high metrics, thereby demonstrating how superficial performance gains can mask deeper methodological flaws.
The rest of this paper is organized as follows: Section [sec_background] provides background on credit card fraud detection and dataset characteristics. Section 3 reviews the literature on deep learning-based fraud detection. Section 4 describes our methodology, followed by experimental results in Section [sec_results]. Finally, Section 5 concludes the paper.
The widespread use of credit and debit cards has greatly enhanced financial convenience, but it has also led to increased fraudulent activity. Payment card fraud typically involves unauthorized transactions intended to obtain goods, services, or cash. While fraud represents a tiny fraction of all transactions, the absolute financial impact is substantial [2], making it a significant area of concern for financial institutions.
We use the popular European credit card fraud dataset [1] for benchmarking. It contains \(284{,}807\) transactions made by European cardholders over two days in September 2013, out of which only \(492\) are fraudulent, just \(0.17\%\), highlighting extreme class imbalance.
The dataset comprises \(30\) features: \(28\) anonymized principal components (V1 - V28) derived via PCA, along with Time
and Amount
. The target variable
Class
is binary, with \(1\) indicating fraud and \(0\) denoting legitimate transactions. Due to privacy concerns, the original feature labels and semantics were withheld.
Figure 1: Snapshot of the first five rows of the dataset.
In fraud detection, the misclassification of minority class instances (fraudulent cases) carries significant consequences, as failing to detect fraud can lead to financial losses and security risks. Traditional classifiers often bias toward the majority class due to the skewed distribution of data, leading to poor performance on the minority class. Resampling techniques address this issue by adjusting the class distribution, either by oversampling the minority class, undersampling the majority class, or combining both approaches to improve model performance.
Oversampling techniques aim to increase the representation of the minority class by either duplicating existing samples or generating synthetic samples. These methods help the model learn patterns from the minority class more effectively.
Random Oversampling: Duplicates minority samples without introducing new information. This method is simple but can lead to overfitting due to repeated samples.
SMOTE (Synthetic Minority Oversampling Technique) [3]: Generates synthetic samples between nearest neighbors of minority instances. This method helps to create a more balanced dataset and can improve model performance.
ADASYN (Adaptive Synthetic Sampling) [4]: Focuses on generating synthetic samples for harder-to-classify minority samples through adaptive weighting. This method is particularly useful for improving the classification of difficult minority instances.
Borderline-SMOTE [5]: Creates synthetic samples near class boundaries for better discrimination. This method helps to improve the classification of minority instances near the decision boundary.
Undersampling techniques aim to reduce the number of majority class samples to balance the class distribution. These methods help to reduce the computational complexity and improve the model’s focus on the minority class.
Random Undersampling (RUS): Randomly removes majority class samples. This method is simple but can lead to loss of important information.
NearMiss [6]: Selects majority samples based on proximity to minority instances. This method helps to retain informative majority samples.
Tomek Links [7]: Removes borderline samples to clarify decision boundaries. This method helps to improve the classification of minority instances by removing ambiguous majority samples.
Cluster Centroids [8]: Applies K-means clustering to condense the majority class. This method helps to reduce the number of majority samples while retaining the overall distribution.
Hybrid methods combine oversampling and undersampling techniques to balance the class distribution and improve model performance. These methods aim to leverage the strengths of both approaches.
SMOTE-Tomek, SMOTE-ENN [9]: Combine oversampling with data cleaning for improved balance. These methods help to generate synthetic minority samples and remove ambiguous majority samples.
SMOTEBoost [10]: Integrates SMOTE with boosting to enhance weak classifiers. This method helps to improve the performance of weak classifiers by generating synthetic minority samples.
SMOTE-SVM [11]: Uses SVM to guide synthetic sample generation. This method helps to generate synthetic minority samples based on the decision boundary of an SVM classifier.
Due to privacy-preserving transformations like PCA, meaningful original features are not available. While PCA protects sensitive data, it also results in components with uneven explanatory power, where only the top few capture substantial variance. Applying deep learning models to such transformed, sparse information may be excessive.
The literature often favors deep or complex models (CNNs, RNNs, GANs), yet we argue that focus should instead be on feature (re)engineering. By leveraging PCA components, their polynomial and pairwise combinations, and applying SMOTE for balancing, even simple models such as shallow MLPs can achieve competitive performance.
In fraud detection, minimizing false negatives (missed fraud cases) is vital. Hence, recall is a more appropriate metric than accuracy. SMOTE remains a strong choice for mitigating imbalance, and its integration with meaningful features ensures that even basic models can remain effective and interpretable.
In evaluating classification models, the confusion matrix provides essential metrics such as accuracy, precision, recall (sensitivity), specificity, and the F1-score. There are four key terms one can get from a confusion matrix and keeping the use case of a credit card fraud detection in perspective, these are defined as:
True Positives (TP): Correctly predicted positive instances (e.g., fraudulent transactions correctly identified as fraud).
True Negatives (TN): Correctly predicted negative instances (e.g., legitimate transactions correctly identified as legitimate).
False Positives (FP): Incorrectly predicted positive instances (e.g., legitimate transactions flagged as fraud; Type I error).
False Negatives (FN): Incorrectly predicted negative instances (e.g., fraudulent transactions missed by the model; Type II error).
The key metrics and their definitions are as follows: \[\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP+TN+FP+FN}}\] \[\text{Precision} = \frac{\text{TP}}{\text{TP+FP}}\] \[\text{Recall or Sensitivity} = \frac{\text{TP}}{\text{TP+FN}}\] \[\text{Specificity} = \frac{\text{TN}}{\text{TN+FP}}\] \[F_1 \text{ Score}= 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]
In credit card fraud detection [12], where the cost of missing fraudulent transactions (FN) is significantly higher than false alarms (FP), recall is particularly critical to minimize undetected fraud. However, precision must also be balanced to avoid overwhelming analysts with false positives. The F1-score, which harmonizes precision and recall, is thus a key metric for assessing model performance in such imbalanced scenarios.
The Precision-Recall Curve (PRC) is a crucial tool for evaluating the performance of classification models, especially in imbalanced datasets where fraudulent transactions are rare. Unlike the ROC curve, which can be overly optimistic in imbalanced scenarios, the PRC provides a clearer view of the trade-off between precision and recall - two metrics that are directly relevant to fraud detection. The PRC helps in selecting an optimal threshold for the model, balancing the need to catch as many fraudulent transactions as possible (high recall) while keeping false positives manageable (high precision). This balance is critical in fraud detection, where the cost of missing fraud (false negatives) is significantly higher than the cost of false alarms (false positives).
While specificity is important for maintaining customer trust and operational efficiency, it is often secondary to recall and the \(F_1\) score due to the high cost of missing fraudulent transactions. High specificity ensures that legitimate transactions are not unnecessarily flagged as fraudulent, but the primary focus remains on balancing recall and precision to minimize undetected fraud.
Credit card fraud detection has been extensively studied using both traditional and modern machine learning approaches, with comprehensive reviews available in recent works such as [13]–[15]. While this task might appear straightforward in theory, the field has seen an overapplication of complex methods that may be unnecessarily heavy for payment card fraud detection. This trend has even led to questions about the effectiveness of certain approaches, particularly Multilayer Perceptrons (MLPs), in capturing the temporal dependencies and sequential patterns that are crucial for identifying fraudulent activities [16]. In our view, the focus should shift from immediately applying sophisticated techniques to ensuring the correctness and efficiency of fundamental preprocessing tasks. To this end, we critically examine existing literature, identifying and analyzing potential methodological flaws in a substantial sample of previous works.
|p0.03|p0.25|p0.25|p0.075|p0.075|p0.075|p0.09|
No. & Method/Approach & Flaw Identified &
& & &Accuracy & Precision & Recall & F1-Score
No. & Method/Approach & Flaw Identified &
& & &Accuracy & Precision & Recall & F1-Score
& SMOTE + ANN [17] &
suspected Data leak; dataset balancing before the split
Inconsistent results within the article
& 0.99 & 0.93 & 0.88 & 0.91
& UMAP + SMOTE + LSTM (other dimensionality reduction (UMAP etc) [18] &
SMOTE on the dataset before splitting
Dimensionality reduction on a data that is already PCA transformed
& 0.967 & 0.988 & 0.919 & 0.952
& RUS + NMS + SMOTE + DCNN [19] &
Vague on the specific order of under- and oversampling and total samples in the end.
Precision and recall are well below 40% - worse than a random classifier in some cases.
No explanation of using 1D CNNs with \(3\times 3\) kernels
& 0.972 & 0.368 & 0.392 & 0.378
& SMOTE-ENN + boosted LSTM [20] &
& - & - & 0.996 (specificity: 0.998) & -
& SMOTE-Tomek + Bi-GRU [21] &
SMOTE-Tomek before train/test split
BN before activation
AUC too high par rapport the reported metrics
readability
& 0.972 & 0.959 & 0.978 & 0.968
& Borderline SMOTE + LSTM [22] &
Improper data splitting (validation set extracted pre-oversampling) and excessive majority-class oversampling
Misleading terminology (e.g., MLP vs. ANN) and undefined model architectures
& 99.9 & 80.3 & 92.1 & 85.8
& SMOTE-Tomek + BPNN (3 hidden layers: 28+28+dropout+28) [23] &
Oversamples the entire dataset before splitting
Ambiguities: sample count post-balancing, test size (25% or 30%?)
Inconsistent metrics: AUC=1 and AUPR=0.99 incompatible with F1=0.92
& - & 0.855 & 1 & 0.922
& CAE + SMOTE [24] &
Claims inverse_transform
can reconstruct original data from PCA components alone (requires original data + PCA model)
No holdout test set; relies solely on CV for final evaluation
SMOTE applied globally (not per-fold), risking data leakage
Multiple test evaluations may inflate performance
& - & 0.920 & 0.890 & 0.905
& DAE + SMOTE + DNN (4 hidden layers: 22+15+10+5 + 2 output neurons) [25] &
Bizarre DNN architecture: 4 hidden layers with aggressive shrinkage (22+15+10+5)
Output layer has 2 neurons for binary classification (should be 1 + sigmoid)
Unclear if normalization/standardization is applied pre-split (leakage risk)
High recall at low thresholds (overfitting to SMOTE/AE) plus
Sharp recall drop at threshold 0.8 (poor probability calibration)
& \(0.979\) & - & \(0.84\) & -
& SMOTE ahead of various ML methods with the best performance demonstrated by RF followed by an MLP with \(4\) hidden layers (50+30+30+50 neurons) [26] &
Vague methodology: lacks details on SMOTE application and feature exclusion
Likely SMOTE before split (risk of data leakage)
No justification for excluding 5% of features (critical for MLP performance)
Inconsistent performance: RF (F1=0.964) vs. MLP (F1=0.792) suggests overfitting
& \(0.999\) & \(0.964\) (RF) \(0.792\) (MLP) & \(0.816\) & \(0.884\)
(RF) \(0.804\) (MLP)
& CNN (Conv1D + Flatten + Dropout) [27] &
Naive architecture: unnecessary flatten after Conv1D and excessive dropout
Poor performance (\(\approx 93\%\) P/R/A) vs. RF (0.99 F1)
Vague oversampling details
Unclear dataset reduction to 984 samples (fraud class unspecified)
& 0.93 & 0.93 & 0.93 & 0.93
& CNN (Conv1D: \(32\times 2\), \(64\times 2\) + Dropout + Flatten) [28] &
No class balancing
Misuse of Conv2D on 1D data (PCA-transformed features)
Ineffective CNN use on PCA-transformed data (disrupts feature order)
Forced CNN architecture (unnecessary for tabular data)
Lower recall (90.24%) despite high precision/accuracy
& \(0.972\) & \(0.991\) & \(0.902\) & \(0.945\)
& a) MLP: n inputs and n neurons in each hidden layer [29] & & \(0.999\) & \(0.999\) & \(0.999\) & \(0.999\)
& b) CNN: unspecified but seems to have used 2D kernels for the CNN [29] &
Invalid CNN use on PCA data (no spatial structure)
Misapplication of 2D kernels to 1D tabular data
Lack of evaluation transparency
Questionable results reliability
Weak scientific justification
Reported metrics (>99.9% for ANN/CNN, 97.3% for LSTM) seem unrealistic
& \(0.999\) & \(0.999\) & \(0.999\) & \(0.999\)
& c) LSTM-RNN [29] & & \(0.973\) & \(0.973\) & \(0.973\) & \(0.973\)
& a) Random Oversampling (RO) + MLP: two dense hidden layers of 65 units each + 50% dropout [30] & & \(0.983\) & \(0.978\) & \(0.987\) & \(0.983\)
& b) RO+CNN: Conv1D(32,2) + dropout(0.2) + BN + Conv1D(64,2) + BN + flatten+dropout(0.2) + dense(64) + dropout(0.4) + dense(1) [30] &
CNN architecture may be suboptimal for tabular data
Excessive dropout in all models (20-50%)
High accuracy without balancing masks poor other metrics
Performance varies by balancing technique (SMOTE vs. random oversampling)
& \(0.992\) & \(0.996\) &\(0.987\) & \(0.992\)
& c) RO+LSTM: LSTM(50) + dropout(0.5) + dense(65) + dropout(0.5) + dense(1)) [30] & & \(0.957\) &
\(0.802\) &\(0.982\) & \(0.883\)
& LSTM-RNN (4x50 units) [31] &
Normalization before split (data leakage risk)
Misconceptions about PCA re-application
No class balancing (low recall: 80%)
Unnecessary implementation details
High accuracy/precision (99.6%) masks poor recall
&\(0.996\) & \(0.996\) & \(0.80\) & \(0.887\)
& Time-Aware Attention RNN [32] &
No data balancing (precision: 50.07%)
High recall (99.6%) masks poor precision
Memory-intensive (performance improves with larger memory)
Relies on AUC (may obscure class imbalance)
Unspecified memory units
&\(0.958\) & \(0.501\) & \(0.996\) & \(0.667\)
& SMOTE + AdaBoost (RF/ET/XGB/DT/LR) [33] &
SMOTE before split (data leakage risk)
Vague on normalization timing as well as stratification in sampling
Synthetic data in the dataset may overstate performance
& \(0.999\) &\(0.999\) & \(0.999\) & \(0.999\)
& SMOTE + Various Classifiers [34] &
Overemphasis on SMOTE+LR results
Neglect of low precision (<10%) in other methods
Publisher’s expression of concern
Potential reliability issues
& 0.970 & 0.999 & 0.970 & 0.984
& SMOTE-Tomek + RF [35] &
Ambiguous resampling order (likely wrong sequence)
SMOTE-Tomek = SMOTE (no Tomek effect)
Under-sampling too aggressive (492 samples)
High recall focus sacrifices precision
& 0.99 & 0.92 & 0.94& 0.93
& SMOTE + XGBoost [36] &
Data leakage (preprocessing before split)
Incomplete feature normalization
Default classifier parameters
Unrealistic perfect recall (100%)
No temporal validation
& 0.999 & 0.999 & 1.00 & 0.999
The study in [36] evaluates four classifiers (LR, LDA, NB, and XGBoost) using their default configurations on SMOTE-balanced data. While the authors report exceptional performance for XGBoost (accuracy: 99.969%, precision: 99.938%, recall: 100%, F1: 99.969%, AUC: 99.969%), these results appear compromised by methodological flaws. Specifically, the preprocessing pipeline suffers from data leakage as both feature scaling/normalization and SMOTE application were performed before the train-test split. Additionally, the study’s normalization approach is questionable, as it only scales the ‘Time’ feature without addressing the temporal nature of transaction data.
The ANN described in [17] has 4 hidden layers which is still fairly deep for the dataset in question. The data was balanced by applying SMOTE before doing the train/valid/test split and feeding it to the ANN, yet they got \(93\%\) precision and \(88\%\) recall albeit a leaky approach. The results are not consistent with those claimed in the conclusion section, later.
Another set of methods described in [18] relies on swarm intelligence for feature selection, attention mechanism for classifying relevant data items, UMAP for dimensionality reduction, SMOTE for addressing data imbalance and LSTM for modeling long-term dependencies in transaction sequences. The claimed results, are again circumspect because the authors have used SMOTE on the dataset before splitting it.
The authors in [19] propose an ID Dilated Convolutional Neural Network (DCNN) for credit card fraud detection, combining SMOTE with Random Under-Sampling (RUS) and Near Miss (NM) under-sampling, though the specific order of these techniques is not explicitly stated. While their model achieves a reported accuracy of 97.27% on the European credit card fraud dataset, this high accuracy is misleading given the dataset’s extreme class imbalance (fraud prevalence of 0.172%), where a trivial "no fraud" classifier would already achieve 99.8% accuracy. Additionally, the paper’s description of 1D CNNs with \(3\times 3\) kernels and batch normalization over "feature maps" suggests a potential misapplication of 2D CNN architectures on flattened or PCA-reduced features.
Another approach [20] balances the data by combining SMOTE with edited nearest neighbor (SMOTE-ENN) and then employs a strong deep learning ensemble using the long short-term memory (LSTM) neural network as the adaptive boosting (AdaBoost) technique’s base learner. The authors claim to have achieved \(0.996\) recall rate and a specificity of \(0.998\). While everything else is described in detail, the key parts are vague, e.g. balancing before/after the dataset, normalization etc. However the data balancing algorithm outlined in the article points to the balancing of the whole dataset.
Another RNN based method [21] uses the SMOTE-Tomek technique to address the data imbalance and, after splitting the data, employs a Bidirectional Gated Recurrent Unit (Bi-GRU) for classification. The approach is naive, not only for applying SMOTE-Tomek before the train/test split but also for using BN before the activation function, inside the RNN module. The article also has redability issues. The reported metrics (Accuracy = \(97.16\%\) Precision = \(95.98\%\), Recall = \(97.82\%\)) seem inconsistent with the reported AUC of \(99.66\%\) (unbelievably high;should be lower).
The study in [22] exhibits methodological flaws, particularly in data balancing and train/validation/test splits. The authors oversample the majority class but extract the validation set before oversampling, then draw test cases from the oversampled majority class, likely inflating performance metrics. Additionally, their terminology is inconsistent: one model is correctly labeled as an MLP (32-16-8 neurons), while another is misleadingly called an ANN (50 neurons per hidden layer, unspecified depth). Their top-performing model - a computationally heavy LSTM—relies on borderline SMOTE, casting doubt on the robustness of their results.
A three step deep learning technique [37] balances the data with borderline SMOTE and then uses a stacked autoencoder to extract key information for SoftMax based classification. They demonstarate a commendable AUC score but the method is still complex.
The authors of [23] combine SMOTE and Tomek Links to preprocess data before training a Back Propagation Neural Network (BPNN) with 3–6 hidden layers (each containing 28 neurons) and dropout regularization, particularly after the second hidden layer. However, the study suffers from ambiguities—such as the number of samples post-balancing and the test set fraction (25% or 30%) - as well as inconsistencies in reported metrics. Additionally, the approach risks data leakage due to balancing the entire dataset before splitting.
The study in [24] compares PCA and Convolutional Autoencoder (CAE) for feature extraction, followed by SMOTE and a
Random Forest classifier, recommending CAE+SMOTE with an F1-score of 90.5%. However, their claim that inverse_transform
can reconstruct original transactions from PCA components alone is incorrect, as it requires the original data and PCA
model. Additionally, their evaluation strategy lacks a holdout test set, risks data leakage from improper SMOTE application, and may inflate performance due to repeated testing on the same holdout set.
The work in [25] employs a denoising autoencoder (DAE) for data cleaning and SMOTE-based balancing, followed by a deep fully connected neural network (4 hidden layers: 22+15+10+5) with a 2-neuron output layer for classification. Two variants are tested: (1) without SMOTE/autoencoder (AE) and (2) with SMOTE+AE. The baseline model (w/o SMOTE/AE) achieves 97.9% accuracy but near-zero recall, suggesting it fails to detect fraud cases. The SMOTE+AE model improves recall to 90.5% (at the expense of accuracy) but exhibits poor probability calibration (sharp recall drop at high thresholds) and overfitting risks from synthetic data.
The work presented in [26] utilizes SMOTE prior to applying various machine learning techniques, with the best results achieved by a Random Forest classifier followed by an MLP containing four hidden layers (50+30+30+50 neurons). While the Random Forest model attains an F1-score of 0.964, the MLP’s performance is notably inferior (F1=0.792), potentially indicating overfitting or inadequate training. The methodology exhibits significant weaknesses: potential data leakage from performing the train-test split after SMOTE application, unjustified exclusion of 5% of features which might have enhanced the MLP’s performance, and inconsistent results where the performance gap between RF and MLP could indicate either insufficient MLP training or feature mismatch.
The CNN-based approach in [27] appears methodologically flawed due to its naive architecture, which includes an unnecessary flatten layer following Conv1D layers and excessive dropout layers. This overly complex model with redundant components demonstrates inferior performance (\(\approx 93\%\) precision/recall/accuracy) compared to their other methods evaluated (with RF achieving 0.99 F1-score and accuracy). The study also lacks clarity regarding the oversampling process and fails to specify how the dataset was reduced to 984 samples,and how many were the fraud cases.
In [28], without bothering about balancing the dataset, the authors use a CNN with two ReLU based conv2D layers (kernel sizes \(32 \times 2\) and \(64 \times 2\), respectively), with dropout layers and eventually a flatten layer (\(64 \times 1\)). Probably, the use of filters with size \(k \times 2\) maybe an attempt to capture relationships between adjacent PCA features, but this should not be that useful since PCA transformation disrupts natural feature order. The use of CNNs here seems forced at best rather than necessary. The reported results (Accuracy: \(97.15\%\), Precision: \(99.1\%\), Recall: \(90.24\%\)) boasts higher performance as compared to DT, logistic regression, kNN, RF and especially XGBoost. The recall is still on the lower side, though.
The study in [29] evaluates three deep learning approaches combined with SMOTE: ANN, CNN, and LSTM RNN. While the results suggest that CNN and ANN outperform LSTM (with reported accuracy/precision/recall >99.9% for ANN/CNN and 97.3% for LSTM), the methodology suffers from several critical flaws. The application of CNN to PCA-transformed data is fundamentally invalid due to the lack of spatial structure in such data. Moreover, the use of 2D kernels on 1D tabular data represents a misapplication of convolutional techniques. The study lacks transparency in evaluation methodology and presents highly questionable results with weak scientific justification for the core methods employed.
The study in [30] evaluates three deep learning architectures (CNN, MLP, and LSTM) independently, though the authors misleadingly refer to this as "federated learning." The models are trained on data balanced using various undersampling and oversampling techniques. The CNN architecture consists of a Conv1D(32, 2) layer with 20% dropout and batch normalization, followed by a Conv1D(64, 2) layer with batch normalization, a flatten layer, 20% dropout, a 64-unit dense layer with 40% dropout, and a final output layer. The MLP features two dense hidden layers (65 units each) with 50% dropout, while the LSTM comprises one LSTM layer (50 units) with 50% dropout followed by a 65-unit dense layer with 50% dropout before the output. As expected, the models achieve high accuracy without balancing, but other metrics suffer. Performance improves with balancing: LSTM and CNN benefit most from random oversampling, while MLP performs better with SMOTE.
The study in [31] implements a deep recurrent neural network with 4 stacked LSTM layers (each containing 50 hidden units), achieving high accuracy (99.6%) and precision (99.6%) but suffering from low recall (80%). The methodology exhibits several flaws: normalization before train-test splitting (risking data leakage), misconceptions about PCA re-application, lack of class balancing, and inclusion of unnecessary implementation details that obscure the core methodology. While the model demonstrates strong precision, its poor recall indicates a significant bias toward the majority class, likely due to the imbalanced dataset and absence of resampling techniques.
The obsession with using RNN can be observed in [38] in the form of what the authors call an "ensemble" of LSTM/GRU as RNNs. The latter are employed for classification followed by aggregation via a feed forward neural network (FFNN) as the voting mechanism. Their results on the European Cardholders Dataset (ECD) maybe good but yet it’s too much complex. Same is the case with [39] that employs AE for preprocessing, DL for classification, and a Bayesian algorithm to optimize the hyperparameters. Such overuse of deep learning methods can be found in about 11 different techniques reported in [40] and strangely attributed to the adversarial approach [41], which maybe a misreporting and seems far fetched. Similarly a CNN method is wrongly attributed to [42] which in fact has an unspecified MLP as one of the methods and that too perform very poorly on our dataset of interest with \(61.4\%\) precision, \(38.5\%\) sensitivity and \(47.3\%\) F1 score; albeit a better specificity of \(93.2\%\).
The study in [32] proposes a time-aware attention mechanism for RNNs, designed to capture users’ current transactional behavior in relation to their historical patterns. While the approach aims to model behavioral periodicity effectively, it forgoes data balancing, resulting in poor precision (50.07%) despite high recall (99.6%). Performance improves with increased memory size (units unspecified), suggesting memory-intensive requirements. The evaluation relies on AUC rather than balanced metrics, which may obscure class imbalance issues.
The work presented in [33] examines six classification algorithms (SVM, LR, RF, XGBoost, DT, and ET) both in their standard form and with AdaBoost enhancement, employing SMOTE to address class imbalance. The approach shows substantial performance gains when using AdaBoost, most notably in recall metrics (with DT recall improving from 75.57% to 99%), and achieves 99.95% accuracy with RF-AdaBoost. However, the study presents several methodological limitations, 1) data leakage risk: applying SMOTE before the train-test split may contaminate the test set with synthetic samples and 2) preprocessing ambiguity: unclear specification of when normalization occurs relative to the data splitting process
The comparative analysis presented in [34] exhibits fundamental methodological flaws, most notably an overemphasis on the SMOTE + Logistic Regression results while neglecting to investigate the abnormally low precision (<10%) observed in other methods. The study’s credibility is further compromised by the publisher’s expression of concern over the language and readability.
The study in [35] evaluates three sampling techniques (under-sampling, SMOTE, and SMOTE-Tomek) combined with four classifiers (KNN, Logistic Regression, Random Forest, and SVM), though the methodology raises several concerns. First, there’s ambiguity regarding the order of resampling and train/test splitting, which likely occurs in the wrong sequence. More critically, both SMOTE and SMOTE-Tomek produce identical sample counts (227,845 per class), an unusual outcome suggesting either Tomek link misconfiguration or ineffectiveness due to PCA-transformed features. Additionally, the aggressive under-sampling to just 492 samples per class risks underfitting despite appearing to yield good metrics. Finally, the evaluation’s focus on recall over precision may lead to high false positive rates, as evidenced by Logistic Regression’s 0.52 precision despite 0.92 recall when using SMOTE. While Random Forest achieves the highest F1-scores (0.94 with under-sampling, 0.92 with SMOTE, and 0.93 with SMOTE-Tomek), these methodological concerns undermine the reliability of the reported performance improvements.
Based on our comprehensive review of credit card fraud detection methodologies, we have identified several persistent flaws that significantly undermine the reliability and practical applicability of many studies:
Data Leakage in Preprocessing: Numerous studies perform critical preprocessing steps (normalization, SMOTE etc.) before train-test splitting, artificially inflating performance metrics through information leakage.
Intentional Vagueness in Methodology: Many works deliberately omit crucial implementation details, making replication difficult and raising questions about result validity. This includes unspecified parameter settings, ambiguous preprocessing sequences, silence about stratified sampling, and unexplained architectural choices.
Inadequate Temporal Validation: Most approaches fail to account for the time-dependent nature of transaction data, neglecting temporal splitting which is essential for real-world deployment.
Unjustified Method Complexity: There’s a tendency to apply unnecessarily sophisticated techniques without first ensuring proper data preparation and validation, often obscuring fundamental methodological flaws.
Overemphasis on Recall: Many works prioritize recall metrics at the expense of precision, leading to models with high false positive rates that would be impractical in production environments.
These persistent issues - particularly the concerning trend of intentional vagueness - highlight the urgent need for more rigorous evaluation protocols and complete methodological transparency in fraud detection research. Addressing these common pitfalls, especially the lack of full disclosure in implementation details, would significantly improve both the validity and practical utility of future studies in this domain.
The findings of the last section suggest that methodological rigor matters more than algorithmic sophistication. To demonstrate this, we will present a deliberately flawed MLP implementation with SMOTE applied before train-test splitting - a clear violation of proper evaluation protocol. Despite this fundamental flaw, we anticipate this method will outperform many existing approaches, underscoring how data leakage can overshadow algorithmic advantages. This serves as a cautionary example that sophisticated techniques cannot compensate for basic methodological failures in fraud detection research.
The flawed methodology used in this investigation, to differentiate deceptive transactions from genuine ones, focuses on the most prevalent vice from the literature; balancing the data before splitting it into train/val/test partitions. It consists of the integration of two modules central to the detection of the fraudulent card transactions, viz. a SMOTE module and a multilayer perceptron (MLP) network.
To address the unbalanced data, we employ oversampling using the SMOTE, which is particularly effective in mitigating class imbalance issues in classification tasks. SMOTE generates synthetic samples of the minority class to balance the class distribution [3].
The synthetic sample is generated using the following rule: \[x_{new} = x_{i} + \delta \times (x_{k} - x_{i}),\] where:
Figure 2: .
As far as the python implementation is concerned, the SMOTE related module is imported from the imblearn
(imbalanced-learn
) library2. The
imblearn
package includes techniques for handling unbalanced datasets and is built on scikit-learn, an open-source Python library that provides simple and efficient tools for data mining and data analysis. This approach helps in creating a
more balanced dataset, thereby improving the performance of classification models.
An Artificial Neural Network (ANN) is a computational model inspired by the structure and function of biological neural networks, designed to process inputs and learn patterns akin to human cognition. A typical MLP, a type of ANN (as illustrated in Fig. 3), consists of interconnected layers of nodes (neurons), where each connection is associated with a weight that is adjusted during training. Due to their ability to model complex, non-linear relationships, ANNs are widely employed in tasks such as pattern recognition, regression, and classification.
Figure 3: A generic MLP architecture
An MLP typically comprises three key components:
Input Layer: Receives raw data and distributes it to the subsequent layer. Each neuron in this layer corresponds to a feature or attribute of the input data.
Hidden Layers: Perform computations and feature extraction, enabling the network to capture intricate patterns in the data.
Output Layer: Produces the final prediction or classification result. The number of neurons in this layer is determined by the specific task (e.g., one neuron for binary classification, multiple neurons for multi-class classification).
The fundamental building blocks of an ANN are neurons, which process inputs by applying an activation function to the weighted sum of their inputs and producing an output. This output is then propagated to subsequent neurons, facilitating the network’s ability to learn and adapt [43].
None
Figure 4: The Flawed MLP Model
To emphasize on our argument, we are using a very simple MLP architecture. The core argument is that, with a casual leaky approach, there is no need of sophistication. Our MLP module consists of only one hidden layer with an arbitrary number of neurons (\(N\)), utilizing the rectified linear unit (ReLU) activation function. The input layer has \(16\) (arbitrary) neurons, while the output layer has a single neuron with a sigmoid activation function, suitable for binary classification. Assuming that the data is subjected to SMOTE and then split to the training and testing parts, with environment, the model would have the code given in Figure 4.
To further demonstrate our argument about the disproportionate impact of methodological flaws versus algorithmic sophistication, we conducted an extreme simplification test: eliminating the hidden layer entirely (\(N=0\)) while maintaining the data leakage from SMOTE application before train-test splitting. Figure 5 presents the Precision-Recall Curve (PRC) and Receiver Operating Characteristic Curve (ROC) for this minimal configuration.
Figure 5: Test results after applying the MLP with no hidden layer (\(N=0\))..
Despite this extreme simplification, Table 1 shows we achieved remarkably high performance metrics: 94% recall and 97.6% precision. This demonstrates that:
A single output neuron with data leakage can outperform many sophisticated models
The data leakage from improper SMOTE application provides more benefit than architectural complexity
Such results are artificially inflated and would not generalize to real-world scenarios
No. | N | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|---|
1 | 0 | 0.958 | 0.976 | 0.939 | 0.958 |
2 | 1 | 0.959 | 0.985 | 0.932 | 0.957 |
3 | 2 | 0.967 | 0.976 | 0.958 | 0.967 |
4 | 4 | 0.982 | 0.980 | 0.983 | 0.982 |
5 | 6 | 0.982 | 0.985 | 0.979 | 0.982 |
6 | 8 | 0.986 | 0.988 | 0.985 | 0.986 |
7 | 10 | 0.992 | 0.989 | 0.994 | 0.992 |
8 | 12 | 0.992 | 0.991 | 0.992 | 0.992 |
9 | 16 | 0.996 | 0.992 | 0.999 | 0.996 |
We then systematically increased the complexity by introducing a hidden layer and varying the number of neurons (\(N=1\) to \(N=16\)). As shown in Table 1, performance metrics improved incrementally with more neurons, reaching near-perfect scores (99.9% recall) at \(N=16\). Figure 6 illustrates the PRC and ROC curves for selected configurations.
Figure 6: Pairwise PRC(left) and ROC (right) test results of the MLP method with \(N\) neurons in a single hidden layer..
Following are the important points to note:
These results demonstrate how data leakage can overshadow architectural improvements.
Performance gains from adding neurons are marginal compared to the initial boost from data leakage.
The near-perfect metrics at \(N=16\) (even beyond \(N=8\)) are statistically implausible for real-world fraud detection.
This experiment underscores that proper evaluation methodology matters more than model complexity.
This case study vividly illustrates how fundamental methodological flaws can produce deceptively impressive results that would fail in real-world deployment. The key takeaway is that the evaluation of rigor must precede architectural sophistication in fraud detection research.
Our analysis reveals that methodological rigor is far more critical than algorithmic sophistication in fraud detection research. Through deliberate experimentation with flawed evaluation protocols, we demonstrated how even simple models can achieve deceptively impressive results when fundamental methodological principles are violated. The extreme case of a single-neuron model outperforming sophisticated architectures - solely due to data leakage from improper SMOTE application - serves as a stark illustration of how evaluation flaws can overshadow algorithmic advantages.
Beyond these technical considerations, our findings must be contextualized within the broader academic ecosystem. The peer review system faces significant challenges including reviewer fatigue, time constraints, and overwhelming submission volumes. These issues are exacerbated by publication pressures, financial incentives tied to rapid dissemination, and citation practices that may inadvertently inflate impact metrics. The current academic reward structure, which heavily weights publication quantity and citation counts for career advancement, can sometimes incentivize quantity over quality. While understandable given institutional pressures, this emphasis may occasionally compromise research rigor and originality.
These systemic factors contribute to a growing disconnect between academic research and industrial practice. As researchers, we often find ourselves playing catch-up to industry innovation, with many academic publications essentially formalizing concepts that have already gained practical traction. The review process itself may sometimes prioritize presentation quality over substantive contribution, allowing incremental or ambiguous work to pass through.
It is crucial to emphasize that this critique is general in nature and not directed at any specific work cited in this study. All references were selected based on their relevance and alignment with our discussion, and should not be construed as examples of the concerns raised. Rather, our analysis aims to highlight systemic issues that affect the field as a whole, with the goal of fostering more rigorous and impactful research practices.
Moving forward, we advocate for:
Stricter evaluation protocols that prioritize methodological soundness over novel architectures Enhanced transparency in reporting preprocessing pipelines and evaluation methodologies Better alignment between academic research and industrial needs Reform of incentive structures to reward quality and impact over publication quantity By addressing these challenges, we can bridge the gap between academic research and practical applications, ultimately advancing the field of fraud detection in more meaningful ways.