Abstract

This study critically examines the methodological rigor in credit card fraud detection research, revealing how fundamental evaluation flaws can overshadow algorithmic sophistication. Through deliberate experimentation with improper evaluation protocols, we demonstrate that even simple models can achieve deceptively impressive results when basic methodological principles are violated. Our analysis identifies four critical issues plaguing current approaches: (1) pervasive data leakage from improper preprocessing sequences, (2) intentional vagueness in methodological reporting, (3) inadequate temporal validation for transaction data, and (4) metric manipulation through recall optimization at precision’s expense. We present a case study showing how a minimal neural network architecture with data leakage outperforms many sophisticated methods reported in literature, achieving 99.9% recall despite fundamental evaluation flaws. These findings underscore that proper evaluation methodology matters more than model complexity in fraud detection research. The study serves as a cautionary example of how methodological rigor must precede architectural sophistication, with implications for improving research practices across machine learning applications.

1 INTRODUCTION↩︎

The rise of deep learning has profoundly transformed how we approach complex classification and regression problems. Given its data-hungry nature, the prevailing emphasis has been on acquiring large volumes of training data to boost model performance. Meanwhile, the implementation of deep learning models has become increasingly trivialized, thanks to the proliferation of high-level APIs and libraries—especially in Python—that abstract much of the underlying complexity. Today, even large language mode...

However, this convenience comes at a cost. The focus has shifted so heavily toward automation and performance metrics that critical aspects of the modeling pipeline are often overlooked or poorly documented. Researchers frequently delegate key steps to automated tools, sometimes without fully understanding or reporting them. In doing so, we risk putting the cart before the horse—prioritizing model tuning over foundational concerns such as data quality and preparation.

The initial steps that precede model training are often loosely grouped under the term “preprocessing,” yet they receive scant attention in many studies. It is not uncommon for papers to devote pages to re-explaining ubiquitous evaluation metrics like accuracy, precision, recall, or confusion matrices, while providing vague or incomplete details on more foundational questions such as:

How were categorical, ordinal, or fuzzy variables handled?
What strategy was used for data splitting (e.g., random, stratified, time-based)?
When and how was normalization or standardization applied; and to which subsets?
Were oversampling or undersampling techniques limited to the training set, or did they inadvertently affect the test data?
Was feature selection or dimensionality reduction performed before or after data splitting?

This vagueness often extends to the methodological core. Sleek pipeline diagrams are commonly included but tend to omit essential details. For instance, one might encounter the use of 2D Convolutional Neural Networks (CNNs) applied to tabular data without any justification. Was there a compelling reason for employing a spatial model on non-spatial input? If so, how was the data reshaped to accommodate this architecture? Such decisions are non-trivial and require clear, transparent reporting.

These questions are far from minor. They directly affect the reproducibility, interpretability, and trustworthiness of machine learning models. When unaddressed, they may introduce silent data leakage, bias, or misleading performance metrics; ultimately undermining the credibility of the research. Thorough documentation and transparency in preprocessing are not optional; they are essential to rigorous, responsible data science.

In this work, we undertake a critical review of existing literature in light of the concerns previously discussed, using credit card fraud detection as a case study-specifically focusing on a widely referenced benchmark dataset [1]. This domain poses distinctive challenges, with extreme class imbalance being a primary issue. In such contexts, seemingly high accuracy scores are not uncommon, yet they can be misleading. Methodological oversights or missteps, whether inadvertent or otherwise, can easily lead to inflated performance metrics and a skewed perception of model efficacy.

One recurring issue we have observed is the mishandling of resampling techniques - particularly when oversampling or undersampling is performed before the train-test split - leading to data leakage. To underscore this point, we intentionally apply a simple yet flawed MLP-based approach. Despite its simplicity, the model yields impressively high metrics, thereby demonstrating how superficial performance gains can mask deeper methodological flaws.

The rest of this paper is organized as follows: Section [sec_background] provides background on credit card fraud detection and dataset characteristics. Section 3 reviews the literature on deep learning-based fraud detection. Section 4 describes our methodology, followed by experimental results in Section [sec_results]. Finally, Section 5 concludes the paper.

2 Credit Card Fraud Detection↩︎

The widespread use of credit and debit cards has greatly enhanced financial convenience, but it has also led to increased fraudulent activity. Payment card fraud typically involves unauthorized transactions intended to obtain goods, services, or cash. While fraud represents a tiny fraction of all transactions, the absolute financial impact is substantial [2], making it a significant area of concern for financial institutions.

2.1 Dataset Description↩︎

We use the popular European credit card fraud dataset [1] for benchmarking. It contains \(284{,}807\) transactions made by European cardholders over two days in September 2013, out of which only \(492\) are fraudulent, just \(0.17\%\), highlighting extreme class imbalance.

The dataset comprises \(30\) features: \(28\) anonymized principal components (V1 - V28) derived via PCA, along with Time and Amount. The target variable Class is binary, with \(1\) indicating fraud and \(0\) denoting legitimate transactions. Due to privacy concerns, the original feature labels and semantics were withheld.

fig: — Figure 1: Snapshot of the first five rows of the dataset.

2.2 Handling Imbalanced Data↩︎

In fraud detection, the misclassification of minority class instances (fraudulent cases) carries significant consequences, as failing to detect fraud can lead to financial losses and security risks. Traditional classifiers often bias toward the majority class due to the skewed distribution of data, leading to poor performance on the minority class. Resampling techniques address this issue by adjusting the class distribution, either by oversampling the minority class, undersampling the majority class, or combining both approaches to improve model performance.

2.2.1 Oversampling Methods↩︎

Oversampling techniques aim to increase the representation of the minority class by either duplicating existing samples or generating synthetic samples. These methods help the model learn patterns from the minority class more effectively.

Random Oversampling: Duplicates minority samples without introducing new information. This method is simple but can lead to overfitting due to repeated samples.
SMOTE (Synthetic Minority Oversampling Technique) [3]: Generates synthetic samples between nearest neighbors of minority instances. This method helps to create a more balanced dataset and can improve model performance.
ADASYN (Adaptive Synthetic Sampling) [4]: Focuses on generating synthetic samples for harder-to-classify minority samples through adaptive weighting. This method is particularly useful for improving the classification of difficult minority instances.
Borderline-SMOTE [5]: Creates synthetic samples near class boundaries for better discrimination. This method helps to improve the classification of minority instances near the decision boundary.

2.2.2 Undersampling Methods↩︎

Undersampling techniques aim to reduce the number of majority class samples to balance the class distribution. These methods help to reduce the computational complexity and improve the model’s focus on the minority class.

Random Undersampling (RUS): Randomly removes majority class samples. This method is simple but can lead to loss of important information.
NearMiss [6]: Selects majority samples based on proximity to minority instances. This method helps to retain informative majority samples.
Tomek Links [7]: Removes borderline samples to clarify decision boundaries. This method helps to improve the classification of minority instances by removing ambiguous majority samples.
Cluster Centroids [8]: Applies K-means clustering to condense the majority class. This method helps to reduce the number of majority samples while retaining the overall distribution.

2.2.3 Hybrid Methods↩︎

Hybrid methods combine oversampling and undersampling techniques to balance the class distribution and improve model performance. These methods aim to leverage the strengths of both approaches.

SMOTE-Tomek, SMOTE-ENN [9]: Combine oversampling with data cleaning for improved balance. These methods help to generate synthetic minority samples and remove ambiguous majority samples.
SMOTEBoost [10]: Integrates SMOTE with boosting to enhance weak classifiers. This method helps to improve the performance of weak classifiers by generating synthetic minority samples.
SMOTE-SVM [11]: Uses SVM to guide synthetic sample generation. This method helps to generate synthetic minority samples based on the decision boundary of an SVM classifier.

2.3 Privacy Constraints and Feature Engineering↩︎

Due to privacy-preserving transformations like PCA, meaningful original features are not available. While PCA protects sensitive data, it also results in components with uneven explanatory power, where only the top few capture substantial variance. Applying deep learning models to such transformed, sparse information may be excessive.

The literature often favors deep or complex models (CNNs, RNNs, GANs), yet we argue that focus should instead be on feature (re)engineering. By leveraging PCA components, their polynomial and pairwise combinations, and applying SMOTE for balancing, even simple models such as shallow MLPs can achieve competitive performance.

In fraud detection, minimizing false negatives (missed fraud cases) is vital. Hence, recall is a more appropriate metric than accuracy. SMOTE remains a strong choice for mitigating imbalance, and its integration with meaningful features ensures that even basic models can remain effective and interpretable.

2.4 Performance Metrics↩︎

In evaluating classification models, the confusion matrix provides essential metrics such as accuracy, precision, recall (sensitivity), specificity, and the F1-score. There are four key terms one can get from a confusion matrix and keeping the use case of a credit card fraud detection in perspective, these are defined as:

True Positives (TP): Correctly predicted positive instances (e.g., fraudulent transactions correctly identified as fraud).
True Negatives (TN): Correctly predicted negative instances (e.g., legitimate transactions correctly identified as legitimate).
False Positives (FP): Incorrectly predicted positive instances (e.g., legitimate transactions flagged as fraud; Type I error).
False Negatives (FN): Incorrectly predicted negative instances (e.g., fraudulent transactions missed by the model; Type II error).

The key metrics and their definitions are as follows: \[\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP+TN+FP+FN}}\] \[\text{Precision} = \frac{\text{TP}}{\text{TP+FP}}\] \[\text{Recall or Sensitivity} = \frac{\text{TP}}{\text{TP+FN}}\] \[\text{Specificity} = \frac{\text{TN}}{\text{TN+FP}}\] \[F_1 \text{ Score}= 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]

In credit card fraud detection [12], where the cost of missing fraudulent transactions (FN) is significantly higher than false alarms (FP), recall is particularly critical to minimize undetected fraud. However, precision must also be balanced to avoid overwhelming analysts with false positives. The F1-score, which harmonizes precision and recall, is thus a key metric for assessing model performance in such imbalanced scenarios.

The Precision-Recall Curve (PRC) is a crucial tool for evaluating the performance of classification models, especially in imbalanced datasets where fraudulent transactions are rare. Unlike the ROC curve, which can be overly optimistic in imbalanced scenarios, the PRC provides a clearer view of the trade-off between precision and recall - two metrics that are directly relevant to fraud detection. The PRC helps in selecting an optimal threshold for the model, balancing the need to catch as many fraudulent transactions as possible (high recall) while keeping false positives manageable (high precision). This balance is critical in fraud detection, where the cost of missing fraud (false negatives) is significantly higher than the cost of false alarms (false positives).

While specificity is important for maintaining customer trust and operational efficiency, it is often secondary to recall and the \(F_1\) score due to the high cost of missing fraudulent transactions. High specificity ensures that legitimate transactions are not unnecessarily flagged as fraudulent, but the primary focus remains on balancing recall and precision to minimize undetected fraud.

3 LITERATURE REVIEW↩︎

Credit card fraud detection has been extensively studied using both traditional and modern machine learning approaches, with comprehensive reviews available in recent works such as [13]–[15]. While this task might appear straightforward in theory, the field has seen an overapplication of complex methods that may be unnecessarily heavy for payment card fraud detection. This trend has even led to questions about the effectiveness of certain approaches, particularly Multilayer Perceptrons (MLPs), in capturing the temporal dependencies and sequential patterns that are crucial for identifying fraudulent activities [16]. In our view, the focus should shift from immediately applying sophisticated techniques to ensuring the correctness and efficiency of fundamental preprocessing tasks. To this end, we critically examine existing literature, identifying and analyzing potential methodological flaws in a substantial sample of previous works.

|p0.03|p0.25|p0.25|p0.075|p0.075|p0.075|p0.09|

No. & Method/Approach & Flaw Identified &
& & &Accuracy & Precision & Recall & F1-Score

No. & Method/Approach & Flaw Identified &
& & &Accuracy & Precision & Recall & F1-Score

& SMOTE + ANN [17] &

suspected Data leak; dataset balancing before the split
Inconsistent results within the article

& 0.99 & 0.93 & 0.88 & 0.91

& UMAP + SMOTE + LSTM (other dimensionality reduction (UMAP etc) [18] &

SMOTE on the dataset before splitting
Dimensionality reduction on a data that is already PCA transformed

& 0.967 & 0.988 & 0.919 & 0.952

& RUS + NMS + SMOTE + DCNN [19] &

Vague on the specific order of under- and oversampling and total samples in the end.
Precision and recall are well below 40% - worse than a random classifier in some cases.
No explanation of using 1D CNNs with \(3\times 3\) kernels

& 0.972 & 0.368 & 0.392 & 0.378

& SMOTE-ENN + boosted LSTM [20] &

Vague on details especially data balancing. Their pseudocode suggests balancing was applied to the whole dataset.

& - & - & 0.996 (specificity: 0.998) & -

& SMOTE-Tomek + Bi-GRU [21] &

SMOTE-Tomek before train/test split
BN before activation
AUC too high par rapport the reported metrics
readability

& 0.972 & 0.959 & 0.978 & 0.968

& Borderline SMOTE + LSTM [22] &

Improper data splitting (validation set extracted pre-oversampling) and excessive majority-class oversampling
Misleading terminology (e.g., MLP vs. ANN) and undefined model architectures

& 99.9 & 80.3 & 92.1 & 85.8

& SMOTE-Tomek + BPNN (3 hidden layers: 28+28+dropout+28) [23] &

Oversamples the entire dataset before splitting
Ambiguities: sample count post-balancing, test size (25% or 30%?)
Inconsistent metrics: AUC=1 and AUPR=0.99 incompatible with F1=0.92

& - & 0.855 & 1 & 0.922

& CAE + SMOTE [24] &

Claims inverse_transform can reconstruct original data from PCA components alone (requires original data + PCA model)
No holdout test set; relies solely on CV for final evaluation
SMOTE applied globally (not per-fold), risking data leakage
Multiple test evaluations may inflate performance

& - & 0.920 & 0.890 & 0.905

& DAE + SMOTE + DNN (4 hidden layers: 22+15+10+5 + 2 output neurons) [25] &

Bizarre DNN architecture: 4 hidden layers with aggressive shrinkage (22+15+10+5)
Output layer has 2 neurons for binary classification (should be 1 + sigmoid)
Unclear if normalization/standardization is applied pre-split (leakage risk)
High recall at low thresholds (overfitting to SMOTE/AE) plus
Sharp recall drop at threshold 0.8 (poor probability calibration)

& \(0.979\) & - & \(0.84\) & -

& SMOTE ahead of various ML methods with the best performance demonstrated by RF followed by an MLP with \(4\) hidden layers (50+30+30+50 neurons) [26] &

Vague methodology: lacks details on SMOTE application and feature exclusion
Likely SMOTE before split (risk of data leakage)
No justification for excluding 5% of features (critical for MLP performance)
Inconsistent performance: RF (F1=0.964) vs. MLP (F1=0.792) suggests overfitting

& \(0.999\) & \(0.964\) (RF) \(0.792\) (MLP) & \(0.816\) & \(0.884\) (RF) \(0.804\) (MLP)

& CNN (Conv1D + Flatten + Dropout) [27] &

Naive architecture: unnecessary flatten after Conv1D and excessive dropout
Poor performance (\(\approx 93\%\) P/R/A) vs. RF (0.99 F1)
Vague oversampling details
Unclear dataset reduction to 984 samples (fraud class unspecified)

& 0.93 & 0.93 & 0.93 & 0.93
& CNN (Conv1D: \(32\times 2\), \(64\times 2\) + Dropout + Flatten) [28] &

No class balancing
Misuse of Conv2D on 1D data (PCA-transformed features)
Ineffective CNN use on PCA-transformed data (disrupts feature order)
Forced CNN architecture (unnecessary for tabular data)
Lower recall (90.24%) despite high precision/accuracy

& \(0.972\) & \(0.991\) & \(0.902\) & \(0.945\)

& a) MLP: n inputs and n neurons in each hidden layer [29] & & \(0.999\) & \(0.999\) & \(0.999\) & \(0.999\)

& b) CNN: unspecified but seems to have used 2D kernels for the CNN [29] &

Invalid CNN use on PCA data (no spatial structure)
Misapplication of 2D kernels to 1D tabular data
Lack of evaluation transparency
Questionable results reliability
Weak scientific justification
Reported metrics (>99.9% for ANN/CNN, 97.3% for LSTM) seem unrealistic

& \(0.999\) & \(0.999\) & \(0.999\) & \(0.999\)

& c) LSTM-RNN [29] & & \(0.973\) & \(0.973\) & \(0.973\) & \(0.973\)

& a) Random Oversampling (RO) + MLP: two dense hidden layers of 65 units each + 50% dropout [30] & & \(0.983\) & \(0.978\) & \(0.987\) & \(0.983\)

& b) RO+CNN: Conv1D(32,2) + dropout(0.2) + BN + Conv1D(64,2) + BN + flatten+dropout(0.2) + dense(64) + dropout(0.4) + dense(1) [30] &

CNN architecture may be suboptimal for tabular data
Excessive dropout in all models (20-50%)
High accuracy without balancing masks poor other metrics
Performance varies by balancing technique (SMOTE vs. random oversampling)

& \(0.992\) & \(0.996\) &\(0.987\) & \(0.992\)

& c) RO+LSTM: LSTM(50) + dropout(0.5) + dense(65) + dropout(0.5) + dense(1)) [30] & & \(0.957\) & \(0.802\) &\(0.982\) & \(0.883\)

& LSTM-RNN (4x50 units) [31] &

Normalization before split (data leakage risk)
Misconceptions about PCA re-application
No class balancing (low recall: 80%)
Unnecessary implementation details
High accuracy/precision (99.6%) masks poor recall

&\(0.996\) & \(0.996\) & \(0.80\) & \(0.887\)

& Time-Aware Attention RNN [32] &

No data balancing (precision: 50.07%)
High recall (99.6%) masks poor precision
Memory-intensive (performance improves with larger memory)
Relies on AUC (may obscure class imbalance)
Unspecified memory units

&\(0.958\) & \(0.501\) & \(0.996\) & \(0.667\)

& SMOTE + AdaBoost (RF/ET/XGB/DT/LR) [33] &

SMOTE before split (data leakage risk)
Vague on normalization timing as well as stratification in sampling
Synthetic data in the dataset may overstate performance

& \(0.999\) &\(0.999\) & \(0.999\) & \(0.999\)

& SMOTE + Various Classifiers [34] &

Overemphasis on SMOTE+LR results
Neglect of low precision (<10%) in other methods
Publisher’s expression of concern
Potential reliability issues

& 0.970 & 0.999 & 0.970 & 0.984

& SMOTE-Tomek + RF [35] &

Ambiguous resampling order (likely wrong sequence)
SMOTE-Tomek = SMOTE (no Tomek effect)
Under-sampling too aggressive (492 samples)
High recall focus sacrifices precision

& 0.99 & 0.92 & 0.94& 0.93

& SMOTE + XGBoost [36] &

Data leakage (preprocessing before split)
Incomplete feature normalization
Default classifier parameters
Unrealistic perfect recall (100%)
No temporal validation

& 0.999 & 0.999 & 1.00 & 0.999

The study in [36] evaluates four classifiers (LR, LDA, NB, and XGBoost) using their default configurations on SMOTE-balanced data. While the authors report exceptional performance for XGBoost (accuracy: 99.969%, precision: 99.938%, recall: 100%, F1: 99.969%, AUC: 99.969%), these results appear compromised by methodological flaws. Specifically, the preprocessing pipeline suffers from data leakage as both feature scaling/normalization and SMOTE application were performed before the train-test split. Additionally, the study’s normalization approach is questionable, as it only scales the ‘Time’ feature without addressing the temporal nature of transaction data.

The ANN described in [17] has 4 hidden layers which is still fairly deep for the dataset in question. The data was balanced by applying SMOTE before doing the train/valid/test split and feeding it to the ANN, yet they got \(93\%\) precision and \(88\%\) recall albeit a leaky approach. The results are not consistent with those claimed in the conclusion section, later.

Another set of methods described in [18] relies on swarm intelligence for feature selection, attention mechanism for classifying relevant data items, UMAP for dimensionality reduction, SMOTE for addressing data imbalance and LSTM for modeling long-term dependencies in transaction sequences. The claimed results, are again circumspect because the authors have used SMOTE on the dataset before splitting it.

The authors in [19] propose an ID Dilated Convolutional Neural Network (DCNN) for credit card fraud detection, combining SMOTE with Random Under-Sampling (RUS) and Near Miss (NM) under-sampling, though the specific order of these techniques is not explicitly stated. While their model achieves a reported accuracy of 97.27% on the European credit card fraud dataset, this high accuracy is misleading given the dataset’s extreme class imbalance (fraud prevalence of 0.172%), where a trivial "no fraud" classifier would already achieve 99.8% accuracy. Additionally, the paper’s description of 1D CNNs with \(3\times 3\) kernels and batch normalization over "feature maps" suggests a potential misapplication of 2D CNN architectures on flattened or PCA-reduced features.

Another approach [20] balances the data by combining SMOTE with edited nearest neighbor (SMOTE-ENN) and then employs a strong deep learning ensemble using the long short-term memory (LSTM) neural network as the adaptive boosting (AdaBoost) technique’s base learner. The authors claim to have achieved \(0.996\) recall rate and a specificity of \(0.998\). While everything else is described in detail, the key parts are vague, e.g. balancing before/after the dataset, normalization etc. However the data balancing algorithm outlined in the article points to the balancing of the whole dataset.

Another RNN based method [21] uses the SMOTE-Tomek technique to address the data imbalance and, after splitting the data, employs a Bidirectional Gated Recurrent Unit (Bi-GRU) for classification. The approach is naive, not only for applying SMOTE-Tomek before the train/test split but also for using BN before the activation function, inside the RNN module. The article also has redability issues. The reported metrics (Accuracy = \(97.16\%\) Precision = \(95.98\%\), Recall = \(97.82\%\)) seem inconsistent with the reported AUC of \(99.66\%\) (unbelievably high;should be lower).

The study in [22] exhibits methodological flaws, particularly in data balancing and train/validation/test splits. The authors oversample the majority class but extract the validation set before oversampling, then draw test cases from the oversampled majority class, likely inflating performance metrics. Additionally, their terminology is inconsistent: one model is correctly labeled as an MLP (32-16-8 neurons), while another is misleadingly called an ANN (50 neurons per hidden layer, unspecified depth). Their top-performing model - a computationally heavy LSTM—relies on borderline SMOTE, casting doubt on the robustness of their results.

A three step deep learning technique [37] balances the data with borderline SMOTE and then uses a stacked autoencoder to extract key information for SoftMax based classification. They demonstarate a commendable AUC score but the method is still complex.

The authors of [23] combine SMOTE and Tomek Links to preprocess data before training a Back Propagation Neural Network (BPNN) with 3–6 hidden layers (each containing 28 neurons) and dropout regularization, particularly after the second hidden layer. However, the study suffers from ambiguities—such as the number of samples post-balancing and the test set fraction (25% or 30%) - as well as inconsistencies in reported metrics. Additionally, the approach risks data leakage due to balancing the entire dataset before splitting.

The study in [24] compares PCA and Convolutional Autoencoder (CAE) for feature extraction, followed by SMOTE and a Random Forest classifier, recommending CAE+SMOTE with an F1-score of 90.5%. However, their claim that inverse_transform can reconstruct original transactions from PCA components alone is incorrect, as it requires the original data and PCA model. Additionally, their evaluation strategy lacks a holdout test set, risks data leakage from improper SMOTE application, and may inflate performance due to repeated testing on the same holdout set.

The work in [25] employs a denoising autoencoder (DAE) for data cleaning and SMOTE-based balancing, followed by a deep fully connected neural network (4 hidden layers: 22+15+10+5) with a 2-neuron output layer for classification. Two variants are tested: (1) without SMOTE/autoencoder (AE) and (2) with SMOTE+AE. The baseline model (w/o SMOTE/AE) achieves 97.9% accuracy but near-zero recall, suggesting it fails to detect fraud cases. The SMOTE+AE model improves recall to 90.5% (at the expense of accuracy) but exhibits poor probability calibration (sharp recall drop at high thresholds) and overfitting risks from synthetic data.

The work presented in [26] utilizes SMOTE prior to applying various machine learning techniques, with the best results achieved by a Random Forest classifier followed by an MLP containing four hidden layers (50+30+30+50 neurons). While the Random Forest model attains an F1-score of 0.964, the MLP’s performance is notably inferior (F1=0.792), potentially indicating overfitting or inadequate training. The methodology exhibits significant weaknesses: potential data leakage from performing the train-test split after SMOTE application, unjustified exclusion of 5% of features which might have enhanced the MLP’s performance, and inconsistent results where the performance gap between RF and MLP could indicate either insufficient MLP training or feature mismatch.

The CNN-based approach in [27] appears methodologically flawed due to its naive architecture, which includes an unnecessary flatten layer following Conv1D layers and excessive dropout layers. This overly complex model with redundant components demonstrates inferior performance (\(\approx 93\%\) precision/recall/accuracy) compared to their other methods evaluated (with RF achieving 0.99 F1-score and accuracy). The study also lacks clarity regarding the oversampling process and fails to specify how the dataset was reduced to 984 samples,and how many were the fraud cases.

In [28], without bothering about balancing the dataset, the authors use a CNN with two ReLU based conv2D layers (kernel sizes \(32 \times 2\) and \(64 \times 2\), respectively), with dropout layers and eventually a flatten layer (\(64 \times 1\)). Probably, the use of filters with size \(k \times 2\) maybe an attempt to capture relationships between adjacent PCA features, but this should not be that useful since PCA transformation disrupts natural feature order. The use of CNNs here seems forced at best rather than necessary. The reported results (Accuracy: \(97.15\%\), Precision: \(99.1\%\), Recall: \(90.24\%\)) boasts higher performance as compared to DT, logistic regression, kNN, RF and especially XGBoost. The recall is still on the lower side, though.

The study in [29] evaluates three deep learning approaches combined with SMOTE: ANN, CNN, and LSTM RNN. While the results suggest that CNN and ANN outperform LSTM (with reported accuracy/precision/recall >99.9% for ANN/CNN and 97.3% for LSTM), the methodology suffers from several critical flaws. The application of CNN to PCA-transformed data is fundamentally invalid due to the lack of spatial structure in such data. Moreover, the use of 2D kernels on 1D tabular data represents a misapplication of convolutional techniques. The study lacks transparency in evaluation methodology and presents highly questionable results with weak scientific justification for the core methods employed.

The study in [30] evaluates three deep learning architectures (CNN, MLP, and LSTM) independently, though the authors misleadingly refer to this as "federated learning." The models are trained on data balanced using various undersampling and oversampling techniques. The CNN architecture consists of a Conv1D(32, 2) layer with 20% dropout and batch normalization, followed by a Conv1D(64, 2) layer with batch normalization, a flatten layer, 20% dropout, a 64-unit dense layer with 40% dropout, and a final output layer. The MLP features two dense hidden layers (65 units each) with 50% dropout, while the LSTM comprises one LSTM layer (50 units) with 50% dropout followed by a 65-unit dense layer with 50% dropout before the output. As expected, the models achieve high accuracy without balancing, but other metrics suffer. Performance improves with balancing: LSTM and CNN benefit most from random oversampling, while MLP performs better with SMOTE.

The study in [31] implements a deep recurrent neural network with 4 stacked LSTM layers (each containing 50 hidden units), achieving high accuracy (99.6%) and precision (99.6%) but suffering from low recall (80%). The methodology exhibits several flaws: normalization before train-test splitting (risking data leakage), misconceptions about PCA re-application, lack of class balancing, and inclusion of unnecessary implementation details that obscure the core methodology. While the model demonstrates strong precision, its poor recall indicates a significant bias toward the majority class, likely due to the imbalanced dataset and absence of resampling techniques.

The obsession with using RNN can be observed in [38] in the form of what the authors call an "ensemble" of LSTM/GRU as RNNs. The latter are employed for classification followed by aggregation via a feed forward neural network (FFNN) as the voting mechanism. Their results on the European Cardholders Dataset (ECD) maybe good but yet it’s too much complex. Same is the case with [39] that employs AE for preprocessing, DL for classification, and a Bayesian algorithm to optimize the hyperparameters. Such overuse of deep learning methods can be found in about 11 different techniques reported in [40] and strangely attributed to the adversarial approach [41], which maybe a misreporting and seems far fetched. Similarly a CNN method is wrongly attributed to [42] which in fact has an unspecified MLP as one of the methods and that too perform very poorly on our dataset of interest with \(61.4\%\) precision, \(38.5\%\) sensitivity and \(47.3\%\) F1 score; albeit a better specificity of \(93.2\%\).

The study in [32] proposes a time-aware attention mechanism for RNNs, designed to capture users’ current transactional behavior in relation to their historical patterns. While the approach aims to model behavioral periodicity effectively, it forgoes data balancing, resulting in poor precision (50.07%) despite high recall (99.6%). Performance improves with increased memory size (units unspecified), suggesting memory-intensive requirements. The evaluation relies on AUC rather than balanced metrics, which may obscure class imbalance issues.

The work presented in [33] examines six classification algorithms (SVM, LR, RF, XGBoost, DT, and ET) both in their standard form and with AdaBoost enhancement, employing SMOTE to address class imbalance. The approach shows substantial performance gains when using AdaBoost, most notably in recall metrics (with DT recall improving from 75.57% to 99%), and achieves 99.95% accuracy with RF-AdaBoost. However, the study presents several methodological limitations, 1) data leakage risk: applying SMOTE before the train-test split may contaminate the test set with synthetic samples and 2) preprocessing ambiguity: unclear specification of when normalization occurs relative to the data splitting process

The comparative analysis presented in [34] exhibits fundamental methodological flaws, most notably an overemphasis on the SMOTE + Logistic Regression results while neglecting to investigate the abnormally low precision (<10%) observed in other methods. The study’s credibility is further compromised by the publisher’s expression of concern over the language and readability.

The study in [35] evaluates three sampling techniques (under-sampling, SMOTE, and SMOTE-Tomek) combined with four classifiers (KNN, Logistic Regression, Random Forest, and SVM), though the methodology raises several concerns. First, there’s ambiguity regarding the order of resampling and train/test splitting, which likely occurs in the wrong sequence. More critically, both SMOTE and SMOTE-Tomek produce identical sample counts (227,845 per class), an unusual outcome suggesting either Tomek link misconfiguration or ineffectiveness due to PCA-transformed features. Additionally, the aggressive under-sampling to just 492 samples per class risks underfitting despite appearing to yield good metrics. Finally, the evaluation’s focus on recall over precision may lead to high false positive rates, as evidenced by Logistic Regression’s 0.52 precision despite 0.92 recall when using SMOTE. While Random Forest achieves the highest F1-scores (0.94 with under-sampling, 0.92 with SMOTE, and 0.93 with SMOTE-Tomek), these methodological concerns undermine the reliability of the reported performance improvements.

Based on our comprehensive review of credit card fraud detection methodologies, we have identified several persistent flaws that significantly undermine the reliability and practical applicability of many studies:

Data Leakage in Preprocessing: Numerous studies perform critical preprocessing steps (normalization, SMOTE etc.) before train-test splitting, artificially inflating performance metrics through information leakage.
Intentional Vagueness in Methodology: Many works deliberately omit crucial implementation details, making replication difficult and raising questions about result validity. This includes unspecified parameter settings, ambiguous preprocessing sequences, silence about stratified sampling, and unexplained architectural choices.
Inadequate Temporal Validation: Most approaches fail to account for the time-dependent nature of transaction data, neglecting temporal splitting which is essential for real-world deployment.
Unjustified Method Complexity: There’s a tendency to apply unnecessarily sophisticated techniques without first ensuring proper data preparation and validation, often obscuring fundamental methodological flaws.
Overemphasis on Recall: Many works prioritize recall metrics at the expense of precision, leading to models with high false positive rates that would be impractical in production environments.

These persistent issues - particularly the concerning trend of intentional vagueness - highlight the urgent need for more rigorous evaluation protocols and complete methodological transparency in fraud detection research. Addressing these common pitfalls, especially the lack of full disclosure in implementation details, would significantly improve both the validity and practical utility of future studies in this domain.

4 Our Flawed Methodology↩︎

The findings of the last section suggest that methodological rigor matters more than algorithmic sophistication. To demonstrate this, we will present a deliberately flawed MLP implementation with SMOTE applied before train-test splitting - a clear violation of proper evaluation protocol. Despite this fundamental flaw, we anticipate this method will outperform many existing approaches, underscoring how data leakage can overshadow algorithmic advantages. This serves as a cautionary example that sophisticated techniques cannot compensate for basic methodological failures in fraud detection research.

The flawed methodology used in this investigation, to differentiate deceptive transactions from genuine ones, focuses on the most prevalent vice from the literature; balancing the data before splitting it into train/val/test partitions. It consists of the integration of two modules central to the detection of the fraudulent card transactions, viz. a SMOTE module and a multilayer perceptron (MLP) network.

4.1 Synthetic Minority Over-sampling Technique (SMOTE)↩︎

To address the unbalanced data, we employ oversampling using the SMOTE, which is particularly effective in mitigating class imbalance issues in classification tasks. SMOTE generates synthetic samples of the minority class to balance the class distribution [3].

The synthetic sample is generated using the following rule: \[x_{new} = x_{i} + \delta \times (x_{k} - x_{i}),\] where:

As far as the python implementation is concerned, the SMOTE related module is imported from the imblearn (imbalanced-learn) library². The imblearn package includes techniques for handling unbalanced datasets and is built on scikit-learn, an open-source Python library that provides simple and efficient tools for data mining and data analysis. This approach helps in creating a more balanced dataset, thereby improving the performance of classification models.

4.2 The Multilayer Perceptron (MLP) module↩︎

An Artificial Neural Network (ANN) is a computational model inspired by the structure and function of biological neural networks, designed to process inputs and learn patterns akin to human cognition. A typical MLP, a type of ANN (as illustrated in Fig. 3), consists of interconnected layers of nodes (neurons), where each connection is associated with a weight that is adjusted during training. Due to their ability to model complex, non-linear relationships, ANNs are widely employed in tasks such as pattern recognition, regression, and classification.

fig: — Figure 3: A generic MLP architecture

An MLP typically comprises three key components:

Input Layer: Receives raw data and distributes it to the subsequent layer. Each neuron in this layer corresponds to a feature or attribute of the input data.
Hidden Layers: Perform computations and feature extraction, enabling the network to capture intricate patterns in the data.
Output Layer: Produces the final prediction or classification result. The number of neurons in this layer is determined by the specific task (e.g., one neuron for binary classification, multiple neurons for multi-class classification).

The fundamental building blocks of an ANN are neurons, which process inputs by applying an activation function to the weighted sum of their inputs and producing an output. This output is then propagated to subsequent neurons, facilitating the network’s ability to learn and adapt [43].

None

Figure 4: The Flawed MLP Model

To emphasize on our argument, we are using a very simple MLP architecture. The core argument is that, with a casual leaky approach, there is no need of sophistication. Our MLP module consists of only one hidden layer with an arbitrary number of neurons (\(N\)), utilizing the rectified linear unit (ReLU) activation function. The input layer has \(16\) (arbitrary) neurons, while the output layer has a single neuron with a sigmoid activation function, suitable for binary classification. Assuming that the data is subjected to SMOTE and then split to the training and testing parts, with environment, the model would have the code given in Figure 4.

4.3 Results and Analysis↩︎

To further demonstrate our argument about the disproportionate impact of methodological flaws versus algorithmic sophistication, we conducted an extreme simplification test: eliminating the hidden layer entirely (\(N=0\)) while maintaining the data leakage from SMOTE application before train-test splitting. Figure 5 presents the Precision-Recall Curve (PRC) and Receiver Operating Characteristic Curve (ROC) for this minimal configuration.

Figure 5: Test results after applying the MLP with no hidden layer (\(N=0\))..

Despite this extreme simplification, Table 1 shows we achieved remarkably high performance metrics: 94% recall and 97.6% precision. This demonstrates that:

A single output neuron with data leakage can outperform many sophisticated models
The data leakage from improper SMOTE application provides more benefit than architectural complexity
Such results are artificially inflated and would not generalize to real-world scenarios

Table 1: Our MLP with various number of neurons (N) in its hidden layer
No.	N	Accuracy	Precision	Recall	F1
1	0	0.958	0.976	0.939	0.958
2	1	0.959	0.985	0.932	0.957
3	2	0.967	0.976	0.958	0.967
4	4	0.982	0.980	0.983	0.982
5	6	0.982	0.985	0.979	0.982
6	8	0.986	0.988	0.985	0.986
7	10	0.992	0.989	0.994	0.992
8	12	0.992	0.991	0.992	0.992
9	16	0.996	0.992	0.999	0.996

We then systematically increased the complexity by introducing a hidden layer and varying the number of neurons (\(N=1\) to \(N=16\)). As shown in Table 1, performance metrics improved incrementally with more neurons, reaching near-perfect scores (99.9% recall) at \(N=16\). Figure 6 illustrates the PRC and ROC curves for selected configurations.

Figure 6: Pairwise PRC(left) and ROC (right) test results of the MLP method with \(N\) neurons in a single hidden layer..

Following are the important points to note:

These results demonstrate how data leakage can overshadow architectural improvements.
Performance gains from adding neurons are marginal compared to the initial boost from data leakage.
The near-perfect metrics at \(N=16\) (even beyond \(N=8\)) are statistically implausible for real-world fraud detection.
This experiment underscores that proper evaluation methodology matters more than model complexity.

This case study vividly illustrates how fundamental methodological flaws can produce deceptively impressive results that would fail in real-world deployment. The key takeaway is that the evaluation of rigor must precede architectural sophistication in fraud detection research.

5 Conclusion↩︎

Our analysis reveals that methodological rigor is far more critical than algorithmic sophistication in fraud detection research. Through deliberate experimentation with flawed evaluation protocols, we demonstrated how even simple models can achieve deceptively impressive results when fundamental methodological principles are violated. The extreme case of a single-neuron model outperforming sophisticated architectures - solely due to data leakage from improper SMOTE application - serves as a stark illustration of how evaluation flaws can overshadow algorithmic advantages.

Beyond these technical considerations, our findings must be contextualized within the broader academic ecosystem. The peer review system faces significant challenges including reviewer fatigue, time constraints, and overwhelming submission volumes. These issues are exacerbated by publication pressures, financial incentives tied to rapid dissemination, and citation practices that may inadvertently inflate impact metrics. The current academic reward structure, which heavily weights publication quantity and citation counts for career advancement, can sometimes incentivize quantity over quality. While understandable given institutional pressures, this emphasis may occasionally compromise research rigor and originality.

These systemic factors contribute to a growing disconnect between academic research and industrial practice. As researchers, we often find ourselves playing catch-up to industry innovation, with many academic publications essentially formalizing concepts that have already gained practical traction. The review process itself may sometimes prioritize presentation quality over substantive contribution, allowing incremental or ambiguous work to pass through.

It is crucial to emphasize that this critique is general in nature and not directed at any specific work cited in this study. All references were selected based on their relevance and alignment with our discussion, and should not be construed as examples of the concerns raised. Rather, our analysis aims to highlight systemic issues that affect the field as a whole, with the goal of fostering more rigorous and impactful research practices.

Moving forward, we advocate for:

Stricter evaluation protocols that prioritize methodological soundness over novel architectures Enhanced transparency in reporting preprocessing pipelines and evaluation methodologies Better alignment between academic research and industrial needs Reform of incentive structures to reward quality and impact over publication quantity By addressing these challenges, we can bridge the gap between academic research and practical applications, ultimately advancing the field of fraud detection in more meaningful ways.

References↩︎

[1]

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi. Calibrating probability with undersampling for unbalanced classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

[2]

Thanh Thi Nguyen, Hammad Tahir, Mohamed Abdelrazek, and Ali Babar. Deep learning methods for credit card fraud detection. arXiv preprint arXiv:2012.03754, 2020.

[3]

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16: 321–357, 2002.

[4]

Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pages 1322–1328. Ieee, 2008.

[5]

Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878–887. Springer, 2005.

[6]

Inderjeet Mani and I Zhang. knn approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets, volume 126, pages 1–7. ICML, 2003.

[7]

Ivan Tomek. An experiment with the edited nearest-nieghbor rule. 1976.

[8]

Megha-Natarajan. cluster centroid. https://medium.com/@megha.natarajan/understanding-the-intuition-behind-cluster-centroids-smote-and-smoteen-techniques-for-dealing-058f3233abeb, 2023. Accessed April 23,2024.

[9]

Gustavo EAPA Batista, Ronaldo C Prati, and Maria Carolina Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6 (1): 20–29, 2004.

[10]

Nitesh V Chawla, Aleksandar Lazarevic, Lawrence O Hall, and Kevin W Bowyer. Smoteboost: Improving prediction of the minority class in boosting. In Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, September 22-26, 2003. Proceedings 7, pages 107–119. Springer, 2003.

[11]

Hien M Nguyen, Eric W Cooper, and Katsuari Kamei. Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3 (1): 4–21, 2011.

[12]

Aditya-Mishra. Metrics. https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234, 2018. Accessed september 10,2024.

[13]

Asma Cherif, Arwa Badhib, Heyfa Ammar, Suhair Alshehri, Manal Kalkatawi, and Abdessamad Imine. Credit card fraud detection in the era of disruptive technologies: A systematic review. Journal of King Saud University - Computer and Information Sciences, 35 (1): 145–174, 2023. ISSN 1319-1578. . URL https://www.sciencedirect.com/science/article/pii/S1319157822004062.

[14]

Ibrahim Y Hafez, Ahmed Y Hafez, Ahmed Saleh, Amr A Abd El-Mageed, and Amr A Abohany. A systematic review of ai-enhanced techniques in credit card fraud detection. Journal of Big Data, 12 (1): 6, 2025.

[15]

S Gbadebo-Ogunmefun, AG Oketola, T Gbadebo-Ogunmefun, and A Agbeja. A review of credit card fraud detection using machine learning algorithms, 2023.

[16]

Ibomoiye Domor Mienye and Nobert Jere. Deep learning for credit card fraud detection: A review of algorithms, challenges, and solutions. IEEE Access, 12: 96893–96910, 2024. .

[17]

Pratyush Sharma, Souradeep Banerjee, Devyanshi Tiwari, and Jagdish Chandra Patni. Machine learning model for credit card fraud detection-a comparative analysis. Int. Arab J. Inf. Technol., 18 (6): 789–796, 2021.

[18]

Ibtissam Benchaji, Samira Douzi, Bouabid El Ouahidi, and Jaafar Jaafari. . Journal of Big Data, 8: 1–21, 2021.

[19]

J Karthika and A Senthilselvi. Smart credit card fraud detection system based on dilated convolutional neural network with sampling technique. Multimedia Tools and Applications, 82 (20): 31691–31708, 2023.

[20]

Ebenezer Esenogho, Ibomoiye Domor Mienye, Theo G Swart, Kehinde Aruleba, and George Obaido. A neural network ensemble with feature engineering for improved credit card fraud detection. IEEE Access, 10: 16400–16407, 2022.

[21]

Imane Sadgali, Nawal Sael, and Faouzia Benabbou. Bidirectional gated recurrent unit for improving classification in credit card fraud detection. Indonesian Journal of Electrical Engineering and Computer Science (IJEECS), 21 (3): 1704–1712, 2021.

[22]

Zainab Saad Rubaidi, Boulbaba Ben Ammar, and Mohamed Ben Aouicha. Comparative data oversampling techniques with deep learning algorithms for credit card fraud detection. In International Conference on Intelligent Systems Design and Applications, pages 286–296. Springer, 2022.

[23]

Naoufal Rtayli. An efficient deep learning classification model for predicting credit card fraud on skewed data. Journal of Information Security and Cybercrimes Research, 5 (1): 61–75, 2022.

[24]

Zahra Salekshahrezaee, Joffrey L Leevy, and Taghi M Khoshgoftaar. Feature extraction for class imbalance using a convolutional autoencoder and data sampling. In 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pages 217–223. IEEE, 2021.

[25]

Junyi Zou, Jinliang Zhang, and Pin Jiang. Credit card fraud detection using autoencoder neural network. ArXiv, abs/1908.11553, 2019. URL https://api.semanticscholar.org/CorpusID:201698402.

[26]

Dejan Varmedja, Mirjana Karanovic, Srdjan Sladojevic, Marko Arsenovic, and Andras Anderla. Credit card fraud detection - machine learning methods. In 2019 18th International Symposium INFOTEH-JAHORINA (INFOTEH), pages 1–5, 2019. .

[27]

Mohammad Ziad Mizher and Ali Bou Nassif. Deep cnn approach for unbalanced credit card fraud detection data. In 2023 Advances in Science and Engineering Technology International Conferences (ASET), pages 1–7, 2023. .

[28]

E. Ajitha, S. Sneha, Shanmitha Makesh, and K. Jaspin. A comparative analysis of credit card fraud detection with machine learning algorithms and convolutional neural network. In 2023 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI), pages 1–8, 2023. .

[29]

Md. Nawab Yousuf Ali, Taniya Kabir, Noushin Laila Raka, Sanzida Siddikha Toma, Md. Lizur Rahman, and Jannatul Ferdaus. Smote based credit card fraud detection using convolutional neural network. In 2022 25th International Conference on Computer and Information Technology (ICCIT), pages 55–60, 2022. .

[30]

Nahid Ferdous Aurna, Md Delwar Hossain, Yuzo Taenaka, and Youki Kadobayashi. . In 2023 IEEE International Conference on Cyber Security and Resilience (CSR), pages 180–186, 2023. .

[31]

Otasowie Owolafe, Oluwaseun Bosede Ogunrinde, and Aderonke Favour-Bethy Thompson. A Long Short Term Memory Model for Credit Card Fraud Detection, pages 369–391. Springer International Publishing, Cham, 2021. ISBN 978-3-030-72236-4. . URL https://doi.org/10.1007/978-3-030-72236-4_15.

[32]

Yu Xie, Guanjun Liu, Chungang Yan, Changjun Jiang, and MengChu Zhou. Time-aware attention-based gated network for credit card fraud detection by extracting transactional behaviors. IEEE Transactions on Computational Social Systems, 10 (3): 1004–1016, 2023. .

[33]

Emmanuel Ileberi, Yanxia Sun, and Zenghui Wang. Performance evaluation of machine learning methods for credit card fraud detection using smote and adaboost. IEEE Access, 9: 165286–165294, 2021.

[34]

JVV Sriram Sasank, G Ram Sahith, K Abhinav, and Meena Belwal. Credit card fraud detection using various classification and sampling techniques: a comparative study. In 2019 international conference on communication and electronics systems (ICCES), pages 1713–1718. IEEE, 2019.

[35]

Konduri Praveen Mahesh, Shaik Ashar Afrouz, and Anu Shaju Areeckal. Detection of fraudulent credit card transactions: A comparative analysis of data sampling and classification techniques. In Journal of Physics: Conference Series, volume 2161, page 012072. IOP Publishing, 2022.

[36]

Ahmed Qasim Abdulghani, Osman Nuri Uçan, and Khattab M Ali Alheeti. Credit card fraud detection using xgboost algorithm. In 2021 14th International Conference on Developments in eSystems Engineering (DeSE), pages 487–492. IEEE, 2021.

[37]

Salima Smiti and Makram Soui. Bankruptcy prediction using deep learning approach based on borderline smote. Information Systems Frontiers, 22 (5): 1067–1083, 2020.

[38]

Javad Forough and Saeedeh Momtazi. Ensemble of deep sequential models for credit card fraud detection. Applied Soft Computing, 99: 106883, 2021. ISSN 1568-4946. . URL https://www.sciencedirect.com/science/article/pii/S1568494620308218.

[39]

Hosein Fanai and Hossein Abbasimehr. A novel combined approach based on deep autoencoder and deep classifiers for credit card fraud detection. Expert Systems with Applications, 217: 119562, 2023. ISSN 0957-4174. . URL https://www.sciencedirect.com/science/article/pii/S0957417423000635.

[40]

Fawaz Khaled Alarfaj, Iqra Malik, Hikmat Ullah Khan, Naif Almusallam, Muhammad Ramzan, and Muzamil Ahmed. Credit card fraud detection using state-of-the-art machine learning and deep learning algorithms. IEEE Access, 10: 39700–39715, 2022. .

[41]

Francesco Cartella, Orlando Anunciacao, Yuki Funabiki, Daisuke Yamaguchi, Toru Akishita, and Olivier Elshocht. Adversarial attacks for tabular data: Application to fraud detection and imbalanced data. arXiv preprint arXiv:2101.08030, 2021.

[42]

Vinay Arora, Rohan Singh Leekha, Kyungroul Lee, and Aman Kataria. Facilitating user authorization from imbalanced data logs of credit cards using artificial intelligence. Mobile Information Systems, 2020 (1): 8885269, 2020. . URL https://onlinelibrary.wiley.com/doi/abs/10.1155/2020/8885269.

[43]

Amir-Al. Artificial neural network (ann) with practical implementation. https://medium.com/machine-learning-researcher/artificial-neural-network-ann-4481fa33d85a, 2019. Accessed september 18,2024.

Corresponding author↩︎
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html ↩︎

Data Leakage and Deceptive Performance: A Critical Examination of Credit Card Fraud Detection Methodologies