1 Introduction↩︎

Artificial Intelligence (AI) has rapidly expanded across various domains, including the medical field [1]. Recent examples include the use of machine learning algorithms for the diagnosis of COVID-19 [2] and the detection of skin lesions [3], highlighting the potential of these technologies in automating decision-making processes. However, many of these models, particularly those based on deep learning, operate as black boxes, providing high accuracy but low transparency regarding the contribution of input variables to the final outcome [4]. This lack of interpretability is especially critical in high-risk contexts such as clinical diagnosis, with bladder cancer (BCa) being a prominent example. Classified as the tenth most common type of neoplasm and the thirteenth leading cause of cancer-related death worldwide [5], with more than 90% of cases belonging to the urothelial carcinoma subtype [6], BCa presents both high incidence and heterogeneous clinical behavior. In this scenario, improving early diagnostic methods is essential, making the application of transparent and interpretable AI models crucial to reducing the morbidity and mortality associated with this neoplasm.

Smoking is identified as the main risk factor, being associated with approximately half of the cases and 37% of disease-related deaths [5]. Occupational exposure to carcinogenic substances represents the second most relevant factor, accounting for about 10% of cases [5], which contributes to an incidence up to four times higher in men compared to women. Lifestyle and environment-related factors also play a role, as the urothelial epithelium is continuously exposed to potentially mutagenic substances present in the urine [7]. The most frequently reported symptom is hematuria (blood in the urine), which may be macroscopic (visible) or microscopic (detected through laboratory tests), typically intermittent and painless [8], [9].

Beyond its clinical impact, bladder cancer imposes a significant financial burden on healthcare systems due to its high recurrence rate and the costs of diagnostic procedures, such as periodic cystoscopies, intravesical treatments, and more complex surgeries [6], [7]. In developed regions, where population aging is increasing, the global burden of BCa is projected to rise in the coming decades, underscoring the need for improved strategies for prevention, diagnosis, and treatment [5].

In this context, the development of more accurate and interpretable models and analytical methods may play a relevant role in early detection and therapeutic management. Approaches that enhance the understanding of the determinants of bladder cancer progression have the potential to optimize clinical decision-making and reduce the associated morbidity and mortality.

Clinical and epidemiological databases often present high dimensionality, with many variables and, at the same time, a relatively small number of samples [10]. Under these conditions, conventional machine learning algorithms may produce models with limited generalization power. Dimensionality reduction techniques, such as Principal Component Analysis (PCA) [11] and SHAP (SHapley Additive exPlanations) [12], help mitigate this issue but do not fully address the challenge of interpretability [4].

This study proposes the application of SHAP for dimensionality reduction. The aim is to identify potential relationships between variables with high predictive importance and the outcome of interest. For the analysis, a dataset consisting of 1,336 clinical and laboratory samples from patients with different urinary tract diseases, collected by [13] between 2017 and 2022, will be used.

Although SHAP was originally developed for feature interpretation, it can also be applied to dimensionality reduction. Kumar et al. (2020) [14] employed this technique on two datasets, demonstrating that, in addition to assigning importance to variables, it enables the simplification of complex models without compromising the reliability of predictions.

Liu et al. (2022) [15] employed machine learning models combined with SHAP for feature selection in the diagnosis of Parkinson’s disease, integrating algorithms such as Deep Forest (gcForest), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF). The results indicated superior performance compared to conventional techniques, achieving accuracy above 91% and an F1-score of 0.945.

The structure of this article is organized as follows: Section 2 presents in detail the data used in the study, the machine learning models employed, and the experimental methodology adopted; Section 3 describes the experimental setup, the results obtained, and the associated discussion. Subsequently, Section 4 addresses the critical analysis of the results and relevant interpretations. Finally, Section 5 provides the study’s conclusions and highlights possible directions for future work.

2 Material and Methods↩︎

2.1 Patient data↩︎

The dataset used consists of clinical and laboratory records, including urinalysis and biochemical tests, collected at Mackay Memorial Hospital between January 2017 and February 2020 [13]. It comprises 1,336 samples distributed across five classes: 591 cases of bladder cancer, 201 cases of prostate cancer, 200 cases of kidney cancer, 200 cases of uterine cancer, and 144 cases of cystitis, as shown in Table 1. The dataset contains a total of 39 variables, a considerable number that reinforces the need for dimensionality reduction.

Table 1: Class distribution of the dataset.
Disease	Number of samples	Missing data (%)
Bladder cancer	591	42,27
Prostate cancer	201	15,04
Kidney cancer	200	14,97
Uterus cancer	200	14,97
Cystitis	144	10,78

2.2 Data Preprocessing↩︎

2.2.1 Missing Data Imputation↩︎

The treatment of missing values is a critical step in data preprocessing, as failures in extraction or collection may generate gaps that compromise model performance [16]. Proper imputation helps reduce biases and preserve the consistency of subsequent analyses. In this study, the KNNImputer, based on the K-Nearest Neighbors (KNN) algorithm, was employed [17]. The method estimates missing values from the nearest samples in the feature space, assuming that similar instances exhibit correlated patterns [16]. The procedure follows three main steps: (i) for each instance with missing values, a set of \(k\) nearest neighbors is identified according to a distance metric, such as Euclidean distance; (ii) the known values of these neighbors are used to estimate the missing value; (iii) the imputation is carried out using the mean, median, or mode, depending on the variable type and the adopted configuration.

2.2.2 Oversampling↩︎

Imbalanced datasets, in which some classes are underrepresented, can hinder the generalization of machine learning algorithms for those classes [18]. To address this issue, the Synthetic Minority Over-sampling Technique (SMOTE) [19] was applied, which generates new synthetic samples instead of simply replicating minority instances. The method selects a minority class instance, identifies its \(k\) nearest neighbors, randomly chooses one of them, and creates a new sample at a random point along the line segment connecting the original instance to the neighbor; this process is repeated until the desired class proportion is achieved [20].

2.2.3 Dimensionality Reduction↩︎

In high-dimensional datasets, the presence of irrelevant or redundant variables can degrade the performance of machine learning algorithms, increase computational cost, and reduce model interpretability [15], [21]. To mitigate these effects, feature selection methods such as SHapley Additive exPlanations (SHAP) [12] allow the identification and prioritization of variables with greater predictive impact. SHAP decomposes each prediction into the sum of the individual contributions of the variables, providing explainability and supporting feature selection. Although the exact calculation based on Shapley values is infeasible in high-dimensional problems, SHAP employs computational approximations, such as weighted linear regression for generic models and specific optimizations for tree-based models [22]. The SHAP value of a variable \(i\) indicates its average marginal contribution to the prediction, considering all possible orders of inclusion of the variables in the model.

2.3 Algorithms↩︎

2.3.1 Extreme Gradient Boosting (XGBoost)↩︎

eXtreme Gradient Boosting (XGBoost) [23] is a tree-based Gradient Boosting algorithm, recognized for its high performance, scalability, and efficient regularization for overfitting control. The boosting technique consists of iteratively combining weak models (typically shallow decision trees) to form a robust model.

2.3.2 Categorical Boosting (CatBoost)↩︎

Categorical Boosting (CatBoost) [24] is a Gradient Boosting algorithm based on decision trees that directly handles categorical variables, eliminating the need for manual encoding. It employs Ordered Target-Based Encoding to prevent target leakage and incorporates regularization to reduce overfitting, making it particularly efficient in problems with many categorical variables, such as recommendation systems and demand forecasting.

2.3.3 Light Gradient Boosting Machine (LightGBM)↩︎

The Light Gradient Boosting Machine (LightGBM) [25] is a Gradient Boosting algorithm optimized for large-scale data and high dimensionality. It employs Gradient-Based One-Side Sampling (GOSS) to prioritize instances with larger gradients and reduce sample size without information loss. In addition, it adopts a leaf-wise growth strategy, which expands the leaf that yields the greatest error reduction, requiring regularization to prevent overfitting. These optimizations make training faster and more efficient, making LightGBM particularly suitable for large-scale applications such as fraud detection and recommendation systems.

2.4 Experimental Methodology↩︎

The experimental methodology adopted was inspired by the study of Tsai et al. (2022) [13], who used clinical and laboratory data for bladder cancer prediction. Similarly, this work employs a binary classification strategy, defining multiple experimental scenarios, each designed to discriminate between a specific pair of pathological conditions. The scenarios analyzed were: (1) Bladder Cancer vs. Prostate Cancer, (2) Bladder Cancer vs. Cystitis, (3) Bladder Cancer vs. Kidney Cancer, (4) Bladder Cancer vs. Uterus Cancer, (5) Bladder Cancer vs. Others, and (6) Prostate Cancer vs. Others. In experiments 5 and 6, the goal was to identify cases of bladder or prostate cancer against the other conditions in the dataset, where the class distribution in each scenario is presented in Table 2.

Table 2: Class Distribution by Experiment
Experiment	Groups	Number of Samples
1	Bladder Cancer Prostate Cancer	591 201
2	Bladder Cancer Cystitis	591 144
3	Bladder Cancer Kidney Cancer	591 200
4	Bladder Cancer Uterus Cancer	591 200
5	Bladder Cancer Others	591 745
6	Prostate Cancer Others	201 1135

A pipeline was developed to integrate all preprocessing steps and the classification model, serving as the central object of optimization and training. Figure 1 illustrates its structure. Data preprocessing involved imputing missing values for numerical variables using the KNNImputer, with the number of neighbors \(k\) defined through optimization with Optuna, followed by standardization with StandardScaler. For categorical variables, imputation was performed using the mode with SimpleImputer, followed by One-Hot Encoding, resulting in a total of 56 variables for the complete model. Class balancing was applied exclusively to the training data during cross-validation, using the SMOTE technique. The classification step employed machine learning algorithms, including XGBoost, LightGBM, and CatBoost.

Figure 1: General framework flow: the dataset is divided according to experiments and classes. The training set undergoes preprocessing (imputation, normalization, and encoding) and oversampling with SMOTE, followed by hyperparameter optimization using Optuna. With the optimal parameters, the model is trained and evaluated, generating performance metrics. Subsequently, SHAP values are used for dimensionality reduction and the selection of the most relevant features for classification.

2.4.1 Metrics↩︎

To evaluate the performance of the classification algorithms, five commonly used metrics in the literature were employed. Accuracy, Balanced Accuracy, Recall, Precision, and F-score are calculated as follows:

\[ACC = \frac{TP + TN}{TP + FP + TN + FN} \label{ACC}\tag{1}\]

\[BACC = \frac{\frac{TP}{TP + FN} + \frac{TN}{TN + TP}}{2} \label{BACC}\tag{2}\]

\[Recall = \frac{TP}{TP + FN} \label{RE}\tag{3}\]

\[Precision = \frac{TN}{TN + FP} \label{PR}\tag{4}\]

\[\textit{F-score} = 2* \frac{Recall * Precision}{Recall + Precision} \label{FS}\tag{5}\]

The variables TP, TN, FP, and FN correspond to True Positive, True Negative, False Positive, and False Negative, respectively. In the context of medical classifications, it is particularly important that the classifier correctly identifies cancer cases (TP), since the patient’s condition may be critical and medical intervention must be immediate.

2.4.2 Hyperparameters setting↩︎

The hyperparameter search was conducted using Optuna, with the goal of maximizing the mean Balanced Accuracy (BACC) obtained through stratified 5-fold cross-validation. In this procedure, the dataset of each scenario was divided into five subsets (folds) of approximately equal size, while preserving the original class distribution within each fold. In each iteration, four folds were used to train the model and the remaining fold for hyperparameter validation, repeating the process until each fold had served once as the validation set. The mean BACC across the five folds provided an estimate of model performance for the tested hyperparameter configuration. A total of 100 optimization trials per scenario and per model were performed to identify the parameter combination that maximized BACC. The hyperparameter search space is listed in Table 3.

Table 3: Hyperparameter search space for XGBoost, CatBoost, LightGBM and KNNImputer.
Algorithm	Hyperparameter	Search space
XGBoost	max depth	[3 \(\sim\) 15]
	number of estimators	[50 \(\sim\) 250]
	learning rate	[0,01 \(\sim\) 0,2]
	subsample	[0,5 \(\sim\) 1,0]
	colsample by tree	[0,6 \(\sim\) 1,0]
	min child weight	[1,0 \(\sim\) 10]
	gamma	[0,0 \(\sim\) 1,0]
	reg alpha	[0,01 \(\sim\) 2,0]
	reg lambda	[0,01 \(\sim\) 5,0]
	scale pos weight	[0,5 \(\sim\) 5,0]
LightGBM	max depth	[3 \(\sim\) 15]
	number of leaves	[20 \(\sim\) 50]
	number estimators	[50 \(\sim\) 200]
	learning rate	[0,01 \(\sim\) 0,2]
	subsample	[0,5 \(\sim\) 1,0]
	colsample by tree	[0,6 \(\sim\) 1,0]
	min child samples	[5 \(\sim\) 100]
	reg lambda	[0,01 \(\sim\) 5,0]
	lambda l1	[0,0 \(\sim\) 2,0]
	lambda l2	[0,0 \(\sim\) 5,0]
	min split gain	[0,0 \(\sim\) 1,0]
CatBoost	depth	[3 \(\sim\) 15]
	learning rate	[0,01 \(\sim\) 0,2]
	iterations	[50 \(\sim\) 200]
	l2 leaf reg	[1 \(\sim\) 10]
	bagging temperature	[0,1 \(\sim\) 5,0]
	border count	[32 \(\sim\) 255]
	random strength	[0,5 \(\sim\) 2,0]
KNNImputer	number of neighbors	[3 \(\sim\) 30]

3 Results↩︎

Six binary classification experiments were conducted, applied under two approaches: using the complete dataset and the dataset after dimensionality reduction. The experiments included comparisons between bladder cancer and different urological and oncological conditions, as well as broader scenarios involving multiple pathologies, including comparisons with the baseline [13]. For each binary scenario, a standardized computational pipeline was employed, implemented in Python using the Scikit-learn [26], Imbalanced-learn [27], and Optuna [28] libraries.

The preprocessing stage included the treatment of missing data, where a threshold of 45% was defined for the removal of variables with high rates of missingness. Variables exceeding this threshold were excluded from the analysis, as in the case of the attribute Calcium in the BC vs. UC experiment, which presented 46.4% missing values. After this cleaning step, the variable with the highest percentage of remaining missing data in the dataset had 44.2%. Following data cleaning, the dataset was partitioned into training (80%) and testing (20%). To ensure class representativeness in both partitions, stratified sampling was applied.

For both approaches, three classification pipelines were constructed, corresponding to the evaluated models (XGBoost, LightGBM, and CatBoost). After hyperparameter optimization with Optuna and evaluation on the test set, the model with the highest Balanced Accuracy (BACC) was selected for each experiment. In the reduced-data approach, an additional step was performed: keeping the same train-test split, SHAP was applied to compute feature importance and assess the impact of dimensionality reduction. The number \(N\) of variables with the highest SHAP values was defined as the one that yielded the best BACC, determined through a sensitivity test ranging from 2 up to the total number of available variables in each experiment. The results obtained, with and without feature reduction, are presented in Table 4, while the 20 variables with the highest SHAP values are illustrated in Figure 2.

The results indicate that dimensionality reduction had a positive impact in most experiments. In the BC vs. PC experiment, the variables with the highest SHAP values were Urine epithelium and Gender (Figure 2 (a)). Since prostate cancer occurs exclusively in male individuals, the model’s attribution of a high SHAP value to this variable is consistent. The best BACC was 95.81% for the complete model compared to 95.40% for the model with \(N=26\). When compared with the baseline, both the reduced and complete models achieved better accuracy (ACC), with the reduced model yielding the best performance at 95.60% against 84.80% from the baseline.

In the BC vs. Cystitis experiment, two features stood out: Urine epithelium and A/G Ratio (Figure 2 (b)). A BACC of 97.03% was obtained with \(N=18\), compared to 93.59% for the complete model. When compared with the baseline, the reduced model achieved 95.24% against 87.60%. It is also noteworthy that both precision and specificity reached 100%. In the BC vs. KC experiment, the feature Urine Occult Blood showed a high SHAP value (Figure 2 (c)), being a relevant marker in the diagnosis of bladder cancer [8], [9]. The BACC results were 72.02% obtained with \(N=21\), compared to 68.66% for the complete model. When compared with the baseline, the experiments yielded lower results in terms of ACC and Specificity, with 84.50% versus 77% for ACC and 60% versus 82.9% for Specificity.

In the BC vs. UC experiment, Urine epithelium once again showed a high SHAP value (Figure 2 (d)), reflecting the influence of gender in diseases exclusive to females. The two main features were the same as in the BC vs. PC experiment. The best BACC of 96.25% was obtained with \(N=5\), compared to 95% for the complete model. When compared with the baseline, the reduced model achieved 98.11% versus 86.90

In the BC vs. All experiment, the best BACC was 83.46% for the complete model compared to 83.37% for the model with \(N=43\), once again highlighting the feature Urine epithelium Figure 2 (e)). In the PC vs. All experiment, the best BACC was obtained with \(N=38\), reaching 94.58% compared to 92.46% for the complete model, with Gender being the most relevant variable according to SHAP values (Figure 2 (f)). For these final two experiments, no baseline values were available, and the complete model was used as the reference.

Figure 2: SHAP values for the top 20 features of the dataset across all scenarios..

Table 4: Performance achieved by the models in each experiment, considering both the results with the complete dataset and after dimensionality reduction. The Baseline rows refer to the reference values from [13], while the Entire rows indicate the results obtained with the full dataset.
Exp.	Model	Alg.	N	ACC(%)	BACC(%)	Prec.(%)	Sens.(%)	Spec.(%)	F1(%)
BC vs. PC	Reduced	CatBoost	26	95,60	95,40	98,28	95,80	95,00	97,02
	Entire	CatBoost	57	94,97	95,81	99,12	94,12	97,50	96,55
	Baseline [13]	LightGBM	-	84,80	-	86,60	84,40	85,10	85,10
BC vs. Cystitis	Reduced	LightGBM	18	95,24	97,03	100,00	94,07	100,00	96,94
	Entire	LightGBM	57	93,88	93,59	98,23	94,07	93,10	96,10
	Baseline [13]	LightGBM	-	87,60	-	86,30	89,50	85,50	87,70
BC vs. KC	Reduced	CatBoost	21	77,00	72,02	86,21	84,03	60,00	85,11
	Entire	CatBoost	57	72,96	68,66	85,19	77,31	60,00	81,06
	Baseline [13]	LightGBM	-	84,50	-	83,00	86,80	82,90	84,50
BC vs. UC	Reduced	LightGBM	5	98,11	96,25	97,54	100,00	92,50	98,76
	Entire	LightGBM	57	97,48	95,00	96,75	100,00	90,00	98,35
	Baseline [13]	LightGBM	-	86,90	-	87,10	87,80	86,70	87,30
BC vs. All	Reduced	LightGBM	43	83,58	83,37	81,51	81,51	85,23	81,51
	Entire	LightGBM	57	83,58	83,46	80,99	82,35	84,56	81,67
	Baseline [13]	-	-	-	-	-	-	-	-
PC vs. All	Reduced	XGBoost	38	92,54	94,58	67,24	97,50	91,67	79,59
	Entire	XGBoost	57	90,67	92,46	62,30	95,00	89,91	75,25
	Baseline [13]	-	-	-	-	-	-	-	-

4 Discussion↩︎

In several experiments, the metrics outperformed those reported by Tsai et al. (2022) [13], particularly in terms of Balanced Accuracy (BACC), precision, and specificity. It can be observed that, in most scenarios, dimensionality reduction either maintained or improved the metrics obtained with the complete dataset.

In some cases, the removal of redundant variables or those with low SHAP importance resulted in improved performance: in the BC vs. Cystitis experiment, the use of only 18 features increased BACC from 93.59% to 97.03%, while in the BC vs. KC experiment the best performance was achieved with only 5 variables, maintaining 100% sensitivity. This effect can be explained by the mitigation of overfitting risk, since simpler models tend to generalize better in datasets with a limited number of samples. However, in more complex scenarios or those with strong feature overlap between classes, such as the BC vs. KC experiment, dimensionality reduction did not provide significant gains.

It is noteworthy that certain variables consistently proved important across the experiments. For example, the feature Urine Epithelium (UL) frequently ranked among the top two features with the highest SHAP values in all experiments except the third. The presence of epithelial cells in urinary cytology, although not a consolidated biomarker for bladder cancer, may be associated with pathological alterations in the urinary tract [29]. For all scenarios involving gender-specific diseases, the feature Gender was correctly ranked with a higher SHAP value. In the scenario involving bladder cancer and kidney cancer, the feature Urine Occult Blood obtained the highest SHAP value, which is consistent with the most frequent symptom of bladder cancer, hematuria (blood in the urine) [7]. In this dataset, the manifestation was microscopic hematuria.

The application of explainable methods such as SHAP made it possible to identify factors with a potential direct relationship to diagnosis, including laboratory and demographic variables that maintained high predictive weight. The presence of epidemiologically plausible relationships reinforces the usefulness of this approach in assisting specialists with the prioritization of examinations and the formulation of diagnostic hypotheses. It is worth noting that the dataset used presents class imbalance, which, despite the application of SMOTE, may introduce biases. Another point is that although SHAP is effective in assigning importance, it does not explicitly capture dependency relationships, making it necessary to combine it with specific methods.

5 Conclusion↩︎

This study proposed a methodology for feature selection to support the diagnosis of bladder cancer, integrating explainability through SHAP (SHapley Additive exPlanations) within a pipeline that combined preprocessing, data imputation, class balancing with SMOTE, and hyperparameter optimization via Optuna, using three machine learning algorithms (XGBoost, LightGBM, and CatBoost). Six binary classification scenarios were conducted, comparing bladder cancer with different urological and oncological conditions as well as broader pathological contexts. Feature selection based on SHAP values effectively reduced dimensionality without significant loss of performance and, in some cases, improved metrics such as balanced accuracy, precision, and specificity. Despite limitations related to limited data availability in certain scenarios and heterogeneity in clinical attribute completion, the results demonstrate that combining explainable methods with robust pipelines is a promising strategy for clinical decision support systems, enhancing both model transparency and interpretability.

Acknowledgements↩︎

R.A. Krohling thanks the Brazilian research agency Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil - grant no. 302021/2025-6.

Declarations↩︎

Funding The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Author contribution All authors have made substantial contributions to the conception and design, revising the manuscript, and the final approval of the version to be published. Also, all authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

References↩︎

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

Enhancing Diagnostic Accuracy for Urinary Tract Disease through Explainable SHAP-Guided Feature Selection and Classification