Enhancing Diversity in Multi-objective Feature Selection
6

Sevil Zanjani Miyandoab\(^{*1}\), Shahryar Rahnamayan\(^{*2}\), SMIEEE, Azam Asilian Bidgoli\(^{*3}\),
Sevda Ebrahimi\(^{*1}\), Masoud Makrehchi\(^{1}\)
1 2 3 4 5


Abstract

Feature selection plays a pivotal role in the data preprocessing and model-building pipeline, significantly enhancing model performance, interpretability, and resource efficiency across diverse domains. In population-based optimization methods, the generation of diverse individuals holds utmost importance for adequately exploring the problem landscape, particularly in highly multi-modal multi-objective optimization problems. Our study reveals that, in line with findings from several prior research papers, commonly employed crossover and mutation operations lack the capability to generate high-quality diverse individuals and tend to become confined to limited areas around various local optima. This paper introduces an augmentation to the diversity of the population in the well-established multi-objective scheme of the genetic algorithm, NSGA-II. This enhancement is achieved through two key components: the genuine initialization method and the substitution of the worst individuals with new randomly generated individuals as a re-initialization approach in each generation. The proposed multi-objective feature selection method undergoes testing on twelve real-world classification problems, with the number of features ranging from 2,400 to nearly 50,000. The results demonstrate that replacing the last front of the population with an equivalent number of new random individuals generated using the genuine initialization method and featuring a limited number of features substantially improves the population’s quality and, consequently, enhances the performance of the multi-objective algorithm.

1 INTRODUCTION↩︎

Feature selection involves discarding as many features or variables as possible without compromising classification accuracy. The removal of irrelevant and redundant information not only reduces computational requirements but also enhances the classifier’s performance by mitigating the curse of dimensionality and facilitates model interpretation [1][4].

In recent years, machine learning problems have expanded to include thousands, millions, or even billions of variables, making feature selection a crucial preprocessing step [3]. This practice diminishes data dimensions, simplifying tasks for classifiers and concurrently diminishing memory and computational costs in large-scale datasets. However, excessive feature elimination can lead to classification failure, presenting a delicate balance between the number of features and classification accuracy [1], [2], [5].

Feature selection finds broad applications across various domains, including machine learning and data science [6], [7], bioinformatics and computational biology [8], [9], text mining [10], image analysis and medical diagnosis [11], [12], natural language processing (NLP) [13], [14], financial analysis and economics [15], [16], and many other fields.

The mainstay of classical feature selection methods falls into three categories: wrapper, filter, and embedded methods [3], [8], [17][19]. Filter-based methods analyze data characteristics and assess features without engaging learning models [19]. Embedded methods incorporate variable selection into the training process, varying for each learning machine. Wrappers score variable subsets based on predictive power under the black-box of the machine [17], [18].

In the realm of multi-objective feature selection algorithms, the Non-dominated Sorting Genetic Algorithm II (NSGA-II) [20], a widely recognized algorithm, has been employed in several research studies [1], [2], [21]. This algorithm, a multi-objective scheme of the evolutionary genetic algorithm (GA), comprises three fundamental components: initialization, generating new candidate solutions, and selection. The effectiveness of population-based evolutionary algorithms heavily relies on these components, particularly in providing diverse solutions during optimization. Initialization is a critical step, as confirmed by Kazimipour et al. [22], who reviewed popular techniques introduced before 2014 and classified them based on randomness, compositionality, and generality. Rahnamayan et al. [23] proposed an opposition-based initialization method, accelerating optimizer convergence. Burke et al. [24] demonstrated that heuristic initialization methods enhance diversity and performance.

Additionally, the substitution of individuals in the population acts as a form of re-initialization, commonly employed when an evolutionary optimization method converges [25]. Ghosh et al. [25] highlighted that substituting the worst individuals, and re-entering a dropped-out individual, significantly improves population diversity and, consequently, GA performance. Cobb et al. [26] introduced Random Immigrants, replacing a fraction of the population randomly in each generation to enhance GA performance. These insights motivated our investigation into the effect of substitution in the multi-objective GA scheme, NSGA-II, applied to feature selection.

In this study, we present a strategy to augment population diversity in multi-objective feature selection algorithms. Our findings underscore a critical point: commonly used crossover and mutation operations lack the potency to generate valuable diverse individuals across the search space. Injecting random individuals from outside the population proves essential for boosting algorithm exploration. In other words, auxiliary re-initialization components are necessary to enhance population diversity.

The organization of this paper is as follows: Section II provides a detailed background review, Section III outlines our proposed method, illuminating the steps and techniques employed, Section IV presents results and analysis from our study, and Section V concludes with remarks.

2 Background Review↩︎

2.1 Multi-objective Optimization↩︎

Algorithms designed to address multi-objective problems typically navigate trade-off decisions, yielding a set of solutions rather than a singular one. This array of solutions is commonly referred to as the Pareto front or non-dominated solutions. The establishment of the Pareto front is based on the principle of dominance, a criterion used for the comparison of solutions.

Definition 1. Multi-objective Optimization [27] \[\begin{align} \begin{aligned} & Min/Max\; F(\pmb x)=[f_{1}(\pmb x),f_{2}(\pmb x),...,f_{M}(\pmb x)] \\ &s.t. \quad L_{i}\leq x_{i}\leq U_{i}, i=1,2,...,d \end{aligned} \end{align}\] Here, \(\pmb x\) represents the feature set (i.e., candidate solution), \(M\) denotes the number of objectives, \(d\) equals the number of variables (i.e., dimension), and each variable’s value, denoted as \(x_{i}\), falls within the interval \([L_{i}, U_{i}]\). However, in binary multi-objective optimization problems like feature selection, \(x_{i}\) is restricted to only two values, either from the set \(\{0, 1\}\) or \(\{True, False\}\). The objective function, denoted as \(f_{i}\), aims for either minimization or maximization [27]. In such problems, a widely adopted approach for comparing candidate solutions involves the concept of dominance.

Definition 2. Dominance Concept
If \(\pmb x=(x_{1},x_{2},...,x_{d})\) and \(\acute{\pmb x}=(\acute{x}_{1},\acute{x}_{2},...,\acute{ x}_{d})\) are two vectors in a minimization problem search space, \(\pmb x\) dominates \(\acute{\pmb x}\) (\(\pmb x\prec\acute{\pmb x}\)) if and only if \[\begin{align} \begin{aligned} &\forall i\in{\{1,2,...,M\}}, f_i(\pmb x)\leq f_i(\acute{\pmb x}) \wedge\\ &\exists j \in{\{1,2,...,M\}}: f_j(\pmb x)<f_j(\acute{\pmb x}) \end{aligned} \end{align}\] This concept delineates the optimality of a solution within a multi-objective space. A candidate solution, \(\pmb x\), is considered superior to another solution, \(\acute{\pmb x}\), if it is not inferior in any of the objectives and, at a minimum, exhibits superior performance in one of the objectives. Solutions that are not dominated by any other form the Pareto front and are termed non-dominated solutions [28].

Another prevalent metric for comparing candidate solutions in multi-objective problems is the crowding distance. In computing the crowding distance, individuals are sorted based on each objective, and boundary individuals (minimums and maximums) receive an infinite distance assignment. Other individuals are assigned the sum of the normalized Euclidean individual distances from their neighbors with the same rank across all objectives. This assesses an individual’s importance relative to the density of surrounding individuals [20].

Multi-objective algorithms strive to identify the Pareto front through the utilization of generating strategies/operators and selection schemes. The non-dominated sorting (NDS) algorithm [20] stands out as a popular selection strategy that operates on the dominance concept. It categorizes the population’s solutions into different levels of optimality. The algorithm initiates by identifying all non-dominated solutions in the first rank. To determine the second rank of individuals, non-dominated candidates are removed, and the process repeats for the remaining ones. This iterative procedure continues until all individuals are grouped into distinct Pareto levels. Deb et al. [20] introduced the renowned multi-objective optimization method, NSGA-II, based on this concept.

2.2 Multi-objective Feature Selection↩︎

To transform feature selection into a binary optimization problem, each feature’s status is denoted by a binary variable. With the dataset’s features having the option of being retained or removed, each variable assumes only two states. Consequently, each individual, or solution, represents a distinct feature configuration. The classification accuracy, calculated by masking the dataset’s features and employing a regular classifier such as \(k\)-Nearest Neighbor (\(k\)-NN), Support Vector Machine (SVM), or Decision Tree (DT), serves as one objective value for each individual. Additionally, recognizing the significance of the remaining features, especially in large-scale databases, we define the ratio of retained features as the second objective in our optimization problem.

Jiao et al. [29] introduced a multi-objective optimization approach for feature selection in classification, named PRDH. In this work, the authors have devised a duplication handling method to augment diversity in both the objective and search spaces. Their method transforms the multi-objective feature selection problem into a constrained optimization problem, emphasizing classification performance. A novel constraint-handling method is employed to select feature subsets featuring more informative and strongly relevant features.

Cheng et al. [30] demonstrated that the variable granularity search-based multi-objective evolutionary algorithm, VGS-MOEA, significantly narrows down the search space for large-scale feature selection problems. In this method, each bit of the individuals represents a feature subset, with the subset’s granularity starting larger and gradually refining, aiming for higher-quality features. This technique markedly enhances feature selection efficiency and accuracy.

In efforts to improve the evaluation and selection of solutions for multi-objective optimization problems, [31] introduces a novel fitness evaluation mechanism (FEM) utilizing fuzzy relative entropy (FRE). To effectively tackle the problem, a hybrid genetic algorithm is employed within this framework. Deb et al. introduced NSGA-III [32] as a reference-point based many-objective optimization algorithm that prioritizes non-dominated population members in proximity to specified reference points, particularly effective when dealing with four or more objectives.

More recently, Nikbakhtsarvestani [33] proposed a multi-objective coordinate search optimization, outperforming NSGA-II, especially in computationally expensive applications with a limited number of function calls. Zanjani et al. [1] introduced a method for binary multi-objective optimization using coordinate search for feature selection. In this approach, a variable of each individual on the Pareto front is flipped to generate new individuals in each generation. The presented results markedly surpass those of NSGA-II. They also proposed binary Compact NSGA-II [2], employing Probability Vectors (PVs) to preserve the distribution of solutions, alongside the Pareto front, instead of maintaining two populations during the process.

3 PROPOSED METHOD↩︎

3.1 Objectives↩︎

Enhancing classification accuracy and minimizing data dimensionality constitute the primary objectives of feature selection. Consequently, the calculation of two objectives for each candidate solution (or feature set) becomes imperative. The first objective function computes the classification accuracy, achieved through the utilization of a \(k\)-NN classifier to assess both the training set (during the optimization phase) and the test set (post-optimization for the evaluation of final solutions).

Given that our optimization problem is of a minimization nature, we calculate the classification error as the first objective, relying on the predictions made by the \(k\)-NN model:

\[\label{f1} \mathit{Classification \:Error}= 1 - \frac{\# Correct\: Predictions}{Total \:\# Predictions}\tag{1}\]

The second objective is the diminution of the selected features ratio. Its computation involves tallying the count of features with a value of 1 (or True) for each feature set, followed by dividing this count by the total number of features:

\[\label{f2} \mathit{Ratio\: of\: Selected\: Features}=\frac{Number \:of \: Trues }{Total \:\# Features }\tag{2}\]

Both of these objectives possess real values within the interval \([0, 1]\). Prior research has indicated their inherent conflict, so they are suitable candidates for multi-objective optimization, as highlighted in [5].

3.2 Proposed Algorithm: Binary Diverse NSGA-II↩︎

Despite the effectiveness of NSGA-II, it frequently becomes stuck in local optima and prematurely converges. Our observations suggest that the average pairwise distance among individuals in the feature space is notably low, and as the algorithm progresses, this value declines rapidly. Therefore, a plausible reason for the algorithm’s failure to discover the global optimum lies in its insufficient exploration of the search space and the production of diverse solutions.

Based on our experiments and literature findings, the most effective technique for population initialization to enhance diversity and explore the search space is genuine uniform initialization [5], [34]. The commonly used Bit-string Uniform (BU) binary population initialization method fails to provide chromosome-wise uniformity in the population, impacting the quality of the population in the long run. In contrast, the genuine uniform approach or Uniform Covering (UC) binary initialization results in a more diverse population, featuring individuals from various regions of the search space with both chromosome-wise and gene-wise uniformity. In this method, instead of having 50% True variables in nearly all individuals containing a substantial number of variables, the number of True variables is uniformly distributed across feature sets.

Subsequently, the objective values of each individual are calculated based on the utilized classifier and Equations 1 , 2 . Following this, a non-domination sorting determines the rank of each solution in the population. After generating new individuals - children - through crossover and mutation operations, a non-dominated sorting is executed, and the best individuals survive based on their ranking and crowding distance, akin to the original NSGA-II algorithm.

To address diversity degradation, the proposed idea involves replacing the worst-ranked candidate solutions with newly-generated solutions. At the end of each generation, we replace the last front with new random subsets. These new individuals are generated using the UC initialization method [34], and the number of selected features (i.e., True variables) falls within the range of the size of the smallest and largest feature subsets in the current population.

Suppose the size of the smallest (individual with the lowest number of True variables) and largest (individual with the highest number of True variables) feature subsets in the population is given by \(\alpha\) and \(\beta\), respectively. \([\alpha, \beta]\) determines the range of sizes for reinitialized individuals in the population. For each individual on the last front that must be replaced, a random integer (e.g., \(TrueIndices\)) within the range of \([\alpha, \beta]\), representing the number of features in the new individual, is generated. Then, \(TrueIndices\) number of variables of the new individual are randomly set to True, while the others are set to False. This constraint, applied to the number of selected features of the new solutions, enhances the algorithm’s performance by incorporating prior knowledge from the population and the process, avoiding the generation of entirely random candidate solutions.

Note that in any generation where the entire population is located on a single front, the reinitialization process is bypassed for that generation.

In contrast to the members generated using crossover and mutation operators, which are in the neighborhood of the previous members and have a small distance from their parents, the new random individuals potentially originate from unseen parts of the landscape and aid in exploring and discovering new areas. A sufficient diversity of members and exploration is particularly critical in vast search spaces.

The termination condition is defined based on the total number of function calls (i.e., evaluations) so that in problems with a demanding evaluation function, such as classification, algorithms can be compared fairly within a specific number of function calls. The number of generations is not an appropriate termination condition because the proposed method’s number of evaluations and computations may vary based on the number of individuals on the last front in each generation.

The pseudocode of the proposed binary diverse NSGA-II is presented in Algorithm [alg-one]. Additionally, Algorithm [alg-two] provides details on the genuine initialization method or UC.

\(population =\) GenuineInitialization (\(N\), \(NVars\), 1, \(NVars\)) Evaluate (\(population\))

\(NewIndividuals\) = 2d array size [\(N\), \(NVars\)] filled with \(False\)

4 EXPERIMENTAL RESULTS AND ANALYSIS↩︎

4.1 Datasets↩︎

We have evaluated the efficacy of our proposed approach (i.e., diverse NSGA-II), NSGA-II, and the genuinely initialized version of NSGA-II, across twelve extensive datasets within the domains of microarray and image analysis. The objective is to identify the optimal feature set. These datasets are characterized by a substantial number of features paired with a relatively limited number of instances. Consequently, the reduction of features is imperative to enhance classification accuracy. The properties of the employed datasets are outlined in Table 1.

Table 1: Datasets Description
Number Dataset #features #samples #classes
D1 warpAR10P [35] 2400 130 10
D2 warpPIE10P [35] 2420 210 10
D3 LUNG [36] 3312 203 5
D4 GLIOMA [37] 4434 50 4
D5 TOX-171 [35] 5748 171 4
D6 LEUKEMIA [38] 7070 72 2
D7 CARCINOM [39] 9182 174 11
D8 pixraw10P [35] 10000 100 10
D9 CLL-SUB-111 [35] 11340 111 3
D10 SMK-CAN-187 [35] 19993 187 2
D11 GLI-85 [35] 22283 85 2
D12 GLA-BRA-180 [35] 49151 180 4

4.2 Experimental Settings↩︎

We conducted multiple runs of the algorithms, repeating the process 31 times with different seeds to mitigate the impact of stochasticity. In each iteration, a distinct twenty percent of instances from each dataset were randomly chosen as a test set. So, the method of validation is Repeated Random Subsampling validation or Monte Carlo Cross-Validation. Recognizing the computational expense associated with feature selection on large-scale datasets, we maintained a fixed number of function calls at 15,000 for the mentioned algorithms, ensuring a fair comparison. Our experiments primarily rely on the pymoo framework [40]. Further details on hyperparameter settings can be found in Table 2.

Table 2: Parameter Settings for both proposed method and NSGA-II algorithms
Population size 100
Number of function calls (NFC) 15,000
Selection method Tournament Selection
Mutation method Bit-flip Mutation
Mutation probability 0.01
Crossover Single Point Crossover
Survival method NDS and crowding distance
Duplicate Elimination TRUE
Number of runs 31

Our objective is to minimize both the classification error and the ratio of selected features to all features, both of which fall within the real value interval [0,1]. A suitable multi-objective evaluation metric for comparing algorithm performance is hypervolume (HV), with the reference point set at \((1, 1)\). Furthermore, to calculate the classification error, we utilize the \(k\)-NN classifier with a fixed parameter \(k\) set to 5 across all datasets and experiments.

4.3 Numerical Results and Analysis↩︎

Table 3 shows that across all datasets, our method notably surpasses regular NSGA-II in terms of the HV performance metric. Average HV values during optimization for NSGA-II with genuine initialization (red) and the proposed binary diverse NSGA-II method (blue) for sample datasets are depicted in Fig. 1. The genuine initialization establishes a markedly superior starting point for the HV value compared to the regular NSGA-II, and the substitution of low-ranked individuals accelerates HV growth by promoting exploration of the search space. The combination of these techniques facilitates a rapid HV increase, resulting in an average value of 0.97, as shown in Table 3. Notably, for half of the datasets, the mean train HV exceeds 0.99, with none falling below 0.89. Conversely, NSGA-II fails to attain a train HV of 0.67 or higher for any dataset.

Furthermore, we observed that, in most cases, the diverse NSGA-II generates a greater number of candidate solutions on the Pareto front, all featuring fewer features compared to solutions on NSGA-II’s Pareto front. Table 3, which includes the average final HV values for the train and test Pareto fronts after 15,000 function calls, confirms the statistically significant superiority of the proposed method’s numerical results against NSGA-II across all cases. These results have been validated using a statistical t-test at a 95% confidence level. Notably, the HV value on the test set consistently exceeds 0.70 for all datasets when employing our proposed method, while the maximum test HV value for the regular NSGA-II is 0.58. Moreover, in 9 out of 12 cases, the test HV values obtained from our proposed method surpasses that of the genuinely initialized NSGA-II, and we see three statistical ties.

This table also underscores the substantial impact of the initialization approach on the final HV value, surpassing the effect of solution substitution. Specifically, employing the genuine initialization enhances the train set HV by approximately 0.42 and the test set HV by about 0.34. The addition of a straightforward method to diversify the population further, such as the replacement technique, yields an average improvement of approximately 0.04 on the test set HV.

Figure 1: Comparing HV plots during optimization for the traditional and proposed diverse NSGA-II on some datasets

Table 3: Final train HV values of NSGA-II, NSGA-II with Genuine Initialization, anddiverse NSGA-II. # w/t/l stands for number of wins/ties/losses based on the t-test statistical test with a confidence level of 95%.
Train HV Test HV
Dataset NSGA-II NSGA-II with Genuine Initialization ****Proposed Diverse NSGA-II**** NSGA-II NSGA-II with Genuine Initialization ****Proposed Diverse NSGA-II****
D1 0.5294 0.9572 0.9643 0.3470 0.6472 0.7052
D2 0.6692 0.9937 0.9985 0.4566 0.7354 0.7674
D3 0.6267 0.9878 0.9964 0.5833 0.9017 0.9044
D4 0.5420 0.9689 0.9965 0.4176 0.7058 0.7546
D5 0.5543 0.9829 0.9889 0.4175 0.7673 0.7927
D6 0.5630 0.9916 0.9998 0.4942 0.8416 0.9439
D7 0.5159 0.9581 0.9748 0.4176 0.7475 0.7815
D8 0.5517 0.9914 0.9999 0.3973 0.7239 0.7080
D9 0.4663 0.9414 0.9603 0.3814 0.7524 0.8151
D10 0.4430 0.8863 0.9183 0.4178 0.7974 0.8328
D11 0.5066 0.9751 0.9978 0.4586 0.8634 0.9045
D12 0.4122 0.8572 0.8957 0.3876 0.7534 0.7959
Average 0.5317 0.9576 0.9743 0.4314 0.7698 0.8088
# w/t/l 0/0/12 0/1/11 - 0/0/12 0/3/9 -

Fig. 2 illustrates the average pairwise Hamming distance plots obtained during optimization using NSGA-II with and without the substitution component. The plots highlight that substituting low-ranked individuals significantly increases the distance between individuals during the process, thereby enhancing diversity. Both methods employed random BU initialization for a fair comparison in this experiment.

Figure 2: Average pairwise Hamming distance plots during optimization forNSGA-II with and without the substitution component

Table 4 presents the average of the maximum accuracy over the 31 runs and the ratio of the features selected in the most accurate solutions. These results have also been validated using a statistical t-test at 95% confidence level. Across all datasets, except for LUNG and pixraw10P (D3 and D8), our method significantly enhances the classifier’s accuracy compared to the regular NSGA-II. For the mentioned two datasets, we see statistical ties. In essence, looking at it from a singular objective perspective, diverse NSGA-II method consistently achieves superior accuracy as well. This heightened accuracy is accomplished using less than 1% of the total features, a substantial reduction compared to the number of features selected by NSGA-II. The average number of selected features in the most accurate solution found by NSGA-II across various datasets is approximately 5508, whereas the improved scheme identifies better solutions with about 164 times less memory requirement, averaging 33.5 variables. Consequently, we benefit from both improved accuracy and reduced memory consumption, potentially expediting pipelined processes.

Furthermore, Table 4 illustrates that the impact of substitution in improving algorithm performance to find a more accurate solution is almost comparable to the initialization method. On average, it elevates maximum accuracy by over 3% compared to NSGA-II with genuine initialization, concurrently reducing the number of features by over three times.

Table 4: Average of the maximum accuracy of the individuals and the ratio of features on the most accurate solutions. # w/t/l stands for number of wins/ties/losses based on the t-test statistical test with a confidence level of 95%.
Maximum Accuracy #Features in the Most Accurate Solution
Dataset NSGA-II NSGA-II with Genuine Initialization ****Proposed Diverse NSGA-II**** NSGA-II NSGA-II with Genuine Initialization ****Proposed Diverse NSGA-II****
D1 0.5347 0.6514 0.7060 846.2258 22.8065 10.1290
D2 0.6759 0.7389 0.7680 792.1935 15.1290 7.9032
D3 0.9111 0.9072 0.9048 1191.5161 22.5484 6.5806
D4 0.6774 0.7097 0.7548 1700.1935 25.4516 2.3548
D5 0.7152 0.7733 0.7935 2400.0323 59.8387 25.9355
D6 0.8473 0.8473 0.9441 2951.6452 47.6452 1.8387
D7 0.7429 0.7539 0.7825 4032.3226 99.9677 39.3548
D8 0.6952 0.7290 0.7081 4284.9355 75.6452 1.7097
D9 0.6886 0.7588 0.8163 5075.7419 111.7419 28.3548
D10 0.7725 0.8048 0.8345 9190.0323 192.9355 57.4194
D11 0.8482 0.8710 0.9051 10241.7419 199.7742 20.2903
D12 0.7392 0.7608 0.7984 23394.9355 513.2258 200.4516
Average 0.7374 0.7755 0.8097 5508.4600 115.5591 33.5269
# w/t/l 0/2/10 0/3/9 - 0/0/12 0/0/12 -

Table 5 displays the 95% confidence intervals for the number of selected features for the solutions on the Pareto front. According to these results, the average value of the interval for the second objective obtained from the proposed method is significantly superior to that of the other models. This finding further confirms the effectiveness of both employed techniques in enhancing feature reduction.

2.5pt

Table 5: 95% Confidence Interval of the objective values on the Pareto front solutions of each algorithm. # w/t/l stands for number of wins/ties/losses.
#Features
Dataset NSGA-II NSGA-II with Genuine Initialization ****Proposed Diverse NSGA-II****
D1 (834, 851) (13.17, 19.17) (4.10, 6.28)
D2 (779, 811) (8.19, 11.33) (3.51, 4.69)
D3 (1181, 1198) (13.69, 23.17) (3.17, 4.51)
D4 (1687, 1703) (21.66, 45.25) (1.56, 2.18)
D5 (2380, 2406) (31.15, 46.06) (9.37, 16.11)
D6 (2939, 2969) (35.63, 62.43) (1.40, 1.81)
D7 (4014, 4049) (50.99, 77.58) (13.13, 21.49)
D8 (4276, 4299) (58.18, 100.40) (1.36, 1.69)
D9 (5068, 5115) (68.23, 115.15) (10.72, 32.41)
D10 (9166, 9207) (155.30, 246.32) (23.41, 59.59)
D11 (10222, 10261) (121.84, 228.92) (6.52, 22.72)
D12 (23392, 23450) (286.43, 490.89) (92.23, 197.59)
Average (5495, 5526) (72.04, 122.22) (14.21, 30.92)
# w/t/l 0/0/12 0/0/12 -

The average ratio of replaced solutions over the generations in the proposed method is also presented in Table 6. On average, 11% of the solutions are positioned on the last front and replaced in each generation. Remarkably, this percentage remains almost consistent across all twelve datasets, showing no significant variations. Fig. 3 includes several sample plots depicting the ratio of replaced solutions over the generations. As shown in the plots, the number of solutions on the last front exhibits no discernible correlation with the generation number or the dataset overall, with the curve fluctuating throughout the generations. However, the initial points of the curves are generally lower, indicating higher diversity in the initial population. The curve occasionally drops to zero when all solutions are concentrated on a single front, resulting in no solution replacements in that generation.

Figure 3: Sample plots of the ratio of the replaced individuals in each generation of diverse NSGA-II over generations.

Table 6: Average Ratio of the Replaced Solutions
Dataset Replaced Solutions %
D1 10.98
D2 11.80
D3 11.80
D4 11.02
D5 11.58
D6 11.58
D7 11.81
D8 11.00
D9 10.88
D10 10.09
D11 11.21
D12 10.06
Average 11.15

5 CONCLUSION REMARKS↩︎

With the growing importance of feature selection in large-scale real-world datasets, there’s an increasing demand for efficient binary multi-objective optimization algorithms. These algorithms should not only minimize memory and computational demands but also reduce classification errors or enhance other processing tasks. In this research, we improved the population diversity of the NSGA-II algorithm through a novel approach, resulting in enhanced performance. Our method replaced the Bit-string Uniform binary population initialization with a more effective genuine initialization strategy, significantly increasing population diversity. Additionally, we proposed substituting candidate solutions on the last front in each generation to further diversify the population and efficiently explore the search space. Our findings demonstrated that these techniques not only substantially reduce the number of selected features but also maintain or boost classification accuracy, thereby increasing the Hypervolume (HV) of both train and test sets. The complexity of this method matches that of NSGA-II. One drawback of replacing solutions is that we may lose some information and potentially beneficial solutions.

Our proposed method is tailored for binary multi-objective optimization in general, with feature selection serving as a specific case study. However, its utility extends far beyond feature selection to address a wide range of binary optimization challenges. Moreover, the demonstrated benefits of dimension reduction observed in this study, particularly in enhancing classification and \(k\)-NN performance, can also be harnessed to improve efficiency and effectiveness across various other processing tasks. Furthermore, given that crossover and mutation operations in genetic algorithms like NSGA-II do not generate new individuals from diverse parts of the landscape, we recommend exploring alternative techniques to enhance population diversity as a promising avenue for future research.

References↩︎

[1]
S. Z. Miyandoab, S. Rahnamayan, and A. A. Bidgoli, “Multi-objective binary coordinate search for feature selection,” in 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC).1em plus 0.5em minus 0.4emIEEE, 2023, pp. 4176–4183.
[2]
——, “Compact NSGA-II for multi-objective feature selection,” in 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC).1em plus 0.5em minus 0.4emIEEE, 2023, pp. 3868–3875.
[3]
G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, 2014.
[4]
M. F. Ghalwash, X. H. Cao, I. Stojkovic, and Z. Obradovic, “Structured feature selection using coordinate descent optimization,” BMC bioinformatics, vol. 17, no. 1, pp. 1–14, 2016.
[5]
A. A. Bidgoli, H. Ebrahimpour-Komleh, and S. Rahnamayan, “An evolutionary decomposition-based multi-objective feature selection for multi-label classification,” PeerJ Computer Science, vol. 6, p. e261, 2020.
[6]
M. A. Hall, “Correlation-based feature selection for machine learning,” Ph.D. dissertation, The University of Waikato, 1999.
[7]
J. Cai, J. Luo, S. Wang, and S. Yang, “Feature selection in machine learning: A new perspective,” Neurocomputing, vol. 300, pp. 70–79, 2018.
[8]
Y. Saeys, I. Inza, and P. Larranaga, “A review of feature selection techniques in bioinformatics,” bioinformatics, vol. 23, no. 19, pp. 2507–2517, 2007.
[9]
P. Bertolazzi, G. Felici, and G. Lancia, “Application of feature selection and classification to computational molecular biology,” in Biological Data Mining.1em plus 0.5em minus 0.4emChapman and Hall/CRC, 2009, pp. 277–314.
[10]
M. H. Aghdam, N. Ghasem-Aghaee, and M. E. Basiri, “Text feature selection using ant colony optimization,” Expert systems with applications, vol. 36, no. 3, pp. 6843–6853, 2009.
[11]
A. A. Bidgoli, S. Rahnamayan, T. Dehkharghanian, A. Riasatian, and H. Tizhoosh, “Evolutionary computation in action: Hyperdimensional deep embedding spaces of gigapixel pathology images,” IEEE Transactions on Evolutionary Computation, 2022.
[12]
A. Asilian Bidgoli, S. Rahnamayan, T. Dehkharghanian, A. Grami, and H. R. Tizhoosh, “Bias reduction in representation of histopathology images using deep feature selection,” Scientific reports, vol. 12, no. 1, p. 19994, 2022.
[13]
M. Anand, K. B. Sahay, M. A. Ahmed, D. Sultan, R. R. Chandan, and B. Singh, “Deep learning and natural language processing in computation for offensive language detection in online social networks by feature selection and ensemble classification techniques,” Theoretical Computer Science, vol. 943, pp. 203–218, 2023.
[14]
J. Suzuki, H. Isozaki, and E. Maeda, “Convolution kernels with feature selection for natural language processing tasks,” in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), 2004, pp. 119–126.
[15]
C.-F. Tsai, “Feature selection in bankruptcy prediction,” Knowledge-Based Systems, vol. 22, no. 2, pp. 120–127, 2009.
[16]
D. Liang, C.-F. Tsai, and H.-T. Wu, “The effect of feature selection on financial distress prediction,” Knowledge-Based Systems, vol. 73, pp. 289–297, 2015.
[17]
P. Agrawal, H. F. Abutarboush, T. Ganesh, and A. W. Mohamed, “Metaheuristic algorithms on feature selection: A survey of one decade of research (2009-2019),” IEEE Access, vol. 9, pp. 26 766–26 791, 2021.
[18]
I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research, vol. 3, no. Mar, pp. 1157–1182, 2003.
[19]
H. Liu, H. Motoda, R. Setiono, and Z. Zhao, “Feature selection: An ever evolving frontier in data mining,” in Feature selection in data mining.1em plus 0.5em minus 0.4emPMLR, 2010, pp. 4–13.
[20]
K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002.
[21]
T. M. Hamdani, J.-M. Won, A. M. Alimi, and F. Karray, “Multi-objective feature selection with nsga ii,” in Adaptive and Natural Computing Algorithms: 8th International Conference, ICANNGA 2007, Warsaw, Poland, April 11-14, 2007, Proceedings, Part I 8.1em plus 0.5em minus 0.4emSpringer, 2007, pp. 240–247.
[22]
B. Kazimipour, X. Li, and A. K. Qin, “A review of population initialization techniques for evolutionary algorithms,” in 2014 IEEE congress on evolutionary computation (CEC).1em plus 0.5em minus 0.4emIEEE, 2014, pp. 2585–2592.
[23]
S. Rahnamayan, H. R. Tizhoosh, and M. M. Salama, “A novel population initialization method for accelerating evolutionary algorithms,” Computers & Mathematics with Applications, vol. 53, no. 10, pp. 1605–1614, 2007.
[24]
E. K. Burke, J. P. Newall, and R. F. Weare, “Initialization strategies and diversity in evolutionary timetabling,” Evolutionary computation, vol. 6, no. 1, pp. 81–103, 1998.
[25]
A. Ghosh, S. Tsutsui, H. Tanaka, and D. Corne, “Genetic algorithms with substitution and re entry of individuals,” 2000.
[26]
H. G. Cobb, J. J. Grefenstette et al., “Genetic algorithms for tracking changing environments.” in ICGA, vol. 1993, 1993, pp. 523–530.
[27]
A. Asilian Bidgoli, S. Rahnamayan, B. Erdem, Z. Erdem, A. Ibrahim, K. Deb, and A. Grami, “Machine learning-based framework to cover optimal pareto-front in many-objective optimization,” Complex & Intelligent Systems, vol. 8, no. 6, pp. 5287–5308, 2022.
[28]
A. A. Bidgoli, H. Ebrahimpour-Komleh, and S. Rahnamayan, “Reference-point-based multi-objective optimization algorithm with opposition-based voting scheme for multi-label feature selection,” Information Sciences, vol. 547, pp. 1–17, 2021.
[29]
R. Jiao, B. Xue, and M. Zhang, “Solving multi-objective feature selection problems in classification via problem reformulation and duplication handling,” IEEE Transactions on Evolutionary Computation, 2022.
[30]
F. Cheng, J. J. Cui, Q. J. Wang, and L. Zhang, “A variable granularity search based multi-objective feature selection algorithm for high-dimensional data classification,” IEEE Transactions on Evolutionary Computation, 2022.
[31]
L. He, R. Chiong, W. Li, S. Dhakal, Y. Cao, and Y. Zhang, “Multiobjective optimization of energy-efficient job-shop scheduling with dynamic reference point-based fuzzy relative entropy,” IEEE Transactions on Industrial Informatics, vol. 18, no. 1, pp. 600–610, 2021.
[32]
K. Deb and H. Jain, “An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: solving problems with box constraints,” IEEE transactions on evolutionary computation, vol. 18, no. 4, pp. 577–601, 2013.
[33]
F. Nikbakhtsarvestani, A. A. Bidgoli, M. Ebrahimi, and S. Rahnamayan, “Multi-objective coordinate search optimization,” in 2023 IEEE Congress on Evolutionary Computation (CEC).1em plus 0.5em minus 0.4emIEEE, 2023, pp. 1–9.
[34]
S. Ebrahimi, S. Rahnamayan, and A. Asilian Bidgoli, “Thesis: Improving binary optimization algorithms using genuine uniform initialization.”
[35]
Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu, “Advancing feature selection research,” ASU feature selection repository, pp. 1–28, 2010.
[36]
A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette et al., “Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses,” Proceedings of the National Academy of Sciences, vol. 98, no. 24, pp. 13 790–13 795, 2001.
[37]
C. L. Nutt, D. Mani, R. A. Betensky, P. Tamayo, J. G. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M. E. McLaughlin, T. T. Batchelor et al., “Gene expression-based classification of malignant gliomas correlates better with survival than histological classification,” Cancer research, vol. 63, no. 7, pp. 1602–1607, 2003.
[38]
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” science, vol. 286, no. 5439, pp. 531–537, 1999.
[39]
A. I. Su, J. B. Welsh, L. M. Sapinoso, S. G. Kern, P. Dimitrov, H. Lapp, P. G. Schultz, S. M. Powell, C. A. Moskaluk, H. F. Frierson Jr et al., “Molecular classification of human carcinomas by use of gene expression signatures,” Cancer research, vol. 61, no. 20, pp. 7388–7393, 2001.
[40]
J. Blank and K. Deb, “pymoo: Multi-objective optimization in python,” IEEE Access, vol. 8, pp. 89 497–89 509, 2020.

  1. *Nature-Inspired Computational Intelligence (NICI) Lab↩︎

  2. \(^1\)Department of Electrical, Computer, and Software Engineering, Ontario Tech University, Oshawa, ON, Canada↩︎

  3. \(^2\)Department of Engineering, Brock University, St. Catharines, ON, Canada↩︎

  4. \(^3\)Faculty of Science, Wilfrid Laurier University, Waterloo, ON, Canada↩︎

  5. Contact:sevil.zanjanimiyandoab@ontariotechu.net↩︎

  6. This research is supported by the Vector Scholarship in Artificial Intelligence, provided through the Vector Institute. We utilized ChatGPT 3.5 solely for enhancing the writing and performing grammatical checks.↩︎