Deep Learning as a Competitive Feature-Free Approach for Automated Algorithm Selection

on the Traveling Salesperson Problem

June 29, 2020

In this work we focus on the well-known Euclidean Traveling Salesperson Problem (TSP) and two highly competitive inexact heuristic TSP solvers, EAX and LKH, in the context of per-instance algorithm selection (AS). We evolve instances with \(1\,000\) nodes where the solvers show strongly different performance profiles. These instances serve as a basis for an exploratory study on the identification of well-discriminating problem characteristics (features). Our results in a nutshell: we show that even though (1) promising features exist, (2) these are in line with previous results from the literature, and (3) models trained with these features are more accurate than models adopting sophisticated feature selection methods, the advantage is not close to the virtual best solver in terms of penalized average runtime and so is the performance gain over the single best solver. However, we show that a feature-free deep neural network based approach solely based on visual representation of the instances already matches classical AS model results and thus shows huge potential for future studies.

The *Traveling Salesperson Problem* (TSP) is a classical \(\mathcal{NP}\)-hard optimization problem of utmost relevance, e.g., in transportation logistics, bioinformatics or circuit board fabrication. The goal is to route a salesperson through a set of cities such that each city is visited exactly once and the tour is of minimal length. In the past decades tremendous progress has been made in the development of high-performing heuristic TSP solvers. The local search-based *Lin-Kernigham Heuristic* (LKH) [1] and the genetic algorithm *Edge-Assembly-Crossover* (EAX) [2], along with their respective restart versions introduced in Kotthoff et al. [3], undeniably pose the state-of-the-art in inexact TSP solving.

*Automated Algorithm Selection* (AS), originally proposed by Rice [4] back in 1976, is a powerful framework to predict the best-performing solver(s) from a portfolio of candidate solvers by means of machine learning. It has been successfully applied to a wide spectrum of challenging optimization problems in both the combinatorial [5]–[9] and continuous domain [10], [11] with partly astonishing performance gains – see the recent survey by Kerschke et al. [12] for a comprehensive overview. In particular, the TSP was subject to several successful AS-studies [3], [13]–[16] which exploited the complementary performance profiles of simple heuristics on the one hand and the state-of-the-art solvers LKH and EAX on classical TSP benchmark sets on the other hand.

In the classic setting, AS relies on characteristic problem instance properties, termed *(instance) features*. These features are used as predictor variables for classical machine learning algorithms, e.g., random forests or support vector machines. The key idea – and ideal outcome – is that these features can easily be used to automatically derive decision rules that are well-suited to partition the instance space into ideally disjoint sub-spaces of instances, and which then are uniquely solved best by different solvers. However, features have many drawbacks: they are usually hand-crafted in a tedious process [17], partly require time-consuming calculations (which need to be taken into account by the model fitting step) and are problem-tailored (or at least specific to a problem domain, e.g., graph problems). Moreover, we usually prefer light-weight models with few features. Hence, training AS models is frequently combined with automated feature selection methods [18]–[20] or dimensionality reduction techniques [21].

Recently, Alissa et al. [22] took a promising new path after exploring first approaches to avoid manual feature-crafting [23], [24]. The authors proposed a deep learning based approach which does not rely on any a-priori calculated feature profiles. Instead, in their study on the 1D-Bin-Packing Problem (BPP) the neural network is given temporal sequence data of the BPP instance as the only input. They were able to achieve drastic improvements. In this paper we adopt and adapt this idea for the TSP. To this end, we evolve a set of instances where LKH and EAX show strongly different behaviour in terms of *Penalized Average Runtime*^{1} (PAR10; [25]). We show that with classical AS we can clearly beat the *Single Best Solver* (SBS; the solver with best average performance). However, the gap to the *Virtual Best Solver* (VBS; perfectly predicting oracle) can only be reduced slightly with much room for improvement. This holds true even in the case when we enrich the machine learning pipeline with (1) hand-selected feature subsets (based on exploratory data analysis), (2) different feature selection methods, or (3) a combination of both. After that, we propose a feature-free deep learning approach where the neural networks are trained on the plain image representations of Euclidean TSP instances. This approach achieves competitive performance, but drops the need for manual feature derivation and calculation.

The remainder of this paper is structured as follows. We describe the benchmark set (generation) and pre-selection of feature subsets in Sections 2 and 3 respectively. In Section 4 we present the results that we achieve using classical feature-based AS approaches. Next, in Section 5, we detail our feature-free deep learning based approaches, and compare the results with the classical models. We close with a discussion and an outlook in Section 6.

Our benchmark requires a set of Euclidean TSP instances that show strong differences in algorithmic performance. To this end, we adopt an *evolutionary algorithm* (EA) and creative mutation operators recently proposed by Bossek et al. [26]. Their method allows for the tailored generation^{2} of TSP instances that (1) have the desired performance difference (in terms of the ratio of PAR10-scores), (2) show multifarious topologies in terms of point arrangement in the Euclidean plane^{3}, and (3) are well distributed in the space of instances characteristics/features; in particular the latter two properties were not achieved by evolutionary instance generation methods before [14], [15]. For sake of brevity we refer the reader to Bossek et al. [26] for more details.

We generated a balanced data set of \(1\,000\) TSP instances with \(n=1\,000\) nodes per instance using the EA parameters from [26]; each \(500\) being uniquely faster solved to optimality by either EAX or LKH.^{4} All generated data is available in a public GitHub repository (https://github.com/mvseiler/PPSN_TSP_DL).

Fig. 1 depicts the performance – measured by means of PAR10 – of EAX and LKH on the entire benchmark set. The plot highlights apparent potential for automated algorithm selection due to strong performance differences. On the other hand the data reveals the general superiority of the EAX solver since it is much harder to evolve instances that are hard for EAX (shown as orange points). There are just two instances for which the EAX hits the cutoff time \(T=3\,600\) seconds (1 hour) at least once in all of its ten independent runs on that instance. In contrast, LKH frequently gets stuck in local optima – see the cluster of green points between \(T=3\,600\) and \(10 \cdot T\) in the top left corner.

The state-of-the-art TSP-related feature sets [14], [16], [28] consist of hundreds of hand-crafted features. Features range from statistics (mean, variance etc.) of edge lengths, angles of nearest neighbors, to more sophisticated features based on Minimum Spanning Trees (MST) or \(k\)-Nearest-Neighbor-Graphs (\(k\)-NNG). Fig. 2 depicts visual impressions of MSTs and \(k\)-NNGs on two evolved instances; these images will be a key ingredient to the neural network in Section 5.

Due to the size of feature sets, in the context of algorithm selection, automated feature selection methods have shown their suitability for automatically filtering a (small) subset of discriminating features [13]. Regardless of the sophistication of the feature selection method at hand, feature selection needs to cope with an exponentially sized search space of the underlying subset-selection problem. Hence, in order to assist our feature-based model fitting we conduct a simple univariate exploratory data analysis in order to identify a lucid subset of adequate features *a-priori*. To this end we adopt a simple greedy heuristic. First, all features \(f\) are scaled to \([0,1]\) to allow for a fair comparison across the features. Then, we perform a two-sided non-parametric Wilcoxon-Mann-Whitney test [29] at significance level \(\alpha=0.05\) per feature \(f\), to check the null hypothesis that the distributions of the EAX instances and the LKH instances with respect to \(f\) differ by a location shift equal to zero. We extract the 15 most relevant features according to the smallest \(p\)-values. This procedure was done once for the entire benchmark set of salesperson features as the most comprehensive feature set (see Sec. 4) and was repeated for subsets of “hardest” instances of decreasing size. Hardest in this context relates to (a) each 300 and 150 hardest instances for each solver with respect to mean PAR10-score, or (b) the ratio of mean PAR10-scores. Both methods pursue to reduce the benchmark set to instances with maximal performance differences.

Fig. 3 shows the distribution of selected features both across the whole benchmark set and the subset of each \(300\) most difficult instances with respect to mean PAR10-score for each solver (marked with crosses in Fig. 1). Noticeably, the 15 most relevant features are identical for both sets. They are mainly composed of summary statistics on strong connected components of the nearest neighbor graph (nng_) and properties based on minimum spanning trees (mst_). This is very much in line with crucial features identified in most TSP-related AS-studies [13]–[16], [30], [31] by sophisticated variable importance measurement. These features seem plausible since both MSTs and NNGs capture the global structure, e.g., existence of clusters etc., very well. A close look at Fig. 3 suggests, that instances that are easy for EAX cover a wider range of values which is derived from wider (green) boxes for the features while easy instances for LKH show much more narrow (orange) boxes, with many strong outliers though. The right hand plot, however, shows that the hardest instances seem to be better separable with the features. For instance, for the features in the top 4 rows we observe that \(75\%\) of feature values for LKH-friendly instances are higher than \(75\%\) of the respective values for EAX-friendly instances.

Fig. 1 reveals very complementary performances of EAX and LKH, which gives us reason to assume that automated AS might work well in this setting. Further, as outlined in the univariate analysis of the TSP features (see Sec. 3), the features also indicate their potential for distinguishing instances that are beneficial for EAX from instances for which LKH is preferable. As previous works [3], [13] already confirmed the effectivity of feature-based AS, we adopted their experimental setup – and only slightly modified it to the scenario at hand. Below, we will outline the considered machine learning algorithms, feature sets, as well as feature selection strategies, which have been used for training our final selectors.

All our candidate AS models are trained using PAR10 [32] as performance measure and assessed with a 10-fold cross-validation (CV). As indicated by Fig. 1, it is much more difficult for LKH to perform well on the instances that were evolved in favor of EAX (green points), rather than vice versa (orange). Therefore, the selectors will likely have a bias towards EAX instances. To adjust for this bias, we additionally tune the classification threshold for all trained models.

For training the potential automated AS models, we considered four different classifiers [33] using the R-package `mlr`

[34]: decision trees [35], random forests [36], support vector machines [37] and gradient boosting [38]. Each of them is trained using three different feature sets: the UBC features from Hutter et al. [28], a large set of features by Pihera and Musliu [16], as well as the TSP features from the R-package `salesperson`

^{5} [39]. The salesperson features provide the up-to-now most comprehensive collection of features; in fact, they are a strict superset of the Pihera and tspmeta features [15]. On the other hand, the UBC and Pihera features led to the best performing algorithm selectors in previous works [3], [13] – which did not consider the salesperson features as the package did not exist back then. Note that there is a large overlap across the three considered feature sets as outlined in [12]. To reduce the noise within and redundancy between the features, we additionally created five small subgroups from the `salesperson`

feature set, consisting of 15 features each (see Sec. 3 for details).

In addition to the 32 potential AS models described above (8 feature sets \(\times\) 4 learners), the respective feature sets were further reduced using three automated feature selection strategies: sequential floating forward selection (sffs), sequential floating backward selection (sfbs), and – for the reduced feature sets – exhaustive search of the 15 features [12], [40]. This resulted in 84 further candidate selectors.

Tab. **¿tbl:tab:overview-95par10-95classical-95as?** summarizes the averaged PAR10 performances of all 116 considered AS models, with the best achieved scores highlighted in red. Of course, all shown PAR10-scores already include the costs for the computation of the TSP features. On average those costs account for merely 0.7s at most. According to the listed performances of the best models (61.21s), the SBS (67.47s) and the VBS (4.92s), the best classical AS approaches are able to reduce the SBS-VBS-gap by 10%.

The best found selectors are random forests, which reduced the top 15 features that they were given initially, to the following four features: the sum, arithmetic mean, median and coefficient of variation of the distances of the MST. Moreover, the tuned thresholds varied from 4% to 33% across the ten folds, implying that EAX has always been selected once the model predicted EAX with a probability of at least 33%. In consequence, out of all instances, in which LKH was actually the faster solver, it has only been selected 151 times (corresonding to roughly 30%). On the other hand, LKH has only been (wrongly) picked in 4% of the cases, in which EAX would have been the correct choice.

When starting with the full feature sets from Pihera, UBC and salesperson, a support vector machine based on a subset of the UBC features achieved the best performance (printed in bold in Tab. **¿tbl:tab:overview-95par10-95classical-95as?**). In fact, the PAR10-score of 56.67s is only slightly worse than the one of our best selector(s). Noticeably, the SVM also relied on MST features only: the arithmetic mean and standard deviation of the lengths of the edges in the MST, as well as the skewness of its node degrees.

```
**Method**
**Algor.** **All Features** **Top 15 Features**
```

(l2ptr2pt)3-5 (l2ptr2pt)6-10 **Pihera** **UBC** **Sales.** **all** **r150** **r300** **s150** **s300** none rpart 67.25 69.35 67.70 67.32 67.32 67.52 68.12 67.32 rf 59.09 61.56 62.22 61.21 61.99 60.14 61.51 62.30 xgboost 65.40 67.00 67.57 65.53 65.53 64.51 65.78 65.53 ksvm 61.34 61.55 63.09 61.98 61.98 60.54 61.26 61.98 sffs rpart 66.20 67.92 67.26 66.82 66.82 66.82 66.82 66.82 rf 325.27 58.42 62.77 60.70 59.74 **56.29** **56.29** **56.29** xgboost 65.45 66.60 62.61 64.20 64.20 64.20 64.20 64.20 ksvm 66.42 **56.67** 60.57 62.38 62.38 62.22 62.06 62.38 sfbs rpart 67.16 66.10 65.61 67.12 67.12 67.19 68.12 67.12 rf 60.01 61.56 63.24 62.05 61.95 59.62 62.19 62.17 xgboost 65.47 67.00 67.57 65.88 63.87 64.77 61.87 65.88 ksvm 61.25 63.07 62.88 60.66 60.62 59.40 62.75 60.47 exh. rpart 67.19 67.19 66.82 67.10 67.19 rf 86.70 59.68 **56.29** 60.57 63.58 xgboost 61.75 64.28 63.02 64.09 63.02 ksvm 62.16 62.16 62.16 62.07 62.16

: Overview of all PAR10 results using the classical feature-based AS approach. The best PAR10-scores are highlighted in red. Note that we did not perform exhaustive feature selection on the non-reduced feature sets due to enormous computational costs. {#tbl:tab:overview-95par10-95classical-95as}

Our findings are also confirmed by the left image of Fig. 4 as only few observations (20) are located above the diagonal. However, looking at the right image, it becomes nearly obvious that the selector is still quite far away from the performance of the VBS, as shown by the many misclassifications (381 out of all 1 000 instances) above the diagonal.

In an attempt to better understand the reason for these rather small thresholds, we investigated the misclassification costs in detail. *Prior* to tuning the threshold, LKH was predicted 154 times when EAX would have been the correct choice – and each of those misclassifications caused (on average) an overhead of 4 423.89s. In contrast, the 128 cases in which EAX was predicted instead of LKH only came with an average penalty of 95.95s. *After* tuning the thresholds, each of the 20 wrong predictions of LKH caused only 290.04s – compared to the average penalty of 126.07s for the 361 wrong predictions of EAX. Thus, by being rather conservative and only predicting LKH in cases, where the model is highly certain, the selector was able to reduce the misclassification costs significantly.

As our results indicate, the common feature-based approaches are able to improve over the SBS. However, it is also noticeable that the currently available features still have a very hard time in extracting sufficient information from the TSP instance to reliably predict the better solver. Therefore, we will test the suitability of deep learning neural networks as an alternative or supporting means for automated algorithm selection.

As demonstrated in the previous section, feature-based AS methods can outperform the SBS. However, these models come with three major drawbacks: they (1) are hand-crafted in a tedious process, (2) partly require time-consuming calculations, and (3) are problem tailored (see Sec. 1). To overcome these issues, we propose a novel, and sophisticated feature-free approach that is based on so-called *Convolutional Neural Networks* (CNN) [41].

To train CNN based AS models that are independent of the commonly used TSP features, we will produce different visual representations of our TSP instances (see Fig. 2) and use them for training the deep learning networks. Those images are created with a resolution of \(512 \times 512\) pixels and the coordinates of the instances are scaled to fill out the entire image as exemplarily shown in the two images in the left column of Fig. 2. In addition to these point clouds, images of corresponding *Minimum Spanning Trees* (MST, second column of Fig. 2) and *\(k\)-Nearest-Neighbor Graphs* (\(k\)-NNG, right column of Fig. 2) with \(k = 5\) were generated. We chose MST and \(5\)-NNG as additional visual representations because we found in Section 3 that the 15 most important features are based almost exclusively on MST and \(5\)-NNG graphs. In the \(5\)-NNGs, not only mutual (strong) connections but also one-sided (weak) links were considered, in which one city belongs to the nearest neighbor set of another city, but not vice versa.

Admittedly, only networks whose generation was based exclusively on point clouds can be described as *feature-free*. For a better comparison, however, we have additionally evaluated *feature-based* networks that were trained with images of the corresponding MST and \(5\)-NNG. Hence, we considered two different scenarios for our network-based approaches. In the first scenario (S1), the networks were trained based on (a) point clouds of the cities (Points), (b) MST images, and (c) \(5\)-NNG images. In scenario (S2), we combined (a) the scatterplots with the MST images (as two input channels), and (b) the scatterplots with the MST and \(5\)-NNG images (as three input channels). As the costs for generating the images are insignificant, we have not taken their generation time into account when computing the PAR10 scores of the deep learning models. For larger instances, though, these times would have to be taken into account.

To process the visual representations of the instances, we used eight stacked convolutional layers (see Fig. 5). Three of them used \(Strides=2\) to reduce the size of the feature maps and four of them used \(Dilation=\{2,3\}\)-Kernels to enlarge the receptive fields and, thus, gain a larger view of the instances (see Fig. **¿fig:cnn-95working?** to compare the effects of Strides and Dilation). We used *Rectified Linear Unit* (ReLU) [42] as activation function for all layers except for the last linear layer, for which we used a *Softmax* activation. To improve the training speed, the outputs of all convolutional layers are normalized by using *Group Normalization* (GN) [43] with \(G=8\). The GN layers are in-between the convolutional layers and the ReLU activation. For transition from the three-dimensional convolutional (\(Width \times Height \times Channels\)) layers to the one-dimensional linear layer, a *Global Average Pooling Layer* (GPL) [44] is used. After the GPL, a Dropout (DP) [45] layer with 25% dropout is added to improve regularization. The final layer is a single, linear layer with two output neurons – one for EAX and one for LKH (see Fig. 5). Last, we used 10-fold cross-validation to evaluate the performance of the neural networks. The folds were the same as for the classical feature-based approach. All networks were trained using mini-batches of eight, Adam [46] as optimizer and *Cross-Entropy* [47] as loss function. Note that neural networks are most commonly trained using *Stochastic Gradient Descent* [48], which strongly differs from the training methods used in Section 4.

The best performing classical approach achieved a mean PAR10-score of \(56.29\) (see Tab. **¿tbl:tab:overview-95par10-95classical-95as?**). Our feature-free networks, which were trained exclusively using the points, achieved a mean PAR10-score of \(56.31\) after tuning the threshold and thus a similar performance (see Tab. 1) as the feature-based, classical approaches.

As stated before, we additionally investigated whether adding additional features to the networks could improve the models’ performances. Therefore, we also trained feature-based networks using MST and 5-NNG images. As shown in Tab. 1, both variants perform noticeably better. Besides, we found that while the thresholds between the points and the MST models are rather similar, the thresholds of the NNG models are on average \(10\%\) higher. Next, the thresholds of the Points and MST models range from \(13\%\) to \(34\%\), and \(16\%\) to \(35\%\), respectively, while the thresholds of the NNG models range from \(3\%\) to \(73\%\). Thus, networks trained on the NNG images appear to be less stable.

Moreover, the networks based on the Points correctly predicted EAX in 91.8% (and thus 459 times) of the cases, in which EAX was the better solver, compared to only 22.6% (113 cases) for the LKH-friendly cases. This behavior likely results from the fact that a misclassification of an instance, which is favorable for LKH, is cheaper than a misclassification of an instance that is easier for EAX. In contrast, the MST-based networks predict EAX in 91.6% (458) and LKH in 27.2% (136) cases, correctly. Thus, compared to the feature-free networks, which were exclusively based on Points, the MST networks benefit from correctly identifying LKH-friendly instances – without losing accuracy on the EAX-friendly instances. Noticeably, in case of the 5-NNG networks, only 84.6% (423) of the EAX-easy instances are classified correctly, compared to 34.8% (174) among the LKH-easy instances. Thus, despite the improvements among the instances that are favorable for LKH, the PAR10 score of the NNG-based networks is inferior to the MST-based selector, as misclassifying EAX-easy instances is more expensive.

Scenario S1 |
Scenario S2 |
|||||||||

(l2ptr2pt)2-7 (l2ptr2pt)8-11 | Points | MST | NNG | Points + MST | Points+MST+NNG | |||||

(l2ptr2pt)2-3 (l2ptr2pt)4-5 (l2ptr2pt)6-7 (l2ptr2pt)8-9 (l2ptr2pt)10-11 Fold | PAR10 | TH | PAR10 | TH | PAR10 | TH | PAR10 | TH | PAR10 | TH |

1 | 34.40 | 0.19 | 30.91 | 0.27 | 31.35 | 0.64 | 27.13 |
0.21 | 32.95 | 0.14 |

2 | 78.88 | 0.15 | 57.95 | 0.17 | 76.99 | 0.18 | 54.65 |
0.14 | 58.06 | 0.22 |

3 | 67.43 | 0.14 | 64.14 | 0.19 | 57.94 |
0.32 | 62.40 | 0.25 | 62.15 | 0.15 |

4 | 50.50 | 0.13 | 45.89 |
0.19 | 59.35 | 0.03 | 54.92 | 0.19 | 56.58 | 0.13 |

5 | 52.41 |
0.29 | 57.76 | 0.16 | 59.40 | 0.43 | 58.50 | 0.24 | 61.43 | 0.25 |

6 | 58.07 | 0.15 | 45.28 |
0.21 | 56.09 | 0.35 | 64.59 | 0.17 | 54.95 | 0.19 |

7 | 50.32 | 0.40 | 58.91 | 0.29 | 42.49 |
0.05 | 49.47 | 0.25 | 54.76 | 0.22 |

8 | 36.89 | 0.23 | 35.35 | 0.20 | 26.08 |
0.73 | 63.47 | 0.29 | 32.07 | 0.27 |

9 | 93.07 |
0.31 | 95.44 | 0.23 | 95.41 | 0.41 | 96.89 | 0.16 | 93.57 | 0.27 |

10 | 41.10 |
0.34 | 48.79 | 0.35 | 44.52 | 0.13 | 52.09 | 0.42 | 48.66 | 0.14 |

\(\varnothing\) | 56.31 | 0.23 | 54.04 |
0.23 | 54.96 | 0.33 | 58.41 | 0.23 | 55.52 | 0.20 |

To investigate whether the combination of the three different input variants would lead to networks that achieve better performances when predicting EAX- and LKH-easy instances, we combined Points and MST, as well as Points, MST and NNG into two and three input channels (Scenario 2), respectively. However, as shown in Tab. 1, combining the visual representations does not improve the networks’ overall performances. Also, while the network based on Points + MST classifies 31% (155) of the LKH-easy instances correctly, the selector based on Points + MST + NNG only succeeds in 17% (85) of the respective cases. We further observed that the threshold values are quite similar to the ranges of the Points and MST models from scenario S1.

As visualized in Fig. 6, the Points (left) and MST networks (right) prefer EAX over LKH – as there are far more observations below the diagonal. This is also confirmed by the low thresholds (see Tab. 1). Interestingly, the networks solely based on visual representations of the instance, even perform slightly better than the classical feature-based AS models – but are still clearly inferior to the VBS.

The conducted experiments shed light on still existing shortcomings of classical feature-based per-instance algorithm selection on the TSP. While previous studies clearly reported successful approaches, the informative character of existing TSP feature sets reaches its limits for instances specifically evolved for maximum performance difference of the two state-of-the art heuristic solvers. Sophisticated mutation operators here lead to so far unobserved topological structures. Despite outperforming the SBS, the gap to the performance of the oracle-like VBS cannot be closed substantially, even after utilization of sophisticated preprocessing and feature selection approaches.

However, it again becomes obvious that the minimum spanning tree and nearest neighbor structures of the points are most informative in discriminating solver performances. We build on this information and enrich a deep neural network approach based on images of the instance topology by specific images visualizing the minimum spanning tree and the nearest neighbor graph. Most interestingly, our feature-free deep neural network nicely matches the performance of the quite complex classical AS approach (see Tab. 2), despite being solely based on an image of the instance’s points.

This proof-of-concept study thus shows the huge potential of deep learning, feature-free approaches in this domain which we will exploit in future studies by more sophisticated networks and altered loss functions specifically adapted to the designated performance indicators. Moreover, additional image channels will be added, e.g., in terms of heatmaps. Specific investigations have to be conducted with regard to the scaling behaviour of the approach, as image resolutions most probably will have to be carefully adapted to increasing instance sizes.

On the other hand, the observed limitations of classical TSP features show the necessity of enriching the library of TSP features by alternative sets which capture other kinds of — obviously important — instance structures. We will apply our approach to classical feature sets such as RUE or TSPLib as well for a comparison. However, it is specifically noteworthy that, in principle, the deep learning approach nicely generalizes to other graph-based optimization problems while instance features are almost exclusively tailored to the focused domain.

Measure |
Baseline |
Classical AS |
Deep Learning AS |
||||

(l2ptr2pt)2-3 (l2ptr2pt)4-5 (l2ptr2pt)6-8 | VBS | SBS (EAX) | RF | SVM | Points | MST | Points + MST |

PAR10 | 4.92 | 67.47 | 56.29 | 56.67 | 56.31 | 54.04 | 55.52 |

Accuracy | 1.00 | 0.49 | 0.62 | 0.60 | 0.57 | 0.59 | 0.61 |

F1-Score | 1.00 | 0.65 | 0.71 | 0.70 | 0.68 | 0.69 | 0.70 |

The authors acknowledge support by the *European Research Center for Information Systems (ERCIS)*.

[1] K. Helsgaun, “An effective implementation of the lin–kernighan traveling salesman heuristic,” *European Journal of Operational Research*, vol. 126, no. 1, pp. 106–130, 2000.

[2] Y. Nagata and S. Kobayashi, “A Powerful Genetic Algorithm Using Edge Assembly Crossover for the Traveling Salesman Problem,” *INFORMS Journal on Computing*, vol. 25, no. 2, pp. 346–363, 2013.

[3] L. Kotthoff, P. Kerschke, H. H. Hoos, and H. Trautmann, “Improving the State of the Art in Inexact TSP Solving Using Per-Instance Algorithm Selection,” in *Proceedings of the 9th International Conference on Learning and Intelligent Optimization (LION)*, 2015, vol. 8994, pp. 202–217, [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-19084-6_18.

[4] J. R. Rice, “The Algorithm Selection Problem,” *Advances in Computers*, vol. 15, pp. 65–118, 1976, [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0065245808605203.

[5] L. Kotthoff, “Algorithm Selection for Combinatorial Search Problems: A Survey,” *AI Magazine*, vol. 35, no. 3, pp. 48–60, 2014, doi: 10.1609/aimag.v35i3.2460.

[6] T. M. Lindauer, H. H. Hoos, F. Hutter, and T. Schaub, “AutoFolio: An Automatically Configured Algorithm Selector (Extended Abstract),” in *Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)*, 2017, pp. 5025–5029, doi: 10.24963/ijcai.2017/715.

[7] Y. Malitsky, A. Sabharwal, H. Samulowitz, and M. Sellmann, “Algorithm Portfolios Based on Cost-Sensitive Hierarchical Clustering,” in *Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI)*, 2013, vol. 13, pp. 608–614, [Online]. Available: https://www.aaai.org/ocs/index.php/IJCAI/IJCAI13/paper/view/6946.

[8] M. Rizzini, C. Fawcett, M. Vallati, A. E. Gerevini, and H. H. Hoos, “Static and Dynamic Portfolio Methods for Optimal Planning: An Empirical Analysis,” *International Journal on Artificial Intelligence Tools*, vol. 26, no. 1, pp. 1–27, 2017, doi: 10.1142/S0218213017600065.

[9] L. Xu, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Evaluating Component Solver Contributions to Portfolio-Based Algorithm Selectors,” in *Proceedings of the 15th International Conference on Theory and Applications of Satisfiability Testing (SAT)*, 2012, vol. 7317, pp. 228–241.

[10] P. Kerschke and H. Trautmann, PMID: 30365386“Automated algorithm selection on continuous black-box problems by combining exploratory landscape analysis and machine learning,” *Evolutionary Computation*, vol. 27, no. 1, pp. 99–127, 2019, doi: 10.1162/evco\_a\_00236.

[11] B. Bischl, O. Mersmann, H. Trautmann, and M. Preuss, “Algorithm Selection Based on Exploratory Landscape Analysis and Cost-Sensitive Learning,” in *Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (GECCO)*, 2012, pp. 313–320, doi: 10.1145/2330163.2330209.

[12] P. Kerschke, H. H. Hoos, F. Neumann, and H. Trautmann, “Automated Algorithm Selection: Survey and Perspectives,” *Evolutionary Computation (ECJ)*, vol. 27, no. 1, pp. 3–45, 2019.

[13] P. Kerschke, L. Kotthoff, J. Bossek, H. H. Hoos, and H. Trautmann, “Leveraging TSP Solver Complementarity through Machine Learning,” *Evolutionary Computation (ECJ)*, vol. 26, no. 4, pp. 597–620, 2018.

[14] O. Mersmann, B. Bischl, J. Bossek, H. Trautmann, M. Wagner, and F. Neumann, “Local Search and the Traveling Salesman Problem: A Feature-Based Characterization of Problem Hardness,” in *Proceedings of 6th International Conference on Learning and Intelligent Optimization (LION)*, 2012, vol. 7219, pp. 115–129, doi: 10.1007/978-3-642-34413-8\_9.

[15] O. Mersmann, B. Bischl, H. Trautmann, M. Wagner, J. Bossek, and F. Neumann, “A Novel Feature-Based Approach to Characterize Algorithm Performance for the Traveling Salesperson Problem,” *Annals of Mathematics and Artificial Intelligence*, vol. 69, no. 2, pp. 151–182, 2013, doi: 10.1007/s10472-013-9341-2.

[16] J. Pihera and N. Musliu, “Application of machine learning to algorithm selection for TSP,” in *26th IEEE international conference on tools with artificial intelligence, ICTAI 2014, limassol, cyprus, november 10-12, 2014*, 2014, pp. 47–54.

[17] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown, “Algorithm runtime prediction: Methods & evaluation,” *Artif. Intell.*, vol. 206, pp. 79–111, Jan. 2014, doi: 10.1016/j.artint.2013.10.003.

[18] I. Guyon and A. Elisseeff, “An Introduction to Feature Extraction,” in *Feature Extraction*, Springer, 2006, pp. 1–25.

[19] H. Peng, F. Long, and C. Ding, “Feature Selection Based on Mutual Information Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy,” *IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)*, vol. 27, no. 8, pp. 1226–1238, 2005, doi: 10.1109/TPAMI.2005.159.

[20] R. J. Urbanowicz, M. Meeker, W. La Cava, R. S. Olson, and J. H. Moore, “Relief-Based Feature Selection: Introduction and Review,” *Journal of biomedical informatics*, vol. 85, pp. 189–203, Sep. 2018, doi: 10.1016/j.jbi.2018.07.014.

[21] W. K. Härdle and L. Simar, *Applied multivariate statistical analysis*, 4th ed. Springer, Berlin, Heidelberg, 2015.

[22] M. Alissa, K. Sim, and E. Hart, “Algorithm selection using deep learning without feature extraction,” in *Proceedings of the genetic and evolutionary computation conference*, 2019, pp. 198–206, doi: 10.1145/3321707.3321845.

[23] P. Ross, S. Schulenburg, J. G. Marı́n-Bläzquez, and E. Hart, “Hyper-heuristics: Learning to combine simple heuristics in bin-packing problems,” in *Proceedings of the 4th annual conference on genetic and evolutionary computation*, 2002, pp. 942–948.

[24] K. Sim, E. Hart, and B. Paechter, “A hyper-heuristic classifier for one dimensional bin packing problems: Improving classification accuracy by attribute evolution,” in *Parallel problem solving from nature - PPSN XII - 12th international conference, taormina, italy, september 1-5, 2012, proceedings, part II*, 2012, vol. 7492, pp. 348–357, doi: 10.1007/978-3-642-32964-7\_35.

[25] B. Bischl *et al.*, “ASlib: A benchmark library for algorithm selection,” *Artif. Intell.*, vol. 237, pp. 41–58, 2016, doi: 10.1016/j.artint.2016.04.003.

[26] J. Bossek, P. Kerschke, A. Neumann, M. Wagner, F. Neumann, and H. Trautmann, “Evolving Diverse TSP Instances by Means of Novel and Creative Mutation Operators,” in *Proceedings of the 15\(^{th}\) acm/sigevo workshop on foundations of genetic algorithms (foga xv)*, 2019, pp. 58–71.

[27] G. Reinelt, “TSPLIB–a traveling salesman problem library,” *ORSA Journal on Computing*, vol. 3, no. 4, pp. 376–384, 1991.

[28] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown, “Algorithm Runtime Prediction: Methods & Evaluation,” *Artificial Intelligence Journal (AIJ)*, vol. 206, pp. 79–111, 2014, [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0004370213001082.

[29] H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” *Ann. Math. Statist.*, vol. 18, no. 1, pp. 50–60, Mar. 1947, doi: 10.1214/aoms/1177730491.

[30] J. Bossek and H. Trautmann, “Evolving Instances for Maximizing Performance Differences of State-of-the-Art Inexact TSP Solvers,” in *Proceedings of the 10th international conference on learning and intelligent optimization (lion 2016)*, 2016, vol. 10079 LNCS, pp. 48–59.

[31] J. Bossek and H. Trautmann, “Understanding Characteristics of Evolved Instances for State-of-the-Art Inexact TSP Solvers with Maximum Performance Difference,” in *AI*IA 2016 advances in artificial intelligence*, 2016, vol. 10037, pp. 3–12, doi: 10.1007/978-3-319-49130-1\_1.

[32] P. Kerschke, J. Bossek, and H. Trautmann, “Parameterization of State-of-the-Art Performance Indicators: A Robustness Study Based on Inexact TSP Solvers,” in *Proceedings of the 20th genetic and evolutionary computation conference (gecco) companion*, 2018, pp. 1737–1744, doi: 10.1145/3205651.3208233.

[33] T. Hastie, R. Tibshirani, and J. Friedman, *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*, 2nd ed. Springer, 2009.

[34] B. Bischl *et al.*, “Mlr: Machine Learning in R,” *Journal of Machine Learning Research (JMLR)*, vol. 17, no. 170, pp. 1–5, 2016, [Online]. Available: http://jmlr.org/papers/v17/15-066.html.

[35] T. Therneau and B. Atkinson, R package version 4.1-15*rpart: Recursive Partitioning and Regression Trees*. 2019.

[36] A. Liaw and M. Wiener, “Classification and Regression by randomForest,” *R News*, vol. 2, no. 3, pp. 18–22, 2002, [Online]. Available: https://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf.

[37] A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis, “kernlab – An S4 Package for Kernel Methods in R,” *Journal of Statistical Software (JSS)*, vol. 11, no. 9, pp. 1–20, 2004, [Online]. Available: http://www.jstatsoft.org/v11/i09/.

[38] T. Chen *et al.*, R package version 0.90.0.2*xgboost: Extreme Gradient Boosting*. 2019.

[39] J. Bossek, R package version 1.0.0*salesperson: Computation of Instance Features and R Interface to the State-of-the-Art Exact and Inexact Solvers for the Traveling Salesperson Problem*. 2017.

[40] R. Kohavi, G. H. John, and others, “Wrappers for feature subset selection,” *Artificial intelligence*, vol. 97, nos. 1-2, pp. 273–324, 1997.

[41] Y. LeCun, Y. Bengio, and others, “Convolutional networks for images, speech, and time series,” *The handbook of brain theory and neural networks*, vol. 3361, no. 10, p. 1995, 1995.

[42] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in *Proceedings of the fourteenth international conference on artificial intelligence and statistics, AISTATS 2011, fort lauderdale, usa, april 11-13, 2011*, 2011, vol. 15, pp. 315–323, [Online]. Available: http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf.

[43] Y. Wu and K. He, “Group normalization,” *Int. J. Comput. Vis.*, vol. 128, no. 3, pp. 742–755, 2020, doi: 10.1007/s11263-019-01198-w.

[44] M. Lin, Q. Chen, and S. Yan, “Network in network,” in *2nd international conference on learning representations, ICLR 2014, banff, ab, canada, april 14-16, 2014, conference track proceedings*, 2014, [Online]. Available: http://arxiv.org/abs/1312.4400.

[45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” *The journal of machine learning research*, vol. 15, no. 1, pp. 1929–1958, 2014.

[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in *3rd international conference on learning representations, ICLR 2015, san diego, ca, usa, may 7-9, 2015, conference track proceedings*, 2015, [Online]. Available: http://arxiv.org/abs/1412.6980.

[47] S. Mannor, D. Peleg, and R. Y. Rubinstein, “The cross entropy method for classification,” in *Machine learning, proceedings of the twenty-second international conference (ICML 2005), bonn, germany, august 7-11, 2005*, 2005, vol. 119, pp. 561–568, doi: 10.1145/1102351.1102422.

[48] H. Robbins and S. Monro, “A stochastic approximation method,” *The annals of mathematical statistics*, pp. 400–407, 1951.

The PAR10-score is a common measure in AS for combinatorial optimization problems. For a stochastic algorithm \(A\) and an instances \(I\) it is defined as the average of the running times of \(A\) on \(I\) where runs which did not reach the optimum within a given time limit \(T\) are penalised by a factor of \(10\cdot T\).↩︎

Various TSP benchmark libraries exist, e.g., TSPLIB [27] or classical Random Uniform Euclidean (RUE) instances. However, these instances are either inhomogeneous in size or exhibit very little structural difference. For the purpose of algorithm selection though a balanced and homogeneous benchmark set is highly beneficial.↩︎

The mutation operators are designed to evolve structures that can be observed in real-world TSP instances, e.g.,

*Very Large Scale Integration*(VLSI) and are thus closer to the real-world than the often used random uniform problems.↩︎We work with the restart versions of EAX and LKH which trigger a restart once the internal stopping conditions are met [3] as long as the time limit is not reached.↩︎

The R-package

`salesperson`

is an efficient and more comprehensive extension of the`tspmeta`

feature generator by [15].↩︎