Learning Optimal Objective Values for MILP

Lara Scavuzzo
TU Delft
l.v.scavuzzomontana@tudelft.nl
Karen Aardal
TU Delft
k.i.aardal@tudelft.nl
Neil Yorke-Smith
TU Delft
n.yorke-smith@tudelft.nl


Abstract

Modern Mixed Integer Linear Programming (MILP) solvers use the Branch-and-Bound algorithm together with a plethora of auxiliary components that speed up the search. In recent years, there has been an explosive development in the use of machine learning for enhancing and supporting these algorithmic components [1]. Within this line, we propose a methodology for predicting the optimal objective value, or, equivalently, predicting if the current incumbent is optimal. For this task, we introduce a predictor based on a graph neural network (GNN) architecture, together with a set of dynamic features. Experimental results on diverse benchmarks demonstrate the efficacy of our approach, achieving high accuracy in the prediction task and outperforming existing methods. These findings suggest new opportunities for integrating ML-driven predictions into MILP solvers, enabling smarter decision-making and improved performance.

1 Introduction↩︎

Mixed Integer Linear Programming (MILPs) is a widespread tool for modelling mathematical optimization problems, with applications in numerous real-world scenarios. The Branch-and-Bound (B&B) algorithm, which employs a divide-and-conquer approach, is the preferred method for solving MILPs to global optimality. In recent years, there has been a surge in interest in harnessing the power of machine learning (ML) tools to aid the solution process of MILPs. From solution prediction (e.g. [2][4]) to interventions on the heuristic rules used by the solvers (e.g. [5][7]), several approaches have been studied in the literature (see [1] for an in-depth discussion of this topic). The overarching trend is to build dynamic MILP solvers that can make active use of the large amounts of data produced during the solving process.

Many of the decisions that must be made during the B&B process could be better informed were the optimal solution known from the start. In fact, even knowing the optimal objective value can positively influence the solver behaviour. For example, once a solution is found that matches this value, any effort to find new solutions can be avoided. With perfect information of the optimal objective value, a solver can further do more aggressive pruning of nodes. In general, having this knowledge can allow the solver to adapt its configuration, putting more emphasis on different components. Even in absence of perfect information, a good prediction of the optimal objective value can still be used to change the solver settings or to devise smarter rules, such as node selection policies that account for this predicted value. Inspired by these observations we ask the two following closely-related questions:

  • How well can we predict the optimal objective value?

  • With what accuracy can we predict, during the solution process, whether or not a given solution is optimal?

Our contributions are as follows. First, we propose a methodology to predict optimal objective values, answering (Q1). We then use the output of this predictive model, together with additional data, as input of our proposed classifiers, which give an answer to question (Q2). For this second task, we also propose some metrics that capture the state of the solving process, and that prove to be valuable for our classifier. Our computational study shows the high accuracy of our proposed predictor. Furthermore, when compared to previous methods, our classifiers show better performance. Finally, we provide further insight into how the performance can be tuned to the desired behaviour and into the ways that the classifier makes use of the provided data.

Our discussion is organized as follows. We start by defining some key concepts and notation in Section 2, followed by a discussion of the work most closely related to ours (Section 3). Section 4 describes our methodology in detail. The results of our computational study are presented in Section 5. Finally, we conclude with some final remarks and future work in Section 6. The code to reproduce all experiments is available online [8].

2 Background↩︎

2.0.0.1 Mixed Integer Linear Programming

Given are a matrix \(\boldsymbol{A}\in \mathbb{Q}^{m\times n}\), vectors \(\boldsymbol{c}\in\mathbb{Q}^n\) and \(\boldsymbol{b}\in\mathbb{Q}^m\), and a partition \((\mathcal{I}, \mathcal{C})\) of the variable index set \(\{1,2,...,n\}\). A Mixed Integer Linear Program is the problem of finding \[\label{eq:MILP} \begin{align} z^* = \min\;\; & \boldsymbol{c}^{^\mathsf{T}} \boldsymbol{x}\\ \text{subject to } & A \boldsymbol{x}\geq \boldsymbol{b}, \\ & x_{j}\in \mathbb{Z}_{\geq 0}\quad \forall j\in \mathcal{ I}, \\ & x_{j}\geq 0 \quad \forall j\in \mathcal{C}. \\ \end{align}\tag{1}\] Notice that the variables in \(\mathcal{I}\) are required to be integer. Removing this integrality constraint turns the problem into a Linear Program (LP), which constitutes a relaxation of the original MILP, known as the LP relaxation. While MILP is \(\mathcal{NP}\)-hard, LPs are polynomial solvable.

2.0.0.2 Solving Mixed Integer Linear Programs

The standard approach to solving MILPs is to use the LP-based branch-and-bound (B&B) algorithm. This algorithm sequentially partitions the feasible region, while using LP relaxations to obtain lower bounds on the quality of the solutions of each sub-region. This search can be represented as a binary1 tree. At a given time \(t\) of the solution process we use \(T_t\) to denote the search tree, i.e. the set of nodes, constructed so far by the B&B algorithm. We denote by \(\boldsymbol{x}^*\) the optimal solution to Problem (1 ) and \(z^*\) its corresponding optimal objective value. For a given node \(i\) of the search tree, let \(z^{LP}_i\) be the optimal objective value of the node’s LP relaxation. We use the notation \(z^{LP}\) for the root node, i.e., the solution to the original problem’s LP relaxation. At any point of the search, an integer feasible solution provides an upper bound on the optimal objective value. Let \(\bar{\boldsymbol{x}}(t)\) be the best known solution at time \(t\) and let \(\bar{z}(t)=\boldsymbol{c}^{^\mathsf{T}} \bar{\boldsymbol{x}}(t)\) denote its objective value (also called the incumbent). Then we can prune any node \(i\) such that \(z^{LP}_i \geq \bar{z}(t)\).

The nodes of \(T_t\) can be classified into three types:

  • \(I_t\) is the set of inner nodes of the tree. This is, nodes that have been processed (its LP relaxation solved) and resulted in branching.

  • \(L_t\) is the set of leaves of the tree. This is, the set of nodes that have been processed and resulted in pruning or in an integer feasible solution.

  • \(O_t\) is the set of open nodes, i.e., nodes that have not been processed yet.

As mentioned before, the incumbent \(\bar{z}(t)\) provides an upper bound on \(z^*\). We can also obtain a global lower bound. Let \(\underline{z}(t):= \min_{i\in O_t}\{z^{LP}_i\}\). Then notice that necessarily \(\underline{z}(t) \leq z^*\).

In practice, MILP solvers implement a plethora of other techniques to accelerate the solution process. Among them, cutting planes and primal heuristics are essential parts of today’s mathematical optimization software.

2.0.0.3 MILP solving phases

The B&B algorithm can solve MILPs to optimality. This means that, if the algorithm terminates, it does so after having obtained a feasible solution and a proof of its optimality (or, on the contrary, proof of infeasibility). Several solver components work together for this goal, each with more or less focus on the feasibility and the optimality parts. [9] point out that, typically, the optimal solution is found well before the solver can prove optimality. Following this, they propose partitioning the search process into phases, according to three target goals. These phases are the following.

  1. Feasibility. This phase encompasses the time spanned from the beginning of the search until the first feasible solution is found.

  2. Improvement. From the moment the first feasible solution is found until an optimal solution is found.

  3. Proving. Spans the time elapsed from the moment the optimal solution is found until the solver terminates with a proof of optimality.

The transition between the first and the second phase happens when a feasible solution is found. In contrast, the moment in which the solver transitions from the second to the third phase is unknown until the search is completed. Notice that if the instance is infeasible the solver terminates while in the first phase. For the purpose of our study we assume that the instances are feasible.

3 Related Work↩︎

3.0.0.1 MILP solution prediction

In recent years, the topic of solution prediction for combinatorial optimization problems has gained momentum [10][12]. For MILPs, the goal is to produce a (partial) assignment of the integer variables via a predictive machine learning model. This prediction can then be used to guide the search in different ways. [2] impose a constraint that forces the search to remain in a neighbourhood of the predicted optimal solution. In this way, by restricting the size of the feasible region, the authors aim to accelerate the solution process. In contrast, the approaches of [3] and [13] consist in fixing a subset of variables to their predicted optimal value, letting the solver optimize over the remaining ones. [13] further propose a solver mode that uses the predicted solution to guide the node processing order. In the present work, we take a different path by aiming to predict the optimal objective value, as opposed to the solution, i.e., the values that each variable takes. This task is easier from a learning perspective, yet still offers several ways in which one can exploit this information.

3.0.0.2 Phase transition predictions

[9] defined the three phases of MILP solving that were introduced in Section 2. Their goal is to adapt the solver’s strategy depending on the phase. For this purpose, they propose two criteria that can be used to predict the transition between phase 2 (improvement) and phase 3 (proving) without knowledge of the optimal solution. These criteria are based on node estimates: for every node \(i\in T_t\), the solver SCIP keeps an estimate \(\hat{c}(i)\) of the objective value of the best solution attainable at that node (see [9] for a formal definition of how this is computed). At time \(t\) of the solving process, let \(\hat{c}^{\min}(t):=\min \{\hat{c}(i) \mid i\in O_t\}\) be the minimum of these estimates among the open nodes. We further define \(d(i)\) to be the depth2 of node \(i\). The first transition criterion, the best-estimate criterion, indicates that the transition moment is the first time the incumbent becomes smaller than \(\hat{c}^{\min}(t)\). Formally, let us define a binary classifier \(C^{\text{est}}\) that indicates if the transition has occurred using the criterion

\[\label{eq:estimate95classifier} C^{\text{est}} = \begin{cases} 1 & \text{if } \min_{s\in [0,t]} \{\bar{z}(s) - \hat{c}^{\min}(s)\} < 0 \\ 0 & \text{otherwise}. \end{cases}\tag{2}\]

The second criterion is called rank-1 and is based on the set of open nodes with better estimate than the processed nodes at the same depth. Formally, let \[R^1(t):=\Big\{i \in O_t \mid \hat{c}(i) \leq \inf \{\hat{c}(j):j\in I_t\cup L_t, d(j)=d(i)\} \Big\} \,\]

This set can be used to define a classifier \(C^{\text{rank-1}}\) that indicates that the transition has occurred once the set becomes empty for the first time. This is,

\[\label{eq:rank195classifier} C^{\text{rank-1}} = \begin{cases} 1 & \text{if } \min_{s\in [0,t]} |R^1(s)| = 0 \\ 0 & \text{otherwise}. \end{cases}\tag{3}\]

The authors use these criteria to switch between different pre-determined solver settings depending on the phase of solving. Their experiments show improved solving time, especially when using the rank-1 criterion. However, it is also clear that both criteria tend to be satisfied before the phase transition actually occurs, and there is some room for improvement in the accuracy of the classifiers, as we shall see from our own computational study.

3.0.0.3 B&B resolution predictions

Closely related to the present work is that of [14], who use a number of solver metrics to predict the final B&B tree size. They use a combination of metrics from the literature, together with their own, as input to a machine learning model that estimates the final tree size dynamically as the tree is being constructed. Their method was incorporated into version 7.0 of the solver SCIP as a progress metric for the user. In a similar fashion, [15] use a number of solver metrics to predict, during the solving process, whether or not the run will end within the given time limit. This prediction can be used to adapt the solver behaviour in the case that the answer is negative.

4 Methodology↩︎

This section details the methodology used to answer questions Q1 (Section 4.1) and Q2 (Section 4.2). We assume we are given a space \(\mathcal{X}\) of instances of interest. For some tasks, we will use the bipartite graph representation of MILPs introduced by [5]. This is, given an MILP instance \(X\in \mathcal{X}\) defined as in Eq. 1 , we build a graph representation as follows: each constraint and each variable have a corresponding representative node. A constraint node is connected to a variable node if the corresponding variable has a non-zero coefficient in the corresponding constraint. Each node has an associated vector of features that describes it. We utilize the same features as Gasse et al., except that we do not include any incumbent information. In short, instead of the raw data in \(X\in\mathcal{X}\) we use the graph representation, which we denote \(X_G\in\mathcal{X}_G\), and is composed of a tuple \(X_G=(\boldsymbol{C}, \boldsymbol{V}, \boldsymbol{A})\), where \(\boldsymbol{C}\in \mathbb{R}^{m\times d_c}\) and \(\boldsymbol{V}\in \mathbb{R}^{n\times d_v}\) represent the constraint and variable features, respectively, and \(\boldsymbol{A}\in \mathbb{R}^{m\times n}\) is the adjacency matrix.

Figure 1: Optimal objective value prediction task. The MILP representation is computed after the root node has been processed. This serves as an input to a GNN that outputs a prediction \(\tilde{z}^*\) of the optimal objective value.

4.1 Optimal value prediction↩︎

The first task we tackle is the one of predicting the optimal objective value (Q1). That is, given an MILP instance \(X \in \mathcal{X}\), we want to predict the optimal objective value \(z^*\). This prediction is computed once and for all at the root node, once the LP solution is available. We frame this as a regression task. This process is depicted in Figure 1.
For this regression task, we utilise the bipartite graph representation of [5] defined above, which is processed using a Graph Neural Network (GNN) that performs two half-convolutions. In particular, the feature matrices \(\boldsymbol{C}\) and \(\boldsymbol{V}\) first go through an embedding layer with two feedforward networks with ReLU activation. Next, one first pass updates the constraint descriptors using the variable descriptors, while a second pass updates the variable descriptors using the (new) constraint descriptors. This is done with message-passing operations, computed as \[\boldsymbol{c}'_i = \boldsymbol{W}^{(11)}\boldsymbol{c}_i + \boldsymbol{W}^{(12)}\sum_{j=1}^n \boldsymbol{A}_{ij} \boldsymbol{v}_j\] \[\boldsymbol{v}'_j = \boldsymbol{W}^{(21)}\boldsymbol{v}_j + \boldsymbol{W}^{(22)}\sum_{i=1}^m \boldsymbol{A}_{ij} \boldsymbol{c}'_i\] where \(\boldsymbol{W}^{(11)}\), \(\boldsymbol{W}^{(12)}\), \(\boldsymbol{W}^{(21)}\) and \(\boldsymbol{W}^{(22)}\) are trainable weights, \(\boldsymbol{c}_i\) is the feature vector of constraint \(i\) and \(\boldsymbol{v}_j\) is the feature vector of variable \(j\). The variable descriptors then go through another feedforward network with ReLU activation. Finally, average pooling is applied to obtain one single output value.

Our goal is to learn a mapping \(f(X_g): \mathcal{X}_G \mapsto \mathbb{R}\) which outputs an approximation \(\tilde{z}^*\) of the optimal objective value \(z^*\). At the moment of this prediction, the solution to the root LP relaxation is known and can be used for further context. In order to exploit that knowledge, we test three potential targets for the machine learning model, namely \[\Theta_1 = z^*\] \[\Theta_2 = \frac{z^*}{z^{LP}}\] \[\Theta_3 = z^* - z^{LP} \, .\] This gives rise to three models \(f_1(X_g)\), \(f_2(X_g)\) and \(f_3(X_g)\), which we later transform into the desired output by setting either \(f(X_g)=f_1(X_g)\), \(f(X_g)=f_2(X_g)\cdot z^{LP}\), or \(f(X_g)=f_3(X_g) + z^{LP}\).

4.2 Prediction of phase transition↩︎

The second task (Q2) is predicting the transition between phases 2 (improvement) and 3 (proving). That is, at any point during the solution process we want to predict whether the incumbent is in fact optimal. We cast this problem as a classification task.
We test the performance of two classifiers. The first one is based on the output of the GNN model discussed in Section 4.1. Given an instance \(X\in\mathcal{X}\) (in fact its associated graph representation \(X_G\)) and the current incumbent \(\Bar{z}\), we obtain a binary prediction \(C^{GNN}_{\epsilon}:\mathcal{X}_G\times \mathbb{R} \mapsto \{0,1\}\) in the following way \[\label{eq:gnn95classifier} C^{GNN}_{\epsilon}(X_G, \Bar{z}) = \begin{cases} 1 & \text{if } \Bar{z} < f(X_G) + \epsilon \cdot |f(X_G)| \\ 0 & \text{otherwise} \end{cases}\tag{4}\] for some \(\epsilon\in [-1,1]\). The \(\epsilon\) parameter allows us to control the confidence in the prediction.

The \(C^{GNN}_{\epsilon}\) classifier is static, in the sense that it does not make use of any information coming from the B&B process. On the contrary, the second predictor we propose, which we call \(C^D\), is based on a set of dynamic metrics that are collected during the solving process. The metrics are the following.

4.2.0.1 Gap

Following SCIP [16], we define the gap as \[\label{eq:gap} g(t) := \begin{cases*} 1 & if no solution has been found yet or \bar{z}(t)\cdot \underline{z}(t)<0,\\ \frac{|\bar{z}(t)-\underline{z}(t)|}{\max \{|\bar{z}(t)|, |\underline{z}(t)|, \epsilon\}} & otherwise. \end{cases*}\tag{5}\]

4.2.0.2 Tree weight

For a given node \(v\in T_t\), let \(d(v)\) denote the node’s depth. Then, the tree weight at time \(t\) is defined as \[\label{eq:tree95weight} \omega(t) := \sum_{v\in L_t} 2^{-d(v)} \, .\tag{6}\] This metric was first defined by [17].

4.2.0.3 Median gap

Let \(m(t) = \text{median}\{z^{LP}_i \mid i\in O_t\}\) and let \(\bar{z}^0\) be the first incumbent found. We define the median gap as \[\label{eq:median95gap} \mu (t) = \frac{|\bar{z}(t) - m(t)|}{|\bar{z}^0- z^{LP}|}\tag{7}\]

4.2.0.4 Trend of open nodes

For a certain window size \(h\), we store the values of \(|O_k|\) for \(k\in \{t-h, t-h+1, ..., t\}\). We then fit a linear function using least squares to compute the trend of this sequence. We denote this trend at time \(t\) as \(\tau(t)\).

4.2.0.5 Ratio to GNN prediction

We make use of the prediction \(f(X_G)\) coming from the GNN model and include the ratio with respect to the current incumbent as a metric. In particular we use \[\rho (t) = \frac{f(X_G)}{\bar{z}(t)}\]

Notice that, while the gap and the tree weight are metrics from the literature, the other three are our own.

The input to the classifier is therefore a tuple \(X_D=(g(t), \omega(t), \mu(t), \tau(t), \rho(t) )\). We train a classifier \(C^D(X_D)\) that makes use of these dynamic features to make a binary prediction on whether we are in phase 2 or 3. We use a simple logistic regression, which will allow us to more easily interpret the resulting model, in contrast to more complex machine learning models.

5 Computational Results↩︎

This section describes our computational setup and results. All experiments were performed with the solver SCIP v.8.0 [16]. Code for reproducing all experiments in this section is available online [8].

5.1 Experimental Set Up↩︎

5.1.0.1 Benchmarks

We use three NP-hard problem benchmarks from the literature: set covering, combinatorial auctions and generalized independent set problem (GISP). We create a fourth benchmark (mixed) that is comprised of instances of the three types, in equal proportion. The method and configuration used for generation of the instances is summarized in Table 1. For each instance type, we generate 10,000 instances for training, 2000 instances for validation and another 2000 for testing.

Table 1: Method and configuration settings used to generate the instances of problem benchmark.
Benchmark Generation method Configuration
Set covering [18] Items: 750
Sets: 1000
Combinatorial [19] Items: 200
auctions with arbitrary relationships Bids: 1000
GISP Nodes: 80
[20] \(p=0.6\)
with Erdos-Renyi graphs \(\alpha=0.75\)
SET2, A

5.1.0.2 Phase analysis

As a first approach to the instances, we run an experiment to analyze the breakdown into solving phases. We solve 100 of the training instances, each with 3 different randomization seeds, which gives us a total of 300 data points per benchmark. During the solution process we record the time when branching starts, the time when the first solution is found, the time when a solution within 5% of the optimal is found, and the time when the optimal solution is found. This allows us to compute the percentage of time spent on each phase, and the percentage of time spent branching versus before branching (i.e., pre-processing the instance and processing the root node). We average these numbers over the 300 samples to obtain a view of the typical behaviour of the solver on each benchmark. We further divide phase 2 (improvement) into two sub-phases: (2a) from the first feasible solution to the first feasible solution with objective value within 5% of the optimal, and (2b) which encompasses the rest of phase 2. The results are shown in Figure 2. We observe the following. For all benchmarks, obtaining a feasible solution is trivial. For set covering instances, the optimal solution is often known by the time that branching starts. In the case of combinatorial auctions, the optimal solution is typically not known at the start of B&B, but a good solution is. For GISP, finding optimal, or even good, solutions is not as easy, making the proving phase relatively shorter. We conclude that these benchmarks allow us to test our methodology on three very different settings that may arise in a real-life situation.

5.1.0.3 Data collection procedure

For each instance, we collect information at the root node: the bipartite graph representation \(X_G=(\boldsymbol{C}, \boldsymbol{V}, \boldsymbol{A})\) and the optimal root LP value \(z^{LP}\). We then proceed to solve the instance. For the first 100 processed nodes and as long as no incumbent exists, no samples are collected. This allows us to initialize statistics as the trend of open nodes \(\tau(t)\), and to ignore instances that are solved within 100 nodes which are therefore too easy. After 100 nodes have been processed and an incumbent exists, we collect samples with a probability of \(0.02\). At sampling time, we record the value of the dynamic features (see Section 4.2), as well as the incumbent value \(\bar{z}(t)\). Once the instance is solved, the collected samples are completed by appending the root node information \((X_G, z^{LP})\) as well as the optimal objective value \(z^*\), which will be used as a target.

a

Set covering

b

Combinatorial auctions

c

GISP

Figure 2: Phase analysis of three instance types. We divide the solution process into (1) Feasibility, in dark yellow, (2a) Improvement up to 5% to optimality, in light yellow, (2b) Improvement from 5% to optimal, in light purple, and (3) Proving, in dark purple. We also indicate when the first branching occurs. The data is averaged over 100 instances with 3 randomization seeds (i.e., 300 samples)..

5.1.0.4 Optimal objective value prediction (Q1)

We test the prediction accuracy of our GNN model on the four benchmarks. We train a model for each of the targets described in Section 4. We measure the error as \[\label{eq:error} e = 100 \times \frac{1}{N} \sum_{i=1}^N \frac{|z^*_i - \tilde{z}^*_i|}{|z^*_i|}\tag{8}\] where \(N\) is the number of samples, \(z^*_i\) is the optimal objective value of sample \(i\) and \(\tilde{z}^*_i\) is the predicted optimal objective value of sample \(i\). Notice that, independently of the learning target, we measure the error in the space of the original prediction we want to make.

5.1.0.5 Prediction of phase transition (Q2)

We make a prediction on whether we have transitioned to phase 3 (optimal solution has been found). We compare the performance of four predictors. The first two predictors are the ones proposed by [9], namely \(C^{\text{est}}\) (best-estimate, see Eq. 2 ) and \(C^{\text{rank-1}}\) (rank-1, see Eq. 3 ). The third predictor \(C^{GNN}_{\epsilon}\) is based on the GNN regression model, as described in Eq. 4 . We report the performance of this classifier with \(\epsilon = 0\) and with a tuned value \(\epsilon^*\) which was obtained by optimizing the accuracy with a small grid search over the range \([-0.02, 0.02]\) on the validation set. The fourth predictor \(C^D\) is based on the dynamic features, as described in Section 4.2.

5.2 Results↩︎

Tables 2 and 3 show the results of the optimal objective value prediction task. The GNN models tested in Table 2 were trained and tested on instances of the same type. On the contrary, the results of Table 3 correspond to one unique model that was trained in the mixed dataset, and then tested on different benchmarks. First, we observe that using targets that include LP information (\(\Theta_2\) and \(\Theta_3\)) is beneficial to performance, as opposed to directly trying to predict the optimal objective value (\(\Theta_1\)). There is no clear winner among targets \(\Theta_2\) and \(\Theta_3\). Second, we observe that the generalist model, the one trained on the mixed dataset, performs comparably to the specialized models, even outperforming them in some cases.
We now select one GNN model per benchmark to be used in the next prediction task: the phase transition prediction. We select the model in the following way: we use the specialized model that achieves the best result on the validation set. Figure 3 (a-c) shows the results for all classifiers on the pure benchmarks (see Table 4 for the same results in table form). Further, we include a column that shows the classification accuracy of a dummy model that always predicts the majority class. We observe that the classifiers of [9] (best-estimate and rank-1) tend to predict the phase transition too early. This is, they mostly output a positive prediction, which means they believe the incumbent to be optimal. This results in the misclassified samples being almost exclusively false positives. On the contrary, the GNN model \(C^{GNN}_{0}\) tends to be too pessimistic, which can be fixed with the right tuning of the \(\epsilon\) parameter. For all benchmarks, \(C^{GNN}_{\epsilon^*}\) performs better than the classifiers of [9]. At the same time, the inclusion of the dynamic features (\(C^D\)) further improves the performance, except for set covering where \(C^{GNN}_{\epsilon^*}\) and \(C^D\) are close to a tie.
It is important to notice that, depending on the application, false positives and false negatives could have very different consequences. As an example, if the phase transition prediction is used to change the behaviour of the primal heuristics (e.g. switch them off once the optimal is found) a false positive could excessively delay finding the optimal solution and therefore has a much bigger potential of harming performance than a false negative. The parameter \(\epsilon\) provides an easy way to navigate this tradeoff, where one could sacrifice some accuracy to keep the rate of false positives to a minimum.
Figure 3d shows the same experiment but on a mixed dataset. This is, the models were trained and tested on a benchmark comprised of instances of all three types (in equal proportion). We observe a similar behaviour compared to the specialized benchmarks. The GNN model \(C^{GNN}_{0}\) tends to be too pessimistic, while \(C^{GNN}_{\epsilon^*}\) achieves better accuracy and better false positive rate than the classifiers of [9]. Using dynamic features further improves the accuracy of the model.

Table 2: Average relative error (as defined in Eq. 8 ) of the GNN model. One model was trained per benchmark. The train and test instances in each case are of the same type.
Instances \(\Theta_1\) \(\Theta_2\) \(\Theta_3\)
Set covering \(1.48\) \(0.80\) \(\mathbf{0.54}\)
Combinatorial auctions \(3.20\) \(\mathbf{0.55}\) \(0.62\)
GISP \(3.32\) \(\mathbf{2.35}\) \(2.39\)
Table 3: Average relative error (as defined in Eq. 8 ) of the GNN mixed model. Only one model was trained on a dataset comprised of intances of all types. The test sets are comprised of instances of one type only, except for the mixed test set (last row).
Instances \(\Theta_1\) \(\Theta_2\) \(\Theta_3\)
Set covering 1.35 0.73 0.82
Combinatorial auctions 3.15 1.17 0.53
GISP 3.17 2.32 2.43
Mixed test set 1.70 0.97 0.75

a

Set covering

b

Combinatorial auctions

c

GISP

d

Mixed

Figure 3: Prediction accuracy of the different classifier models. We show the fraction of correctly classified samples (correct, in purple), the fraction of false positives (fp, dark yellow) and the fraction of false negatives (fn, light yellow)..

Finally, we analyze the importance of the dynamic features assigned by the \(C^D\) classifier (Figure 4). We see that the four learned models are in fact very different, with the GISP model mostly making decisions based on the gap and the other three considering all features more uniformly. This speaks in favour of learning on sets of instances of the same type.

Table 4: Prediction accuracy of the different classifier models. We show the fraction of correctly classified samples, the fraction of false positives and the fraction of false negatives.
Correct False positives False negatives
Majority 0.89 0.11 0.00
\(C^{\text{est}}\) 0.91 0.09 0.00
\(C^{\text{rank-1}}\) 0.92 0.08 0.00
\(C^{\text{GNN}}_0\) 0.52 0.05 0.43
\(C^{\text{GNN}}_{\epsilon^*}\) 0.93 0.04 0.03
\(C^{D}\) 0.90 0.05 0.05
Set covering
Correct False positives False negatives
Majority 0.64 0.36 0.00
\(C^{\text{est}}\) 0.67 0.33 0.00
\(C^{\text{rank-1}}\) 0.68 0.32 0.00
\(C^{\text{GNN}}_0\) 0.57 0.06 0.37
\(C^{\text{GNN}}_{\epsilon^*}\) 0.72 0.27 0.01
\(C^{D}\) 0.84 0.07 0.09
Combinatorial auctions
Correct False positives False negatives
Majority 0.59 0.00 0.41
\(C^{\text{est}}\) 0.39 0.61 0.00
\(C^{\text{rank-1}}\) 0.43 0.57 0.00
\(C^{\text{GNN}}_0\) 0.68 0.10 0.22
\(C^{\text{GNN}}_{\epsilon^*}\) 0.69 0.12 0.19
\(C^{D}\) 0.77 0.11 0.12
GISP
Correct False positives False negatives
Majority 0.64 0.36 0.00
\(C^{\text{est}}\) 0.65 0.35 0.00
\(C^{\text{rank-1}}\) 0.67 0.33 0.00
\(C^{\text{GNN}}_0\) 0.59 0.08 0.34
\(C^{\text{GNN}}_{\epsilon^*}\) 0.73 0.14 0.13
\(C^{D}\) 0.77 0.14 0.09
Mixed

Figure 4: Feature importance of the dynamic models trained to predict phase transition for each of the benchmarks.

6 Conclusions↩︎

In this paper, we presented our methodology for predicting the optimal objective value of MILPs. Compared to the literature on predicting optimal solutions, our learning task is easier, yet still offers a variety of possibilities for its application within MILP solvers. Our methods can be used to both predict the optimal objective value and to classify a feasible solution into optimal or sub-optimal. Our computational study shows that our proposed approach outperforms the existing approaches in the literature. Further, they provide more flexibility to tune the model into the desired behaviour. We show that there are benefits to learning a model that specializes to an instance type, yet our model is still able to generalize well and have superior performance to other methods on mixed instance sets.

These results open the door for many possible applications. In general terms, this prediction can be used to adapt the behaviour of the different solver components and rules depending on the solving phase. These applications, however, require further study and will be the subject of future work.

References↩︎

[1]
L. Scavuzzo, K. Aardal, A. Lodi, and N. Yorke-Smith. Machine learning augmented branch and bound for mixed integer linear programming. Mathematical Programming, pages 1–44, 2024.
[2]
J.-Y. Ding, C. Zhang, L. Shen, S. Li, B. Wang, Y. Xu, and L. Song. Accelerating primal solution findings for mixed integer programs based on solution prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 1452–1459, 2020.
[3]
V. Nair, S. Bartunov, F. Gimeno, I. von Glehn, P. Lichocki, I. Lobov, B. O’Donoghue, N. Sonnerat, C. Tjandraatmadja, P. Wang, et al. Solving mixed integer programs using neural networks. arXiv preprint arXiv:2012.13349, 2020.
[4]
N. Sonnerat, P. Wang, I. Ktena, S. Bartunov, and V. Nair. Learning a large neighborhood search algorithm for mixed integer programs. arXiv preprint arXiv:2107.10201, 2021.
[5]
M. Gasse, D. Chételat, N. Ferroni, L. Charlin, and A. Lodi. Exact combinatorial optimization with graph convolutional neural networks. Advances in Neural Information Processing Systems, 32, 2019.
[6]
A. Chmiela, E. Khalil, A. Gleixner, A. Lodi, and S. Pokutta. Learning to schedule heuristics in branch and bound. Advances in Neural Information Processing Systems, 34: 24235–24246, 2021.
[7]
M. B. Paulus, G. Zarpellon, A. Krause, L. Charlin, and C. Maddison. Learning to cut by looking ahead: Cutting plane selection via imitation learning. In International Conference on Machine Learning, pages 17584–17600. PMLR, 2022.
[8]
L. Scavuzzo. Code for the paper “Learning optimal objective values for MILP", 2024. https://github.com/lascavana/ObjValPrediction.
[9]
T. Berthold, G. Hendel, and T. Koch. From feasibility to improvement to proof: three phases of solving mixed-integer programs. Optimization Methods and Software, 33 (3): 499–517, 2018.
[10]
Y. Shen, Y. Sun, X. Li, A. C. Eberhard, and A. T. Ernst. Adaptive solution prediction for combinatorial optimization. European Joural of Operational Research, 309: 1392–1408, 2022. URL https://api.semanticscholar.org/CorpusID:256358882.
[11]
N. Efthymiou and N. Yorke-Smith. Predicting the optimal period for cyclic hoist scheduling problems. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research (CPAIOR), volume 13884 of Lecture Notes in Computer Science, pages 238–253. Springer, 2023.
[12]
Q. Han, L. Yang, Q. Chen, X. Zhou, D. Zhang, A. Wang, R. Sun, and X. Luo. A GNN-guided predict-and-search framework for mixed-integer linear programming. In International Conference on Learning Representations, 2023. URL https://api.semanticscholar.org/CorpusID:256827203.
[13]
E. B. Khalil, C. Morris, and A. Lodi. : A data-driven framework for guiding combinatorial solvers. AAAI, 2022.
[14]
G. Hendel, D. Anderson, P. Le Bodic, and M. E. Pfetsch. Estimating the size of branch-and-bound trees. INFORMS Journal on Computing, 34 (2): 934–952, 2022.
[15]
M. Fischetti, A. Lodi, and G. Zarpellon. Learning MILP resolution outcomes before reaching time-limit. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research (CPAIOR), volume 16, pages 275–291. Springer, 2019.
[16]
K. Bestuzheva, M. Besançon, W.-K. Chen, A. Chmiela, T. Donkiewicz, J. van Doornmalen, L. Eifler, O. Gaul, G. Gamrath, A. Gleixner, L. Gottwald, C. Graczyk, K. Halbig, A. Hoen, C. Hojny, R. van der Hulst, T. Koch, M. Lübbecke, S. J. Maher, F. Matter, E. Mühmer, B. Müller, M. E. Pfetsch, D. Rehfeldt, S. Schlein, F. Schlösser, F. Serrano, Y. Shinano, B. Sofranac, M. Turner, S. Vigerske, F. Wegscheider, P. Wellner, D. Weninger, and J. Witzig. . Technical report, Optimization Online, December 2021. URL http://www.optimization-online.org/DB_HTML/2021/12/8728.html.
[17]
P. Kilby, J. Slaney, S. Thiébaux, T. Walsh, et al. Estimating search tree size. In Proceedings of the AAAI Conference on Artificial Intelligence, 2006.
[18]
E. Balas and A. Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: a computational study. In Combinatorial Optimization, pages 37–60. Springer, 1980.
[19]
K. Leyton-Brown, M. Pearson, and Y. Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the 2nd ACM conference on Electronic commerce, pages 66–76, 2000.
[20]
M. Colombi, R. Mansini, and M. Savelsbergh. The generalized independent set problem: Polyhedral analysis and solution approaches. European Journal of Operational Research, 260 (1): 41–55, 2017.

  1. Standard implementations of the B&B algorithm use single-variable disjunctions that partition the feasible set into two. Other approaches exist but are, to the best of our knowledge, not implemented in standard optimization software.↩︎

  2. We define the depth of a node as its distance to the root node. Therefore, by definition, the depth of the root node is zero.↩︎