Piecewise Linear Regression via a Difference of Convex Functions


Abstract

We present a new piecewise linear regression methodology that utilizes fitting a difference of convex functions (DC functions) to the data. These are functions \(f\) that may be represented as the difference \(\phi_1 - \phi_2\) for a choice of convex functions \(\phi_1, \phi_2\). The method proceeds by estimating piecewise-liner convex functions, in a manner similar to max-affine regression, whose difference approximates the data. The choice of the function is regularised by a new seminorm over the class of DC functions that controls the \(\ell_\infty\) Lipschitz constant of the estimate. The resulting methodology can be efficiently implemented via Quadratic programming even in high dimensions, and is shown to have close to minimax statistical risk. We empirically validate the method, showing it to be practically implementable, and to have comparable performance to existing regression/classification methods on real-world datasets.

1 Introduction↩︎

The multivariate nonparametric regression problem is a fundamental statistical primitive, with a vast history and many approaches. We adopt the following setup: given a dataset, \(\{(\boldsymbol{x}_i, y_i)\}_{i \in [n]}\), where \(\boldsymbol{x}_i \in \mathbb{R}^d\) are predictors, assumed drawn i.i.d. from a law \(P_X,\) and \(y_i\) are responses such that \(y_i = f(\boldsymbol{x}_i) + \varepsilon_i,\) for a bounded, centered, independent random noise \(\varepsilon_i\), and bounded \(f\), the goal is to recover an estimate \(\hat{f}\) of \(f\) such that on new data, the squared error \(\mathbb{E}[(y- \hat{f}(\boldsymbol{x}))^2]\) is small.

The statistical challenge of the problem lies in the fact that \(f\) is only weakly constrained - for instance, \(f\) may only be known to be differentiable, or Lipschitz. In addition, the problem is algorithmically challenging in high-dimensions, and many approaches to the univariate problem do not scale well with the dimension \(d\). For instance, piecewise linear regression methods typically involve a prespecified grid, and thus the number of grid points, or knots, grows exponentially with dimension. Similarly, methods like splines typically require both stronger smoothness guarantees and exponentially more parameters to fit with dimension in order to avoid singularities in the estimate.

This paper is concerned with regression over the class of functions that are differences of convex functions, i.e., DC functions. These are functions \(f\) that can be represented as \(f = \phi_1 - \phi_2\) for a choice of two convex functions. DC functions constitute a very rich class - for instance, they are known to contain all \(\mathcal{C}^2\) functions. Such functions have been applied in a variety of contexts including non-convex optimization [1], [2], sparse signal recovery [3] and reinforcement learning [4].

The principal contribution of this paper is a method for piecewise linear regression over the class of DC functions. At the heart of the method is a representation of piecewise linear DC functions via a set of linear constraints, in a manner that generalises the representations used for max-affine regression. The choice of the function is regularised for smoothness by a new seminorm that controls the \(\ell_\infty\)-Lipschitz constant of the resulting function. The resulting estimate is thus a piecewise linear function, represented as the difference of two piecewise linear convex functions, that are smooth in the sense of having bounded gradients.

The method enjoys two main advantages:

  1. It is agnostic to any knowledge about the function, and requires minimal parameter tuning.

  2. It can be implemented efficiently, via quadratic programming, even in high dimensions. For \(n\) data points in \(\mathbb{R}^d\), the problems has \(2n(2d+1) + 1\) decision variables, and \(n^2\) linear constraints, and can be solved in the worst case in \(O(d^2 n^5)\) time by interior-point methods. Furthermore the algorithm does not need to specify partitions for piece-wise linear parts and avoids ad-hoc generalizations of splines or piece-wise linear methods to multi-dimensions.

In addition, the method is shown to be statistically viable, in that it is shown to attain vanishing risk as the sample size grows at a non-trivial rate, under the condition that the ground truth has bounded DC seminorm. The risk further adapts to structure such as low dimensional supports. Lastly, the solution obtained is a piecewise-linear fit, and thus enjoys interpretability in that features contribution heavily to the value can be readily identified. Further, the fitting procedure naturally imposes \(\ell_1\) regularisation on the weights of the piecewise linear parts, thus enforcing a sparsity of local features, which further improves interpretability.

To establish practical viability, we implement the method on a number of regression and classification tasks. The settings explored are those of moderate data sizes - \(n \le 10^3\), and dimensions \(d \le 10^2\). We note that essesntially all non-parametric models are only viable in these settings - typical computational costs grow with \(n\) and become infeasible for large datasets, while for much higher dimensions, the sample complexity - which grows exponentially with \(d\) - cause small datasets to be non-informative. More pragmatically, all nonparametric methods we compare against have been evaluated on such data. Within these constraints, the method is shown to have better error performance and fluctuation with respect to popular methodologies such as multivariate adaptive regression splines, nearest neighbour methods, and two-layer perceptrons, evaluated on both synthetic and real world data-sets.

1.1 Connections to Existing Methodologies↩︎

Piecewise Linear Regression is popular since such regressors can can model the local features of the data without affecting the global fit. In higher than \(1\) dimensions, piecewise linear functions are usually fit via choosing a partition of the space and fitting linear functions on each part. The principle difficulty thus lies in choosing these partitions. The approach is usually a rectangular grid - for instance, a variable rectangular partition of the space is studied in [5] and solved optimally. However the rectangulization becomes prohibitive in high dimension as the number of parts grow exponentially with the dimension. Other approaches include Bayesian methods such as [6], which rely on computing posterior means for the parameters to be fit via MCMC methods. Max-Affine Regression is a nonparametric approximation to convex regression, originating in [7], [8] that recovers the optimal piece-wise linear approximant to a convex function with the form \(f = \max_{i\in [K]} \langle a_i, \boldsymbol{x}_i \rangle + b_i\) for some \(K\). Smoothness of the estimate can be controlled by constraining the convex function to be Lipschitz. The problem is generic in that it is easily argued that piecewise linear convex functions can uniformly approximate any Lipschitz convex function on a bounded domain. Parametric approaches, i.e., with a fixed \(K\), are popular, but can be computationally intensive due to the induced combinatorics of which of the \(K\) planes is maximised at which data point, and various heuristics and partitioning techniques have to be applied [9][11]. The nonparametric case, where \(K\) grows with \(n\), has been extensively analysed in the works [12], [13].

On the other hand, if \(K = n,\) i.e. if the number of affine functions used is the same as the number of datapoints, then the problem becomes amenable to convex programming techniques - when estimating the parameters \(a_i, b_i,\) one can remove the nonlinearity induced by the max, and instead enforce the same via \(n\) linear constraints. This simple fact allows efficient algorithmic approaches to max-affine functions. The heart of our method for DC function regression is an extension of this trick to DC functions. Smoothing splines are an extremely popular regression methodology in low dimensions. The most popular of these are the \(L_2\) smoothing splines, which, in one dimension, involve fixing a ‘knot’ at each data point, and estimating the gradients of the function at each point, with regularisation of the form \(\int |\hat{f}''|^2\). Unfortunately this \(L_2\) regularisation leads to singularities in \(d \ge 3\) dimensions, and methods such as thin plate splines generalising these to higher dimensions resort of regularising up to the \(d\)th derivative of the estimate, leading to an explosion in the number of parameters to be estimated [14].

Our method is closer in relation to \(L_1\) regularised splines, which in the univariate case regularise \(\int |\hat{f}''|\) - it is shown in Appx. 8 that in one dimension our method reduces to these. As a consequence, one may view this method as a new generalisation of the \(L_1\)-spline regressor.

A number of alternative methods for multivariate splines have been proposed, with several, such as general additive models modelling the data via low dimensional projections and assumptions. The most relevant multivariate spline methods are the adaptive regression splines, originating in [15], which is a greedy procedure for recursively refining a partition of the data, and fitting new polynomials over the parts. Previous DC Modeling Finally, let us mention that the final chapter of the doctoral thesis of [13] anticipates our study of DC function regression, but gives neither algorithms nor analyses. Subsequently, [16] introduces the same DC modeling as us in a broader context, where the loss function can also be a DC function. However their problem ends up being non-convex. They focus on developing a majorization-minimization algorithm to find an approximate solution with desirable guarantees such as convergence to a directional stationary solution.

2 A brief introduction to DC functions↩︎

Difference of convex functions are functions which can be represented as difference of two continuous convex function over a domain \(\Omega \subset \mathbb{R}^d\), i.e., the class \[\begin{align} \mathcal{DC}(\Omega) \triangleq \{f: \Omega \to \mathbb{R} \,| \exists &\phi_1,\phi_2 \textrm{ convex, continuous s.t.}\\ &f = \phi_1 - \phi_2 \}. \end{align}\] Throughout the text we will set \(\Omega = \{ \boldsymbol{x} : \| \boldsymbol{x} \|_\infty \le R\}\). We will assume that the noise and the ground truth function are bounded so that \(|\varepsilon|, \sup_{\boldsymbol{x} \in \Omega} |f(\boldsymbol{x})| \le M.\) As a consequence, \(|y| \le 2M\).

One of the first characterizations of DC functions is due to [17]: a univariate function is a DC function if and only if its directional derivatives each has bounded total variation on all compact subsets of \(\Omega\). For higher dimensions it is known that DC functions are a subclass of locally-Lipschitz functions and include \(C^{2}\) functions. Therefore, any continuous function can be uniformly approximated by DC functions. For a recent review see [18].

In the following section we show that D.C functions can fit any sample data. Thus, to allow a bias-variance tradeoff, we regularise the selection of DC functions via the DC seminorm \[\begin{align} \| f\| \triangleq \inf_{\phi_1,\phi_2}&\sup_{\boldsymbol{x}} \sup_{\boldsymbol{v}_i \in \partial_* \phi_i(\boldsymbol{x})} \|\boldsymbol{v}_1\|_1 + \|\boldsymbol{v}_2\|_1 \\ & \textrm{s.t. } \phi_1, \phi_2 \text{ are convex, } \phi_1 - \phi_2 = f, \end{align}\]

where \(\partial_*\phi_i\) denotes the set of subgradients of \(\phi_i\). The above function is not a norm because every constant function satisfies \(\|c\| = 0\). Indeed, if we equate DC functions up to a constant, then the above seminorm turns into a norm. We leave a proof of the fact that the above is a seminorm to Appx. 9.1. Note that the DC seminorm offers strong control on the \(\ell_\infty\)-Lipschitz constant of the convex parts of at least one DC representation of the function (and in turn on the Lipschitz constant of the function).

We will principally be interested in DC functions with bounded DC seminorm, and thus define \[\mathcal{DC}_L \triangleq \{ f \in \mathcal{DC}: \|f\| \le L\}.\]

The bulk of this paper concentrates on piecewise linear DC functions. This is justified because piecewise linear functions are known to uniformly approximate bounded variation functions, and structural results [19], [20] showing that every piecewise linear function may be represented as difference of two convex piecewise linear functions - i.e., max-affine functions. Since the term is used very often, we will abbreviate “piecewise linear DC” as PLDC, and symbolically define \[\begin{align} \textrm{pl-}\mathcal{DC}&\triangleq \{ f \in \mathcal{DC}: f\textrm{ is piecewise linear} \},\\ \textrm{pl-}\mathcal{DC}_{L} &\triangleq \{ f \in \mathcal{DC}_{L}: f\textrm{ is piecewise linear} \}. \end{align}\] The following bound on the seminorm of PLDC functions is useful. The proof is obvious, and thus omitted.

Proposition 1. Every \(f \in \textrm{pl-}\mathcal{DC}\) can be represented as a difference of two max-affine functions \[f(\boldsymbol{x}) = \max_{k \in [K]} \langle \boldsymbol{a}_k, \boldsymbol{x} \rangle + c_k - \max_{k \in [K]} \langle \boldsymbol{b}_k, \boldsymbol{x} \rangle + c_k'\] for some finite \(K\). For such an \(f\), \(\|f\| \le \max_{k} \|\boldsymbol{a}_k\|_1 + \max_{k} \|\boldsymbol{b}_k\|_1\).

2.1 Expressive Power of Piecewise linear DC functions↩︎

We begin by arguing that PLDC functions can interpolate any finite data. The principle characterisation for DC functions is as follows:

Proposition 2. For any finite data \(\{(\boldsymbol{x}_i, y_i)\}_{i \in [n]},\) there exists a DC function \(h: \mathbb{R}^d\rightarrow \mathbb{R}\), that takes values \(h(\boldsymbol{x}_i) = y_i\) if and only if there exist \(\boldsymbol{a}_i, \boldsymbol{b}_i \in \mathbb{R}^d, z_i \in \mathbb{R}, i\in [n]\) such that: \[\begin{align} \label{equ:lem1462} \begin{aligned} y_i - y_j + z_i - z_j&\geq \langle \boldsymbol{a}_j, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle, \quad i,j \in [n]\\ z_i - z_j &\geq \langle \boldsymbol{b}_j, \boldsymbol{x}_i-\boldsymbol{x}_j \rangle , \quad i,j \in [n] \end{aligned} \end{align}\qquad{(1)}\]

Further, if there exists a DC function that interpolates a given data, then there exists a PLDC function that also interpolates the data.

**Proof.* Assuming \(h=\phi_1 - \phi_2\) for convex functions \(\phi_1\) and \(\phi_2\), take \(\boldsymbol{a}_j\) and \(\boldsymbol{b}_j\) to be sub-gradients of respectively \(\phi_1\) and \(\phi_2\) then (?? ) holds by convexity. Conversely, assuming (?? ) holds, define \(h\) as \[\begin{align} \label{equ:max-affine32max-affine} h(\boldsymbol{x})&= \max_{i\in [n]} \langle \boldsymbol{a}_i, \boldsymbol{x}-\boldsymbol{x}_i \rangle + y_i + z_i - \max_{i\in[n]} \langle \boldsymbol{b}_i, \boldsymbol{x}-\boldsymbol{x}_i \rangle + z_i \end{align}\tag{1}\] \(h \in \mathcal{DC}\) since it is expressed as the difference of two max-affine functions. Further, it holds that \(h(\boldsymbol{x}_k) = y_k\) for any \(k \in [n]\). Indeed, the first condition implies that for any \(i \neq k,\) \[\langle \boldsymbol{a}_i, {\boldsymbol{x}_k - \boldsymbol{x}_i}\rangle + y_i + z_i \le y_k + z_k,\] with equality when \(i = k\). Thus, the first maximum simply takes the value \(y_k + z_k\) at the input \(\boldsymbol{x}_k\). Similarly, the second maximum takes the value \(z_k\) at this input, and thus \(h(\boldsymbol{x}_k) = (y_k + z_k) - z_k = y_k.\)*

Notice that the interpolating function given in the above is actually piecewise-linear. Thus, if a DC function fits the given data, then extracting the vectos \(\boldsymbol{a}_i, \boldsymbol{b}_i\) and constants \(z_i\) as in the first part of the proof, and constructing the interpolant in the second part yields a PLDC function that fits the data. ◻

The principal utility of the conditions stated above is that we can utilise these to enforce the shape restriction of getting a DC estimate in an efficient way when performing empirical risk minimisation. Indeed, suppose we wish to fit a DC function to some data \((\boldsymbol{x}_i, y_i)\). We may then introduce decision variables \(\hat{y}_i, z_i, \boldsymbol{a}_i, \boldsymbol{b}_i,\) where the \(\hat{y}_i\) represent the value of our fit at the various \(\boldsymbol{x}_i,\) and then enforce the linear constraints of the above proposition (with \(y_i\) replaced by \(\hat{y}_i\)) while minimising a loss of the form \(\sum (y_i - \hat{y}_i)^2\). Since these constraints are linear, the resulting program is thus convex, and can be efficiently implemented. This observation forms the core of our algorithmic proposal in §3

The above characterisation relies on existence of vectors that may serve as subgradients for the two convex functions. This condition can be removed, as in the following.

Proposition 3. Given any finite data \(\{(\boldsymbol{x}_i, y_i)\}_{i \in [n]},\) such that \(y_i \neq y_j \implies \boldsymbol{x}_i \neq \boldsymbol{x}_j\), there exists a PLDC function which interpolates this data.

Proof. The interpolating function is constructed by adding and subtracting a quadratic function to the data. Let \[C \triangleq \max_{i,j} \frac{|y_i-y_j|}{\|\boldsymbol{x}_i-\boldsymbol{x}_j\|_2^2}.\] Then the piecewise linear function \[\begin{align} \label{dlzbcmxs} h(\boldsymbol{x})&\triangleq \max_{i \in [n]} \langle C \boldsymbol{x}_i ,\boldsymbol{x} -\boldsymbol{x}_i \rangle +\frac{1}{2}C \|\boldsymbol{x}_i\|^2 + \frac{1}{2}y_i \nonumber \\ &- \max_{i \in [n]} \langle C \boldsymbol{x}_i ,\boldsymbol{x} -\boldsymbol{x}_i \rangle +\frac{1}{2}C\|\boldsymbol{x}_i\|^2 - \frac{1}{2}y_i \end{align}\tag{2}\] satisfies the requirements. Indeed, the function is DC, since it the difference of two max-affine funcitons. The argument proceeds similarly to the previous case - at any \(\boldsymbol{x}_j,\) we have: \[\begin{align} &\max_{i \in [n]}\langle C \boldsymbol{x}_i ,\boldsymbol{x}_j -\boldsymbol{x}_i \rangle +\frac{1}{2}C \|\boldsymbol{x}_i\|^2 + \frac{1}{2}y_i \\ &= \max_{i \in [n]} \frac{1}{2}C\|\boldsymbol{x}_j\|^2 -\frac{C}{2} \| \boldsymbol{x}_i - \boldsymbol{x}_j \|^2+ \frac{1}{2}y_i \\ &\leq \max_{i \in [n]} \frac{1}{2}C\|\boldsymbol{x}_j\|^2 - \frac{|y_i- y_j|}{2} + \frac{1}{2}y_i \\ &\leq \frac{1}{2}C\|\boldsymbol{x}_j\|^2 + \frac{1}{2}y_j. \end{align}\] However the upper bound is reached by choosing \(i=j\) in the max operator. Similarly the value of the second convex function at \(\boldsymbol{x}_j\) is equal to \(\frac{1}{2}C\|\boldsymbol{x}_j\|^2 -\frac{1}{2}y_j\) which together results in \(h(\boldsymbol{x}_j) = y_j\). Note that \(h(\boldsymbol{x})\) has the \(\ell_\infty\)-Lipschitz constant \(2 C\max_{i,j} \|\boldsymbol{x}_i -\boldsymbol{x}_j\|_1,\) and \(\|h \| \le 2C \max_i \|\boldsymbol{x}_i\|_1.\) ◻

Next, in order to contextualise the expressiveness of DC functions, we argue that the popular parametric class of ReLU neural networks can be represented by PLDC functions, and vice versa. This is also argued in [16] for a \(2\) layer network.

Proposition 4. A fully connected neural network \(f\), with ReLU activations, and \(D\) layers with weight matrices \(\boldsymbol{W}^1, \dots, \boldsymbol{W}^D\), i.e, \[\begin{align} f(\boldsymbol{x}) &= \sum_j w^{D+1}_j a^D_j \\ a^{l+1}_i &= \max (\sum_{j} w^{l+1}_{i,j} a^{l}_j, 0), \quad D > l\geq 1\\ a^1_i &= \max (\sum_{j}w^1_{i,j} x_j, 0), \end{align}\] is a PLDC function with the DC seminorm bounded as \[\begin{align} \|f\| \leq |\boldsymbol{w}^{D+1}|^T \Big( \prod_{l=1}^D |\boldsymbol{W}^l |\Big) \vec{\boldsymbol{1}}, \end{align}\]

where \(|\cdot|\) is the entry-wise absolute value. The above is proved in Appx. 9.2 via an induction over the layers using the relations \[\begin{align} &\max(\max(a,0)\!-\!\max(b,0),0) = \max(a,b,0)\! -\! \max(b,0)\\ &\max(a,b) + \max(c,d) = \max (a+b,a+d, b+c, b+d). \end{align}\]

Proposition 5. Every PLDC function with \(K\) hyper-planes can be represented by a ReLU net with \(2\lceil \log_2 K \rceil\) layers and maximum width of \(8K\).

The proof is constructed in Appx. 9.2 using the relations \[\begin{align} &\max(a,b,c,d) = \max \left(\max (a,b), \max(c,d)\right) \\ &\max(a,b) = \max(a\!-\!b,0) + \max(b,0) -\max(-b,0) . \end{align}\]

3 Algorithms↩︎

We begin by motivating the algorithmic approach we take. This is followed by separate section developing key portions of the algorithm.

Suppose we observe a data-set \(S_n = \{ (\boldsymbol{x}_i, y_i)\}_{i\in[n]}\) generated iid from some distribution \(\mathcal{X}\times\mathcal{Y}\) and a convex loss function \(\ell:\mathbb{R}\times\mathbb{R}\rightarrow \mathbb{R}_+\) bounded by \(c\). The goal is to minimize the expected risk \[\begin{align} \label{erm:dc2} &\min_{f\in\mathcal{DC}} \mathbb{E} [(f(\boldsymbol{x}) - y)^2] \end{align}\tag{3}\] by choosing an appropriate function \(f\) from the class of \(\mathcal{DC}\) functions. Note instead of the squared error, the above could be generalised to any bounded, Lipschitz losses \(\ell(f(\boldsymbol{x}), y)\). Note also that the squared loss is bounded in our setting because of our assumption that both the ground truth and noise are bounded.

There are two basic problems with the above - the distribution is unknown, so the objective above cannot be evaluated, and that the class of DC functions is far too rich, and so the problem is strongly underdetermined. In addition, directly optimising over all DC functions is an intractable problem.

To begin with, we reduce the problem by instead finding the values that a best fit DC function must take at the datapoint \(\boldsymbol{x}_i\), and then fitting a PLDC functions with convex parts that are max-affine over precisely \(n\) linear functionals on this. This has the significant advantage of reducing the optimisation problem to a set of convex constraints. Quantitatively, this step is justified in §4, which argues that the error induced by this approximation via PLDC functions is dominated by the intrinsic risk of the problem.

To handle the lack of knowledge of the distribution, we resort to uniform generalisations bounds in the literature. Our approach to relies on the following result, which mildly generalises the bounds of [21], and uniformly controls the generalisation error of an empirical estimate of the expectation (specialised to our context):

Theorem 1. Let \(\{(\boldsymbol{x}_i, y_i)\}_{i \in [n]},\) be i.i.d. data, with \(n\) assumed to be even. Let the empirical maximum discrepancy* of the class \(\mathcal{DC}_L\), be defined as, \[\hat{D}_n( \mathcal{\mathcal{DC}}_{L}) \triangleq \sup_{f \in \mathcal{DC}_{L}} \frac{2}{n}\left( \sum_{i = 1}^{n/2} f(\boldsymbol{x}_i) - \sum_{i = n/2 + 1}^n f(\boldsymbol{x}_i)\right).\] With probability \(\ge 1-\delta\) over the data, it holds holds uniformly over all \(f \in \mathcal{DC}\cap \{|f| \le M\}\) that \[\begin{align} \label{gen95bound95BM} &\left| \mathbb{E}[(f(\boldsymbol{x})- y)^2] - \frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i) - y_i)^2 \right| \notag \\\le&\,\, 12M\hat{D}_n (\mathcal{DC}_{2\|f\| + 2}) \notag \\ &\,\,\, + 45M^2\sqrt{\frac{C \max(2, \log_2 \| f\|) +\ln(1/\delta)\big)}{n}}, \end{align}\tag{4}\] where \(C\) is a constant independent of \(f, M ,R\).*

The above statement essentially arises from a doubling trick over a Rademacher complexity based bound for a fixed \(\|f\|\). The broad idea is that since \(\mathcal{DC}_L \subset \mathcal{DC}_{L'}\) for \(L \le L',\) we can separately develop Rademacher complexity based bounds over \(L\) of the form \(2^j\), each having the more stringent high-probability requirement \(\delta_j = \delta 2^{-j}\). A union bound over these then extends these bounds to all of \(\mathcal{DC}= \bigcup_{j \ge 1} \mathcal{DC}_{2^j},\) and for a particular \(f\), the bound for \(j = \lceil \log_2 \|f\| \rceil\) can be used. See §10 for details.

Optimising the empirical upper bound on \(\mathbb{E}[(f(X) - Y)^2]\) implied by the above directly leads to a structural risk minimization over the choice of \(L\). The crucial ingredient in the practicality of this is that for DC functions, \(\hat{D}_n(\mathcal{DC}_L) = L \hat{D}_n(\mathcal{DC}_1)\), and further, \(\hat{D}_n(\mathcal{DC}_1)\) can be computed via linear programming. Thus, the term \(\hat{D}_n\) above serves as a natural, efficiently computable penalty function, and acts exactly as a regularisation on the DC seminorm.

3.1 Computing empirical maximum discrepancy.↩︎

Throughout this and the following sections, we use \(\hat{y}_i\) to denote \(\hat{f}(\boldsymbol{x}_i)\), where \(\hat{f}\) is the estimated function.

The principle construction relies on the characterisation of Proposition 2.

Theorem 2. Given data \(\{(\boldsymbol{x}_i, y_i)\},\) the following convex program computes \(\hat{D}_n(\mathcal{DC}_L)\) \[\begin{align} \label{program:d95hat95dc} &\qquad \max_{\hat{y}_i, z_i, \boldsymbol{a}_i, \boldsymbol{b}_i} \frac{2}{n} \left( \sum_{i = 1}^{n/2} \hat{y}_i - \sum_{i = n/2+1}^n \hat{y}_i \right) \\ &\textrm{s.t.} \begin{cases} \hat{y}_i - \hat{y}_j + z_i - z_j\geq \langle \boldsymbol{a}_j, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle & i,j \in [n] \quad \mathrm{(i)}\\ z_i - z_j \geq \langle \boldsymbol{b}_j, \boldsymbol{x}_i-\boldsymbol{x}_j \rangle & i,j \in [n] \quad \mathrm{(ii)}\\ \|\boldsymbol{a}_i\|_1 + \|\boldsymbol{b}_i\|_1 \leq L & \quad i\in[n] \quad \mathrm{(iii)} \end{cases}\notag \end{align}\qquad{(2)}\] Further, \(\hat{D}_n(\mathcal{DC}_L) = L \hat{D}_n(\mathcal{DC}_1)\).

**Proof.* By Proposition 2, conditions \((\mathrm{i})\) and \((\mathrm{ii})\) are necessary and sufficient for the existence of a DC function that takes the values \(\hat{y}_i\) at \(\boldsymbol{x}_i\). Thus, these first two conditions allow exploration over all values a DC function can take. Further, by the second part of Proposition 2 if a DC function interpolates this data, then so does a PLDC function, with \(\boldsymbol{a}_i\) and \(\boldsymbol{b}_i\) serving as the gradients of the max-affine parts of the function. Thus, by Proposition 1, the condition \((\mathrm{iii})\) is necessary and sufficient for the DC function implied by the first to conditions to have seminorm bounded by \(L\). It follows that the conditions allow exploration of all values a \(\mathcal{DC}_L\) function may take at the given \(\{\boldsymbol{x}_i\}\), at which point the claim follows.*

Now, notice that if we multiply each of the decision variables in the above program by \(L\), the value of the program is multiplied by a factor of \(L\), while the constraints \(\mathrm{(i), (ii)}\) remain unchanged. On the other hand, the constraint \(\mathrm{(iii)}\) is modified to \(\|\boldsymbol{a}_i \| + \|\boldsymbol{b}_i \| \le 1\). Thus, the resulting program is \(L\) times the program computing \(\hat{D}_n(\mathcal{DC}_1)\), ergo \(\hat{D}_n(\mathcal{DC}_L) = L\hat{D}_n(\mathcal{DC}_1)\). ◻

3.2 Structural Risk Minimisation↩︎

To perform the SRM, we again utilize the structural result of Proposition \(1\) to determine the values that the optimal estimate takes at each of the \(\boldsymbol{x}_i\). The choice of the values is penalised by the seminorm as \(\lambda L,\) where \(\lambda = 24M\hat{D}_n(\mathcal{DC}_1),\) which may be computed using the program (?? ). Note that the logarithmic term in the generalisation bound (4 ) is typically small, and is thus omitted in the following. This also has the benefit of rendering the objective function convex, as it avoids the \(\log L\) term that would instead enter. If desired, an convex upper bound may be obtained for the same, for instance by noting that \(\sqrt{ \max(1, \log_2 \|f\|)} \le 1 + \|f\|\). This has the effect of bumping up the value of \(\lambda\) required by \(O(M^2/\sqrt{n})\). However, theoretically this term is strongly dominated by the \(\hat{D}_n\) (see §4), while practically even the value \(\lambda = \hat{D}_n(\mathcal{DC}_1)\) tends to produce overly smoothened solutions (see §5).

With the appropriate choice of \(\lambda,\) this yields the following convex optimisation problem, \[\begin{align} \label{program:SRM} &\qquad \min_{\hat{y}_i, z_i, \boldsymbol{a}_i, \boldsymbol{b}_i, L} \sum_{i = 1}^n ( \hat{y_i} - y_i)^2 + \lambda L\\ &\textrm{s.t.} \begin{cases} \hat{y}_i - \hat{y}_j + z_i - z_j\geq \langle \boldsymbol{a}_j, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle & i,j \in [n] \\ z_i - z_j \geq \langle \boldsymbol{b}_j, \boldsymbol{x}_i-\boldsymbol{x}_j \rangle & i,j \in [n] \\ \|\boldsymbol{a}_i\|_1 + \|\boldsymbol{b}_i\|_1 \leq L & \quad i\in[n] \end{cases}\notag \end{align}\tag{5}\]

Once again, in the above constraints, the first two are necessary and sufficient to ensure that a DC function taking the values \(\hat{y}_i\) at \(\boldsymbol{x}_i\) exists, with the vectors \(\boldsymbol{a}_i, \boldsymbol{b}_i\) serving as witnesses for the subgradients of the convex parts of such a function at the \(\boldsymbol{x}_i\), and the third constraint enforces that the function has seminorm at most \(L\). Notice that the third condition effectively imposes \(\ell_1\)-regularisation on the weight vectors \(\boldsymbol{a}_i, \boldsymbol{b}_i\). This causes these weights to be sparse.

Finally, we may use the witnessing values \(\hat{y}_i, z_i\) and \(\boldsymbol{a}_i, \boldsymbol{b}_i,\) to construct, in the manner of Proposition 2, the following PLDC function, which interpolates the values \(\hat{y}_i\) to the entirety of the domain. Notice that since this function has the same loss as any DC function that satisfies \(f(\boldsymbol{x}_i) = y_i\), this construction enjoys risk bounds constructed above. \[\label{eqn:pldc95estimate} \begin{align} \hat{f}(\boldsymbol{x}) \triangleq & \max_{i\in [n]} \langle \boldsymbol{a}_i, \boldsymbol{x}-\boldsymbol{x}_i \rangle + \hat{y}_i + z_i \\ &\quad -\max_{i\in[n]} \langle \boldsymbol{b}_i, \boldsymbol{x}-\boldsymbol{x}_i \rangle + z_i \end{align}\tag{6}\]

3.3 Computational Complexity↩︎

Training First, we note that we may replace the constraints on the \(1\)-norms of the vectors \(\boldsymbol{a}_i, \boldsymbol{b}_i\) in the above by linear constraints via the standard trick of writing the positive and negative parts of each of their components separately. Overall, this renders the program (?? ) as a linear programming problem over \(2n(2d + 1)\) variables, and with \(n^2\) non-trivial constraints. Note that in our setting, one typically requires that \(n \ge d\) - that one has more samples than dimensions. Via interior point methods, this problem may be solved in \(O(n^5)\) time. For the least squares loss \(\ell(y,\hat{y}) = (y - \hat{y})^2,\) the second program (5 ) is a convex quadratic program when the \(1\)-norm constraints are rewritten as above. The decision variables are the same as the first problem, with the addition of the single \(L\) variable, and the constraints remain identical. Again, via interior point methods, these programs can be solved in \(O(d^2n^5)\) time (see Ch. 11 of [22]). The latter term dominates this complexity analysis. We note that in practice, these problems can be solved significantly faster than the above bounds suggest.

Speeding up training via a GPU implementation an iterative solver for a modified version of the SRM in program (5 ) is given in Algorithm ([program:SRM95ADMM]) in the Appx 12 via the ADMM method [23]. Each iteration of this algorithm can be distributed to \(O(n^2+nd)\) parallel processors. A python implementation is given in our GitHub repository 2 which is compatible with GPU’s. We note that a similar algorithm for that of Lipschitz convex regression is provided in [13], [24] however not all the ADMM blocks are solved in closed form and require additional optimization in each iteration.

Prediction By appending a \(1\) to the input, and the constants \(y_i + z_i - \langle \boldsymbol{a}_i , \boldsymbol{x}_i \rangle\) and \(z_i - \langle \boldsymbol{b}_i, \boldsymbol{x}_i\rangle\) to \(\boldsymbol{a}_i\) and \(\boldsymbol{b}_i\), we can reduce the inference time problem to solving two maximum inner product search problems over \(n\) vectors in \(\mathbb{R}^{d+1}\). This is a well developed and fast primitive, e.g. [25] provide a locality sensitive hashing based scheme that solves the problem in time that is sublinear in \(n\).

4 Analysis↩︎

We note that this analysis makes extensive use of the work of [12], [13] on convex and max-affine regression, with emphasis on the latter thesis, which contains certain refinements of the former paper.

In this section, we assume that the true model \(y = f(\boldsymbol{x}) + \varepsilon_i\) holds for a \(f\) that is a DC function, and that we have explicit knowledge that \(\| f\| \le L.\) Also recall our assumption that the distribution is supported on a set that is contained in the \(\ell_\infty\) ball of radius \(R\). We begin with a few preliminaries

4.0.0.1 A lower bound on risk

The minimax risk of estimation under the squared loss is \(\Omega(n^{-2/d+2}).\) This follows by setting \(p = 1\) (i.e., Lipschitzness) in Theorem 3.2 of [26], which can be done since the standard constructions of obstructions to estimators used in showing such lower bounds all have regularly varying derivatives, and thus finite DC seminorms.

4.0.0.2 PLDC solutions are not lossy

Lemma 5.2 of [13] argues that for every convex \(L\)-Lipschitz functions \(\phi\) with Lipschitz constant1, \(\sup_x \|\nabla_* \phi\|_1\) bounded by \(L\), there exists a Lipschitz max-affine function \(\phi_{\mathrm{PL}}\) with maximum over \(n\) pieces such that \(\| \phi - \phi_{\mathrm{PL}}\| \le 36LR n^{-2/d}\). Recall that PLDC functions can be represented as differences of max-affine functions, and DC functions as differences of convex functions. Since the DC seminorm controls the \(1\)-norm of the subgradients, which dominates the \(2\)-norm, it follows that for every \(\mathcal{DC}_L\) function \(f\), there exists a \(\textrm{pl-}\mathcal{DC}_{2L}\) function \(f_{\mathrm{PL}}\) with \(\| f - f_{\mathrm{PL}}\|_\infty\! =\! O(n^{-2/d}).\) Note that the resulting excess risk in the squared error due to using PLDC functions can thus be bounded as \(O(n^{-4/d}),\) which is \(o(n^{-2/d+2}),\) i.e., it is dominated by minimax risk.

4.1 Statistical risk↩︎

The bound (4 ) provides a instance specific control on the generalisation error of an estimate via the empirical maximum discrepancy \(\hat{D}_n\). This section is devoted to providing generic bounds on the same under the assumption of i.i.d. sampled data. We adapt the analysis of [13] for convex regression in order to do so. The principal result is as follows.

Theorem 3. For distributions supported on a compact domain \(\Omega \subset \{\|\boldsymbol{x}\|_\infty \le R\},\) with \(n\ge d\), it holds that \[\hat{D}_n(\mathcal{DC}_L) \le 60LR \left(\frac{d}{n}\right)^{2/d+4} \left( 1 + 2\frac{\log(n/d)}{d+4}\right).\] Further, if the ground truth \(f \in \mathcal{DC}_L\), then with probability at least \(1-\delta\) over the data, the estimator \(\hat{f}\) of \((\ref{eqn:pldc95estimate})\) satisfies \[\begin{align} &\mathbb{E}[ |Y - \hat{f}(\boldsymbol{x})|^2] \le \mathbb{E}[|Y- f(\boldsymbol{x})|^2]\\ &\quad + O((n/d)^{-2/d+4} \log(n/d) ) + O(\sqrt{\log(1/\delta)/n}). \end{align}\]

Proof. Assume \(f \in \mathcal{DC}_L\). Note that for any convex representation \(f = \psi_1 - \psi_2\), we may instead construct a representation \(f = \phi_1 - \phi_2 + c\) for a constant \(c\) so that the resulting convex function \(\phi_1\) and \(\phi_2\) are uniformly bounded on the domain - indeed, this may be done by setting \(c = f(0)\), and \(\phi_k = \psi_k - \psi_k(0)\). The \(\phi\)s retain the bound on their Lipschitz constants, and thus are uniformly bounded by \(LR\) over the domain. Thus, we may represent the class of \(\mathcal{DC}_L\) functions as the sum of a constant, and a \(\mathcal{DC}_L\) function whose convex parts are bounded. Call the latter class \(\mathcal{DC}_{L,0}.\) Importantly, since the constants are cancelled in the computation of empirical maximum discrepancy, we can observe that \(\hat{D}_n(\mathcal{DC}_L) = \hat{D}_n(\mathcal{DC}_{L,0}).\)

The principle advantage of the above exercise is that the empirical discrepancy for DC functions with bounded convex parts can be controlled via the metric entropy of bounded Lipschitz convex functions, which have been extensively analysed by [13]. This is summarised in the following pair of lemmata. The first argues that the discrepancy of \(\mathcal{DC}_{L,0}\) functions is controlled by that of bounded Lipschitz convex functions.

Lemma 1. Let \(\mathcal{C}_{L,LR}\) be the set of convex functions that are \(L\)-Lipschitz and bounded by \(LR\). Then \(\hat{D}_n(\mathcal{DC}_{L,0}) \le 2 \hat{D}_n(\mathcal{C}_{L,LR}).\)

The proof of the above is left to Appx C. The second lemma, due to Dudley, is a generic method to allow control on the discrepancy. We state this for \(\mathcal{C}_{L, LR}\)

Lemma 2. Let \(\mathcal{H}_\infty(\mathcal{C}_{L, LR}, \epsilon)\) be the metric entropy of the class \(\mathcal{C}_{L, LR}\) under the sup-metric \(d(f,g) = \|f-g\|_\infty\). Then the empirical maximum discrepancy is bounded as \[\hat{D}_n(\mathcal{C}_{L, LR}) \le \inf_{\epsilon > 0} \left( \epsilon + LR \sqrt{2\frac{\mathcal{H}_\infty(\mathcal{C}_{L, LR}, \epsilon)}{n}} \right).\]

Finally, we invoke the control on metric entropy of bounded Lipschitz convex functions provided by [12], [13]

Theorem 1 ([12], [13]). For \(\epsilon \in (0, 60LR],\) \[\mathcal{H}_\infty(\mathcal{C}_{L,LR}, \epsilon) \le 3d\left(\frac{40 L R}{\epsilon}\right)^{d/2}\log \left( \frac{60LR}{\epsilon} \right).\]

Using the above in the bound of Lemma 2, and choosing \(\epsilon = (60 LR) (d/n)^{2/d+4}\) yields the claim.

Control on the excess risk of the estimator follows readily. For any \(\lambda \ge 24M \hat{D}_n(\mathcal{DC}_1),\) we have have, with high probability,\[\begin{align} &\mathbb{E}[ (\hat{f} - Y)^2 ] \\ \le\,\, &\hat{\mathbb{E}}[ (\hat{f} - Y)^2] + \lambda\|\hat{f}\| + 2\lambda + O(1/\sqrt{n})\\ \le\,\, &\hat{\mathbb{E}}[ (f - Y)^2] + \lambda \| f\| + 2\lambda + O(1/\sqrt{n})\\ \le\,\, &\mathbb{E}[ (f-Y)^2] + 2\lambda \|f\| + 2\lambda + O(1/\sqrt{n}),\\ \le\,\, &\mathbb{E}[ (f-Y)^2] + 2\lambda (L+1) + O(1/\sqrt{n}) \end{align}\] where the first and last inequalities utilise (4 ), while the second inequality is because \(\hat{f}\) is the SRM solution. The claim follows on incorporating the bound on \(\hat{D}_n\) developed above, and since we choose \(\lambda\) proportional to the same. ◻

4.1.0.1 On adaptivity

Notice that the argument for showing the excess risk bound proceeds by controlling \(\hat{D}_n\). This allows the argument to adapt to the dimensionality of the data. Indeed, if the true law is supported on some low dimensional manifold, then the empirical discrepancy, which only depends on the observed data, grows only with this lower dimension. More formally, due to the empirical discrepancy being an empirical object, we can replace the metric entropy control over DC functions in \(\mathbb{R}^d\) by metric entropy of DC functions supported on the manifold in which the data lies, which in turn grows only with the (doubling) dimension of this manifold.

4.1.0.2 On suboptimality

Comparing to the lower bound, we find that the above rate is close to being minimax, although the (multiplicative) gap is about \(n^{4/d^2}\) and diverges polynomially with \(n\). In analogy with the observations of [13] for max-affine regression, we suspect that this statistical suboptimality can be ameliorated by restricting the PLDC estimate to have some (precisely chosen) \(K_n < n\) hyperplanes instead of the full \(n\). However, as discussed in §1.1, such restrictions lead to increase in the computational costs of training, and thus, we do not pursue them here.

5 Experiments↩︎

In this section we apply our method to both synthetic and real datasets for regression and multi-class classification. The datasets were chosen to fit in the regime of \(n\le 10^3, d \le 10^2\) as described in the introduction. All results are averaged over \(100\) runs and are reported with the \(95\%\) confidence interval.

For the DC function fitting procedure, we note that that the theoretical value for the regularization weight tends to over-smooth the estimators. This behaviour is expected since the bound (4 ) is designed for the worst case. For all the subsequent experiments we make two modifications - since none of the values in the datasets observed are very large, we simply set \(12M = 1,\) and further, we choose the weight, i.e. \(\lambda\) in (5 ), by cross validation over the set \(2^{-j} \hat{D}_n(\mathcal{DC}_1)\) for \(j \in [-8:1]\). Fig. 1 presents both these estimates in a simple setting where one can visually observe the improved fit. Note that this tuning is still minimal - the empirical discrepancy of \(\mathcal{DC}_1\) fixes a rough upper bound on the \(\lambda\) necessary, and we explore only a few different scales.

a

b

c

Figure 1: Top A two dimensional function along with the sampled points used for estimating the function; Bottom learned DC function via L1 regression using only \(\lambda= 2\hat{D}_n\) (left); using cross validation over \(\lambda\) (right)..

For the regression task we use the \(L_1\) empirical loss in our algorithm, instead of \(L_2\). That is, the objective in (5 ) is replaced by \(\sum |y_i - \hat{y}_i|\). This change allows us to implement the fitting program as a linear programming problem and significantly speeds up computation. However, in the following we will only report the \(L_2\) error of the solutions obtained thi way. We compare our method to a multi-layer perceptron (neural network) with variable number of hidden layers chosen from \(1:10\) by \(5\)-fold cross validation, Multivariate Adaptive Regression Splines (MARS) and \(K\)-nearest neighbour(\(K\)-NN) where Best value of \(K\) was chosen by \(5\)-fold cross validation from \(1:10\).

For the multi-class classification task we adopt the multi-class hinge loss to train our model, i.e. the loss \[\begin{align} \sum_i \sum_{j\neq y_i}^m\max( f_j(\boldsymbol{x}_i)-f_{y_i}(\boldsymbol{x}_i) +1, 0), \end{align}\] where \(m\) is the number of classes and \(f_j\)’s are DC functions. We compare with the same MLP as above but trained with the cross entropy Loss, KNN and a one versus all SVM.

In both cases, we have used MATLAB Statistics and Machine learning Library for their implementation of MLP, KNN and SVM. For MARS we used the open source implementation in ARESLab toolbox implemented in MATLAB. Our code along with the other algorithms is available in our GitHub repository2
. In each of the following tasks, we observe that our method performs competitively to all considered alternatives in almost every dataset, and often outperforms them, across the variation in dimensionality and dataset sizes.

5.0.0.1 Regression on Synthetic Data

We generated data from the function, \[\begin{align} y &= f(\boldsymbol{x}) + \varepsilon\\ f(\boldsymbol{x}) &= \sin\big(\frac{\pi}{\sqrt{d}}\sum_{j=1}^d x_{j}\big)+\big(\frac{1}{\sqrt{d}}\sum_{j=1}^d x_{j}\big)^2, \end{align}\] where the \(\boldsymbol{x}\) is sampled from a standard Gaussian, while \(\varepsilon\) is a centred Gaussian noise with standard deviation of \(0.25\).

We generate \(50\) points for training. For testing we estimate the Mean Squared Error based on a test set of \(5000\) points without the added noise. We normalize the MSE by variance of the values of test data and multiply by \(100\). Results are presented in Fig. 2. Observe that our algorithm consistently outperforms the competing methods, especially as we increase the dimension of the data. Furthermore our algorithm has lower error variance across the runs.

Figure 2: Mean Squared Error in a regression task vs dimension of the data in the synthetic experiements. Note that both the value and the size of the error bars are consistently better than the competing methods

5.0.0.2 Regression on Real Data

We apply the stated methods to various moderately sized regression datasets that are available in the MATLAB statistical machine learning library. The results are presented in Fig. 3.

In the plot, the datasets are arranged so that the dimension increases from left to right. We observe that we do comparably to the other methods for some datasets and outperform in others. See Appx. D for a description of each of the datasets studied.

Figure 3: Normalized Mean Squared Error in regression tasks.

5.0.0.3 Multi-class classification

We used popular UCI classification datasets for testing our classification algorithm. We repeated the experiments \(100\) times We present the mean and \(95\%\) C.I.s on a 2-fold random cross validation set, in Fig. 4. We observe to perform comparably to other algorithms on some datasets and outperform in others.

Figure 4: Miss-Classification on UCI data sets.

6 Discussion↩︎

The paper proposes an algorithm to learn piecewise linear functions using difference of max-affine functions. Our model results in linear or convex programs which can be solved efficiently even in high dimensions. We have shown several theoretical results on expressivity of our model, as well as its statistical properties, including good risk decay and adaptivity to low dimensional structure, and have demonstrated practicality of our resulting model under regression and classification tasks.

6.0.0.1 Wider context

Non-parametric methods are most often utilised in settings with limited data in moderate dimensions. Within this context, along with strong accuracy, it is often desired that the method is fast, and is interpretable, especially in relatively large dimensions.

In settings with large datasets, accuracy is relatively easy to achieve, and speed at inference is more important. In low dimensional settings, this makes methods such as MARS attractive, due to their fast inference time, while in high-dimensions MLPs are practically preferred. In settings with low dimensionality and small datasets, interpretability and speed take a backseat due to the small number of features, while accuracy become the critical requirement, promoting use of kernelised or nearest neighbour methods.

On the other hand, for small datasets in moderate to high dimensions, interpretability gains an increased emphasis. Within this context, our method results in a piecewise linear fit for which it is easy to characterise the locally important features, and further does so with relatively sparse weights via the \(\ell_1\) regularisation on \(\boldsymbol{a}_i, \boldsymbol{b}_i\). Further, since the suboptimality of our statistical risk is controlled as \(n^{O(1/d^2)},\) as the dimension increases, our method gets closer to the optimal accuracy. This thus represents the natural niche where application of DC function regression is appropriate.

7 Acknowledgment↩︎

This research was supported by NSF CAREER Award 1559558, CCF-2007350 (VS), CCF-2022446 (VS), CCF-1955981 (VS) and the Data Science Faculty Fellowship from the Rafik B. Hariri Institute.

Appendix to ‘Piecewise Linear Regression via a Difference of Convex Functions’

8 Connection to \(L_1\)-regularised splines↩︎

The following proposition argues that in the univariate case fitting a DC function is equivalent to fitting an \(L_1\)-regularised spline. Note that in the bivariate case, \(L_1\) splines are regularised by the Frobenius \(1\)-norm of the Hessian of the matrix, while in the multivariate case it is similarly regularised by a higher derivative tensor. Thus, our method is an alternate generalisation of the \(L_1\)-spline in the univaraite case.

Proposition 6. In the univariate setting, solving regression with the \(L_1\)-spline objective, \[\begin{align} \label{prog:spline1} \inf_{f\in C^2} \sum_{i=1}^n \ell(f( x_i), y_i) + \lambda \int_0^1 |f^{''}(s)| d s. \end{align}\qquad{(3)}\] is equivalent to regression with difference of convex functions penalized by their DC seminorm, with a free linear term* \[\begin{align} \label{prog:dc1} \min_{\phi_1, \phi_2 \;\text{convex} , a} &\sum_{i=1}^n \ell(\phi_1(x_i) - \phi_2(x_i) + a x ,y_i) + 2\lambda \sup_{x\in [0, 1]} | \phi_1'(x)| + | \phi_2' ( x)| \end{align}\tag{7}\] *

Proof. Suppose \(f\) is the solution to (?? ) and \(\phi_1 - \phi_2 + ax = f\), where \(\phi_1\) and \(\phi_2\) are convex. Hence for some \(a_1, a_2\in \mathbb{R}\): \[\begin{align} \phi_1'(x) &= a_1 + \int_{0}^{x} f''^+(s) + g(s) \,ds \\ \phi_2'(x) &= a_2 + \int_{0}^{x} f''^-(s) + g(s)\, ds \\ a &= a_2 - a_1 + f'(0), \end{align}\] where \(g(s) \geq 0\) otherwise the second differential of \(\phi_1(s)\) or \(\phi_2(s)\) would be negative which contradicts convexity. Since the error term in (7 ) is invariant to choices of \(a_1, a_2\) and \(g\), the minimization in (7 ) seeks to minimize only the regularization term. Thus, \[\begin{align} &\min_{\substack{ \phi_1, \phi_2 \textrm{ convex} \\ \phi_1 - \phi_2 +ax= f}}\sup_x \bigg(|\phi_1'(x)| + |\phi_2'(x)|\bigg)\\ =\,\,&\min_{g\geq 0}\min_{a_1,a_2}\sup_x \bigg( \Big|a_1+\int_{0}^{x} f''^+(s)+g(s)ds\Big| + \Big|a_2+\int_{0}^{x} f''^-(s)+g(s)ds\Big|\bigg) \\ \overset{*}{=}\,\, &\min_{g\geq 0}\min_{a_1,a_2} \max\bigg(\Big|a_1\Big| + \Big|a_2\Big|,\;\Big|a_1+\int_{0}^{1} f''^+(s)+g(s)ds\Big| + \Big|a_2+\int_{0}^{1} f''^-(s)+g(s)ds\Big|\bigg)\\ \overset{**}{=}\,\,&\min_{g\geq 0}\bigg(\frac{1}{2}\Big|\int_{0}^{1} f''^+(s)+g(s)ds\Big| + \frac{1}{2}\Big|\int_{0}^{1} f''^-(s)+g(s)ds\Big| \bigg)\\ =\,\, &\min_{g\geq 0} \bigg(\frac{1}{2}\int_{0}^{1} |f''(s)| ds + \int_{0}^{1} g(s) ds\bigg)\\ =\,\,& \frac{1}{2}\int_{0}^{1} |f''(s)| ds. \end{align}\] Therefore (?? ) and (7 ) are equivalent as they have the same objective.

\((*)\) is true since the expressions inside the absolute value are convex and monotonic in terms of \(x\), which causes the inner maximization to take place at endpoints of \(x\), i.e. \(x=0\) or \(x=1\). To show \((**)\) consider, \[\begin{align} c_1 &= \int_{0}^{1} f''^+(s)+g(s)ds \\ c_2 &= \int_{0}^{1} f''^-(s)+g(s)ds. \end{align}\] It’s sufficient to show \[\begin{align} \label{equ:CC} C = \min_{a_1, a_2} \max\Big (|a_1| + |a_2|, |a_1+c_1| + |a_2+c_2|\Big) = \frac{|c_1| + |c_2|}{2}. \end{align}\tag{8}\] by choosing \(a_1 = -\frac{c_1}{2}\) and \(a_2 = -\frac{c_2}{2}\) the value on the right hand side of (8 ) is achieved. Thus \(C \leq \frac{|c_1| + |c_2|}{2}\). To show this upper bound is tight, assume \(C < \frac{|c_1| + |c_2|}{2}\), hence both options for the max player should be smaller than this, e.g: \[\begin{align} |a_1| + |a_2| < \frac{|c_1| + |c_2|}{2}. \end{align}\] However this in turn causes: \[\begin{align} C \geq |a_1+c_1| + |a_2+c_2| &\geq |c_1| - |a_1| + |c_2| - |a_2| \\ &> \frac{|c_1| + |c_2|}{2}, \end{align}\] which contradicts the assumption \(C < \frac{|c_1| + |c_2|}{2}\). ◻

9 Proofs Omitted from §2↩︎

9.1 Proof that the DC seminorm is indeed a seminorm↩︎

Proposition 7. The functional \[\|f\| := \min_{\substack{ \phi_1, \phi_2 \textrm{ convex} \\ \phi_1 - \phi_2 = f} } \sup_x \sup_{\boldsymbol{v}_i \in \partial_* \phi_i(\boldsymbol{x})} \|\boldsymbol{v}_1 \|_1 + \| \boldsymbol{v}_2\|_1\] is a seminorm over the class \(\mathcal{DC}(\mathbb{R}^d)\). Further, if DC functions are identified up to a constant, it is a norm.

**Proof.* \(None\) Homogenity: Since non-negative scalings of convex functions are convex, if \(f = \phi_1 - \phi_2\), then for any real \(a\), \(af = a\phi_1 - a\phi_2\) is a DC representation if \(a > 0\), and \(f = -a\phi_2 - (-a\phi_1)\) is a DC representation if \(a \le 0.\) Trivially, if \(\boldsymbol{v} \in \partial_* \phi_i\) then \(a\boldsymbol{v} \in \partial_* (a\phi_i)\) and since \(\| \cdot\|_1\) is a norm, it holds that \(\|a \boldsymbol{v}\|_1 = |a| \|\boldsymbol{v}\|_1\). Thus, \[\begin{align} \|a f\| &= \min_{\substack{ \phi_1, \phi_2 \textrm{ convex} \\ \phi_1 - \phi_2 = af} } \sup_x \sup_{\boldsymbol{v}_i \in \partial_* \phi_i(\boldsymbol{x})} \|\boldsymbol{v}_1 \|_1 + \| \boldsymbol{v}_2\|_1 \\ &= \min_{\substack{ \phi_1, \phi_2 \textrm{ convex} \\ \phi_1 - \phi_2 = f} } \sup_x \sup_{\boldsymbol{v}_i \in \partial_* \phi_i(\boldsymbol{x})} \|(\pm a) \boldsymbol{v}_1 \|_1 + \| (\pm a) \boldsymbol{v}_2\|_1 \\ &= \min_{\substack{ \phi_1, \phi_2 \textrm{ convex} \\ \phi_1 - \phi_2 = f} } \sup_x \sup_{\boldsymbol{v}_i \in \partial_* \phi_i(\boldsymbol{x})} |a| ( \|\boldsymbol{v}_1 \|_1 + \| \boldsymbol{v}_2 \|_1 )\\ &= |a| \|f\|. \end{align}\]*

**Triangle Inequality:* For a DC function \(f\), let \(\mathscr{C}_f\) be the set of pairs of continuous convex functions whose difference is \(f\), i.e. \[\mathscr{C}_f := \{ (\phi_1, \phi_2) \textrm{ cont., convex }: \phi_1 - \phi_2 = f \}.\]*

We first argue that for DC functions \(f,g\), \[\begin{align} \mathscr{C}_{f+g} = \{ (\psi_1, \psi_2): \exists (\phi_1, \phi_2) \in \mathscr{C}_f, (\widetilde{\phi}_1, \widetilde{\phi}_2) \in \mathscr{C}_g \textrm{ s.t. } \psi_1 = \phi_1 + \widetilde{\phi}_1, \psi_2 = \phi_2 + \widetilde{\phi}_2 \} \end{align}\]

This follows because any decomposition \(\psi_1 - \psi_2 = f+g,\) and \(\phi_1 - \phi_2 = f\) gives decomposition \((\psi_1 + \phi_2, \psi_2 + \phi_1) \in \mathscr{C}_g\) and vice versa, so \(\mathscr{C}_{f+g}\) lies within the right hand side above. On the other hand, for any pair \((\phi_1, \phi_2) \in \mathscr{C}_f\) and \((\widetilde{\phi}_1, \widetilde{\phi}_2) \in \mathscr{C}_g,\) clearly \((\phi_1 + \widetilde{\phi}_1, \phi_2 + \widetilde{\phi}_2) \in \mathscr{C}_{f+g},\) so the right hand side lies within the \(\mathscr{C}_{f+g}\).

As a consequence, \[\begin{align} &\,\,\,\,\|f + g\|\\ &=\min_{ (\psi_1, \psi_2) \in \mathscr{C}_{f+g} } \sup_x \sup_{\boldsymbol{u}_i \in \partial_* \psi_i(\boldsymbol{x})} \|\boldsymbol{u}_1 \|_1 + \| \boldsymbol{u}_2\|_1 \\ &=\min_{ \substack{(\phi_1, \phi_2) \in \mathscr{C}_f \\ (\widetilde{\phi}_1, \widetilde{\phi}_2) \in \mathscr{C}_g } } \sup_x \sup_{\substack{\boldsymbol{v}_i \in \partial_* \phi_i(\boldsymbol{x})\\ \widetilde{{\boldsymbol{v}}}_i \in \partial_* \tilde{\phi}_i(\boldsymbol{x})}}\| \boldsymbol{v}_1 + \widetilde{\boldsymbol{v}}_1\|_1 +\| \boldsymbol{v}_2 + \widetilde{\boldsymbol{v}}_2\|_1\\ &\le \min_{ \substack{(\phi_1, \phi_2) \in \mathscr{C}_f \\ (\widetilde{\phi}_1, \widetilde{\phi}_2) \in \mathscr{C}_g } } \sup_x \sup_{\substack{\boldsymbol{v}_i \in \partial_* \phi_i(\boldsymbol{x})\\ \widetilde{{\boldsymbol{v}}}_i \in \partial_* \tilde{\phi}_i(\boldsymbol{x})}}\| \boldsymbol{v}_1\|_1 + \| \widetilde{\boldsymbol{v}}_1\|_1 +\| \boldsymbol{v}_2 \|_1 + \| \widetilde{\boldsymbol{v}}_2\|_1 \\ &\le \min_{(\phi_1, \phi_2) \in \mathscr{C}_f} \sup_x \sup_{\substack{\boldsymbol{v}_i \in \partial_* \phi_i(\boldsymbol{x})}}\| \boldsymbol{v}_1 \|_1 + \| \boldsymbol{v}_2\|_1 + \min_{(\widetilde{\phi}_1, \widetilde{\phi}_2) \in \mathscr{C}_g} \sup_x \sup_{\substack{ \widetilde{{\boldsymbol{v}}}_i \in \partial_* \tilde{\phi}_i(\boldsymbol{x})}} \|\widetilde{\boldsymbol{v}}_1\|_1 + \|\widetilde{\boldsymbol{v}}_2\|_1 \\ &= \|f\| + \|g\|. \end{align}\]

**Positive Definite up to constants:* We note that if \(\|f\| = 0,\) then \(f\) is a constant function. Indeed, the norm being zero implies that there exists a DC representation \(f = \phi_1 - \phi_2\) such that for all \(\boldsymbol{x}\), the biggest subgradients in \(\partial_* \phi_i(\boldsymbol{x})\) have \(0\) norm, ergo \(\|\nabla \phi_1 (x) \|_1 = \|\nabla \phi_2(x)\|_1 = 0,\) and thus, \(\nabla f = 0\) everywhere. Conversely, clearly for every constant function, \(\|c\|= 0\). Thus, the norm is zero iff the function is a constant. Lastly, note that for any \(f\) and a constant \(c\), \(\|f\| = \|f + c\|.\) Thus, equating DC functions that differ only by a constant turns the above seminorm into a norm. ◻*

9.2 ReLU networks and PLDC functions↩︎

9.2.1 Proof of Proposition 4↩︎

Proof. Recall that a fully connected neural network with ReLU activations, and \(D\) layers with weight matrices \(\boldsymbol{W}^1, \dots, \boldsymbol{W}^D\), is a function, here \(f\), taking the form \[\begin{align} f(\boldsymbol{x}) &= \sum_j w^{D+1}_j a^D_j \\ a^{l+1}_i &= \max (\sum_{j} w^{l+1}_{i,j} a^{l}_j, 0), \quad D > l\geq 1\\ a^1_i &= \max (\sum_{j}w^1_{i,j} x_j, 0), \end{align}\]

Our proof of the statement is constructed by induction over the layers of the network, using the following relations: \[\begin{align} \max(a,b) + \max(c,d) &= \max (a+b,a+d, b+c, b+d). \tag{9} \\ \max(\max(a,0)-\max(b,0),0) &= \max(a,b,0) - \max(b,0) \nonumber\\ &=\max_{z\in\{0,1\}, t\in\{0,1\}}(tz a +t(1\!-\!z)b) - \max_{t\in\{0,1\}} tb. \tag{10} \end{align}\] We will prove each node \(a^{l}_i\) is a DC function of the form: \[\begin{align} a_i^l &= \max_{k\in[K_l]} \langle \boldsymbol{\alpha}^l_{i,k}, \boldsymbol{x} \rangle - \max_{k\in[K_l]} \langle \boldsymbol{\beta}^l_{i,k}, \boldsymbol{x} \rangle \end{align}\] such that \[\begin{align} \max_k(\|\boldsymbol{\alpha}^{l+1}_{i,k}\| , \|\boldsymbol{\beta}^{l+1}_{i,k}\|) \leq \sum_{j} |w_{i,j}| \max_k(\|\boldsymbol{\alpha}^l_{j,k}\| , \|\boldsymbol{\beta}^l_{j,k}\|) \end{align}\] The base of induction is trivial by looking at the definition for \(a_i^1\) which is max of linear terms and therefore convex. Now we assume the claim holds \(a_j^{l}\) for all \(j\), we induce the results for \(a_i^{l+1}\). Assume layer \(l\) has \(h\) hidden units. For clarity we drop the index \(l\) from \(w^{l+1}_j\), \(a^l_j\) and \(K_l\). We further define \(s \triangleq a_i^{l+1}\). \[\begin{align} s &= \max\bigg(\sum_{j\in[h]} w_{j} a_j, 0\bigg)\\ &= \max \bigg(\sum_{j\in[h]} w_{j}^+ a_j - \sum_{j\in[h]} -w_{j} ^- a_j, 0 \bigg)\\ &= \max \bigg(\sum_{j\in[h]} \max_{k\in[K]} \Big \langle w_{j}^+ \boldsymbol{\alpha}_{j,k}, \boldsymbol{x} \Big \rangle + \max_{k\in[K]} \Big \langle -w_{j}^-\boldsymbol{\beta}_{j,k}, \boldsymbol{x} \Big \rangle - \sum_{j\in[h]} \max_{k\in[K]}\Big \langle -w_{j}^- \boldsymbol{\alpha}_{j,k}, \boldsymbol{x}\Big \rangle - \max_{k\in[K]} \Big \langle w_{j}^+ \boldsymbol{\beta}_{j,k}, \boldsymbol{x} \Big \rangle , 0 \bigg)\\ &\overset{(\ref{equ:sum95max})}{=} \max \bigg( \max_{\boldsymbol{k1},\boldsymbol{k2}\in[h]^K} \Big \langle \sum_{j\in[h]} w_j^+\boldsymbol{\alpha}_{j,\boldsymbol{k1} (j)}-w_{j}^- \boldsymbol{\beta}_{j,\boldsymbol{k2} (j)}, \boldsymbol{x} \Big \rangle -\max_{\boldsymbol{k1},\boldsymbol{k2}\in[h]^K} \Big \langle \sum_{j\in[h]} -w_j^-\boldsymbol{\alpha}_{j,\boldsymbol{k1}(j)}+w_{j}^+ \boldsymbol{\beta}_{j,\boldsymbol{k2}(j)}, \boldsymbol{x} \Big \rangle ,0\bigg)\\ &\overset{(\ref{equ:max95max})}{=} \max_{\boldsymbol{k1},\boldsymbol{k2}\in[h]^K,z\in\{0,1\},t\in\{0,1\}} \Big \langle t \sum_{j\in[h]} z(w_j^+ \boldsymbol{\alpha}_{j,\boldsymbol{k1}(j)}- w_{j}^- \boldsymbol{\beta}_{j,\boldsymbol{k2}(j)}) + (1-z)(-w_j^-\boldsymbol{\alpha}_{j,\boldsymbol{k1}(j)}+w_{j}^+\boldsymbol{\beta}_{j,\boldsymbol{k2}(j)}), \boldsymbol{x} \Big \rangle \\ &\qquad -\max_{\boldsymbol{k1},\boldsymbol{k2}\in[h]^K,t\in\{0,1\}} \Big \langle t\sum_{j\in[h]} -w_j^-\boldsymbol{\alpha}_{j,\boldsymbol{k1} (j)}+w_{j}^+ \boldsymbol{\beta}_{j,\boldsymbol{k2} (j)}, \boldsymbol{x} \Big \rangle \end{align}\] Now consider \(z=1\) and note that for each term of \(max\) function: \[\begin{align} \|\sum_{j\in[h]} w_j^+ \boldsymbol{\alpha}_{j,\boldsymbol{k1}(j)}- w_{j}^- \boldsymbol{\beta}_{j,\boldsymbol{k2}(j)}\| &\leq \sum_{j\in[h]} |w_j| \max (\|\boldsymbol{\alpha}_{j,\boldsymbol{k1}(j)}\|, \|\boldsymbol{\beta}_{j,\boldsymbol{k2}(j)}\|)\\ &\leq \sum_{j\in[h]} |w_j| \max_{k} (\|\boldsymbol{\alpha}_{j,k}\|, \|\boldsymbol{\beta}_{j,k}\|) \end{align}\] The proof is analogous for \(z=0\). ◻

9.2.2 Proof of Proposition 5↩︎

Proof. For convenience, we extend the input dimension by \(1\) and append a constant \(1\) to the end, so that the effecive inputs are \((\boldsymbol{x}, 1).\) This allows us to avoid writing constants in the following.

Let the PLDC function in question be \[f(\boldsymbol{x}) = \max_{k \in [K]} \langle \boldsymbol{a}_k, \boldsymbol{x} \rangle - \max_{k \in [K]} \langle \boldsymbol{b}_k, \boldsymbol{x}\rangle .\]

We give a construction below to compute \(\max_{k \in [K]} \langle \boldsymbol{a}_k , \boldsymbol{x}\rangle\) via a ReLU net that has at most \(2\lceil \log_2 K\rceil\) hidden layers, and width of at most \(\le 2K\) if \(K\) is a power of \(2\). Note that repeating this construction parallelly with the \(\boldsymbol{a}\)s replaced by \(\boldsymbol{b}\)s would then constitute the required construction.

By adding dummy values of \(\boldsymbol{a}_k, \boldsymbol{b}_k\) (i.e., equal to \(0\)), we may increase the \(K\) to a power of two without affecting the depth in the claim. The resulting width will be bounded as \(2 \cdot 2^{\lceil \log_2 K \rceil} \le 4K,\) and will yield the bound claimed. This avoids us having to introduce further dummy nodes later, and simplifies the argument.

To begin with, we set the first hidden layer to be of \(2K\) nodes, labelled \((k,+)\) and \((k,-)\) for \(k \in [K]\), with the outputs \[\begin{align} h^1_{k,+} &= \max(\langle \boldsymbol{a}_k ,\boldsymbol{x}\rangle, 0) \\ h^1_{k,-} &= \max(\langle -\boldsymbol{a}_k ,\boldsymbol{x}\rangle, 0) \end{align}\]

Thus, the outputs of the first layer encode the inner products \(\langle \boldsymbol{a}_k, \boldsymbol{x}\rangle\) in pairs. In the next layer, we we collate two \(\pm\) pairs of nodes into \(3\) nodes, indexed as \((j,0), (j,1), (j,0)\) for \(j \in [K]\) such that the outputs are \[\begin{align} h^2_{j,0} &= \max( h^1_{2j-1, +} - h^{1}_{2j-1, -} -h^1_{2j, +} + h^1_{2j, -} , 0) = \max( \langle \boldsymbol{a}_{2j-1} - \boldsymbol{a}_{2j}, \boldsymbol{x} \rangle, 0)\\ h^2_{j,1} &= \max( h^1_{2j, +} - h^1_{2j, -},0) = \max(\langle \boldsymbol{a}_{2j} , \boldsymbol{x}\rangle,0) \\ h^2_{j,1} &= \max( -h^1_{2j, +} + h^1_{2j, -},0) = \max(\langle -\boldsymbol{a}_{2j} , \boldsymbol{x}\rangle,0) \end{align}\]

In the next hidden layer, we may merge these three nodes for each \(j\) into two, giving a total width of \(\le K + 1,\) represented as \((\ell, +)\) and \((\ell, -)\). This utilises the simple relation \[\max(a,b) = \max(a-b,0) + \max(b,0) - \max(-b,0),\] easily shown by some casework. Let us define the outputs \[\begin{align} h^3_{\ell, +} &= \max(h^2_{\ell, 0} + h^2_{\ell, 1} - h^2_{\ell, 2}, 0) = \max( \max(\langle \boldsymbol{a}_{2\ell - 1}, \boldsymbol{x} \rangle, \langle \boldsymbol{a}_{2\ell}, \boldsymbol{x} \rangle), 0) \\ h^3_{\ell, -} &= \max(-h^2_{\ell, 0} - h^2_{\ell, 1} + h^2_{\ell, 2}, 0) = \max(- \max( \langle \boldsymbol{a}_{2\ell - 1}, \boldsymbol{x} \rangle, \langle \boldsymbol{a}_{2\ell}, \boldsymbol{x} \rangle), 0) \end{align}\]

But now note that the outputs of the third hidden layer are analogous to those of the first hidden layer, but with some of the maximisations merged, and thus the circuit above can be iterated. This is exploiting the iterative property that \(\max(a,b,c,d) = \max( \max(a,b), \max(c,d))\). Thus, we may apply this construction \(\log_2K\) times in total, reducing the width of every odd numbered hidden layered by \(2\) each time, and using two hidden layers for each step. At the final step, by the iterative property of \(\max\), we will have access to a 2-node hidden layer with outputs \(\max( \max_{k \in [K]} \langle \boldsymbol{a}_k , \boldsymbol{x} \rangle , 0 )\) and \(\max( -\max_{k \in [K]} \langle \boldsymbol{a}_k , \boldsymbol{x} \rangle , 0 ),\) and the final layer may simply report the difference of these, with no nonlinearity.

We note that more efficient constructions are possible, but are harder to explain. In particular, the above construction can be condensed into one with \(\le 3K\) width and \(\lceil \log_2 K \rceil\) depth. ◻

10 Proof of Theorem 1↩︎

For succinctness, we let \(\mathcal{DC}_L^M := \mathcal{DC}_L \cap \{ \sup_{ \boldsymbol{x} \in \Omega} |f(\boldsymbol{x}) \le M\}\). We note that this section makes extensive use of the work of [21], and assumes some familiarity with the arguments presented there. We begin a restricted version of Theorem 1.

Lemma 3. Given \(n\) i.i.d.  samples \(S = \{ (\boldsymbol{x}_i, y_i)\}\) from data \(y_i = f(\boldsymbol{x}_i) + \varepsilon_i\), where both \(f\) and \(\varepsilon\) are bounded by \(M\), with probability at least \(1-\delta,\) it holds uniformly over all \(f \in \mathcal{DC}_L^M\) that \[|\mathbb{E}[ (f(\boldsymbol{x})- y)^2] - \hat{\mathbb{E}}_S[ (f(\boldsymbol{x})- y)^2]| + 12M \hat{D}_n(\mathcal{DC}_L) + 45M^2 \sqrt{\frac{4 + 8 \log(2/\delta)}{n}}\]

We prove this lemma after the main text, and first show how this is utilised to produce the bound of Theorem 1. The key insight is that the classes \(\mathcal{DC}_L^M\) are nested, in that \(\mathcal{DC}_L^M \subset \mathcal{DC}_{L'}^M\) for any \(L \le L'\). This allows a doubling trick over the \(L\)s to bootstrap the above bound to all of \(\mathcal{DC}^M = \bigcup_{L \ge 1} \mathcal{DC}_L^M\). The technique is standard, see, e.g., Thm. 26.14 of [27].

For naturals \(j \ge 1,\) let \(L_j = 2^j\) and \(\delta_j = \delta 2^{-j} = \delta/L_j\). With probability at least \(1-\delta_j,\) it holds for all \(f \in \mathcal{DC}_{L_j}^M\setminus \mathcal{DC}_{L_{j-1}}^M\) simultaneously that \[\begin{align} |\mathbb{E}[ (f(\boldsymbol{x})- y)^2] - \hat{\mathbb{E}}_S[ (f(\boldsymbol{x})- y)^2]| \le 12M \hat{D}_n(\mathcal{DC}_{L_j}) + 45M^2 \sqrt{\frac{4 + 8\log L_j + 8 \log(2/\delta)}{n}} \end{align}\]

since \(\sum \delta_j = \delta,\) the union bound then extends this bound to all DC functions, whilst allowing us to use the appropriate value of \(j\) as desired. Taking \(j_f = \max(1,\lceil \log_2 \| f\| \rceil)\), we find that for any \(f \in \mathcal{DC}^M,\) \[\begin{align} &| \mathbb{E}[ (f(\boldsymbol{x})- y)^2] - \hat{\mathbb{E}}_S[ (f(\boldsymbol{x})- y)^2] | \\ &\le 12M \hat{D}_n(\mathcal{DC}_{2^{j_f}}) + 45M^2 \sqrt{\frac{4 + 8\log 2^{j_f} + 8 \log(2/\delta)}{n}} \\ &\le 12M \hat{D}_n(\mathcal{DC}_{2\|f\| + 2}) + 45M^2 \sqrt{\frac{4 + 8 \ln(2) \max( 1, \log_2 \|f\|) + 8 \log(2/\delta)}{n}} \end{align}\]

Proof of Lemma 3. We first recall Theorem 8 of [21], which says that if \(\ell\) is a loss function that is uniformly bounded by \(B\), then for any class of functions \(\mathcal{F}\) and an iid sample \(S = \{ (\boldsymbol{x}_i, y_i)\}\) of size \(n\), with probability at least \(1-\delta/2\), it holds uniformly over all \(f \in \mathcal{F}\) that \[\mathbb{E}[ \ell( f(\boldsymbol{x}), y)] \le \hat{\mathbb{E}}_S[ \ell(f(\boldsymbol{x}), y)] + \Re_n( \ell \circ \mathcal{F}) + B \sqrt{\frac{8 \log(4/\delta)}{n}},\] where \(\Re_n(\ell \circ \mathcal{F})\) is the Rademacher complexity of the class \(\{ \ell(f(\boldsymbol{x}) , y) - \ell (0,y)\}\).

In our case we are interested in \(\mathcal{F} = \mathcal{DC}_L^M\) and \(\ell = (\cdot - \cdot)^2\). Since \(|f| \le M\) and \(|y| \le 2M,\) we may take \(B = 9M^2\).

Next, we use properties of Theorem 12, part 4 of [21], which says that for \(L\)-Lipschitz \(\ell\) which takes the value \(\ell(0,0) = 0,\) \(\Re_n( \ell \circ \mathcal{F}) \le 2L \Re_n(\mathcal{F})\). Since our assumptions enforce that the argument to the squared loss is bounded by \(3M,\) we note that the loss is \(6M\)-Lipschitz on this class. Thus, we conclude that with \(n\) samples, uniformly over all \(f \in \mathcal{DC}_L^M\) \[\mathbb{E}[ (f(\boldsymbol{x})- y)^2] \le \hat{\mathbb{E}}_S[ (f(\boldsymbol{x})- y)^2] + 12M \Re_n( \mathcal{DC}_L^M) + 9M^2 \sqrt{\frac{8 \log(4/\delta)}{n}}.\]

Next we utilise Lemma 3 of [21], and that \(\mathcal{DC}_L^M \subset \mathcal{DC}_L\) to conclude that with probability at least \(1-\delta/2\), \[\Re_n(\mathcal{DC}_L^M) \le \hat{D}_n(\mathcal{DC}_L^M) + 3M \sqrt{\frac{2}{n}} \le \hat{D}_n(\mathcal{DC}_L) + 3M \sqrt{\frac{ 4 + 8 \log(4/\delta))}{n}}.\]

The claim then follows by using the union bound. ◻

11 Proof of Lemma 1↩︎

Proof. For any function \(f\), let \[\Delta(f) \triangleq \sum_{i= 1}^{n/2} f(\boldsymbol{x}_i) - \sum_{i = n/2+1}^{n} f(\boldsymbol{x}_i).\] Note that \(\Delta(f+g) = \Delta(f) + \Delta(g)\). Then \[\begin{align} \hat{D}_n(\mathcal{DC}_{L,0}) &= \sup_{f \in \mathcal{DC}_{L,0}} \Delta(f) \\ &= \sup_{\substack{f \in \mathcal{DC}_{L,0} \\ \phi_1,\phi_2 \in \mathcal{C}_{L,LR} \\ \phi_1 - \phi_2 = f}} \Delta(\phi_1) - \Delta(\phi_2) \\ &\le \sup_{\phi_1,\phi_2 \in \mathcal{C}_{L,LR}} \Delta(\phi_1) - \Delta(\phi_2) \\ &\le \sup_{\phi_1 \in \mathcal{C}_{L,LR}} \Delta(\phi_1) + \sup_{\phi_2 \in \mathcal{C}_{L,LR}} \Delta(\phi_2) \\ &= 2\hat{D}_n(\mathcal{C}_{L,LR}) \end{align}\] where the first inequality holds because of the representation of \(\mathcal{DC}_{L,0}\) as a difference of \(\mathcal{C}_{L,LR}\) functions, as discussed in the main text, and the second inequality holds since the supremum is non-negative, as \(0 \in \mathcal{C}_{L,LR}\). ◻

12 Algorithms via ADMM↩︎

This section provides a parallel ADMM optimizer for the piece-wise linear regression problem.

data \(\{(\boldsymbol{x}_i, y_i)\,|\, \boldsymbol{x}_i \in \mathbb{R}^D, y_i \in \mathbb{R}\}_{i\in[n]},\, \rho = 0.01, \, \lambda, \, T\) \[\begin{align} \hat{y}_i &= z_i = s_{i,j} = t_{i,j} = \alpha_{i,j} = \beta_{i,j} = 0 & i,j\in[n]\\ \boldsymbol{L} &=\boldsymbol{a}_i =\boldsymbol{b}_i = \boldsymbol{p}_i= \boldsymbol{q}_i = \boldsymbol{u}_{i} = \boldsymbol{\gamma}_{i} = \boldsymbol{\eta}_i = \boldsymbol{\zeta}_i = \boldsymbol{0}_{D\times 1}, & i\in[n]\\ \Lambda_i &= \Big(n\boldsymbol{x}_i\boldsymbol{x}_i^T - \boldsymbol{x}_i (\sum_i \boldsymbol{x}_i)^T - (\sum_i \boldsymbol{x}_i)\boldsymbol{x}_i^T + \sum_j\boldsymbol{x}_j\boldsymbol{x}_j^T + I \Big)^{-1} & i\in[n] \end{align}\] \[\begin{align} A_i &=\sum_j\alpha_{j,i} - \alpha_{i,j} + s_{j,i} - s_{i,j} + \langle \boldsymbol{a}_i + \boldsymbol{a}_j, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle + 2y_j & i \in [n]\\ B_i &= \sum_j\beta_{j,i} - \beta_{i,j}+ t_{j,i} - t_{i,j} + \langle \boldsymbol{b}_i + \boldsymbol{b}_j, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle & i\in [n]\\ \hat{y}_i & = \frac{2}{2+n\rho}y_i + \frac{\rho}{2(2+n\rho)} A_i - \frac{\rho}{2(2+n\rho)} B_i & i\in [n]\\ z_i & = -\frac{1}{2+n\rho}y_i + \frac{1}{2n(2+n\rho)} A_i + \frac{1+ n\rho}{2n(2+n\rho)} B_i & i\in [n]\\ \boldsymbol{a}_i &= \Lambda_i( \boldsymbol{p}_{i} - \boldsymbol{\eta}_{i}+\sum_j (\alpha_{i,j} + s_{i,j} + \hat{y}_i - \hat{y}_j + z_i - z_j)(\boldsymbol{x}_i - \boldsymbol{x}_j) ) & i\in [n] \\ \boldsymbol{b}_i &=\Lambda_i( \boldsymbol{q}_{i} - \boldsymbol{\zeta}_{i}+\sum_j (\beta_{i,j} + t_{i,j} + z_i - z_j)(\boldsymbol{x}_i - \boldsymbol{x}_j) ) & i\in [n] \\ L_d&= \frac{1}{n}(-\lambda_d/\rho + \sum_i\gamma_{i,d} + |p_{i,d}|+ |q_{i,d}|+u_{i,d}) & d\in[D] \\ p_{i,d}&=\frac{1}{2}\mathop{\mathrm{sign}}(\eta_{i,d} + a_{i,d})(|\eta_{i,d} + a_{i,d}|+L_d - u_{i,d} - |q_{i,d}| - \gamma_{i,d} )^+ & i\in[n],d\in[D] \\ q_{i,d} &=\frac{1}{2}\mathop{\mathrm{sign}}(\zeta_{i,d} + b_{i,d})(|\zeta_{i,d} + b_{i,d}|+L_d - u_{i,d} - |p_{i,d}| - \gamma_{i,d} )^+ & i\in[n],d\in[D] \\ u_{i,d} & = (-\gamma_{i,d}- |p_{i,d}| - |q_{i,d}| + L_d )^+ & i\in[n],d\in[D]\\ s_{i,j} &= (-\alpha_{i,j}- \hat{y}_i + \hat{y}_j - z_i + z_j + \langle \boldsymbol{a}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle )^+ & i,j \in [n] \\ t_{i,j} &= (-\beta_{i,j}- z_i + z_j + \langle \boldsymbol{b}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle )^+ & i,j \in [n] \\ \alpha_{i,j} &= \alpha_{i,j} + s_{i,j} + \hat{y}_i - \hat{y}_j + z_i - z_j - \langle \boldsymbol{a}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle & i,j \in [n] \\ \beta_{i,j} &= \beta_{i,j} + t_{i,j} + z_i - z_j - \langle \boldsymbol{b}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle & i,j \in [n] \\ \gamma_{i,d} &= \gamma_{i,d} + u_{i,d} + |p_{i,d}| + |q_{i,d}| - L_d & i\in[n],d\in[D] \\ \eta_{i,d} &= \eta_{i,d} + a_{i,d} - p_{i,d} & i\in[n],d\in[D] \\ \zeta_{i,d} &= \zeta_{i,d} + b_{i,d} - q_{i,d} & i\in[n],d\in[D] \end{align}\] \(\quad \hat{f}(\boldsymbol{x}) \triangleq \max_{i\in [n]} \langle \boldsymbol{a}_i, \boldsymbol{x}-\boldsymbol{x}_i \rangle + \hat{y}_i + z_i -\max_{i\in[n]} \langle \boldsymbol{b}_i, \boldsymbol{x}-\boldsymbol{x}_i \rangle + z_i\)

12.1 Algorithm Derivation↩︎

For the derivation of the ADMM algorithm, we start with a modified version of the empirical risk minimization in (5 ); where we use a components wise regularization: \[\begin{align} \| f\| \triangleq \inf_{\phi_1,\phi_2}&\sum_{d=1}^D\sup_{\boldsymbol{x}} \sup_{ v_{i,d} \in \frac{\partial_*}{\partial {x_d}} \phi_i(\boldsymbol{x})} | v_{1,d}| + |v_{2,d}| \\ & \textrm{s.t. } \phi_1, \phi_2 \text{ are convex, } \phi_1 - \phi_2 = f, \end{align}\] Hence the new empirical risk minimization becomes: \[\begin{align} &\qquad \min_{\hat{y}_i, z_i, \boldsymbol{a}_i, \boldsymbol{b}_i, L} \sum_{i = 1}^n ( \hat{y_i} - y_i)^2 + \lambda \sum_{d=1}^D L_d\\ &\textrm{s.t.} \begin{cases} \hat{y}_i - \hat{y}_j + z_i - z_j\geq \langle \boldsymbol{a}_j, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle & i,j \in [n] ,d\in[D] \\ z_i - z_j \geq \langle \boldsymbol{b}_j, \boldsymbol{x}_i-\boldsymbol{x}_j \rangle & i,j \in [n] \\ |a_{i,d}| + |b_{i,d}| \leq L_d & \quad i\in[n] ,d\in[D] \end{cases}\notag \end{align}\] First we change the inequality constraint into equality constraints. We further introduce some auxiliary variables to ease the solving of subsequent ADMM blocks. \[\begin{align} \label{program:SRM95Admm951} &\qquad \min_{\substack{\hat{y}_i, z_i, \boldsymbol{a}_i, \boldsymbol{b}_i, L\\ \boldsymbol{p}_i, \boldsymbol{q}_i, s_{i,j}, t_{i,j}, u_{i,d}}} \sum_{i = 1}^n (\hat{y}_i - y_i)^2 + \lambda \sum_{d=1}^D L_d\\ &\textrm{s.t.} \begin{cases} s_{i,j} + \hat{y}_i - \hat{y}_j + z_i - z_j - \langle \boldsymbol{a}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle = 0 & i,j \in [n] \\ t_{i,j} + z_i - z_j - \langle \boldsymbol{b}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle = 0 & i,j \in [n] \\ |p_{i,d}| + |q_{i,d}| + u_{i,d} - L_d = 0 & i\in[n],d\in[D] \\ a_{i,d} = p_{i,d} & i\in[n],d\in[D]\\ b_{i,d} = q_{i,d} & i\in[n],d\in[D]\\ s_{i,j}, t_{i,j}, u_{i,d} \geq 0 & i,j \in [n],d\in[D] \\ \end{cases}\notag \end{align}\tag{11}\] Then we write the augmented Lagrangian of ADMM as bellow. \[\begin{align} \label{program:SRM95Admm952} &\qquad \min \ell( \hat{y}_i, z_i, \boldsymbol{a}_i, \boldsymbol{b}_i ,\boldsymbol{p}_i, \boldsymbol{q}_i, L, s_{i,j}, t_{i,j}, \boldsymbol{u}_i, \alpha_{i,j}, \beta_{i,j}, \boldsymbol{\gamma}_i, \boldsymbol{\eta}_i, \boldsymbol{\zeta}_i)\notag\\ & =\quad \sum_{i = 1}^n (\hat{y}_i - y_i)^2 +\lambda L \notag\\ & + \quad \sum_{i}\sum_{j} \alpha_{i,j}\Big(s_{i,j} + \hat{y}_i - \hat{y}_j + z_i - z_j - \langle \boldsymbol{a}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle\Big) + \frac{\rho}{2}\Big(s_{i,j} + \hat{y}_i - \hat{y}_j + z_i - z_j - \langle \boldsymbol{a}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle\Big)^2 \notag\\ &+ \quad \sum_{i}\sum_{j} \beta_{i,j}\Big(t_{i,j} + z_i - z_j - \langle \boldsymbol{b}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle\Big) + \frac{\rho}{2}\Big(t_{i,j} + z_i - z_j - \langle \boldsymbol{b}_i, \boldsymbol{x}_i-\boldsymbol{x}_j\rangle\Big)^2 \notag\\ &+ \quad \sum_i\sum_d\gamma_{i,d}\Big (u_{i,d} - L_d + |p_{i,d}| + |q_{i,d}| \Big) + \frac{\rho}{2} \Big(u_{i,d} - L_d + |p_{i,d}| + |q_{i,d}| \Big)^2 \notag\\ &+ \quad\sum_{i}\sum_{d} \eta_{i,d}\Big(a_{i,d} - p_{i,d}\Big) + \frac{\rho}{2}\Big(a_{i,d} - p_{i,d}\Big)^2\notag\\ &+ \quad \sum_{i}\sum_{d} \zeta_{i,d}\Big(b_{i,d} - q_{i,d}\Big) + \frac{\rho}{2}\Big(b_{i,d} - q_{i,d}\Big)^2\notag \end{align}\tag{12}\] Taking gradients with respect to the augmented Lagrangian and solving for the roots provides the parallel algorithm in Appx 12.

13 Datasets↩︎

Table 1: Datasets used for the regression task. The entries in the first columns are linked to repositry copies of the same. The final column indicates if the DC function based method outperforms all competing methods or not.
Data set Labels Features Did DC do Better?
Rock Samples 16 3
Acetylene 16 3
Moore 20 5
Reaction 13 3
Car small 100 6
Cereal 77 12
Boston Housing 506 13
Forest Fires 517 12

-0.1in

Table 2: Datasets used for the multi-class classification task. The entries in the first columns are linked to the UCI machine learning repositry copies of the same. The final column indicates if the DC function based method outperforms all competing methods or not.
Data set Labels Features Did DC do Better?
BPD 223 1
Iris 150 4
Balance Scale 625 4
Ecoli 337 7
Wine 178 13
Heart Disease 313 13
Ionosphere 351 34

-0.1in

References↩︎

[1]
Yuille, A. L. and Rangarajan, A. The concave-convex procedure (cccp). In Advances in neural information processing systems, pp. 1033–1040, 2002.
[2]
Horst, R. and Thoai, N. V. Dc programming: overview. Journal of Optimization Theory and Applications, 103 (1): 1–43, 1999.
[3]
Gasso, G., Rakotomamonjy, A., and Canu, S. Recovering sparse signals with a certain family of nonconvex penalties and dc programming. IEEE Transactions on Signal Processing, 57 (12): 4686–4698, 2009.
[4]
Piot, B., Geist, M., and Pietquin, O. Difference of convex functions programming for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2519–2527, 2014.
[5]
Toriello, A. and Vielma, J. P. Fitting piecewise linear continuous functions. European Journal of Operational Research, 219 (1): 86–95, 2012.
[6]
Holmes, C. and Mallick, B. Bayesian regression with multivariate linear splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63 (1): 3–17, 2001.
[7]
Hildreth, C. Point estimates of ordinates of concave functions. Journal of the American Statistical Association, 49 (267): 598–619, 1954.
[8]
Holloway, C. A. On the estimation of convex functions. Operations Research, 27 (2): 401–407, 1979.
[9]
Magnani, A. and Boyd, S. P. Convex piecewise-linear fitting. Optimization and Engineering, 10 (1): 1–17, 2009.
[10]
Hannah, L. A. and Dunson, D. B. Multivariate convex regression with adaptive partitioning. The Journal of Machine Learning Research, 14 (1): 3261–3294, 2013.
[11]
Ghosh, A., Pananjady, A., Guntuboyina, A., and Ramchandran, K. Max-affine regression: Provable, tractable, and near-optimal statistical estimation. arXiv preprint arXiv:1906.09255, 2019.
[12]
Balázs, G., György, A., and Szepesvári, C. Near-optimal max-affine estimators for convex regression. In AISTATS, 2015.
[13]
Balázs, G. Convex Regression: Theory, Practice, and Applications. PhD thesis, University of Alberta, 2016.
[14]
Wahba, G. Spline models for observational data, volume 59. Siam, 1990.
[15]
Friedman, J. H. et al. Multivariate adaptive regression splines. The annals of statistics, 19 (1): 1–67, 1991.
[16]
Cui, Y., Pang, J.-S., and Sen, B. Composite difference-max programs for modern statistical estimation problems. SIAM Journal on Optimization, 28 (4): 3344–3374, 2018.
[17]
Hartman, P. et al. On functions representable as a difference of convex functions. Pacific Journal of Mathematics, 9 (3): 707–713, 1959.
[18]
Bačák, M. and Borwein, J. M. On difference convexity of locally lipschitz functions. Optimization, 60 (8-9): 961–978, 2011.
[19]
Kripfganz, A. and Schulze, R. Piecewise affine functions as a difference of two convex functions. Optimization, 18 (1): 23–29, 1987.
[20]
Ovchinnikov, S. Max-min representation of piecewise linear functions. Contributions to Algebra and Geometry, 43 (1): 297–302, 2002.
[21]
Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3 (Nov): 463–482, 2002.
[22]
Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
[23]
Boyd, S., Parikh, N., and Chu, E. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011.
[24]
Mazumder, R., Choudhury, A., Iyengar, G., and Sen, B. A computational framework for multivariate convex regression and its variants. Journal of the American Statistical Association, 114 (525): 318–331, 2019.
[25]
Shrivastava, A. and Li, P. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 2321–2329. Curran Associates, Inc., 2014.
[26]
Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.
[27]
Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

  1. [13] presents an argument with the \(2\)-norm of subgradients bounded, but this can be easily modified to the case of bounded \(1\)-norm under bounds on \(\|\boldsymbol{x}\|_\infty\).↩︎

  2. https://github.com/Siahkamari/Piecewise-linear-regression↩︎