July 28, 2022
While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit’s update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning. Experimental results validate the effectiveness of the proposed method compared to the baselines.
,
,
adaptive systems, model fitting, estimation theory, learning theory, overparameterized models, neural networks technology
While large machine learning models have been successful in numerous domains, their training is computationally demanding and requires iterating over the entire dataset multiple times. This hinders their deployment in many real-world settings such as robot learning, autonomy, and online decision making, where new datapoints are collected over time or become available sequentially. In such settings, storing all the datapoints and retraining the model at every step on all the data is extremely costly and often not feasible. In addition, in certain applications, storing the data may be prohibited for privacy reasons.
Thus, it is very desirable to come up with algorithms that can learn incrementally or in an online fashion, rather than by repeatedly iterating over the entire data many times. However, it is well-known that deep neural networks are prone to significantly forgetting past information while learning new data, which is an issue referred to as catastrophic forgetting [3]. This begs the question:
Can we learn streaming data efficiently without forgetting or retraining on previous data?
This is a setting often referred to as one-pass learning. More specifically, one-pass learning concerns the setting where () the algorithm makes an update using the current datapoint without direct access to previous data; () the new updates do not significantly affect the predictions on the previous data; and () the computational and memory costs of each update must not grow with the iteration count.
One-pass learning and its variants have received increasing attention, and several studies have attempted to address them in various contexts [4]–[8]. For instance, [9] studied learning the ImageNet dataset in a single pass by revisiting some “important” previous datapoints at each learning step, and [10] investigated learning incremental “batches” on a large scale by correcting the classifier’s bias towards new data. However, both methods train on previous data and are not adequate for one-pass learning. On the other hand, [11] proposed an effective one-pass learning method for support vector machines. However, their method is tailored to the specific setting of support vector machines. Further, [12] proposed a one-pass deep learning algorithm, but it relies on a specific network architecture and is vulnerable to forgetting the previous data unless the data is consistently arriving from the same distribution.
One-pass learning is also closely related to a classical problem studied in the context of control and estimation theory. More specifically, a classical algorithm known as recursive least-squares (RLS) (see, e.g., [13]) tackles one-pass learning for linear models (as elaborated in Section 2.2). However, there are two limitations of the standard RLS: () it suffers from high computational and memory costs, and () it is not well-suited for the overparameterized setting where zero training loss is desired. The main focus of this work is to develop a method that overcomes these limitations. Related to the overparameterized setting, a few works have recently discussed utilizing RLS to train popular deep neural networks such as FNN, CNN, RNN, and LSTM [14], [15]. However, these works are empirical in nature and consider multi-pass learning with mini-batches. Moreover, the theoretical properties of RLS for one-pass learning are not studied in those works.
Our main contributions can be summarized as follows.
We develop Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning in the overparameterized setting, which fits new data on the fly while updating the parameters in a direction that causes the least change to the predictions on previous data. Our algorithm uses memory efficiently by exploiting the structure of the streaming data via an incremental principal component analysis (IPCA) to extract the essential information for the update. We further generalize our algorithm to learn from batches of data, which can be viewed as a continual learning method. (§3)
Through the proposed method, we establish an interesting connection between two different algorithms from adaptive filtering and machine learning, namely, the recursive least-squares (RLS) algorithm and the orthogonal gradient descent (OGD). Our method updates the parameters in a direction orthogonal to past gradients, thereby ensuring minimal disruption of previous predictions, similar to OGD. We show that this update leads to an operation similar to the RLS algorithm, but with significantly improved memory and computational cost. (§4)
We characterize the behavior of the proposed method in the overparameterized feature-based linear setting. We prove that ORFit, in one pass, finds the same parameter vector that the standard multi-pass SGD would asymptotically converge to. Additionally, we show that using the principal components to reduce memory usage is minimax optimal in terms of forgetting. (§4)
We demonstrate the practicality of our approach and corroborate our theoretical findings through various experiments. (§5)
We discuss extensions of our results to overparameterized nonlinear models, relevant for deep learning. (§6)
One-pass learning shares some similarities with other settings that deal with streaming data such as online learning and incremental learning. While these terms are often inconsistently defined in the literature, one may distinguish them from one-pass learning based on whether we learn a single datapoint or a batch of data at a time [16]. Additionally, although both of these approaches learn from streaming data, they often make assumptions about the distribution of the data. Compared to these settings, one-pass learning requires explicit efforts to not alter the predictions on the previous data while learning new data. Thus, it aims to preserve the predictions even when the new data comes from a completely different distribution. Another related setting is continual learning where batches of data from different tasks are sequentially learned [17]. Note that a one-pass learning algorithm is applicable to this setting to learn multiple tasks without the knowledge of the boundaries between the tasks.
Let \(f(x;w)\in\mathbb{R}^c\) be a model that the agent is trying to fit, where \(x\in\mathcal{X}\subset\mathbb{R}^d\) is the input and \(w\in\mathbb{R}^p\) is the parameter (weight) vector. Consider a sequentially arriving stream of data \(\left((x_{k}, y_{k})\right)_{k=1}^{K}\) where \(x_{k}\in\mathcal{X}\) and \(y_{k}\in\mathcal{Y}\subset\mathbb{R}^c\). In overparameterized models, we have \(p\geq K\) (and often \(p\gg K\)). For a loss function \(\ell(\cdot,\cdot)\), let \(f_{k}(w)\vcentcolon=f(x_{k};w)\) and \(\ell_{k}(w)\vcentcolon=\ell(y_{k}, f_{k}(w))\). Then, one-pass learning considers the setting where, given an initial parameter \(w_0\in\mathbb{R}^p\), it updates the parameter \(w_i\in\mathbb{R}^p\) after the new data \((x_{i}, y_{i})\) arrived without revisiting previous datapoints \(\{(x_{k}, y_{k})\}_{k=1}^{i-1}\).
For concreteness, we first focus on an overparameterized feature-based linear model \(f(x;w)=\Phi(x)^\top w=[\phi_1(x) \cdots \phi_c(x)]^\top w\) with feature maps \(\phi_j:\mathcal{X}\rightarrow\mathbb{R}^{p}\) for \(j\in[c]:=\{1,\dots,c\}\). We then discuss the extension of the results to nonlinear models (e.g., overparameterized architectures in deep learning) in Section 6.
Before introducing our algorithm and the results, we briefly review two related algorithms capable of one-pass learning, proposed in two different literatures.
First, we briefly review recursive least-squares (RLS) from the control/estimation theory literature (refer to [13] for details). In its standard form, RLS considers a linear model \(f(x;w) = w^\top x\in\mathbb{R}\) for a stream of data \(\left((x_k, y_k)\right)_{k=1}^K\) with \(x\in\mathbb{R}^p\), \(y_k\in\mathbb{R}\). At every step \(i\), it aims to find a parameter vector that solves the following regularized least-squares problem: \[\label{eq:ls} w^{(RLS)}_i = \arg\min_w \sum_{k=1}^{i} (y_k - w^\top x_k)^2 + \lVert w-w_0 \rVert^2_\Pi,\tag{1}\] where \(\lVert x \rVert_\Pi\vcentcolon=\sqrt{x^\top\Pi x}\) for a \(p\times p\) positive-definite matrix \(\Pi\) and \(w_0\) is an initial parameter estimate. Note that the system is underdetermined/overparameterized, and the regularization term in 1 is necessary for the solution to be uniquely defined. While there is a closed-form solution for 1 given by \(w^{(RLS)}_i = (\Pi + X_i^\top X_i)^{-1}(X_i^\top Y_i+\Pi w_0)\) with \(X_i = [x_1\;x_2 \dots \;x_i]^\top\) and \(Y_i = [y_1\;y_2 \dots \;y_i]^\top\), computing the solution directly requires storing all the previous data as well as recomputing the inverse of the covariance matrix for every new datapoint. RLS bypasses this issue by computing the new solution \(w^{(RLS)}_i\) of 1 recursively from \(w^{(RLS)}_{i-1}\) and \((x_i,y_i)\).
We elaborate the algorithm more formally for a general version of RLS called exponentially weighted recursive least-squares (EW-RLS) [13]. Consider the following problem: \[\label{eq:weighted95ls} w^{(RLS)}_i = \arg\min_w \sum_{k=1}^{i} \lambda^{i-k}(y_k - w^\top x_k)^2 + \lambda^i \lVert w-w_0 \rVert^2_\Pi,\tag{2}\] with a forgetting factor \(0<\lambda\leq 1\). Note that this reduces to the problem of the vanilla RLS 1 when \(\lambda = 1\). The exact solution of it is recursively updated as follows: \[\begin{cases} w^{(RLS)}_i = w^{(RLS)}_{i-1} + \dfrac{P_{i-1} x_i}{\lambda^i + x_i^\top P_{i-1} x_i} (y_i - x_i^\top w^{(RLS)}_{i-1})\,,\\ P_i = P_{i-1} - \dfrac{P_{i-1} x_i x_i^\top P_{i-1}}{\lambda^i + x_i^\top P_{i-1} x_i}, \end{cases} \label{eq:rls95update}\tag{3}\] with \(w^{(RLS)}_0=w_0\) and \(P_0 = \Pi^{-1}\). Here, \(P_i\) can be alternatively written as \(P_i = [\Pi + X_i^\top \Lambda_i X_i]^{-1}\) where \(\Lambda_i = \mathrm{diag}(\lambda^{-1}, \lambda^{-2}, \dots, \lambda^{-i})\). Note that the update rule 3 of EW-RLS can be highly inefficient as its memory and computational complexities are \(O(p^2)\), which is prohibitive for one-pass learning in overparameterized models, where we often have \(p\gg K\). In contrast, we develop an algorithm with linear complexities (see Section 4.1 for more details).
An algorithm called orthogonal gradient descent (OGD) [18] has been proposed in the context of machine learning for a different but related problem. More specifically, [18] considers a continual learning setting in which tasks \(\{T_1, T_2, \dots \}\) arrive sequentially, and each task consists of a set of datapoints. Continual learning can be understood as a “batch” version of one-pass learning. At a high level, when the \(i\)-th task \(T_i\) arrives, OGD updates the parameter using new samples from \(T_i\) in a way that causes minimal changes to the predictions for previous tasks \(\{T_b\}_{b=1}^{i-1}\). The gradient of the model \(f(x;w)\in\mathbb{R}\) on a datapoint \(x_j\) with respect to the parameter, \(\nabla_w f(x_j;w)\), is the direction in the parameter space that causes the most change to the prediction on that datapoint. Thus, moving orthogonal to this direction, locally, keeps the prediction unchanged, which is the main idea behind OGD. More formally, the update direction is computed via projecting the current gradient2 \(g\) onto the subspace orthogonal to \(\mathcal{G}\vcentcolon=\text{span}\{\bigcup_{b=1}^{i-1}\{\nabla_w f(x;w_b)\}_{(x,y)\in T_b}\}\): \[\tilde{g} = g - \sum_{v\in S} \mathrm{proj}_{v} (g), \label{eq:ogd95projection}\tag{4}\] where \(S\) is an orthogonal basis for \(\mathcal{G}\) and \(\mathrm{proj}_{v}(g) \vcentcolon= (g^\top v/\lVert v \rVert^2) v = (vv^\top/\lVert v \rVert^2) g\). The orthogonal basis \(S\) is incrementally updated through the Gram-Schmidt procedure.
In this section, we propose a one-pass learning algorithm called orthogonal recursive fitting (ORFit). The algorithm consists of three main components: () orthogonal update of the parameter motivated by OGD; () interpolation (perfect fitting) of new data in a single step; and () efficient use of memory via incremental summary. We describe the details of each component in what follows, starting from (i) and (ii). See Fig. 1 for an illustration.
We start by considering OGD directly applied to the one-pass learning setting with a scalar-output model \(f(x;w)\in\mathbb{R}\), i.e., \(c=1\). By treating each task \(T_k\) in the continual learning setting as consisting of a single datapoint, OGD will run multiple gradient descent steps on the single datapoint. While perfect fitting/interpolation is often desired in highly overparameterized models [19], [20], it will take many iterations for OGD to perfectly fit the datapoint. Instead, inspired by the trends in meta-learning, we consider a one-step learning scheme that only runs a single gradient step to interpolate the new datapoint. We first begin with the following result that serves as a building block for our algorithm design; see Appendix 9 for a proof.
Lemma 1. Consider a feature-based linear model \(f(x;w)=\phi(x)^\top w\in\mathbb{R}\), and let \(\tilde{g}\) be the projection defined by 4 of any vector \(g\in\mathbb{R}^p\). Then, for any step size \(\eta\in\mathbb{R}\), the new parameter \(w' = w - \eta \tilde{g}\) preserves the predictions on the previous datapoints, i.e., \(f(x;w') = f(x;w)\) for all \((x,y)\in\bigcup_{k=1}^{i-1}T_k\).
The main takeaway of Lemma 1 is that the predictions for the previous datapoints do not change when we update the model along the direction \(\tilde{g}\). Hence, we may choose \(\eta\) so that the updated parameter \(w'\) can perfectly fit the new datapoint, say \((x', y')\), as \(y' = \phi(x')^\top w' = \phi(x')^\top (w - \eta \tilde{g})\). Following this principle, a straightforward calculation yields the following update rule: \[\label{eq:orfit195infmem} { \begin{cases} \tilde{g}_{i-1} = \nabla f_{i}(w_{i-1}) - \sum_{v\in S_{i-1}} \mathrm{proj}_{v} (\nabla f_{i}(w_{i-1}))\,,\\ w_i = w_{i-1} - \eta_{i-1} \tilde{g}_{i-1}\,,\\ S_i = S_{i-1} \bigcup \{\tilde{g}_{i-1}\}, \end{cases}}\tag{5}\] for \(i\geq 1\) where \(S_0\) is the empty set, \(w_0\) is the initial weight vector, and the optimal step size is chosen as \[\label{eq:stepsize} \eta_{i-1} = \frac{1}{\nabla f_i(w_{i-1})^\top \tilde{g}_{i-1}} (f_i(w_{i-1}) - y_i),\tag{6}\] assuming \(\tilde{g}_{i-1}\neq 0\). See Algorithm 2 for the detailed procedure. Compared to OGD, the basis is kept orthonormal for more efficient projection. For intuition, we note that the optimal step size 6 is typically small for highly overparameterized models; there are many parameter vectors in the vicinity of the current solution that perfectly fit the new datapoint [21]–[23].
Remark 1 (Computational overhead). It is important to note that all the quantities appearing in 6 are typically available in the gradient-based optimization setting, and hence, there is no computational overhead for computing the stepsize 6 .
Remark 2 (Nonlinear models). Although the update rule 5 is derived based on linear models, it can also be applied to highly overparameterized nonlinear models such as deep neural networks, as we will discuss in Section 6.
Another distinction between the update rule 5 and OGD lies in the update of the orthogonal basis \(S_i\): OGD utilizes “fresher” gradient at the updated parameter \(w_i\) by \(S_i = S_{i-1} \bigcup \{\nabla f_{i}(w_{i}) - \sum_{v\in S_{i-1}} \mathrm{proj}_{v} (\nabla f_{i}(w_{i}))\}\). Although the two bases actually span the same subspace for linear models, it turns out that ORFit in 5 leads to a natural generalization for nonlinear models; see Section 6 for details.
Now, we extend the update rule 5 to learn a general vector-output model \(f(x;w)\in\mathbb{R}^c\), which subsumes the scalar-output case (\(c=1\)).
Remark 3 (Other forms of vector-output model). Note that we could also consider a vector-output model \(f(x;w_1,\dots,w_c)=[\phi_1(x)^\top w_1 \cdots \phi_c(x)^\top w_c]^\top\) which has a separate parameter vector \(w_j\in\mathbb{R}^p\) for each output dimension \(j\in[c]\). Especially, with a shared feature vector \(\phi(x)=\phi_1(x)=\cdots=\phi_c(x)\in\mathbb{R}^p\), the model could be \(f(x;w_1,\dots,w_c)=[w_1 \cdots w_c]^\top \phi(x)\). An example is learning with a pre-trained neural network model by fine-tuning only the last fully connected layer while freezing the earlier layers. In this case, the feature vector \(\phi(x)\) corresponds to the output of the penultimate layer. However, such models could be dealt with as multiple separate scalar-output models, one for each output dimension, with their own parameters. Then, ORFit for \(c=1\) is enough to update them independently.
As in the update rule 5 , we learn the vector-output model \(f(x;w)=\Phi(x)^\top w\in\mathbb{R}^c\) in one pass by making orthogonal updates of the parameter and interpolating new datapoint in a single step. At each iteration \(i\), we first choose an update direction by projecting the model gradients onto the subspace orthogonal to the previous gradients. Unlike in the scalar-output case, we have multiple gradient directions, one for each output, in the form of a Jacobian matrix. Thus, we project each gradient direction onto the orthogonal subspace and find an update direction as a linear combination of the projected gradients. This update direction is still orthogonal to the previous gradients to preserve the predictions on previous data.
Then, we choose the coefficients of the linear combination to interpolate the new datapoint, similarly to how we have computed the optimal step size in 6 . Assuming that new model gradients are not spanned by the previous gradients, the extended update rule is
\[\label{eq:orfit295infmem} \begin{cases} G_{i-1} = \left(\frac{\partial f_{i}(w_{i-1})}{\partial w}\right)^\top,\\ \tilde{G}_{i-1} = G_{i-1} - \sum_{v\in S_{i-1}} \mathrm{proj}_{v} (G_{i-1})\,,\\ w_i = w_{i-1} - \tilde{G}_{i-1} (G_{i-1}^\top \tilde{G}_{i-1})^{-1} (f_i(w_{i-1})-y_i)\,,\\ S_i = S_{i-1} \bigcup \{\mathrm{orth}(\mathrm{col}(\tilde{G}_{i-1}))\}, \end{cases}\tag{7}\] where \(\mathrm{proj}_{v}(G)\) is the column-wise projection of \(G\) onto the direction of \(v\), \(\mathrm{col}(G)\) is the set of the columns of \(G\), and \(\mathrm{orth}(S)\) is an orthonormalization of \(S\). This update rule reduces to 5 when \(c=1\). The linearity of the model with respect to the parameters ensures that this update step fits the new datapoint. We theoretically analyze this update in detail in Section 4.2.
Although the update rule 7 does not access previous datapoints, they still require storing the orthogonal basis \(S_i\), whose size grows linearly in the number of visited datapoints. This is not desirable in practice when one needs to train the model on a large dataset. We address this issue next.
In this section, we overcome the aforementioned memory issue by utilizing the structure of the streaming dataset. The main idea is to approximate the orthogonal basis \(S\) in a lower dimension using an incremental principal component analysis (IPCA) algorithm, known as the sequential Karhunen–Loeve (SKL) algorithm proposed in [24]. IPCA is a memory-efficient variant of PCA that enables sequential update for streaming/large datasets. Let us formally describe how we utilize IPCA to incrementally approximate the orthogonal basis.
Consider the past model gradients \(\cup_{k=1}^i \mathrm{col}(G_i)\) spanned by the orthogonal basis \(S\) as in 7 . Let the singular value decomposition (SVD) of \(A=[G_1\; G_2\,\dots G_i]\) be \(A=U\Sigma V^\top\). Here, the crucial information of the SVD is the left-singular vectors \(\mathrm{col}(U)\) that form an orthonormal basis for \(\mathrm{span}(S)\). This information can be used to come up with a rank-\(m\) approximation of \(S\) by using the principal components corresponding to the top \(m\) singular values. As will be shown, this choice of approximation is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. We will formally establish this in Section 4.3.
Now suppose that the orthogonal basis \(S\) is augmented with the new gradients \(G\) as in 7 . We first want to efficiently update \(U\) for the new basis instead of recomputing the SVD from scratch. The orthogonal components of \(G\) not spanned by \(S\) can be expressed as \(\tilde{G}=G-U(U^\top G)\). Letting \(G_\text{orth}\) denote the column-wise orthonormalization of \(\tilde{G}\), the new basis matrix can be represented as: \[\begin{align} \begin{bmatrix} A & G \end{bmatrix} &= \begin{bmatrix} U\Sigma V^\top & (UU^\top + G_\mathrm{orth} G_\mathrm{orth}^\top) G \end{bmatrix}\\ &= \begin{bmatrix} U & G_\mathrm{orth} \end{bmatrix} \begin{bmatrix} \Sigma & U^\top G\\ 0 & G_\mathrm{orth}^\top G \end{bmatrix} \begin{bmatrix} V^\top & 0\\ 0 & I \end{bmatrix} \\ &= \left( \begin{bmatrix} U & G_\mathrm{orth} \end{bmatrix} \tilde{U} \right) \tilde{\Sigma} \left( \tilde{V}^\top \begin{bmatrix} V^\top & 0\\ 0 & I \end{bmatrix} \right), \label{eq:svd95update} \end{align}\tag{8}\] where \(\tilde{U}\tilde{\Sigma}\tilde{V}^\top\) is the SVD of \(\begin{bmatrix}\Sigma & U^\top G\\ 0 & G_\mathrm{orth}^\top G \end{bmatrix}\). Then, 8 is the SVD of the new basis matrix. Hence, to update \(U\), one can directly use the information from the previous iteration, namely \(U\) and \(\Sigma\). The important aspect here is that the update can be made without storing \(V\) and without having to recompute the SVD of the new gradient matrix. Finally, one can store only the top \(m\) singular values in \(\Sigma\) and their corresponding components in \(U\). By repeatedly applying this IPCA algorithm in addition to 7 , we obtain orthogonal recursive fitting (ORFit). See Algorithm 3 for the detailed procedure.
The main advantage of ORFit is its computational/memory efficiency. By only storing the top \(m\) components, we can reduce the memory size at each iteration \(i\) from \(O(icp)\) to \(O(mp)\). Moreover, the additional computational overhead to perform IPCA as well as the total time complexity of ORFit at each step is \(O((m+c)^2p)\), while \(O(icp)\) is required without IPCA. Hence, ORFit can reduce both the computation and the memory complexity, especially for overparameterized models with large \(p\), by appropriately choosing \(m\).
In addition, we further extend the algorithm to learn batches, i.e., chunks, of data, which can be considered as a continual learning method. As we will see, it turns out that batch learning can be accomplished in the same manner as the vector-output model case. Let \(f(x;w)\in\mathbb{R}^c\) be a model that we want to fit, where \(x\in\mathcal{X}\subset\mathbb{R}^d\) is the input and \(w\in\mathbb{R}^p\) is the parameter (weight) vector. Consider a sequentially arriving stream of batched data \((\{(x_{b,k}, y_{b,k})\}_{k=1}^{n_b})_{b=1}^{B}\) where \(x_{b,k}\in\mathcal{X}\) and \(y_{b,k}\in\mathcal{Y}\subset\mathbb{R}^c\). In overparameterized models, we have \(p\geq \sum_{b=1}^B n_b\) (and often \(p\gg \sum_{b=1}^B n_b\)). For a loss function \(\ell(\cdot,\cdot)\), let \(f_{b,k}(w)\vcentcolon=f(x_{b,k};w)\) and \(\ell_{b,k}(w)\vcentcolon=\ell(y_{b,k}, f_{b,k}(w))\). Then, the batch/continual learning problem considers the setting where, given an initial parameter \(w_0\in\mathbb{R}^p\), we update the parameter \(w_i\in\mathbb{R}^p\) after the new data batch \(\{(x_{i,k}, y_{i,k})\}_{k=1}^{n_i}\) arrived without revisiting previous datapoints \(\left\{\{(x_{b,k}, y_{b,k})\}_{k=1}^{n_b}\right\}_{b=1}^{i-1}\).
The new data batch can be learned as if we are fitting an augmented single datapoint: \[\underbrace{[y_{i,1}; \cdots; y_{i,n_i}]}_{=:Y_i} = \underbrace{[f_{i,1}(w); \cdots; f_{i,n_i}(w)]}_{=:F_i(w)},\] where \([(\cdot) ; (\cdot)]\) denotes row-wise concatenation. Note that the batch size \(n_i\) affects the effective output dimension of \(Y_i\) and \(F_i\). Then, we can achieve this interpolation through the following update rule that generalizes 7 : \[\label{eq:borfit95infmem} \begin{cases} G_{i-1} = \Big[ \left(\frac{\partial f_{i,1}(w_{i-1})}{\partial w}\right)^\top \cdots \left(\frac{\partial f_{i,n_i}(w_{i-1})}{\partial w}\right)^\top \Big],\\ \tilde{G}_{i-1} = G_{i-1} - \sum_{v\in S_{i-1}} \mathrm{proj}_{v} (G_{i-1})\,,\\ w_i = w_{i-1} - \tilde{G}_{i-1} (G_{i-1}^\top \tilde{G}_{i-1})^{-1} (F_i(w_{i-1}) - Y_i)\,,\\ S_i = S_{i-1} \bigcup \{\mathrm{orth}(\mathrm{col}(\tilde{G}_{i-1}))\}, \end{cases}\tag{9}\] as presented in Algorithm 4. Since this method also faces the growing memory issue, we can utilize IPCA to accommodate memory constraints. See Algorithm 5 for the detailed procedure.
In this section, we provide the theoretical properties of the proposed method. We begin by discussing a formal connection between the proposed method and RLS.
It turns out ORFit applied to a linear model \(f(x;w) = x^\top w \in\mathbb{R}\) corresponds to an extreme case of the well-known RLS method, as formally described in the following result; see Appendix 10 for a proof.
Proposition 1. Consider a linear overparameterized (\(p\geq K\)) model \(f(x;w) = x^\top w \in\mathbb{R}\) and a data sequence \(\left((x_k, y_k)\right)_{k=1}^{K}\). Let \(w_0\) be the initialization and \(m\) be the memory limit for ORFit. Then, at each iteration \(i\leq m\), the update rule of ORFit results in the same parameter vector as the EW-RLS update rule 3 does with \(\lambda=0\), \(\Pi=I\), and initialization \(w_0\). In this setting, \(P_i\) in 3 is the projection matrix onto the subspace orthogonal to \(\mathrm{span}\{\nabla f_k(w_{k-1})\}_{k=1}^{i}\).
Remark 4. The optimization problem of EW-RLS 2 is not well-defined for \(\lambda=0\). That said, the update rule 3 can be still computed for \(\lambda=0\), and ORFit finds the same solution as the limiting case of EW-RLS.
Remark 5. ORFit in 5 has \(O(ip)\) time and memory complexities, compared to those of \(O(p^2)\) for the EW-RLS update rule, where typically \(i\ll p\) in the overparameterized setting.
One notable aspect of ORFit is that it bridges the two seemingly distinct algorithms OGD and RLS through Proposition 1. The connection provides us new insights into understanding the behavior of our proposed method, as we discuss next.
Before presenting our main result, we first provide some intuitions. To understand the behavior of ORFit, let us first recall that the EW-RLS update rule 3 is the solution to the optimization problem 2 . In light of Proposition 1, one might be tempted to claim that ORFit in 5 solves 2 with \(\lambda=0\) and \(\Pi=I\). However, 2 is not well-defined for \(\lambda=0\).
Nevertheless, intuitively one can regard ORFit as solving 2 in the limit of \(\lambda \to 0^+\). Then, for sufficiently small \(\lambda>0\), the first term in the objective of 2 outweighs the second term, which suggests that, in the overparameterized case, the solution should enforce \(y_k \approx w^\top x_k\) for all \(k=1,2,\dots, i\), while minimizing \(\lVert w-w_0 \rVert^2\). In the following theorem, we formalize this intuition and characterize the solution of ORFit; see Appendix 11 for a proof.
Theorem 2. Consider a linear overparameterized (\(p\geq K\)) model \(f(x;w) = \Phi(x)^\top w \in\mathbb{R}^c\) and a data sequence \(\left((x_{k}, y_{k})\right)_{k=1}^{K}\). Let \(w_0\) be the initialization and \(m\) be the memory limit. Then, at each iteration \(i\leq m\), the parameter vector obtained by ORFit is the solution of the following optimization problem: \[\begin{align} w_i = \mathop{\mathrm{arg\,min}}_w&& & \lVert w-w_o \rVert_2\\[-1ex] \mathrm{s.t.}&& & {y_{k} = \Phi(x_{k})^\top w \quad(k\in [i]).} \end{align} \label{eq:implicit95bias}\tag{10}\]
It is known that, for a linear overparameterized model \(f(x;w) = x^\top w\), in the standard multi-pass learning setting over the dataset \(\{(x_k,y_k)\}_{k=1}^{i}\), as the number of iterations goes to infinity, the iterates of stochastic gradient descent (SGD) initialized at \(w_0\) with a sufficiently small step size converge to the solution of problem 10 (see, e.g., Proposition 1 in [25]). Thus, ORFit with just an epoch of training finds the solution that SGD in the limit of infinite number of iterations converges to.
Corollary 1. Consider a linear overparameterized (\(p\geq K\)) model \(f(x;w) = x^\top w \in\mathbb{R}\) and a data sequence \(\left((x_k, y_k)\right)_{k=1}^{K}\). Let \(w_0\) be the initialization and \(m\) be the memory limit. The parameter vector obtained by ORFit at each iteration \(i\leq m\) is equal to what SGD would converge to by iterating over the dataset \(\{(x_k, y_k)\}_{k=1}^{i}\) with a sufficiently small step size in the limit of infinite number of iterations.
While Theorem 2 characterizes ORFit as solving a global optimization problem, we can also interpret each update step as solving a local optimization problem. This perspective provides an alternative approach to proving Theorem 2 by showing that iteratively solving these local optimization problems ultimately leads to solving the global optimization problem 10 . This approach is elaborated in detail in [2]. By examining the update rule 7 , we can recursively characterize each update step as below; see Appendix 12 for a proof.
Proposition 3. Consider a linear overparameterized (\(p\geq K\)) model \(f(x;w) = \Phi(x)^\top w \in\mathbb{R}^c\) and a data sequence \(((x_{k}, y_k))_{k=1}^{K}\). Let \(w_0\) be the initialization and \(m\) be the memory limit. Then, at each iteration \(i\leq m\), the parameter vector obtained by ORFit is the solution of the following optimization problem: \[\label{eq:local95opt} \begin{align} w_i = \mathop{\mathrm{arg\,min}}_{w}&& & \lVert w-w_{i-1} \rVert_2\\[-1ex] \mathrm{s.t.}&& & y_{k} = \Phi(x_k)^\top w \quad(k\in [i]). \end{align}\qquad{(1)}\]
ORFit utilizes IPCA to summarize the previous gradients with the top principal components to which the new update is orthogonal. We show that this choice is minimax optimal, minimizing the worst-case forgetting for unknown future updates as below. See Appendix 13 for a proof.
Proposition 4. Consider a linear overparameterized (\(p\geq K\)) model \(f(x;w) = \Phi(x)^\top w\in\mathbb{R}^c\) and a data sequence \(\left((x_k, y_k)\right)_{k=1}^{K}\). Let \(m\) be the memory limit for ORFit and \(\mathrm{PCA}_{m}\left(\cdot\right)\) denote the span of the top \(m\) left singular vectors. Then, at each iteration \(i\geq m\), for the accumulated gradients \(G_{1:i}:=\begin{bmatrix} \left(\frac{\partial f_{k}(w_{k-1})}{\partial w}\right)^\top\end{bmatrix}_{k=1}^{i}\), \(\mathrm{PCA}_{m}\left(G_{1:i}\right)\) is the summary of the memory that minimizes the worst-case forgetting of ORFit in the sense that \[\begin{align} \mathrm{PCA}_{m}\left(G_{1:i}\right) = \underset{S\in \mathcal{V}_{m}}{\mathrm{arg\, min}} \, \max\limits_{\left\|\Delta w\right\|=1} & \sum_{k=1}^i (f_k(w+\Delta w)-f_k(w))^2\\ \text{s.t.} \;\;\; & \Delta w \perp S \end{align}, \label{eq:pca95opt}\qquad{(2)}\] where \(\mathcal{V}_{m}\) denotes the set of all \(m\)-dimensional subspaces of \(\mathbb{R}^p\).
With the limited memory size \(m\), ORFit stores an approximation \(S\) of the accumulated gradients that spans an \(m\)-dimensional subspace of \(\mathbb{R}^p\). Then, the unknown future update \(\Delta w\) is constrained to be orthogonal to \(S\). Thus, as shown in ?? , the summary through the top \(m\) left singular vectors \(\mathrm{PCA}_{m}\left(G_{1:i}\right)\) minimizes the worst-case forgetting (i.e., the maximum change of function values among all feasible future update directions \(\Delta w\)) for the previous datapoints \(\left\{(x_k,y_k)\right\}_{k=1}^i\).
In this section, we demonstrate the effectiveness of our proposed methods in the one-pass learning setting and corroborate the theoretical results presented in Section 4. We performed experiments for linear models in the Rotated MNIST setup described in [26]. In this setup, the inputs are rotated MNIST images for digit ‘\(2\)’, whose size is \(28\times 28\). Our goal is to estimate the rotated angles in \([0, \pi]\). For the training dataset, the angles are uniformly sampled from \([0, \pi]\), and to introduce distribution shift, we order the dataset so that an image rotated with a smaller angle arrives earlier.
Figure 6: Results for the memory-restricted setting (§5.1). () shows the evolution of the test errors measured after learning each datapoint, while () shows the evolution of the prediction errors for a particular sample (the \(16\)-th example) after each iteration. The red dashed line indicates the step on which the sample is trained. The shades indicate the standard deviations over \(10\) independent runs. a — Test Error, b — Sample Prediction Error
In this experiment, we demonstrate the effectiveness of our proposed method in the memory-restricted setting, in which 100 datapoints are sequentially learned while we are only allowed to store up to \(10\) basis vectors. For comparison, we consider the following baselines: (1) Greedy scheme: outputs only the label of the most recently learned datapoint, regardless of the input, as an extreme case of forgetting. (2) One-Step SGD: employs the one-step learning scheme with the step size in 6 but with empty orthogonal basis. (3) ORFit-random: ORFit that keeps \(10\) randomly chosen basis vectors after each iteration instead of performing IPCA. (4) ORFit-latest: ORFit that keeps the latest \(10\) basis vectors after each iteration instead of performing IPCA.
We first compared the test errors of the proposed method and the baselines. The test errors are measured with the test dataset that consists of \(1032\) images of digit ‘\(2\)’ in the MNIST test set of which each is rotated with a random angle in \([0, \pi]\). As shown in Fig. 6 (a), ORFit outperforms other baselines after a sufficient number of training steps. Notably, ORFit results in lower variances over the \(10\) independent runs with different initialization.
Next, we compared the degrees of forgetting for the proposed method and the baselines by keeping track of the prediction errors of a training datapoint throughout the training. As shown in Fig. 6 (b), ORFit successfully keeps the prediction error low throughout the training. This is in stark contrast with other methods for which the prediction error quickly increases as other datapoints are learned.
In this experiment, we follow the setup in Section 5.1 except that this time, we do not impose any memory restrictions. For comparison, we run vanilla SGD with a fixed step size \(10^{-5}\). Vanilla SGD makes multiple passes over the entire dataset for \(1000\) epochs.
Figure 7: Results for the setting without memory restriction (§5.2). () shows the evolution of the test and train errors measured after each training step, while () shows the evolution of the prediction errors for a particular sample (the \(11\)-th example) after each iteration. The red dashed line indicates the step on which the sample is trained. The shades indicate the standard deviations over \(10\) independent runs.. a — Training and Test Errors, b — Sample Prediction Error
Reported in Fig. 7 (a) are the training and test errors of vanilla SGD, One-Step SGD, and ORFit. Note that both SGD and ORFit learn the training dataset perfectly with almost zero training error. Moreover, after the training is finished, SGD and ORFit achieve similar test errors, corroborating Theorem 2.
In addition, Fig. 7 (b) shows the prediction error of the \(11\)-th training datapoint throughout the training (analogous to Fig. 6 (b)). Note that ORFit perfectly preserves the prediction error of the sample, while One-Step SGD quickly deteriorates it. This result demonstrates the effectiveness of the orthogonal update in ORFit for one-pass learning.
In this section, we extend our discussion to nonlinear models under the neural tangent kernel (NTK) regime [27], [28]. The main idea behind the NTK regime is that when the width of the neural network is chosen large enough, the model is well-approximated by its first-order approximation around the initialization: \[f_k(w) \approx f_k(w_0) + \nabla f_k(w_0)^\top (w - w_0).\] In particular, Lee et al. [28] discussed sufficient conditions (in terms of the width of the network) for which this approximation is valid; see their Theorem 2.1 for details. Under the linearized regime, we consider expanding the model around the parameter learned at the previous step as \[\label{eq:NTK} f_k(w) \approx f_k(w_{k-1}) + \nabla f_k(w_{k-1})^\top (w - w_{k-1})=\vcentcolon f_{k|k-1}(w).\tag{11}\] Throughout, we use \(f_{k|k-1}(w)\) to denote the RHS of 11 .
A notable feature of ORFit is that it is applicable to any differentiable nonlinear model \(f\). This is in contrast to the EW-RLS algorithm 3 , which is only applicable to linear models \(f(x;w)=w^\top x\). Based on this observation, we employ the step size calculated in 6 to fit the new data for a nonlinear model under the NTK regime 11 . More specifically, we consider an analog of the EW-RLS 2 for nonlinear models: \[w_i = \arg\min_w \sum_{k=1}^{i} \lambda^{i-k}(y_k - f_{k|k-1}(w))^2 + \lambda^i \lVert w-w_0 \rVert^2_\Pi , \label{eq:weighted95ls95ntk}\tag{12}\] given a forgetting factor \(0<\lambda\leq 1\) and \(\Pi\succ 0\). Then similarly to 3 , one can express the solution of 12 in a recursive manner, while treating \((\nabla f_k(w_{k-1}), y_k-f_k(w_{k-1})+\nabla f_k(w_{k-1})^\top w_{k-1})\) as the streamed data at the \(k\)-th step: \[\begin{cases} w_i = w_{i-1} + \dfrac{P_{i-1} \nabla f_{i}(w_{i-1}) (y_i - f_{i}(w_{i-1}))}{\lambda^i + \nabla f_{i}(w_{i-1})^\top P_{i-1} \nabla f_{i}(w_{i-1})}\,, \\ P_i = P_{i-1} - \dfrac{P_{i-1} \nabla f_{i}(w_{i-1}) \nabla f_{i}(w_{i-1})^\top P_{i-1}}{\lambda^i + \nabla f_{i}(w_{i-1})^\top P_{i-1} \nabla f_{i}(w_{i-1})}\,, \end{cases} \label{eq:rls95update95general}\tag{13}\] with initialization \(w_0\) and \(P_0 = \Pi^{-1}\). We call this generalized update rule NTK-RLS. Following a similar argument as in Section 4, we obtain the following result; see Appendix 14 for a proof.
Theorem 5. Consider a (nonlinear) overparameterized model \(f(x;w)\in\mathbb{R}\) with \(p\geq K\) and a data sequence \(\left((x_k, y_k)\right)_{k=1}^{K}\). Let \(w_0\) be the initialization and \(m\) be the memory limit for ORFit. Then, at each iteration \(i\leq m\), the update rule of ORFit results in the same parameter vector as the NTK-RLS update rule 13 does with \(\lambda=0\), \(\Pi=I\), and initialization \(w_0\). Moreover, at each iteration \(i \leq m\), the parameter vector obtained by ORFit is the solution of the following optimization problem: \[\begin{align} w_i = \arg\min_w \;\; & \lVert w-w_o \rVert\\ \mathrm{s.t.}\;\;\; & y_k = f_{k|k-1}(w) \;\; k=1,2,\dots, i. \end{align} \label{eq:implicit95bias95ntk}\tag{14}\]
Note that, under the NTK regime, for an overparameterized model, we have \(f_{k|k-1}(w)\approx f(x_k;w)\), and the solution obtained by ORFit in one pass is the same as that of SGD in the standard multi-pass setting (as characterized in, e.g., [25]). Compared to NTK-RLS 13 , ORFit in 5 greatly reduces both time and memory complexities from \(O(p^2)\) to \(O(ip)\). This is particularly important for \(p\gg i\), common to the overparameterized settings (for instance, \(p\approx 11\mathrm{M}\) in ResNet-18, commonly used for the CIFAR-10 dataset, consisting of 50K samples).
In this paper, we proposed an algorithm called Orthogonal Recursive Fitting (ORFit) to tackle one-pass learning. We discussed the connection between the proposed method and orthogonal gradient descent (OGD), a practical algorithm in continual learning, as well as the recursive least-squares (RLS), a well-known method from adaptive filtering. Through this connection, we explained the advantages of the proposed method and theoretically characterized its behavior. Our theoretical findings reveal that ORFit attains the same solution as SGD, despite being a single-pass algorithm. We validated our method and its theoretical properties through several experiments and discussed its extensions to nonlinear settings, relevant for deep learning.
We conclude with several interesting future directions. First, although ORFit exhibits outstanding performance in the memory-limited setting, some forgetting is still happening. This is caused by the information loss from summarizing the orthogonal basis via IPCA. Theoretically characterizing how much ORFit forgets would be of great value. Moreover, one can also come up with other methods to summarize the memory such as matrix sketching[29] and compare them with ORFit. Next, we remark that RLS is a special case of the Kalman filter applied on a static system. Based on this connection, another interesting avenue would be to build on ORFit and potentially devise efficient learning/estimation methods for dynamic systems. Lastly, exploring the practicality of ORFit for deep neural networks based on a comprehensive set of experiments would be of great interest.
This work was supported in part by the MIT-IBM Watson AI Lab, MathWorks, the MIT-Amazon Science Hub, and the MIT-Google Program for Computing Innovation. The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing computing resources that have contributed to the results reported within this paper. The authors also thank Kwangjun Ahn for his valuable input in the early stages of this work.
ORFit also has an interesting connection to the extended Kalman filter (EKF). For a parameterized model \(f(\cdot \;;w)\in\mathbb{R}\) with a sequence of data \(((x_k,y_k))_{k=1}^K\), we can regard it as a static system with state vector \(w\) and formulate a state-space model as \[\begin{align} \Dot{w} & = 0,\\ y_k & = f(x_k; w) + v_k\\ & = f_k(w) + v_k \;\;\;\;\; v_k \sim \mathcal{N}(0,R), \end{align}\] where each data point is considered as a measurement of the state \(w\) and \(v_k\) is a measurement noise model. Then, for the streamed datapoints, estimation over \(w\) can be updated sequentially using EKF from the initial estimation \(w_0\) as \[\begin{align} \hat{w_k} & = \hat{w_{k-1}} + L_k (y_k - f_k(\hat{w_{k-1}})),\\ L_k & = \dfrac{Q_{k-1} \nabla f_k(\hat{w_{k-1}})} {f_k(\hat{w_{k-1}})^\top Q_{k-1} \nabla f_k(\hat{w_{k-1}}) + R},\\ Q_k & = [I - L_k f_k(\hat{w_{k-1}})^\top] Q_{k-1}. \end{align}\] When the measurement noise is zero, i.e., \(R=0\), the update rule computes the same solution as ORFit.
Consider \((x,y)\in T_k\) for any \(1\leq k \leq i-1\). \[f(x;w') = \phi(x)^\top w' = \phi(x)^\top (w-\eta \tilde{g}) = f(x;w) - \eta \phi(x)^\top \tilde{g}.\] Since the orthogonal basis \(S\) spans \(\nabla_w f(x;w_k)= \phi(x)\), \(\phi(x)\) can be represented as \(\phi(x) = \sum_{u\in S} \mathrm{proj}_{u} \big(\phi(x)\big)\). Then, \[\begin{align} \phi(x)^\top \tilde{g} &= \Big(\sum_{u\in S} \mathrm{proj}_{u}\big(\phi(x)\big)\Big)^\top \Big(g- \sum_{v\in S} \mathrm{proj}_{v}(g)\Big)\\ &= \Big(\sum_{u\in S} \dfrac{uu^\top}{\lVert u \rVert^2}\phi(x)\Big)^\top \Big(I - \sum_{v\in S} \dfrac{vv^\top}{\lVert v \rVert^2}\Big)g\\ &= \phi(x)^\top \Big(\sum_{u\in S} \dfrac{uu^\top}{\lVert u \rVert^2} - \sum_{u\in S}\sum_{v\in S} \dfrac{uu^\top vv^\top}{\lVert u \rVert^2\lVert v \rVert^2}\Big) g\\ &= \phi(x)^\top \Big(\sum_{u\in S} \dfrac{uu^\top}{\lVert u \rVert^2} - \sum_{u\in S} \dfrac{uu^\top}{\lVert u \rVert^2}\Big) g = 0 \end{align}\] as \(u^\top v = 0\) if \(u\neq v\). Thus, \(f(x;w')=f(x;w)\). 0◻
In the update rule of ORFit 5 , the new basis vector can be represented as \[\begin{align} v' &= \nabla f_{i}(w_{i-1}) - \sum_{v\in S_{i-1}} \mathrm{proj}_{v} (\nabla f_{i}(w_{i-1}))\\ &= (I-\sum_{v\in S_{i-1}}\frac{v v^\top }{\lVert v \rVert^2}) \nabla f_{i}(w_{i-1}) = Q_{i-1} x_i, \end{align}\] where \(Q_{i-1} \vcentcolon= I-\sum_{v\in S_{i-1}} v v^\top/\lVert v \rVert^2\). Similarly, \[\label{eq:direction} \tilde{g}_{i-1} = Q_{i-1} \nabla \ell_{i}(w_{i-1}) = Q_{i-1} \ell'(y_i, f_i(w_{i-1})) x_i,\tag{15}\] where \(\ell'(\cdot,\cdot)\) denotes the derivative of \(\ell(\cdot,\cdot)\) with respect to its second argument.
Putting 15 into the parameter update in 5 , \[\label{eq:osogd95proj1} w_{i} = w_{i-1} + \dfrac{Q_{i-1} x_i}{x_i^\top Q_{i-1} x_i} (y_i - x_i^\top w_{i-1}).\tag{16}\] Also, note that \(Q_{i-1}\) is symmetric and satisfies \(Q_{i-1}^2 = Q_{i-1}\), and it corresponds to the projection matrix onto the orthogonal complement of the subspace \(\mathrm{span}\{S_{i-1}\}=\mathrm{span}\{\nabla f_k(w_{k-1})\}_{k=1}^{i-1}\). With these properties, \(Q_i\) can be expressed in a recursive form as \[\label{eq:osogd95proj2} Q_i = Q_{i-1} - \frac{v' v'^\top }{\lVert v' \rVert^2} = Q_{i-1} - \frac{Q_{i-1} x_i x_i^\top Q_{i-1} }{x_i^\top Q_{i-1} x_i}.\tag{17}\] Then, 16 and 17 are equivalent to the EW-RLS update rule 3 with \(\lambda=0\), \(\Pi=I\), and initialization \(w_0\), while \(Q_0=I=P_0\). 0◻
We prove 10 using KKT conditions. First, we transform the RHS into an equivalent convex problem:
\[\label{eq:convex} \arg\min_w \frac{1}{2}\lVert w-w_o \rVert^2 \; \mathrm{s.t.}\; y_k = \Phi(x_k)^\top w \quad(k\in [i]).\tag{18}\] Let \(Y_{1:i}\vcentcolon=[y_1;\cdots;{y_{i}}]\in\mathbb{R}^{ic}\) and \(\Phi_{1:i}\vcentcolon=[\Phi(x_1)\;\cdots\;\Phi(x_i)]\in\mathbb{R}^{p\times ic}\). Then, the Lagrangian of 18 is then represented as \[\label{eq:lagrangian} \mathcal{L}(w,\lambda)=\frac{1}{2}\lVert w-w_o \rVert^2 + \lambda^\top (Y_{1:i}-\Phi_{1:i}^\top w).\tag{19}\] Since 18 is a convex problem, any primal and dual variables of 19 satisfying KKT conditions are primal and dual optimal. The KKT conditions for a pair of primal and dual variables \((w^*,\lambda^*)\) are \[\label{eq:KKT} \begin{cases} w^* - w_0 - \Phi_{1:i}\lambda^* = 0\\ Y_{1:i} - \Phi_{1:i}^\top w^* = 0. \end{cases}\tag{20}\]
Now, we show that \(w_i\) from the ORFit update rule 7 satisfies 20 with some \(\lambda_i\) so that \(w_i\) is the primal optimal solution of 18 which shows 10 . From 7 , \[\label{eq:stacked95update} w_i = w_0 - \sum_{k=1}^{i} \tilde{G}_{k-1}\eta_{k-1},\tag{21}\] where \(\eta_{k-1}=\left(G_{k-1}^\top \tilde{G}_{k-1}\right)^{-1} (f_k(w_{k-1})-y_k)\in\mathbb{R}^{c}\). For each \(k\in[i]\), \[\tilde{G}_{k-1} = \Phi(x_k) - \sum_{v\in S_{k-1}} \mathrm{proj}_v (\Phi(x_k)).\] Since each projection vector \(v\) is spanned by the columns of \(\Phi_{1:k-1}\), each column of \(\tilde{G}_{k-1}\) is spanned by the columns of \(\Phi_{1:k}\). Then, the term \(\tilde{G}_{k-1}\eta_{k-1}\) is also spanned by the columns of \(\Phi_{1:k}\). Thus, the summation term in 21 is in total spanned by the columns of \(\Phi_{1:i}\), i.e., there exists some \(d_i\in\mathbb{R}^{ic}\) such that \[\begin{align} w_i = w_0 - \sum_{k=1}^{i} \tilde{G}_{k-1}\eta_{k-1} = w_0 - \Phi_{1:i} d_i. \end{align}\]
By letting \(\lambda_i\vcentcolon=d_i\), \((w_i, \lambda_i)\) satisfies the first KKT condition. Also, with the update rule 7 , \(w_i\) fits all the data \(\{(x_k, y_k)\}_{k=1}^{i}\), which implies the second KKT condition. Thus, \(w_i\) is the solution of 18 so that 10 holds. 0◻
At each iteration \(i\leq m\), ORFit follows the update rule 7 . With the orthonormal basis matrix \(U\) of \(S_{i-1}\), \[\tilde{G}_{i-1} = (I-UU^\top)G_{i-1} = Q \Phi(x_i),\] where \(Q:=I-UU^\top\) is the projection matrix onto the orthogonal complement of the previous feature space \(\mathrm{span}\{S_{i-1}\}=\mathrm{span}\{\cup_{k=1}^{i-1}\mathrm{col}(\Phi(x_k))\}\). Then, we can rewrite the update step as \[\Delta w_i := w_i-w_{i-1} = (\Phi(x_i)^\top Q)^+ (y_i-f_i(w_{i-1}))\] with the Moore–Penrose inverse operation \((\cdot)^+\). Also, \(\Delta w_i = Q\Delta w_i\) as the projection matrix \(Q\) is idempotent, i.e. \(Q^2=Q\). Since the Moore–Penrose inverse provides the minimum \(\ell^2\)-norm solutions for under-determined linear systems, \[\label{eq:opt95scalar} \begin{align} \Delta w_i &= \mathop{\mathrm{arg\,min}}_{\Delta w} \|\Delta w\|_2 \mathrm{ s.t. }& y_i-f_i(w_{i-1}) = \Phi(x_i)^\top Q \Delta w\\ &= \mathop{\mathrm{arg\,min}}_{\Delta w} \|\Delta w\|_2 \mathrm{ s.t. }& y_i-f_i(w_{i-1}) = \Phi(x_i)^\top Q \Delta w,\\ & & \Delta w = Q \Delta w\\ &= \mathop{\mathrm{arg\,min}}_{\Delta w} \|Q \Delta w\|_2 \mathrm{ s.t. }& y_i = \Phi(x_i)^\top (w_{i-1}+Q \Delta w),\\ & & \Delta w = Q \Delta w\\ &= \mathop{\mathrm{arg\,min}}_{\Delta w} \|\Delta w\|_2 \mathrm{ s.t. }& y_i = \Phi(x_i)^\top (w_{i-1}+\Delta w),\\ & & \Delta w \perp \mathrm{span}\{S_{i-1}\} \end{align}.\tag{22}\] Since the orthogonality condition \(\Delta w_i \perp \mathrm{span}\{S_{i-1}\}\) is satisfied if and only if the previous predictions are preserved, i.e., \(f_k(w_i)=f_k(w_{i-1})\) for all \(k\in[i-1]\), ?? holds by induction. 0◻
We provide an interpretation of ORFit’s update direction by establishing the minimax optimality of the principal components in terms of forgetting through the Courant–Fischer–Weyl min-max theorem, which is a standard result in linear algebra related to the Rayleigh quotient and the intersection of subspaces; See, for example, [30]–[33] for the proof.
Lemma 2 (Courant-Fischer-Weyl [30]). Let \(A = A^\top \in \mathbb{R}^{n \times n}\) is a matrix whose eigenvalues are \(\lambda_{1} \leq \lambda_{2}\leq\cdots\leq\lambda_{n}\) and associated orthonormal eigenvectors are \(u_{1}, u_{2}, \cdots, u_{n}\). Also, let \(\mathcal{V}_{k}\) denotes the set of \(k\)-dimensional subspaces of \(\mathbb{R}^{n}\). Then, for each \(1\leq k\leq n\), \[\label{eq:CFW95minmax} \begin{align} \lambda_{k} &= \min\limits_{W\in \mathcal{V}_{k}} ~ \max\limits_{x\in W, \left\|x\right\|=1} x^\top Ax\\ &= \max\limits_{W\in \mathcal{V}_{n-k+1}} ~ \min\limits_{x\in W, \left\|x\right\|=1} x^\top Ax \end{align}\tag{23}\] In addition, \(W = \mathrm{span}\left\{u_{1}, \cdots, u_{k}\right\}\) achieves the outer minimum for the first expression of 23 , and \(W = \mathrm{span}\left\{u_{1}, \cdots, u_{k-1}\right\}^{\perp} = \mathrm{span}\left\{u_{k}, \cdots, u_{n}\right\}\) achieves the outer maximum for the second expression of 23 .
Utilizing this lemma, we can characterize the meaning of approximating the previous gradients with the top principal directions as minimizing the worst-case forgetting for unknown future updates as below.
Consider a linear model \(f\left(x;w\right) = \Phi(x)^\top w\). Then, \(G_{1:i}:=\begin{bmatrix} \Phi(x_{1}) & \cdots & \Phi(x_{i})\end{bmatrix}\in\mathbb{R}^{p\times ic}\). Let \(\Delta w\) denote the parameter update step. The change in the function value at a datapoint \(x_{i}\) due to parameter update \(\Delta w\) can be written as \[\label{eq:Delta95f95i} \Delta f_{i} := f_{i}\left(w+\Delta w\right)-f_{i}\left( w\right)= \Phi(x_{i})^\top\Delta w.\tag{24}\] Concatenating the changes for all datapoints into a vector gives \[\label{eq:Delta95f} \Delta F := \begin{bmatrix} \Delta f_{1} ; \cdots ; \Delta f_{i} \end{bmatrix} = G_{1:i}^\top\Delta w.\tag{25}\] Then, we can represent the forgetting as \[\sum_{k=1}^i (f_k(w+\Delta w)-f_k(w))^2 = \left\| \Delta F \right\|_{2}^{2} = \Delta w^\top G_{1:i} G_{1:i}^\top \Delta w.\]
Meanwhile, Lemma 2 states that the outer minimum in \[\label{eq:minmax95Delta95f95norm} \begin{align} \lambda_{p-m} &= \min\limits_{W\in \mathcal{V}_{p-m}} ~ \max\limits_{\Delta w \in W, \left\|\Delta w\right\|=1} \left\| \Delta F \right\|_{2}^{2} \\ &= \min\limits_{W\in \mathcal{V}_{p-m}} ~ \max\limits_{\Delta w \in W, \left\| \Delta w \right\|=1} \Delta w^\top G_{1:i} G_{1:i}^\top \Delta w \end{align}\tag{26}\] is achieved with \(W^{*} = \mathrm{span}\left\{u_{1}, \cdots, u_{p-m}\right\}\) where \(u_{k}\) represents the orthonormal eigenvector of \(G_{1:i} G_{1:i}^\top\) associated with the \(k\)-th least eigenvalue. Since \[\mathrm{PCA}_{m}\left(G_{1:i}\right) = \mathrm{span}\left\{u_{p-m+1},\cdots, u_p\right\} = {W^{*}}^{\perp}\] where \(\perp\) denotes the orthogonal complement of a subspace, an update step \(\Delta w \in W^{*}\) is orthogonal to the subspace given by \(\mathrm{PCA}_{m}\left(G_{1:i}\right)\). Therefore, \[\begin{align} \mathrm{PCA}_{m}\left(G_{1:i}\right) = \underset{S\in \mathcal{V}_{m}}{\mathrm{arg\, min}} \, \max\limits_{\left\|\Delta w\right\|=1} & \sum_{k=1}^i (f_k(w+\Delta w)-f_k(w))^2\\ \text{s.t.} \;\;\; & \Delta w \perp S \end{align}.\] Thus, the update direction satisfying \(\Delta w \perp \mathrm{PCA}_{m}\left(G_{1:i}\right)\) minimizes the maximum of the \(\ell_{2}\)-norm of the function value change spanning all observed datapoints when \(\Delta w\) is confined to the \(\left(p-m\right)\)-dimensional orthogonal subspace of \(\mathbb{R}^{p}\). 0◻
We first prove the connection between ORFit and NTK-RLS. As in the proof of Prop. 1, the new basis vector and the projected gradient of the loss in the update rule of ORFit 5 can be represented as \[\begin{align} v' &= Q_{i-1} \nabla f_{i}(w_{i-1}), \tag{27}\\ \tilde{g}_{i-1} &= Q_{i-1} \ell'(y_i, f_i(w_{i-1})) \nabla f_{i}(w_{i-1}), \tag{28} \end{align}\] where \(Q_{i-1} \vcentcolon= I-\sum_{v\in S_{i-1}} v v^\top/\lVert v \rVert^2\). Putting 28 into the parameter update in 5 , \[\label{eq:osogd95proj195ntk} w_{i} = w_{i-1} + \dfrac{Q_{i-1} \nabla f_{i}(w_{i-1})(y_i - f_i(w_{i-1}))}{\nabla f_{i}(w_{i-1})^\top Q_{i-1} \nabla f_{i}(w_{i-1})}.\tag{29}\] Since \(Q_{i-1}\) is symmetric and satisfies \(Q_{i-1}^2 = Q_{i-1}\), with 27 , \(Q_i\) can be expressed in a recursive form as \[\label{eq:osogd95proj295ntk} Q_i = Q_{i-1} - \frac{Q_{i-1} \nabla f_{i}(w_{i-1}) \nabla f_{i}(w_{i-1})^\top Q_{i-1} }{\nabla f_{i}(w_{i-1})^\top Q_{i-1} \nabla f_{i}(w_{i-1})}.\tag{30}\] Then, 29 and 30 are equivalent to the NTK-RLS update rule 13 with \(\lambda=0\), \(\Pi=I\), and initialization \(w_0\), while \(Q_0=I=P_0\).
We then prove 14 using KKT conditions as in Thm. 2. First, we transform the RHS into an equivalent convex problem: \[\label{eq:convex95ntk} \arg\min_w \frac{1}{2}\lVert w-w_o \rVert^2 \; \mathrm{s.t.}\; y_k = f_{k|k-1}(w) \;\; k=1,\dots,i.\tag{31}\] Let \(\tilde{y}_k\vcentcolon=y_k-f_k(w_{k-1})+\nabla f_k(w_{k-1})^\top w_{k-1}\), \(\tilde{y}\vcentcolon=[\tilde{y}_1\;\dots\;\tilde{y}_i]^\top\in\mathbb{R}^i\), and \(\tilde{X}\vcentcolon=[\nabla f_1(w_{0})\;\dots\;\nabla f_i(w_{i-1})]^\top\in\mathbb{R}^{i\times d}\). Then, the Lagrangian of 31 is then represented as \[\label{eq:lagrangian95ntk} \mathcal{L}(w,\lambda)=\frac{1}{2}\lVert w-w_o \rVert^2 + \lambda^\top (\tilde{y}-\tilde{X}w).\tag{32}\] Since 31 is a convex problem, any primal and dual variables of 32 satisfying KKT conditions are primal and dual optimal. The KKT conditions for a pair of primal and dual variables \((w^*,\lambda^*)\) are \[\label{eq:KKT95ntk} \begin{cases} w^* - w_0 - \tilde{X}^\top\lambda^* = 0\\ \tilde{y} - \tilde{X}w^* = 0\,. \end{cases}\tag{33}\]
Now, we show that \(w_i\) from the ORFit update rule 5 satisfies 33 with some \(\lambda_i\) so that \(w_i\) is the primal optimal solution of 31 which in turn shows 14 . From 5 , \(w_i = w_0 - \sum_{k=0}^{i-1} \eta_{k} \tilde{g}_{k}\). Then for each \(k\), \[\begin{align} \tilde{g}_k &= \nabla \ell_{k+1}(w_{k}) - \sum_{v\in S_{k}} \mathrm{proj}_{v} (\nabla \ell_{k+1}(w_{k}))\\ &= \ell'(y_{k+1}, f_{k+1}(w_{k})) \nabla f_{k+1}(w_{k}) - \sum_{j=1}^{k} c_{k,j} \nabla f_j(w_{j-1})\\ & = \tilde{X}^\top d_k, \end{align}\] where \(d_k\vcentcolon=[-c_{k,1},\dots,-c_{k,k},\ell'(y_{k+1}, f_{k+1}(w_{k})),0,\dots,0]^\top \in\mathbb{R}^i\). Such \(c_{k,j}\in\mathbb{R}\) exists since \(\mathrm{span}\{S_{k}\} =\mathrm{span}\{\nabla f_j(w_{j-1})\}_{j=1}^{k}\). Then, \[w_i = w_0 - \sum_{k=0}^{i-1} \eta_{k} \tilde{X}^\top d_k = w_0 - \tilde{X}^\top \sum_{k=0}^{i-1} \eta_{k} d_k.\] By letting \(\lambda_i\vcentcolon=- \sum_{k=0}^{i-1} \eta_{k} d_k\), \((w_i, \lambda_i)\) satisfies the first KKT condition.
For the second KKT condition, we observe that for \(k\leq i\), \[\begin{align} f_{k|k-1}(w_i) &= f_k(w_{k-1}) + \nabla f_k(w_{k-1})^\top (w_i - w_{k-1})\\ &= f_k(w_{k-1}) - \nabla f_k(w_{k-1})^\top \sum_{j=k-1}^{i-1} \eta_{j} \tilde{g}_{j}\\ &= f_k(w_{k-1}) - \eta_{k-1} \nabla f_k(w_{k-1})^\top \tilde{g}_{k-1}\tag{34}\\ &= f_k(w_{k-1}) - (f_k(w_{k-1}) - y_k) = y_k, \tag{35} \end{align}\] where 34 is satisfied as \(\nabla f_k(w_{k-1}) \perp \tilde{g}_{j}\) for \(j\geq k\), and the step-size from 6 results in 35 . Then, \(w_i\) satisfies the second KKT condition and hence is the solution of 31 . 0◻
This paper was presented in part at the IEEE Conference on Decision and Control, Singapore, December 2022 [1] and at the American Control Conference, Canada, July 2024 [2].↩︎
To be precise, the update rule in OGD uses \(g\) to denote the gradient of the loss, rather than the gradient of the model; however, as long as the former is nonzero, their direction is the same.↩︎