Efficient Online Unlearning via Hessian-Free Recollection of Individual Data Statistics


Abstract

Machine unlearning strives to uphold the data owners’ right to be forgotten by enabling models to selectively forget specific data. Recent methods suggest that one approach of data forgetting is by precomputing and storing statistics carrying second-order information to improve computational and memory efficiency. However, they rely on restrictive assumptions and the computation/storage suffer from the curse of model parameter dimensionality, making it challenging to apply to most deep model. In this work, we propose a Hessian-free online unlearning method. We propose to maintain a statistical vector for each data, computed through affine stochastic recursion approximation of the difference between retrained and learned models. Our proposed algorithm achieves near-instantaneous online unlearning as it only requires a vector addition operation. Based on the strategy that recollecting statistics for forgetting data, the proposed method significantly reduces the unlearning time. Experiments demonstrate that the proposed scheme surpasses existing results by orders of magnitude in terms of time/memory costs, while also enhances accuracy.

1 Introduction↩︎

[ht]
\centering
\setlength\tabcolsep{5.5pt}
  \caption{\textbf{Our main contributions}, \(d\) is the dimensionality of the model parameters, and \(n\) is the total size of the dataset.}
  \label{Table321:32contribution}
  \begin{tabular}{ccccc}
\centering 
    Algorithm     &  Pre-computation time
  &Storage
& Unlearning time (per sample) & Description\\
    \midrule
    \cite{DBLP:conf/nips/SekhariAKS21}& 
$\mathcal{O}(nd^2)$ &$\mathcal{O}(d^2)$
&$\mathcal{O}(d^3)$ & Hessian based method \\
   \cite{DBLP:conf/nips/SuriyakumarW22} & 
$\mathcal{O}(d^3+nd^2)$ &$\mathcal{O}(d^2)$
&$\mathcal{O}(d^2)$  & Hessian based method \\ 
   Proposed method& $\mathcal{O}(nd)$ &$\mathcal{O}(nd)$&$\mathcal{O}(d)$ & Hessian free method  \\
    \bottomrule
  \end{tabular}

Recent data protection regulations, e.g. General Data Protection Regulation, aim to safeguard individual privacy and include provisions such as the Right to be Forgotten. This regulation mandates the data controller shall erase the personal data without undue delay when the data providers request to delete them. Within the context of the right to be forgotten, data owners can request the removal of their personal data’s contribution from the trained models. This can be achieved through the concept of machine unlearning, where the model provider erases the specific data that pertains to the data owner’s personal information.

A straightforward unlearning algorithm is to retrain the model from scratch, which is however impractical as it comes with unaffordable monetary and time costs. It is especially challenging in time-sensitive applications such as fraud detection or online learning, as neglecting timeliness may have a detrimental impact on system performance and responsiveness. To this end, prior efforts [1], [2] have been designing and evaluating unlearning by metrics including efficiency, measured by runtime, and efficacy, measured by the similarity to the retrained model. However, attaining a model entirely identical to the one achieved through retraining for an unlearning method represents a notably demanding criterion. Hence, inspired by differential privacy, [3] introduced a relaxed definition, in anticipation of ensuring the output distribution of the unlearning algorithm is indistinguishable from that of the retraining. Building upon this framework, diverse unlearning methodologies, e.g., [4][6], have been devised.

Online unlearning algorithms deal with deletion requests that come in an online manner [7], [8]. The key idea is to proactively pre-compute and store certain data statistics prior to the deletion requests, to achieve computational and memory efficiency while maintaining reasonable performance. These studies reveal that for models that approximate the minimization of convex and sufficiently smooth empirical minimization objectives, unlearning can be efficiently executed through a Newton style step, facilitated by the pre-storing Hessian. While these works have demonstrated the effectiveness of second-order method to approximate the retrained model, they do have a few fundamental limitations. First, the existing unlearning algorithms require explicit pre-storage of the Hessian matrix (or its inverse) and therefore are expensive to implement. This forms a stark contrast to many training algorithms leveraging Hessian-free solvers. Second, they are not applicable to most deep neural networks due to the need of convexity in the objective function. More concretely, we can ask, how can we design a method to efficiently pre-compute unlearning statistics with lenient assumption?

In this paper, we address these fundamental questions and introduce an efficient and private unlearning approach that departs significantly from prior works. First, we analyze the difference between the learned and the retrained models after forgetting each sample via recollecting the trajectory discrepancy. Subsequently, we demonstrate that these distinctions can be approximated and characterized through an affine stochastic recursion. In this work, we refer to this approximation technique as the approximator. Moreover, we employ the Hessian Vector Product (HVP) to improve efficiency in computing these approximators, thereby reducing the running time. We also employed gradient clipping to bound the errors introduced by approximation. Finally, building upon the aforementioned analysis and techniques, we propose an efficient algorithm that includes pre-computed and pre-stored statistics. When deletion requests arrive, machine unlearning is rapidly achieved by adding the corresponding approximators. Our algorithm requires minimal modifications to the training process, enabling fast unlearning with distinctive computational advantages. Equally importantly, we ensure forgetting by deleting the approximator of the corresponding sample post-unlearning, rather than storing such statistics.

We summarize our contributions below and in Table [Table321:32contribution].

  • More Lenient Assumption, Less Cost. We propose a recollection approximator via analyzing the impact of each sample on the training trajectory. Our proposed approximator allows us to leverage the HVP technique, which is not feasible with existing algorithms. It thus achieves a lower-complexity pre-computation of individual data statistics. (Section 3).

  • Computationally Efficient Algorithm. We then propose a Hessian-free algorithm for online unlearning which is near-instantaneous as it only requires a vector addition operation. Additionally, this algorithm provides a theoretical efficacy guarantee, showing that our approach ourperforms previous methods. (Section 4).

  • Exceptional Performance. We experimentally validate our algorithm with a wide range of metrics. Our experiments elucidate the shortcomings of prior algorithms on non-convex settings. In contrast, our results show that the proposed method surpasses prior works, especially in unlearning time, our algorithm is only on the millisecond level to forget per sample. (Section 5).

1.1 Related Work↩︎

Machine unlearning can be traced back to [9] and their work defines the notion of Exact Unlearning, demanding that the models obtained through retraining and unlearning exhibit complete consistency. Similar methods, such as [10][15], train submodels based on different dataset partitions and aggregation strategies. Approximate Unlearning, e.g., [2], [16][25] seeks to relax definition and minimize the impact of forgetting data to an acceptable level to tradeoff for computational efficiency, reduced storage costs, and flexibility. Therefore, [3] introduced a relaxed definition, which aims to make the unlearned model indistinguishable from retrained model.

In this approximate definition, [7], [8], [26] align with our approach which lies in pre-computing statistics of data which extracts more precise second-order information from Hessian, and use these ‘recollections’ to forget data. Specifically, (1) Newton Step (NS). [8] proposes, through the precomputation and storage of the Hessian offline, to achieve forgetting upon the arrival of a forgetting request using a Newton step. However, this method necessitates strong assumptions, e.g. the objective function is required to be strongly convex. Its unlearning time, involving \(\mathcal{O}(d^3)\), becomes nearly impractical for over-parameterized deep neural networks. (2) Infinitesimal Jacknife (IJ). [7], [26] extends the work of NS, attaining strong convexity through a non-smooth regularizer by leveraging the IJ. Moreover, by trading the offline computation time to inverse the Hessian, the unlearning time is reduced to \(\mathcal{O}(d^2)\). However, the basic issues still exist, i.e., the overly strong assumptions of convexity and the difficulty of applying to over-parameterized deep models. On the contrary, as indicated in Table [Table321:32contribution], our approach evades the curse of model parameters dimensionality. Moreover, previous works necessitate maintaining the Hessian information to address the ongoing deletion requests, which poses the risk of violating the right to be forgotten and entails user privacy. Instead, our method goes beyond ‘recollecting to forget’ by also ‘forgetting the recollections’.

2 Problem Formulation↩︎

2.1 Learning↩︎

Let \(\ell(\mathbf{w}; z)\) be the loss function for a given parameter \(\mathbf{w} \! \in \! \mathbb{R}^d\!\) and on instance \(z\) over an instance space \(\mathcal{Z}\). Learning can be expressed as the population risk minimization problem: \[\min_{\mathbf{w}}\quad F(\mathbf{w}):=\mathbb{E}_{z \sim \mathcal{D}}[\ell(\mathbf{w}; z)]. \label{population32risk}\tag{1}\] Addressing this problem directly is challenging, as the probability distribution \(\mathcal{D}\) is usually unknown. For a finite dataset \(\mathcal{S}= \{z_i\}_{i=1}^n\) with \(n\) samples, one often solves the following empirical risk minimization problem: \[\min_{\mathbf{w}}\quad F_{\mathcal{S}}(\mathbf{w}):=\frac{1}{n}\sum_{i=1}^n \ell(\mathbf{w}; z_i). \label{2-1}\tag{2}\] The stochastic gradient descent (SGD) has been widely adopted to solve the problem (2 ). Specifically, for an initial model, we iteratively move in the direction with stepsize \(\eta\) of the negative gradient, and the one step update rule is: \[\mathbf{w}_{e, b+1} \leftarrow \mathbf{w}_{e, b}-\frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \sum_{i \in \mathcal{B}_{e,b}} \nabla \ell \left(\mathbf{w}_{e, b};z_i\right), \label{2-2}\tag{3}\] for the index of epoch \(e \in \{0, ..., E\}\) and the index of batch \(b \in \{0, ..., B\}\). Here, \(\mathcal{B}_{e,b}\) denotes the dataset that contains the data sampled in the \(e\)-th epoch’s \(b\)-th update, and \(|\mathcal{B}_{e,b}|\) is the size of \(\mathcal{B}_{e,b}\), where \(|\mathcal{B}|\) is the maximum batch size.

2.2 Unlearning↩︎

Consider a scenario where a user requests to remove a forgetting dataset \(U\!\!=\!{\{u_j\}}_{j=1}^m \! \subseteq \! \mathcal{S}\). To unlearn \(U\), retraining from scratch is considered a naive unlearning method, which aims to minimize \(F_{\mathcal{S} \backslash U}(\mathbf{w})\). Specifically, since each data sample takes place only in one (mini-)batch per epoch, we define the data point \(u_j\) in \(U\) to be sampled in \(\mathcal{B}_{e,b(u_j)}\) during the \(e\)-th epoch’s \(b(u_j)\)-th batch update in the learning process, where \(b(u_j)\) represents the batch update index when \(u_j\) is sampled. Due to the removal of sample \(u_j\), the batch \(\mathcal{B}_{e,b(u_j)}\) is reduced to \(\mathcal{B}_{e,b(u_j)} \backslash \{u_j\}\). In accordance with the linear scaling rule discussed in [27], i.e., in order to effectively leverage batch sizes, one has to adjust the stepsize in tandem with the batch size. The retraining stepsize thus is concurrently scaled to \(\eta_{e,b(u_j)} \frac{| \mathcal{B}_{e,b(u_j)} \backslash \{u_j\}|}{|\mathcal{B}_{e,b(u_j)}|}\). Thus, the retraining update rule in \(e\)-th epoch is given by, \[\mathbf{w}_{e,b+1}^{-U} \leftarrow \mathbf{w}_{e,b}^{-U}-\frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \sum_{i \in \mathcal{B}_{e,b} } \nabla \ell (\mathbf{w}_{e,b}^{-U};z_i). \label{2-3}\tag{4}\] Specifically, for the data point \(u_j\) sampled into \(\mathcal{B}_{e,b}\) during \(e\)-th epoch’s \(b(u_j)\)-th update, we have the following update: \[\!\! \mathbf{w}_{e,b(u_j)+1}^{-U}\!\!\! \leftarrow \! \mathbf{w}_{e,b(u_j)}^{-U}\! -\frac{\eta_{e,b(u_j)}}{|\mathcal{B}_{e,b(u_j)}|} \!\! \sum_{i \in \mathcal{B}_{e,b(u_j)} \backslash \{u_j\} } \!\!\!\!\!\!\!\!\! \nabla \ell (\mathbf{w}_{e,b(u_j)}^{-U};z_i). \label{2-4}\tag{5}\]

To avoid the unaffordable costs incurred by retraining, approximate unlearning algorithms aim to approximate the retrained model through limited updates on the learned model \(\mathbf{w}_{e, b+1}\). Our goal is to ensure the indistinguishability between the outputs of the unlearning algorithm and retraining algorithm. Therefore, we consider the notion of unlearning defined in [8].

Definition 1 (\((\epsilon,\delta)\)-unlearning). Let \(S\) be a training set and \(\Omega: S \rightarrow \mathbf{w}\) be an algorithm that trains on \(S\) and outputs a model \(\mathbf{w}\), \(\mathcal{T}(\mathcal{S})\) represent the additional data statistics that need to store (typically not the entire dataset). Given an output of learning algorithm \(\Omega \in \mathcal{W}\) and a set of data deletion requests \(U\), we obtain the results of the unlearning algorithm \(\bar{\Omega}(\Omega(\mathcal{S}), \mathcal{T}(\mathcal{S}))\in \mathcal{W}\) and the retraining algorithm \(\bar{\Omega}(\Omega(\mathcal{S}\backslash U), \emptyset)\) through a removal mechanism \(\bar{\Omega}\). For \(0 <\epsilon \leq 1\) and \(\forall \; \mathcal{W} \subseteq \mathcal{R}^d, S \subseteq \mathcal{Z}\), we say that the removal mechanism \(\bar{\Omega}\) satisfies \((\epsilon, \delta) \text{-unlearning}\) if \[\begin{align} \!\! & \operatorname{P}(\bar{\Omega}(\Omega(\mathcal{S}), \mathcal{T}(\mathcal{S})) \leq e^{\epsilon} \operatorname{P}(\bar{\Omega}(\Omega(\mathcal{S}\backslash U), \emptyset)) )+\delta,\;\text{and} \\ \!\! & \operatorname{P}(\bar{\Omega}(\Omega(\mathcal{S}\backslash U), \emptyset) \leq e^{\epsilon} \operatorname{P}(\bar{\Omega}(\Omega(\mathcal{S}), \mathcal{T}(\mathcal{S})) +\delta. \end{align}\]

3 Main Theoretical Results↩︎

Before detailing our analysis, let us informally motivate our main method. Specifically, we focus on studying the discrepancy between the learning and the retraining model by analyzing the impact of each sample on the update trajectory. Our main observation is that the difference between the retraining and learning models, analyzed through SGD recursion, can be described by an affine stochastic recursion.

Unlearning a single data sample. We start with designing an approximator for a single sample (i.e., \(U\!=\! \{u\}\)) on the model of training updates (the learning update and retraining update), by considering the difference between the learning model in 3 and the retraining model obtained without the knowledge of \(u\) in 4 . In particular, we consider the following two cases of model updates in the \(e\)-th epoch.

Case 1 (\(u\) not in \(\mathcal{B}_{e,b}\)): The difference between retraining and learning updates is \[\begin{align} \mathbf{w}_{e,b+1}^{-u} &-\mathbf{w}_{e,b+1} = \mathbf{w}_{e,b}^{-u}-\mathbf{w}_{e,b} \\ &-\frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|}\!\!\sum_{i \in \mathcal{B}_{e,b} } \!\!\!\! \Big(\nabla \ell (\mathbf{w}_{e,b}^{-u};z_i)\! -\!\nabla \ell (\mathbf{w}_{e,b};z_i) \Big). \end{align} \label{3-1}\tag{6}\] From the Taylor expansion of \(\nabla \ell (\mathbf{w}_{e,b};z_i)\), we have \(\nabla \ell (\mathbf{w}_{e,b}^{-u};z_i) = \nabla \ell (\mathbf{w}_{e,b};z_i)+ \nabla^2 \ell (\mathbf{w}_{e,b};z_i) (\mathbf{w}_{e,b}^{-u} -\mathbf{w}_{e,b} )\) \(+ { o(\mathbf{w}_{e,b}^{-u}-\mathbf{w}_{e,b} )}.\) Let \(\mathbf{H}_{e,b} \!\!= \!\! \sum_{i \in \mathcal{B}_{e,b} } \!\!\!\! \nabla^2 \ell (\mathbf{w}_{e,b};z_i)\) denote the Hessian during the \(e\)-th epoch’s \(b\)-th update, we can then approximate the SGD recursion 6 as \[\mathbf{w}_{e,b+1}^{-u} -\mathbf{w}_{e,b+1} \approx \Big(\mathbf{I}- \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|}\mathbf{H}_{e,b}\Big)(\mathbf{w}_{e,b}^{-u}-\mathbf{w}_{e,b}). \label{3-3}\tag{7}\] Case 2 (\(u\) in \(\mathcal{B}_{e, b(u)}\)): The difference in this case is: \[\begin{align} & \mathbf{w}_{e,b(u)+1}^{-u}- \mathbf{w}_{e,b(u)+1}\approx \underbrace{ \frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \nabla \ell (\mathbf{w}_{e,b(u)};u)}_{\text{Approximate impact of u in epoch \textit{e}}} \\ & + \underbrace{\Big(\mathbf{I} - \frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \mathbf{H}_{e,b(u)}\Big)(\mathbf{w}_{e,b(u)}^{-u} - \mathbf{w}_{e,b(u)})}_{\text{Approximate impact of u in previous epoches}}, \label{3-3461} \end{align}\tag{8}\] where \(\mathbf{I}\) denotes identity matrix of appropriate size. Combining the observations in Cases 1 and 2, we now define the recollection matrix \(\mathbf{M}_{e,b(u)}\), backtracking the overall difference between the retrained and the learned models: \[\! \mathbf{M}_{e,b(u)}= \frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \prod^E_{k=e}\prod_{b=b(u)+1}^{B-1}\Big(\mathbf{I}\! - \! \frac{\eta_{k,b}}{|\mathcal{B}_{k,b}|} \mathbf{H}_{k,b}\Big), \label{Definiton:32M}\tag{9}\] and affine stochastic recursion approximator \(\mathbf{a}_{E,B}^{-u}\) as \[\begin{align} \mathbf{a}_{E,B}^{-u}&:=\sum_{e=0}^E\mathbf{M}_{e,b(u)} \nabla \ell (\mathbf{w}_{e,b(u)};u ). \label{ASR} \end{align}\tag{10}\] We will see that \(\mathbf{a}_{E,B}^{-u}\) can approximate the impact of sample \(u\) on the training trajectory as follows, given by \[\begin{align} \mathbf{w}_{E,B}^{-u} &-\mathbf{w}_{E,B} \approx \mathbf{a}_{E,B}^{-u}, \end{align}\] That is, a simple vector addition to model \(\mathbf{w}_{E,B}\) can obtain an approximate retrained model for forgetting the sample \(u\). We define \(\Delta_{E,B}^{-u}\) as the approximation error, given by1 \[\begin{align} \Delta^{-u}_{E,B}:= \mathbf{w}_{E,B}^{-u} -\mathbf{w}_{E,B} -\mathbf{a}_{E,B}^{-u}. \label{3-4} \end{align}\tag{11}\]

Our designed approximator in 10 enables us to exploit the Hessian vector product (HVP) technique [29] to compute it without explicitly computing the Hessian. When computing the product of the Hessian matrix and an arbitrary vector for the deep neural networks, the HVP technique first obtains gradients during the first backpropagation, multiplying the gradients with the vector, and then performing backpropagation again. This results in a time complexity of \(\mathcal{O}(d)\). We note that the existing second-order methods require to proactively store a Hessian matrix [7], [8], [26]. It is thus unable to deploy the HVP in their settings.

We will analyze the approximation error based on the spectral radius \(\rho\) of recollection matrix \(\mathbf{M}_{e,b(u)}\) which governs the difference between the retrained model and the original learned model \(\mathbf{w}_{E,B}^{-U}-\mathbf{w}_{E,B}\) in Section 4.

Unlearning a set of data samples. Furthermore, when it comes to forgetting a set of data samples, we have that

Theorem 1 (Additivity). When \(m\) sequential (streaming) deletion requests arrive, the sum of \(m\) approximators is equivalent to performing batch deletion simultaneously, i.e., for \(u_1, \ldots, u_m\) sequence of continuously arriving deletion requests, we demonstrate that \(\mathbf{a}_{E,B}^{-\sum_{j=1}^m u_j}=\sum_{j=1}^m \mathbf{a}_{E,B}^{- u_j}\).

This elucidates that by storing the impact of updates for each data point, nearly instantaneous data removal can be achieved through a series of simple vector additions.

4 Proposed Method↩︎

In this section, we analyze the properties of the approximator \(\mathbf{a}_{E,B}^{-u}\) and the approximation error \(\Delta_{E,B}^{-u}\), which followed by the introduction of our proposed online unlearning algorithm. We finally present a detailed convergence and complexity analysis of the proposed method. The detailed proof can be found in Appendix 7.

4.1 Approximation Error Analysis↩︎

Gradient clipping is a technique widely adopted to achieve favorable performances in the deep networks training, by ensuring gradient norms not excessively large [30]. We further show that the gradient clipping with geometrically decaying stepsizes also facilitates us to derive an upper bound for the approximation error in the following:

Lemma 1. In the learning stage, we threshold the stochastic gradient norm at \(C\), i.e., \(\Vert \nabla \ell (\mathbf{w};z) \Vert \!\! \leq \! C\). To simplify the expression, we define that SGD performs a total of \(t\! \in\!\! \{0, \ldots, T\}\) steps for \(\mathbf{w}_{e, b}\), where \(t\!=\! eB+b\). Consider geometrically decaying stepsizes satisfying \(\eta_{t+1} \!\! = \! q\eta_{t}\), where \(0\!<q \!<1\) and \(\eta=\eta_0\). Therefore, after \(t\) steps, for any forgetting set \(U\), we have: \[\Vert \mathbf{w}_{e,b}^{-U}-\mathbf{w}_{e,b} \Vert \leq 2\eta C \frac{1-q^t }{1-q}.\]

To furthermore analyze the approximation error in 11 , we first introduce the following lemma:

Lemma 2. For all \(e,b\) and vector \(\mathbf{v} \in \mathbb{R}^d\), assume that exists \(\lambda\) and \(M\) such that \(\lambda \mathbf{I} \preceq \nabla^2 \ell(\mathbf{w}_{e,b};z_i) \preceq M \mathbf{I}\), and let \(\rho=\text{max}\{|1-\eta_{e,b}\lambda |,|1-\eta_{e,b}M| \}\) be the spectral radius of \(\mathbf{I} - \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}\), we have \(\Vert (\mathbf{I} - \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}) \mathbf{v}\Vert \leq \rho \Vert \mathbf{v}\Vert.\)

Therefore, combining 10 , Lemma 1 and Lemma 2, we have the following approximation error bound:

Theorem 2. For any subset of samples \(U\)=\(\{u_j\}_{j=1}^m\) to be forgotten, the unlearned model based on the approximator \(\mathbf{a}_{E,B}^{-U}\) in 10 leads to an approximation error bounded by \[\begin{align} \!\!\!\! \Vert \Delta_{E,B}^{-U} \Vert \!=\! \Vert \mathbf{w}_{E,B}^{-U}\! -\! \mathbf{w}_{E,B}\! -\! \mathbf{a}_{E,B}^{-U} \Vert \! \leq \! \frac{2 \eta^2 C }{1\!-\!q}\! \zeta_T^{-U} , \end{align}\] where \(T\) is the total number of update steps, and we define
\(\zeta_T^{-U} = \mathcal{O}(\frac{\rho^T- q^T}{\rho - q } +\frac{\rho^T- q^T}{\rho^B- q^B } - \frac{\rho^{T}- q^{2T}}{\rho - q^2 } )\).

Intuitively, Theorem 2 shows that through stochastic matrix recursions of 7 and 8 , the approximation error terms \(o(\mathbf{w}_{e,b}^{-u} \!\!-\!\! \mathbf{w}_{e,b})\) of the previous rounds are scaled by subsequent matrices \(\mathbf{M}_{e,b}\), and generate a new remainder term in this round, which is a polynomial multiplies geometric progression. Specifically, if the step size is small enough such that \(\rho \!<\!1\), the series converges, and as \(T \!\! \rightarrow \! \infty\), the approximation error approaches 0.

4.2 Online Unlearning Algorithm Design↩︎

Based on the above methods and the analysis of Theorem 2, we propose an memory and computationally efficient unlearning scheme in Algorithm 1. To the best of our knowledge, our approach is currently one of the most efficient, as it only requires a single vector addition by utilizing offline computed unlearning statistics. From a theoretical perspective, our designed method has strong interpretability and theoretical performance guarantees as well.

We present our approach in Algorithm 1. In Stage , we bound approximation error through gradient clipping, which is benefit for the optimization process [31]. In Stage , we compute the unlearning statistics vector \(\mathbf{a}_{E,B}^{-u}\) for each sample \(u\), exploiting the HVP technique to reduce computational complexity. The deletion request triggers Stage , in which only a single vector addition with introduced noise ensures the indistinguishably of \((\epsilon,\delta)\)-unlearning guarantee in Definition 1. Moreover, unlike previous works that requires retaining privacy-sensitive statistics to handle subsequent forgetting requests, our approach eliminates the need to retain statistics post-unlearning. Finally, while our approach does not require accessing any datasets in stage , we can enhance the robustness of our algorithm through a simple fine-tuning strategy if we can still access the remaining dataset. Specifically, this involves training on a small number of epochs with the remaining dataset, and then adding the corresponding approximators during this period to the original approximators. The detailed repairing strategy is deferred to Appendix 9.2.

Figure 1: Online Unlearning (OU) Algorithm

4.3 Convergence Analysis↩︎

To discuss the performance guarantee of our proposed method, we further analyze the generalization capability. We first give a convergence of the excess risk defined in 1 and then we provide an asymptotic analysis for our method.

We use the following assumptions which are similar to [7] to establish convergence bound:

Assumption 1. For any \(z\) and \(\mathbf{w}\), the loss \(\ell(\mathbf{w}, z)\) is \(\lambda\)-strongly convex and \(L\)-lipschitz with \(M\)-smooth.

Theorem 3. Let Assumption 1 hold. Choose \(\eta \leq \frac{2}{M+\lambda}\), then we have \(\rho < 1\). Let \({\mathbf{w}}^*\) be the minimizer of the population risk in 1 . Considering the forgetting dataset \(U\) and an unlearning model \(\bar{\mathbf{w}}_{E, B}^{-U}\), we have the following excess risk, \[\begin{align} &F(\tilde{\mathbf{w}}_{E, B}^{-U} ) - \mathbb{E}[F\left(\mathbf{w}^*\right) ] = \\ & \mathcal{O} \Big(\! \rho^T \frac{2ML}{\lambda} \! +\! \frac{4 L^2}{\lambda (n-m)} \! +\! \frac{\sqrt{\ln{(1/\delta)}}}{\epsilon} \frac{ 2 \sqrt{d}L \eta^2 C }{1-q}\zeta_T^{-U} \! \Big) . \end{align} \label{Equation65306Convergence32analysis}\qquad{(1)}\]

Here, we perform an asymptotic analysis for Theorem 3. Specifically, the result of our proposed method is expected to converge to \(\mathcal{O}(\frac{4L^2}{\lambda(n-m)})\) as \(T\) approaches infinity. It is noteworthy that, under these conditions and \(n \!\! \gg \!m\), our proposed approximator outperforms the performance of previous methods [7], [8] which are greater than \(\mathcal{O}( \frac{4L^2}{\lambda(n-m)})\). Furthermore, to measure the maximum number of samples to be forgotten while still ensuring good excess risk guarantees, we analyzed our method’s deletion capacity proposed by [8]. To save space, we defer to Appendix 9.1.

4.4 Comparison to Existing Approaches↩︎

We give a brief overview of previous methods that employed the related (pre-storing statistics) strategies to ours.
Newton Step (NS). [8] proposed an efficient forgetting method by pre-storing the Hessian \(\sum_{i=1}^n \nabla^2 \ell (\mathbf{w}_{E,B}; z_i)\). When a request arrives to forget \(m\) samples \(U\), obtaining a unlearned model \(\mathbf{w}^{-U}_{E,B}\) by adding the approximator \(\mathbf{a}^{-U}_{\text{NS}}\) to original learned model \(\mathbf{w}_{E,B}\). \[\begin{align} \mathbf{a}^{-U}_{\text{NS}} =& \frac{1}{n-m} \mathbf{H}_{\text{NS}}^{-1}\sum_{i \in U} \nabla \ell (\mathbf{w}_{E,B}; z_i),\\ \!\! \!\! \mathbf{H}_{\text{NS}}\!\! =\! \frac{1}{n\!-\!m} & \! \Big( \! \sum_{i=1}^n\! \nabla^2 \ell (\mathbf{w}_{E,B};\! z_i)\!\!-\!\!\sum_{i \in U}\! \nabla^2 \ell (\mathbf{w}_{E,B};\! z_i)\! \Big).\! \end{align}\] Infinitesimal Jacknife (IJ). [7] extends the work of NS by employing the IJ, and the main difference lies in the strategy of pre-computing and storing the inverse Hessian \((\sum_{i=1}^n \nabla^2 \ell (\mathbf{w}_{E,B}; z_i))^{-1}\), thereby trading computation time to reduce unlearning time. When a sample \(u\) requests to be forgotten, it enables unlearning by adding the approximator \(\mathbf{a}^{-u}_{\text{IJ}}\) to the learned model \(\mathbf{w}_{E,B}\). \[\begin{align} \mathbf{a}^{-u}_{\text{IJ}} & = \frac{1}{n} \mathbf{H}_{\text{IJ}}^{-1} \nabla \ell (\mathbf{w}_{E,B}; u), \\ \mathbf{H}_{\text{IJ}} & =\frac{1}{n} \sum_{i=1}^n \nabla^2 \ell (\mathbf{w}_{E,B}; z_i). \end{align}\] Comparison of Pre-Computation. First, for a single data point \(u\), Algorithm 1 entails computing the HVP of the gradient vectors \(\nabla \ell (\mathbf{w}_{e,b};u)\) \(T\) times, which requires \(\mathcal{O}(Td)\) running time. To compute the approximators \(\mathbf{a}^{-u}_{\text{Ours}}\) for all \(n\) samples, the computation steps are performed \(n\) times, resulting in a total running time of \(\mathcal{O}(nTd)\). Our method is highly efficient when dealing with overparameterized deep models, especially when \(d \gg n\). This is because computing the full Hessian \(\sum_{i=1}^n \nabla^2 \ell (\mathbf{w}_{E,B}; z_i)\) for NS requires \(\mathcal{O}(nd^2)\) running time, and computing and inverting the full Hessian \(\mathbf{H}^{-1}_{\text{IJ}}\) for IJ requires \(\mathcal{O}(d^3+nd^2)\) running time.

Comparison of Storage. Algorithm 1 involves storing a vector of model parameter size for \(n\) data, requiring the storage of \(\mathcal{O}(nd)\). This is significantly lower than the overhead introduced by storing matrices for NS and IJ, which is \(\mathcal{O}(d^2)\), when the dataset size \(n\) is lower than the dimension \(d\) of model parameters2.

Comparison of Unlearning Computation. When a forgetting request with \(m\) data to be forgotten arrives, Algorithm 1 utilizes precomputed and prestored approximators to achieve forgetting, requiring a simple vector addition that takes \(\mathcal{O}(md)\) time. For NS, it requires computing and inverting a different Hessian that depends on the user requesting the deletion. This necessitates substantial computational overhead of \(\mathcal{O}(d^3+md^2+md)\) each time a deletion request arrives. Although IJ reduces this to \(\mathcal{O}(md^2+md)\), it is still not as efficient compared to our method.

5 Experiments↩︎

We conduct experiments to answer these questions:

  • RQ1: Approximation Errors. How good the proposed approximator in 10 (Section 3) accurately capture the disparity between the retrained and learned models under both convex and non-convex setting?

  • RQ2: Model Similarity. Does the unlearned model in Algorithm 1 (Section 4) perform similarly to the retrained model?

  • RQ3: Practical Applications. How do these methods perform across different models and datasets?

To answer the three research questions, we devised the following experimental components: verification and application. Verification experiments are centered on evaluating the disparity between the unlearned models and the retrained models, across both convex and non-convex settings. Application experiments are geared towards evaluating the performance of different unlearning algorithms with a specific focus on determining the effectiveness, convergence, pre-computation, and unlearning runtime of these methods. Due to the space limit, we present the more results to Appendix 8, focusing solely on comparative experiments.

5.1 Verification Experiments↩︎

Our primary objective is to validate the differences between the proposed approximator \(\mathbf{a}^{-U}_{\text{Ours}}\) in 11 and the actual \(\mathbf{w}_{E,B}^{-U} - \mathbf{w}_{E,B}\). We approach the evaluation from two perspectives: distance and correlation. Specifically, we use the \(L_2\) norm metric \(\|\mathbf{w}_{E,B}^{-U} - \mathbf{w}_{E,B} - \mathbf{a}^{-U}_{\text{Ours}}\|\) to measure the distance of the approximation error. A smaller distance metric indicates a closer approximation of the unlearned model to the retrained model. Additionally, we utilize the Pearson [33] and Spearman [34] coefficient to assess the correlation between \(\mathbf{a}^{-U}_{\text{Ours}}\) and \(\mathbf{w}_{E,B}^{-U} - \mathbf{w}_{E,B}\) by mapping them from high-dimensional space to scalar loss values, i.e., the approximate loss change \(\ell (\mathbf{w}_{E, B} + \mathbf{a}^{-U}_{\text{Ours}}; U) - \ell (\mathbf{w}_{E, B} ; U)\) and the actual loss change \(\ell (\mathbf{w}^{-U}_{E, B} ; U) - \ell (\mathbf{w}_{E, B} ; U)\) on the forgetting dataset \(U\). Intuitively, the resulting correlation scores, ranging from -1 to 1, offer insights into the level of correlation between our methods’ unlearned model and the retrained model concerning the performance on the forgetting dataset \(U\).

a

b

c

d

Figure 2: Verification experiments . (a) and (b) represent evaluations on LR and CNN, respectively. The left of (a)(b) shows the distance to the retrained model, i.e., \(\|\mathbf{w}_{E,B}^{-U} - \mathbf{w}_{E,B} - \mathbf{a}^{-U}\|\) and the right of (a)(b) shows the correlation between the approximate loss change and the actual change on the forgetting data. It can be observed that the proposed method is more aligned and also correlated with the actual values than that of the previous ones..

Configurations: we conduct experiments in both convex and non-convex scenarios. Specifically, we trained a multinomial Logistic Regression (LR) with total parameters \(d=7850\) and a simple convolutional neural network (CNN) with total parameters \(d=21840\) on MNIST dataset with 1,000 data samples for handwriting digit classification. We apply a cross-entropy loss function and the inclusion of an \(L_2\) regularization coefficient of \(10^{-6}\) to ensure that the loss function of LR is strongly convex. For LR, training was performed for 100 epochs with stepsize of 0.05. For CNN, training was carried out for 20 epochs with stepsize of 0.05 and a batch size of 64. Given these configurations mentioned above, we separately assessed the distance and correlation between our approximators \(\mathbf{a}^{-U}_{\text{Ours}}\), \(\mathbf{a}^{-U}_{\text{NS}}\), \(\mathbf{a}^{-U}_{\text{IJ}}\) at deletion rates in the set {1%, 5%, 10%, 15%, 20%, 25%, 30%}. Following the suggestion in [35], a damping factor of 0.01 is added to the Hessian matrix to ensure its invertibility when implementing NS and IJ. In addition, for the sake of clear visualizations, we use the minimum and maximum values as error bars.

a

b

Figure 3: Verification experiments . In non-convex setting, comparison between norm of approximate parameter change \(\|\mathbf{a}^{-U}\! \|\) and norm of exact parameter change \(\|\mathbf{w}_{E,B}^{-U} - \mathbf{w}_{E,B}\|\) across different random seeds. Intuitively, the NS and IJ methods are contingent on the selection of forgetting data. In contrast, our approach consistently approximates the actual values effectively..

Table 1: Application experiments . An online scenario with 20% data to be forgotten, where the unlearning requests involve only forgetting a single data point with each execution. The proposed algorithm outperforms benchmarks regarding all metrics.
Algorithm Dataset Model Unlearning Pre-Computaion Storage Test Accuracy (%)
Runtime (Sec) Speedup Runtime (Sec) (GB) Unlearned model
NS MNIST Logistic 5.14\(\times 10^2\) 2.39\(\times\) 2.57\(\times 10^3\) 0.23 87.50 (-0.75)
CNN 5.82\(\times 10^3\) 0.05\(\times\) 2.91\(\times 10^{4}\) 1.78 83.50 (-10.25)
2-3 FMNIST CNN 2.31\(\times 10^4\) 0.54\(\times\) 1.16\(\times 10^{5}\) 1.78 57.69 (-22.25)
LeNet 8.53\(\times 10^{4}\) 0.15\(\times\) 4.27\(\times 10^{5}\) 14.18 62.00 (-18.50)
IJ MNIST Logistic 1.84\(\times 10^0\) 665\(\times\) 2.57\(\times 10^3\) 0.25 87.50 (-0.75)
CNN 6.81\(\times 10^0\) 45\(\times\) 2.91\(\times 10^{4}\) 1.78 82.75 (-11.00)
2-3 FMNIST CNN 2.95\(\times 10^1\) 424\(\times\) 1.16\(\times 10^{5}\) 1.78 75.69 (-4.25)
LeNet 3.24\(\times 10^{1}\) 402\(\times\) 4.27\(\times 10^{5}\) 14.18 76.75 (-3.75)
Proposed MNIST Logistic 2.74 448,060\(\times\) 1.95 0.03 87.75 (-0.50)
CNN 3.23 9,940\(\times\) 5.34 0.08 91.50 (-2.25)
2-3 FMNIST CNN 1.27 98,468\(\times\) 3.66 0.32 77.85 (-2.09)
LeNet 1.63 79,828\(\times\) 4.43 0.48 78.63 (-1.87)

In the context of convex setting models, as shown in Fig. 2 (a), our approach surpasses previous works. First, by distance evaluation, the approximation error \(\|\mathbf{w}_{E,B}^{-U} - \mathbf{w}_{E,B} - \mathbf{a}^{-U}\|\) of the proposed method is lower than that of the previous NS and IJ works. At a deletion rate of 30%, our approximation error is 0.171638, slightly lower than that of IJ and NS, which have errors of 0.178244 and 0.178246, respectively. Despite the results being close in convex scenarios, our method does not necessitate strong assumptions similar to those employed by IJ and NS, i.e., when a model has not fully converged to the optimal values, the Hessian computed by IJ and NS may have negative eigenvalues, leading to its non-invertibility. On the contrary, our approach involves fewer assumptions and incurs less approximation error in the convex non-convergent setting. Second, in assessing the loss change for the forgotten dataset \(U\), our proposed method more accurately captures the changes in actual \(\ell (\mathbf{w}^{-U}_{E, B} ; U)-\ell (\mathbf{w}_{E, B} ; U)\). In particular, when removing 30% of the samples, the proposed method maintains a high correlation, with Pearson and Spearman being 0.96 and 0.95, respectively. We conducted all experiments with 7 random seeds to obtain average results.

In the context of non-convex setting, as illustrated in Fig. 2 (b), our approximator demonstrates superior performance, exhibiting less dependency on random variations compared to NS and IJ. The observed outcomes for NS and IJ are consistent with prior research [35], [36], where these methods display substantial random behavior, resulting in a non-significant correlation with retraining across different deletion rates. When 30% of the samples are removed, our proposed method maintains a lower approximation error of 0.96, surpassing the performance of IJ and NS. In contrast to IJ and NS which are far from the actual values in non-convex settings, our proposed methods achieve Spearman and Pearson correlation coefficients of 0.80 and 0.74, respectively, accurately capturing actual loss changes. Additionally, in Fig. 3, we intuitively provide insights into the reasons for the failure of prior works in non-convex settings. Specifically, we investigated the norm of approximate parameter change \(\|\mathbf{a}^{-U}\|\) and the norm of exact parameter change \(\|\mathbf{w}_{E,B}^{-U} - \mathbf{w}_{E,B}\|\) across different selection of the forgetting sample. We observed that, NS and IJ depend on the selection of the forgetting data, i.e., they can approximate actual changes well for partial data points while generating huge approximation errors for others, as illustrated in Fig. 3. This dependency on the data selection leads to failures in Fig. 2. On the contrary, our approach is not impacted by the selection of forgetting data, and can effectively approximate the actual norm of parameters changes.

5.2 Application Experiments↩︎

Furthermore, we evaluate the performance in real-world applications. Our simulations are conducted from two perspectives: runtime and utility. Specifically, run-time focuses on the time spent precomputing unlearning statistics and the speedup of the unlearning algorithm compared to the retraining algorithm. Furthermore, we evaluate the utility by the test accuracy of the unlearned model to ensure that generalization performance is not compromised, and the values in parentheses represent the difference in test accuracy compared to the retrained model.

Configurations: we evaluate our approach performance on different datasets with 20% data to be forgotten: (1) we train a LR and simple CNN on MNIST which have setups identical to the aforementioned experiments. We further simulated on FMNIST with 4,000 data using CNN and LeNet with total 61,706 parameters. The training was conducted for 30 epochs with stepsize of 0.5 and a batch size of 256. (2) Moreover, we assessed our method on non-pretrained ResNet18 [37] with 1,000 data for the CelebA dataset [38] for gender prediction. Training was performed for 10 epochs with stepsize of 0.05 and batch size of 32. In this scenario, we have not evaluated NS and IJ due to the out of memory for storing the Hessian.

Application experiment results show that our proposed method exhibits efficient computation and performs well in realistic environments. Table 1 demonstrates that when each forgetting request contains only one piece of data and a total of 20% of the data was forgotten, the proposed algorithm outperforms other baselines by a significant margin. Specifically, our approach achieves a small computational and storage overhead while maintaining high performance in both convex and non-convex setting. In Fig. 3, we explore batch-wise processing of 200 data points for forgetting, achieving highly efficient forgetting at the millisecond level compared to retraining on ResNet 18 for CelebA. It is important to highlight the challenges associated with evaluating the prior studies NS and IJ. This is primarily due to their high complexity requirements and the more restrictive assumptions made. Hence, in order to further evaluate our experimental results on larger models and datasets, we instead use the following baselines, lacking theoretical indistinguishability guarantees, as described and set up similarly in [25], which we defer the evaluation in the Appendix 8.2.

a

b

Figure 4: Application experiments . ResNet-18 on the CelebA dataset, featuring a total of 11,226,690 parameters. Our approach exhibits great potential for over-parameterized models with millisecond level unlearning and minimal performance degradation..

6 Conclusion↩︎

In this work, we proposed a novel online machine unlearning method that can achieve low-complexity and near-instantaneous unlearning. Specifically, we introduced an approximator based on affine stochastic recursion to characterize the trajectory discrepancy between the learned and retrained models of each forgotten sample. Based on this approximator, we proposed a near-instantaneous Hessian-free algorithm, employing HVP and gradient clipping techniques to mitigate the computation time and bound the approximation errors, respectively. We theoretically showed that our proposed method can reduce the pre-computation time, storage, and unlearning time (per sample) to \(\mathcal{O}(nd)\), \(\mathcal{O}(nd)\), \(\mathcal{O}(d)\), respectively. Experimental results validate that our proposed method benefits machine unlearning with reduced computation and storage costs and improved accuracy.

Impact Statements↩︎

This paper introduces a Hessian-free online unlearning method via maintaining a statistical vector for each data sample, computed through affine stochastic recursion approximation of the difference in the retrained and the unlearn models. We have shown that our approach outperforms the existing results in terms of time/memory costs, while also enhancing accuracy. While the unlearning algorithm was initially proposed to assist organizations in complying with data protection regulations (e.g., GDPR) and help build trust in AI systems by demonstrating the ability to correct mistakes or biases, it is important to be mindful of potential negative consequences: Unlearning algorithms can introduce unforeseen biases or inaccuracies in subsequent predictions or decisions. In addition, adversarial actors could potentially exploit machine unlearning processes to manipulate or subvert models.

7 Proofs↩︎

7.1 Detailed Proof of Eq. 10↩︎

Here, we provide a more detailed proof and explanation for 10 . Specifically, we first consider the impact of sample \(u\) on the \(e\)-th epoch, and then extend the analysis to all epochs.

Proof of 10 . Since each sample participates in updates only once per epoch, without loss of generality, for the \(e\)-th epoch, we define the sample \(u\) to be involved in the \(b(u)\)-th update in the \(e\)-th epoch. Therefore, for a learning model in 3 and a retraining model obtained without the knowledge of \(u\) in 4 , we have that, \[\begin{align} \mathbf{w}_{e,b+1}^{-u} &\leftarrow \mathbf{w}_{e,b}^{-u}-\frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \sum_{i \in \mathcal{B}_{e,b} } \nabla \ell \left(\mathbf{w}_{e,b}^{-u};z_i\right)\\ \mathbf{w}_{e, b+1} &\leftarrow \mathbf{w}_{e, b}-\frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \sum_{i \in \mathcal{B}_{e,b}} \nabla \ell \left(\mathbf{w}_{e, b};z_i\right)\\ \mathbf{w}_{e,b+1}^{-u} -\mathbf{w}_{e, b+1} &\leftarrow \mathbf{w}_{e,b}^{-u}-\mathbf{w}_{e, b}- \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \sum_{i \in \mathcal{B}_{e,b}} \left( \nabla \ell (\mathbf{w}_{e,b}^{-u};z_i) - \nabla \ell \left(\mathbf{w}_{e, b};z_i\right) \right) \end{align} \label{Equation:32proof32theorem323461-1}\tag{12}\] In particular, when the update includes sample \(u\) in the \(e\)-th epoch, consider the linear scaling rule proposed in [27], i.e., the retraining stepsize \(\hat{\eta}_{e,b(u)} = \eta_{e,b(u)} \frac{| \mathcal{B}_{e,b(u)} \backslash \{u\}|}{|\mathcal{B}_{e,b(u)}|}\), and then we have, \[\begin{align} \mathbf{w}_{e,b(u)+1}^{-u} &\leftarrow \mathbf{w}_{e,b(u)}^{-u}-\frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \sum_{i \in \mathcal{B}_{e,b(u)} \backslash \{u\} } \nabla \ell (\mathbf{w}_{e,b(u)}^{-u};z_i) \\ \mathbf{w}_{e,b(u)+1} &\leftarrow \mathbf{w}_{e,b(u)}-\frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \sum_{i \in \mathcal{B}_{e,b(u)} \backslash \{u\} } \nabla \ell \left(\mathbf{w}_{e,b(u)};z_i\right)\\ \mathbf{w}_{e,b(u)+1}^{-u} -\mathbf{w}_{e, b(u)+1} &\leftarrow \mathbf{w}_{e,b(u)}^{-u}-\mathbf{w}_{e, b(u)}- \frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \sum_{i \in \mathcal{B}_{e,b(u)}} \left( \nabla \ell \left(\mathbf{w}_{e,b(u)}^{-u};z_i\right) - \nabla \ell (\mathbf{w}_{e, b(u)};z_i\right) ) \end{align} \label{Equation:32proof32theorem323461-2}\tag{13}\] Consider the Taylor expansion of \(\nabla \ell (\mathbf{w}_{e,b}^{-u};z_i)\) around \(\mathbf{w}_{e,b}\), we have, \[\nabla \ell (\mathbf{w}_{e,b}^{-u};z_i) = \nabla \ell (\mathbf{w}_{e,b};z_i)+ \nabla^2 \ell (\mathbf{w}_{e,b};z_i) (\mathbf{w}_{e,b}^{-u} - \mathbf{w}_{e,b} ) + { o(\mathbf{w}_{e,b}^{-u}-\mathbf{w}_{e,b} )}.\] Therefore, for 12 and 13 , we can approximate the SGD recursion as, \[\begin{align} \mathbf{w}_{e,b+1}^{-u} -\mathbf{w}_{e,b+1} &\approx (\mathbf{I}- \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|}\mathbf{H}_{e,b})(\mathbf{w}_{e,b}^{-u}-\mathbf{w}_{e,b}).\\ \mathbf{w}_{e,b(u)+1}^{-u} - \mathbf{w}_{e,b(u)+1} &\approx \; \underbrace{(\mathbf{I} - \frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \mathbf{H}_{e,b(u)})(\mathbf{w}_{e,b(u)}^{-u} - \mathbf{w}_{e,b(u)})}_{\text{Impact of u in previous epoches}} + \underbrace{ \frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \nabla \ell (\mathbf{w}_{e,b(u)};u).}_{\text{Impact of u in e-th epoch }} \end{align}\] Therefore, in the \(E\)-th epoch, the difference between the retraining and the learning model is approximately as, \[\mathbf{w}_{E,B}^{-u} -\mathbf{w}_{E,B} \approx \prod_{b=0}^{B-1}(\mathbf{I}- \frac{\eta_{E,b}}{|\mathcal{B}_{E,b}|}\mathbf{H}_{E,b})(\mathbf{w}_{E,0}^{-u}-\mathbf{w}_{E,0})+ \prod_{b=b(u)}^{B-1}(\mathbf{I}- \frac{\eta_{E,b}}{|\mathcal{B}_{E,b}|}\mathbf{H}_{E,b}) \frac{\eta_{E,b(u)}}{|\mathcal{B}_{E,b(u)}|} \nabla \ell (\mathbf{w}_{E,b(u)};u).\] We have thus obtained the difference between the retraining model and the learning model in the \(E\)-th epoch. It is worth noting that the second term represents the influence of sample \(u\) on the training trajectory in the \(E\)-th epoch, and it can be computed entirely. In the first term, \(\mathbf{w}_{E,0}^{-u}-\mathbf{w}_{E,0}=\mathbf{w}_{E-1,B}^{-u}-\mathbf{w}_{E-1,B}\) is the result of the \(E-1\)-th epoch, and for the \(E-1\) epoch, we also have a similar result.Apply it recursively to complete the proof, we ultimately obtain the approximator for \(u\). \[\mathbf{w}_{E,B}^{-u} -\mathbf{w}_{E,B} \approx \sum_{e=0}^E\mathbf{M}_{e,b(u)} \nabla \ell (\mathbf{w}_{e,b(u)};u) := \mathbf{a}_{E,B}^{-u}.\] where \(\mathbf{M}_{e,b(u)} = \frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \prod^E_{k=e}\prod_{b=b(u)}^{B-1}(\mathbf{I} - \frac{\eta_{k,b}}{|\mathcal{B}_{k,b}|} \mathbf{H}_{k,b})\). ◻

7.2 Detailed Proof of Theorem 1↩︎

Now, we demonstrate that in the case of continuous arrival of deletion requests, our algorithm is capable of streaming removing samples. Intuitively, this property can be mainly explained by Taylor’s linear expansion. For ease of exposition, we simplify our proof to a scenario with only two samples, \(u_1\) and \(u_2\). That is, our goal is to prove that when \(u_1\) initiates a deletion request and the algorithm is executed, the arrival of a deletion request from \(u_2\) after \(u_1\) (Streaming Deletion), is fully equivalent to the simultaneous execution of deletion requests from both \(u_1\) and \(u_2\) (Batch Deletion). Without loss of generality, we assume that \(b(u_1) < b(u_2)\). Based on 10 , the approximators for \(u_1\) and \(u_2\) are denoted as \(\mathbf{a}_{E,B}^{-u_1}\) and \(\mathbf{a}_{E,B}^{-u_2}\), respectively, as follows: \[\begin{align} \mathbf{a}_{E,B}^{-u_1} &= \sum_{e=0}^E\mathbf{M}_{e,b(u_1)} \nabla \ell (\mathbf{w}_{e,b(u_1)};u_1)\\ \mathbf{a}_{E,B}^{-u_2} &= \sum_{e=0}^E\mathbf{M}_{e,b(u_2)} \nabla \ell (\mathbf{w}_{e,b(u_2)};u_2) \label{Equation:32Proof32of32Corollary-1} \end{align}\tag{14}\] When we simultaneously delete \(u_1\) and \(u_2\), the difference between retraining the model and learning the model is as follows, \[\begin{align} \mathbf{w}_{e,b+1}^{-u_2-u_1} -\mathbf{w}_{e, b+1} &\leftarrow \mathbf{w}_{e,b}^{-u_2-u_1}-\mathbf{w}_{e, b}- \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \sum_{i \in \mathcal{B}_{e,b}} \left( \nabla \ell \left(\mathbf{w}_{e,b}^{-u_2-u_1};z_i\right) - \nabla \ell (\mathbf{w}_{e, b};z_i) \right)\\ \mathbf{w}_{e,b(u_2)+1}^{-u_2-u_1} -\mathbf{w}_{e, b(u_2)+1} &\leftarrow \mathbf{w}_{e,b(u_2)}^{-u_2-u_1}-\mathbf{w}_{e, b(u_2)}- \frac{\eta_{e,b(u_2)}}{|\mathcal{B}_{e,b(u_2)}|} \sum_{i \in \mathcal{B}_{e,b(u_2)}} \left( \nabla \ell (\mathbf{w}_{e,b(u_2)}^{-u_2-u_1};z_i) - \nabla \ell (\mathbf{w}_{e, b(u_2)};z_i) \right)\\ \mathbf{w}_{e,b(u_1)+1}^{-u_2-u_1} -\mathbf{w}_{e, b(u_1)+1} &\leftarrow \mathbf{w}_{e,b(u_1)}^{-u_2-u_1}-\mathbf{w}_{e, b(u_1)}- \frac{\eta_{e,b(u_1)}}{|\mathcal{B}_{e,b(u_1)}|} \sum_{i \in \mathcal{B}_{e,b(u_1)}} \left( \nabla \ell (\mathbf{w}_{e,b(u_1)}^{-u_2-u_1};z_i) - \nabla \ell (\mathbf{w}_{e, b(u_1)};z_i) \right). \end{align}\] Consider the Taylor expansion. For the \(E\)-th epoch, we have, \[\begin{align} \mathbf{w}_{E,b+1}^{-u_2-u_1} -\mathbf{w}_{E,b+1} &\approx (\mathbf{I}- \frac{\eta_{E,b}}{|\mathcal{B}_{E,b}|}\mathbf{H}_{E,b})(\mathbf{w}_{E,b}^{-u_2-u_1}-\mathbf{w}_{E,b})\\ \mathbf{w}_{E,b(u_2)+1}^{-u_2-u_1} - \mathbf{w}_{E,b(u_2)+1} &\approx \; (\mathbf{I} - \frac{\eta_{E,b(u_2)}}{|\mathcal{B}_{E,b(u_2)}|} \mathbf{H}_{E,b(u_2)})(\mathbf{w}_{E,b(u_2)}^{-u_2-u_1} - \mathbf{w}_{E,b(u_2)}) + \underbrace{\frac{\eta_{E,b(u_2)}}{|\mathcal{B}_{E,b(u_2)}|} \nabla \ell (\mathbf{w}_{E,b(u_2)};u_2)}_{\text{Impact of u_2 in e-th epoch }}\\ \mathbf{w}_{E,b(u_1)+1}^{-u_2-u_1} - \mathbf{w}_{E,b(u_1)+1} &\approx \; (\mathbf{I} - \frac{\eta_{E,b(u_1)}}{|\mathcal{B}_{E,b(u_1)}|} \mathbf{H}_{E,b(u_1)})(\mathbf{w}_{E,b(u_1)}^{-u_2-u_1} - \mathbf{w}_{E,b(u_1)}) + \underbrace{\frac{\eta_{E,b(u_1)}}{|\mathcal{B}_{E,b(u_1)}|} \nabla \ell (\mathbf{w}_{E,b(u_1)};u_1)}_{\text{Impact of u_1 in e-th epoch }}. \end{align}\] Therefore, in the \(E\)-th epoch, we have the following affine stochastic recursion, \[\begin{align} \mathbf{w}_{E,B}^{-u_2-u_1}& -\mathbf{w}_{E,B} \approx \prod_{b=0}^{B-1}(\mathbf{I}- \frac{\eta_{E,b}}{|\mathcal{B}_{E,b}|}\mathbf{H}_{E,b})(\mathbf{w}_{E,0}^{-u_2-u_1}-\mathbf{w}_{E,0})\\&+ \prod_{b=b(u_1)}^{B-1}(\mathbf{I}-\frac{\eta_{E,b}}{|\mathcal{B}_{E,b}|}\mathbf{H}_{E,b}) \frac{\eta_{E,b(u_1)}}{|\mathcal{B}_{E,b(u_1)}|} \nabla \ell (\mathbf{w}_{E,b(u_1)};u_1)+ \prod_{b=b(u_2)}^{B-1}(\mathbf{I}- \frac{\eta_{E,b}}{|\mathcal{B}_{E,b}|}\mathbf{H}_{E,b}) \frac{\eta_{E,b(u_2)}}{|\mathcal{B}_{E,b(u_2)}|} \nabla \ell (\mathbf{w}_{E,b(u_2)};u_2). \end{align}\] Apply it recursively to complete the proof, and we can simultaneously compute the impact of \(u_1\) and \(u_2\) on training for all epochs. Since the initial models are identical, recursion stops at \(e=0, b=b(u_1)\), and we eventually obtain, \[\mathbf{w}_{E,B}^{-u_2-u_1} -\mathbf{w}_{E,B} \approx \sum_{e=0}^E\mathbf{M}_{e,b(u_1)} \nabla \ell (\mathbf{w}_{e,b(u_1)};u_1)+\sum_{e=0}^E\mathbf{M}_{e,b(u_2)} \nabla \ell (\mathbf{w}_{e,b(u_2)};u_2) := \mathbf{a}_{E,B}^{-u_2-u_1}. \label{Equation:32Proof32of32Corollary-2}\tag{15}\] According to 14 and 15 , we have completed the proof of \(\mathbf{a}_{E,B}^{-u_2-u_1}= \mathbf{a}_{E,B}^{-u_1}+ \mathbf{a}_{E,B}^{-u_2}\).

7.3 Detailed Proof of Lemma 1↩︎

Proof of Lemma 1. Now we begin the proof that the remainder term \(o(\mathbf{w}_{e,b}^{-u} -\mathbf{w}_{e,b})\) in the \(e\)-th epoch, \(b\)-th round is constrained by gradient clipping. Specifically, recalling 4 and 5 , we have, \[\begin{align} \mathbf{w}_{e,b+1}^{-u}& -\mathbf{w}_{e,b+1} = \mathbf{w}_{e,b}^{-u}-\mathbf{w}_{e,b} -\frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \sum_{i \in \mathcal{B}_{e,b} } \left(\nabla \ell (\mathbf{w}_{e,b}^{-u};z_i)-\nabla \ell (\mathbf{w}_{e,b};z_i) \right). \\ \mathbf{w}_{e,b(u)+1}^{-u} & \!-\! \mathbf{w}_{e, b(u)+1} \! \leftarrow \! \mathbf{w}_{e,b(u)}^{-u} \!-\!\mathbf{w}_{e, b(u)} \!- \! \frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|} \!\! \sum_{i \in \mathcal{B}_{e,b(u)} \backslash \{u\} } \!\!\! \!\!\! \!\!\! \!\!\left(\! \nabla \ell (\mathbf{w}_{e,b(u)}^{-u}\!;z_i) \!-\! \nabla \ell (\mathbf{w}_{e, b(u)}\!;z_i) \! \right) \!+\! \frac{\eta_{e,b(u)}}{|\mathcal{B}_{e,b(u)}|}\nabla \ell (\mathbf{w}_{e,b(u)}\!;u) . \end{align}\] In model training, we threshold the stochastic gradient norm \(\Vert \nabla \ell (\mathbf{w}_{e,b};u) \Vert\) at \(C\) when clipping is applied. \[\begin{align} \Vert \mathbf{w}_{e,b}^{-u} -\mathbf{w}_{e,b} \Vert & \leq \Vert \mathbf{w}_{e,b-1}^{-u} -\mathbf{w}_{e,b-1} \Vert + 2 \eta_{e,b-1} C \end{align}\] Consider a simple stepsize time decay strategy \(\eta_{e,b}=q\eta_{e,b-1}\), where \(0\!<q <1\) is the decay rate, and let \(\eta\) be the initial stepsize. Recalling that \(t=eB+b\), then we have \[\Vert \mathbf{w}_{e,b}^{-u} -\mathbf{w}_{e,b} \Vert \leq \Vert \mathbf{w}_{e,b-1}^{-u} -\mathbf{w}_{e,b-1} \Vert + 2\eta C q^{t-1}\] Noting that the initial retraining model and learning model are identical. Therefore, based on the above recursion, we can obtain the geometric progression as follows, \[\begin{align} \Vert \mathbf{w}_{e,b}^{-u} -\mathbf{w}_{e,b} \Vert &\leq 2\eta C q^{t-1}+ 2\eta C q^{t-2}+...+ 2\eta C q^{0} \\ &=2\eta C \sum_{k=0}^{t-1} q^k\\ &=2\eta C \frac{q^{t}-1}{q-1} \end{align}\] ◻

7.4 Detailed Proof of Theorem 2↩︎

Proof of Theorem 2. Let us recall the approximator of 10 in Section 3, we have, \[\mathbf{a}_{E,B}^{-u} = \sum_{e=1}^E\mathbf{M}_{e,b(u)} \nabla \ell (\mathbf{w}_{e,b(u)};u), \; \text{where}\; \mathbf{M}_{e,b(u)}= \frac{\eta_{E,b(u)}}{|\mathcal{B}_{E,b(u)}|} \prod^E_{k=e}\prod_{b=b(u)+1}^{B-1}\Big(\mathbf{I} - \frac{\eta_{k,b}}{|\mathcal{B}_{k,b}|} \mathbf{H}_{k,b}\Big).\] For the \(E\)-th epoch’ \(B\)-th update, we have the following Taylor approximation, \[\mathbf{w}_{E,B}^{-u}-\mathbf{w}_{E,B}-\mathbf{a}_{E,B}^{-u} = (\mathbf{I} - \frac{\eta_{E,B-1}}{|\mathcal{B}_{E,B-1}|} \mathbf{H}_{E,B-1})( \mathbf{w}_{E,B-1}^{-u}-\mathbf{w}_{E,B-1}) +\eta_{E,B-1}o(\mathbf{w}_{E,B-1}^{-u}-\mathbf{w}_{E,B-1} ) -\mathbf{a}_{E,B}^{-u}.\] For convenience, we define \({\Delta^{-u}_{E,B}} = \mathbf{w}_{E,B}^{-u} - \mathbf{w}_{E,B}\), and \(\hat{\mathbf{H}}_{E,B-1} =\mathbf{I} - \frac{\eta_{E,B-1}}{|\mathcal{B}_{E,B-1}|} \mathbf{H}_{E,B-1}\). Here, we begin to analyze the error generated by each update, \[\begin{align} \Delta^{-u}_{E,B}-\mathbf{a}_{E,B}^{-u} & = \hat{\mathbf{H}}_{E,B-1} \Delta^{-u}_{E,B-1}+\eta_{E,B-1}o(\Delta^{-u}_{E,B-1} ) -\mathbf{a}_{E,B}^{-u} \end{align} \label{Equation:32proof32of32theorem324463-1}\tag{16}\] Furthermore, we observe that in the \(E\)-th epoch, 16 can be obtained from affine stochastic recursion as follows, \[\begin{align} \Vert \Delta^{-u}_{E,B}-\mathbf{a}_{E,B}^{-u} \Vert \leq & \Vert \prod_{b=0}^{B-1} \hat{\mathbf{H}}_{E,b} \Delta^{-u}_{E,0} -\mathbf{a}_{E-1,B}^{-u}\Vert + \frac{\eta q^{T-B+b(u)}}{|\mathcal{B}_{E,b(u)}|} \Vert \prod_{b=b(u)+1}^{B-1} \hat{\mathbf{H}}_{E,b} ( \nabla \ell (\mathbf{w}^{-u}_{E,b(u)};u)-\nabla \ell (\mathbf{w}_{E,b(u)};u)) \Vert \\ &\!\!\!\! + \underbrace{\eta q^{T-1} \Vert \Delta^{-u}_{E,B-1} \Vert + \eta q^{T-2} \Vert \hat{\mathbf{H}}_{E,B-1}\Delta^{-u}_{E,B-2}\Vert +...+\eta q^{T-B} \Vert \prod_{b=0}^{B-1} \hat{\mathbf{H}}_{E,b} \Delta^{-u}_{E,0} \Vert .}_\text{{Approximation error in E-th epoch}} \end{align} \label{Equation:32proof32of32theorem324463-1461}\tag{17}\] Through the above process, we obtain the result of the \(E\)-the epoch approximation. Now let’s discuss the last two terms, i.e., the approximation error for the \(E\)-th epoch. For clarity, let’s define \(T\) as the total number of steps of SGD updates performed when obtaining \(\mathbf{a}_{E,B}\). From Lemma 1, we obtain, \[\begin{align} q^{T-1} \Vert \Delta^{-u}_{E,B-1} \Vert + & q^{T-2} \Vert \hat{\mathbf{H}}_{E,B-1}\Delta^{-u}_{E,B-2}\Vert +...+ q^{T-B} \Vert \prod_{b=0}^{B-1} \hat{\mathbf{H}}_{E,b} \Delta^{-u}_{E,0} \Vert \\ &\leq 2\eta C \frac{q^{T-1}-q^{2(T-1)} }{1-q} + 2\eta C \frac{q^{T-2}-q^{2(T-2)} }{1-q} \rho +...+ 2\eta C \frac{q^{T-B}-q^{2(T-B)} }{1-q} \rho ^{B-1}\\ &=\frac{2\eta C }{1-q} \sum_{k=T-1}^{T-B} \rho ^{k-T-1} (q^{k}-q^{2k}).\\ \frac{\eta q^{T-B+b(u)}}{|\mathcal{B}_{E,b(u)}|} & \Vert \prod_{b=b(u)+1}^{B-1} \hat{\mathbf{H}}_{E,b} ( \nabla \ell (\mathbf{w}^{-u}_{E,b(u)};u)-\nabla \ell (\mathbf{w}_{E,b(u)};u)) \Vert \leq \frac{2C\eta }{|\mathcal{B}|} q^{T-B+b(u)} \rho^{B-b(u)-1} . \end{align} \label{Equation:32Proof32of32Theroem324463-2}\tag{18}\] Similarly, for each epoch, we have the aforementioned approximation error. Also, when \(e=0\), since the initial models for the retraining process and the learning process are same. Therefore, we have: \[\begin{align} \Vert \mathbf{w}_{E,B}^{-u}-\mathbf{w}_{E,B}-\mathbf{a}_{E,B}^{-u} \Vert \leq & \eta \bigg( q^{T-1} \Vert \Delta^{-u}_{E,B-1} \Vert + q^{T-2} \Vert \hat{\mathbf{H}}_{E,B-1}\Delta^{-u}_{E,B-2}\Vert+...+ q^{0} \prod^E_{k=0}\prod_{b=0}^{B-1} \hat{\mathbf{H}}_{k,b}\Delta^{-u}_{0,0} \bigg) \\ &\;\; +\frac{\eta q^{T-B+b(u)}}{|\mathcal{B}_{E,b(u)}|} \Vert \prod_{b=b(u)+1}^{B-1} \hat{\mathbf{H}}_{E,b} ( \nabla \ell (\mathbf{w}^{-u}_{E,b(u)};u)-\nabla \ell (\mathbf{w}_{E,b(u)};u)) \Vert + ... \\ &\;\;+ \frac{\eta q^{b(u)}}{|\mathcal{B}_{0,b(u)}|} \Vert \prod _{e=0}^E \prod_{b=b(u)+1}^{B-1} \hat{\mathbf{H}}_{E,b} ( \nabla \ell (\mathbf{w}^{-u}_{0,b(u)};u)-\nabla \ell (\mathbf{w}_{0,b(u)};u)) \Vert \\ \leq & 2\eta^2 C \frac{q^{T-1}-q^{2(T-1)}}{1-q} + 2\eta^2 C \frac{q^{T-2}-q^{2(T-2)} }{1-q} \rho +...+ 2\eta^2C \frac{q^{0}-q^{2\times(0)} }{1-q} \rho ^{T-1}\\ &\;\;+ \frac{2C\eta }{|\mathcal{B}|} q^{T-B+b(u)} \rho^{B-b(u)-1} +...+\frac{2C\eta }{|\mathcal{B}|} q^{b(u)} \rho^{T-b(u)-1} \\ =& \frac{2\eta^2 C }{1-q} \sum_{k=0}^{T-1} \rho ^{T-k-1} (q^{k}-q^{2k}) + \frac{2C\eta }{|\mathcal{B}|} \sum_{e=0}^{E} \rho^{T-eB-b(u)-1} q^{eB+b(u)} . \end{align}\] Through the polynomial multiplies geometric progression, we have, \[\begin{align} \frac{2\eta^2 C }{1-q} \sum_{k=0}^{T-1} &\rho ^{T-k-1} (q^{k}-q^{2k})+ \frac{2C\eta }{|\mathcal{B}|} \sum_{e=0}^{E} \rho^{T-eB-b(u)-1} q^{eB+b(u)} \\ &=\frac{2\eta^2 C }{1-q}\rho ^{T-1} \sum_{k=0}^{T-1} (\frac{1}{\rho })^{k} (q^{k}-q^{2k}) + \frac{2C\eta }{|\mathcal{B}|} {\rho^{T-b(u)-1}}{q^{b(u)}} \sum_{e=0}^{E} (\frac{q}{\rho })^{eB} \\ &=\frac{2\eta^2 C }{1-q}\rho ^{T-1} \sum_{k=0}^{T-1} \left((\frac{q}{\rho })^{k}-(\frac{q}{\rho }^{2})^{k}\right)+ \frac{2C\eta }{|\mathcal{B}|} \frac{\rho ^T-q^T}{\rho^B -q^B} \rho^{B-b(u)-1}q^{b(u)} \\ &=\frac{2\eta^2 C }{1-q} \left(\frac{\rho ^T-q^T}{\rho -q }-\frac{\rho ^T-q^{2T}}{\rho -q^2 }\right)+\frac{2C\eta }{|\mathcal{B}|} \frac{\rho ^T-q^T}{\rho^B -q^B} \rho^{B-b(u)-1}q^{b(u)}. \end{align}\] For convenience, we define \(\zeta_T^{-u} = \frac{\rho ^T-q^T}{\rho -q }-\frac{\rho ^T-q^{2T}}{\rho -q^2}+\frac{2C\eta }{|\mathcal{B}|} \frac{\rho ^T-q^T}{\rho^B -q^B} \rho^{B-b(u)-1}q^{b(u)}\). At this point, we have completed the proof of Theorem 2 as follows, \[\Vert \mathbf{w}_{E,B}^{-u}-\mathbf{w}_{E,B}-\mathbf{a}_{E,B}^{-u} \Vert \leq \frac{2\eta^2 C }{1-q}\zeta_T^{-u}\] When \(\rho \!<\!1\), the series converges, and as \(T \!\! \rightarrow \! \infty\), the approximation error is equal to 0. ◻

7.5 Additional Notation↩︎

We provide the definition of \(\lambda\)-strongly convex and \(L\)-Lipschitz with \(M\)-Smooth.

Definition 2 (\(\lambda\)-Strongly convex). The loss function \(\ell(\mathbf{w}, z)\) is \(\lambda\)-strongly convex, for any \(z \in \mathcal{Z}\) and \(\mathbf{w}_1,\;\mathbf{w}_2\), \[\ell \left(\mathbf{w}_1, z\right) \geq \ell \left(\mathbf{w}_2, z\right)+\left\langle\nabla \ell \left(\mathbf{w}_2, z\right), \mathbf{w}_1-\mathbf{w}_2\right\rangle+\frac{\lambda}{2}\left\|\mathbf{w}_1-\mathbf{w}_2\right\|^2\]

Definition 3 (\(L\)-Lipschitz). The loss function \(\ell(\mathbf{w}, z)\) is \(L\)-Lipschitz, for any \(z \in \mathcal{Z}\) and \(\mathbf{w}\), \[\left|\ell \left(\mathbf{w}_1, z\right)-\ell \left(\mathbf{w}_2, z\right)\right| \leq L\left\|\mathbf{w}_1-\mathbf{w}_2\right\|.\]

Definition 4 (\(M\)-Smooth). The loss function \(\ell(\mathbf{w}, z)\) is \(M\)-Smooth, for any \(z \in \mathcal{Z}\) and \(\mathbf{w}\), \[\left| \nabla \ell \left(\mathbf{w}_1, z\right)- \nabla\ell \left(\mathbf{w}_2, z\right)\right| \leq M\left\|\mathbf{w}_1-\mathbf{w}_2\right\|.\]

7.6 Detailed Proof of Theorem 3↩︎

Before we commence with our proof, we present some necessary lemmas.

Lemma 3 ([39]). Let \(\mathbf{w}^*\) denote a minimizer of the population risk in 1 and \(\hat{\mathbf{w}}^{-U}\) denote a minimizer of the empirical risk without the knowledge of \(U\) in 2 , where \(|U|=m\). For a \(\lambda\)-strongly convex and \(L\)-lipschitz objective function, we have, \[\mathbb{E}\left[F \left(\hat{\mathbf{w}}^{-U}\right)-F\left(\mathbf{w}^*\right)\right] \leq \frac{4 L^2}{\lambda (n-m)}.\]

Proof of Lemma 3. A comprehensive proof of Lemma 3 is available in [39]. ◻

Lemma 4. For any \(z \in \mathcal{Z}\) the loss function \(\ell(\mathbf{w}, z)\) is \(\lambda\)-strongly convex and \(M\)-smooth, and let \(\hat{\mathbf{w}}^{-U}=\operatorname{argmin}_{\mathbf{w}} F_{\mathcal{S}\backslash U}(\mathbf{w})\) without the knowledge of \(U\) in problem 2 . For all \(e,b\) and vector \(\mathbf{v} \in \mathbb{R}^d\), the spectral radius of \(\mathbf{I} - \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}\) is defined as \(\rho\) which is largest absolute eigenvalue of these matrices. We have that after \(T\) steps of SGD with initial stepsize \(\eta \leq \frac{2 }{\lambda+M}\), \[\Vert \mathbf{w}_T -\hat{\mathbf{w}}^{-U}\Vert \leq \rho^T \frac{2M}{\lambda},\;\text{where } \rho=\max \left\{\left|1-\eta \lambda \right|,\left|1-\eta M\right|\right\} < 1\]

Proof of Lemma 4. Recalling that, according to the strong convexity and smoothness, \(\lambda , M > 0\), we have \(\lambda \mathbf{I} \preceq \nabla^2 \ell(\mathbf{w}, z) \preceq M \mathbf{I}\). Therefore, we have \[( 1- \eta_T\lambda) \mathbf{I} \preceq \mathbf{I}- \frac{\eta_T}{|\mathcal{B}_{T}|} \sum_{i \in \mathcal{B}_{T}} \nabla^2 \ell(\mathbf{w}, z_i) \preceq (1-\eta_T M )\mathbf{I}.\] Let \(\eta \leq \frac{2 }{\lambda+M}\). Therefore, \(\eta_T \leq \frac{2 }{\lambda+M}\), and we have: \[| 1- \eta_T\lambda | < 1,\;| 1- \eta_TM | < 1\] It is seen from the fundamental theorem of calculus that \[\nabla \ell \left(\mathbf{w}_{T}\right)=\nabla \ell \left(\mathbf{w}_{T}\right)-\underbrace{\nabla \ell \left(\hat{\mathbf{w}}^{-U}\right)}_{=0}=\left(\int_0^1 \nabla^2 \ell \left(\boldsymbol{x}_\tau\right) \mathrm{d} \tau\right)\left(\mathbf{w}_{T}-\hat{\mathbf{w}}^{-U}\right),\] where \(\mathbf{w}_\tau:=\mathbf{w}_{T}+\tau\left(\hat{\mathbf{w}}^{-U}-\mathbf{w}_{T}\right)\). Here, \(\left\{\boldsymbol{x}_\tau\right\}_{0 \leq \tau \leq 1}\) forms a line segment between \(\mathbf{w}_{T}\) and \(\hat{\mathbf{w}}^{-U}\). Therefore, According to the SGD update rule and based on Lemma 2, we have, \[\begin{align} \Vert \mathbf{w}_{T+1} -\hat{\mathbf{w}}^{-U} \Vert & =\left\|\left(\boldsymbol{I}-\eta_T \int_0^1 \nabla^2 \ell \left(\boldsymbol{x}_\tau\right) \mathrm{d} \tau\right)\left(\mathbf{w}_{T}-\hat{\mathbf{w}}^{-U}\right)\right\| \\ & \leq \sup _{0 \leq \tau \leq 1}\left\|\boldsymbol{I}-\eta_T \nabla^2 \ell \left(\mathbf{w}_\tau\right)\right\|\left\|\mathbf{w}_{T}-\hat{\mathbf{w}}^{-U}\right\| \leq \rho \Vert \mathbf{w}_0 -\hat{\mathbf{w}}^{-U} \Vert \end{align}\] Apply it recursively, \[\Vert \mathbf{w}_T -\hat{\mathbf{w}}^{-U} \Vert \leq \rho^T \Vert \mathbf{w}_T -\hat{\mathbf{w}}^{-U} \Vert,\;\text{where } \rho=\max \left\{\left|1-\eta \lambda \right|,\left|1-\eta M\right|\right\} < 1.\] By the assumption that \(\lambda\)-strongly convex and \(M\)-smoothness, which implies that \[\begin{align} \frac{\lambda}{2}\left\|\mathbf{w}-\hat{\mathbf{w}}^{-U} \right\|^2 \leq & F(\mathbf{w})-F\left(\hat{\mathbf{w}}^{-U} \right) \\ & F(\mathbf{w})-F\left(\hat{\mathbf{w}}^{-U} \right) \leq \frac{2 M^2}{\lambda}. \end{align}\] Therefore, we have completed the proof as follows, \[\Vert \mathbf{w}_T -\hat{\mathbf{w}}^{-U} \Vert \leq \rho^T \frac{2M}{\lambda},\;\text{where } \rho=\max \left\{\left|1-\eta \lambda \right|,\left|1-\eta M\right|\right\} < 1.\] ◻

Based on Lemma 3 and Lemma 4, we begin the proof of convergence in Theorem 3.

Proof of Theorem 3. Let \(\hat{\mathbf{w}}\) be the optimal model on the entire dataset \(S\) and \(\hat{\mathbf{w}}^{-U}\) be the optimal model on the remaining dataset without the knowledge of \(U\), where \(|U|=m\). For any \(z \in \mathcal{Z}\) the loss function \(\ell(\mathbf{w}, z)\) is \(\lambda\)-strongly convex, \(M\)-smoothness and \(L\)-Lipschitz. For a minimizer \(\mathbf{w}^*\) of the population risk in 1 and \(\tilde{\mathbf{w}}_{E, B}^{-U} = \mathbf{w}_{E, B}+\mathbf{a}_{E,B}^{-U}+\sigma\), we have that \[\mathbb{E}\left[F\left(\tilde{\mathbf{w}}_{E, B}^{-U} \right)-F\left(\mathbf{w}^*\right)\right] = \mathbb{E}\left[F\left(\tilde{\mathbf{w}}_{E, B}^{-U} \right)-F\left(\hat{\mathbf{w}}^{-U}\right)\right] + \mathbb{E}\left[F\left(\hat{\mathbf{w}}^{-U}\right) - F\left(\mathbf{w}^*\right)\right].\] Based on \(L\)-Lipschitz and Lemma 3, we can obtain, \[\mathbb{E}\left[F\left(\tilde{\mathbf{w}}_{E, B} \right)-F\left(\mathbf{w}^*\right)\right] \leq \mathbb{E}[ L \Vert \tilde{\mathbf{w}}_{E, B} -\hat{\mathbf{w}}^{-U} \Vert ]+ \frac{4 L^2}{\lambda (n-m)}.\] Now, let’s bound the first term. Specifically, we have, \[\Vert \tilde{\mathbf{w}}_{E, B}^{-U} -{\mathbf{w}}_{E, B}^{-U} +{\mathbf{w}}_{E, B}^{-U}-\hat{\mathbf{w}}^{-U} \Vert \leq \Vert \tilde{\mathbf{w}}_{E, B}^{-U} -{\mathbf{w}}_{E, B}^{-U} \Vert+\Vert{\mathbf{w}}_{E, B}^{-U}-\hat{\mathbf{w}}^{-U} \Vert.\] Using Lemma 4, we have, \[\begin{align} & \Vert{\mathbf{w}}_{E, B}^{-U}-\hat{\mathbf{w}}^{-U} \Vert \leq \rho^T \frac{2M}{\lambda} ,\\ \text{where } \rho &=\max \left\{\left|1-\eta \lambda \right|,\left|1-\eta M\right|\right\} < 1 , \label{Proof32of32Theorem324464-1} \end{align}\tag{19}\] Using Theorem 2, we can bound the first term, \[\begin{align} \mathbb{E} [ \Vert \tilde{\mathbf{w}}_{E, B}^{-U} - {\mathbf{w}}_{E, B}^{-U} \Vert ]& = \mathbb{E} [\Vert \mathbf{w}_{E,B}+\mathbf{a}_{E,B}^{-U}-\mathbf{w}_{E,B}^{-U} \Vert] +\mathbb{E} [ \Vert \sigma \Vert ]\\ &\leq \frac{2 \eta^2 C }{1-q} \zeta_T^{-U} +\sqrt{d}c,\\ \text{where}\;\zeta_T^{-U} = \frac{\rho ^T- q^T}{\rho - q } - \frac{\rho ^T-q^{2T}}{\rho -q^2}& + \sum_{u_j =1 }^m \frac{2C\eta }{|\mathcal{B}|} \frac{\rho ^T-q^T}{\rho^B -q^B} \rho^{B-b(u_j)-1}q^{b(u_j)}. \end{align} \label{Proof32of32Theorem324464-2}\tag{20}\] And since the loss function \(\ell(\mathbf{w}, z)\) satisfies \(M\)-smoothness and \(\lambda\)-strongly convex, we thus have that \(0< \rho <1\). Plugging 19 and 20 , we have \[\begin{align} \mathbb{E}\left[F\left(\tilde{\mathbf{w}}_{E, B}^{-U} \right) -F\left(\mathbf{w}^*\right)\right] \leq \rho^T \frac{2ML}{\lambda} +(\frac{\sqrt{d} \sqrt{2\ln{(1.25/\delta)}}}{\epsilon} + 1) \frac{ 2 L\eta^2 C }{1-q}\zeta_T^{-U} + \frac{4 L^2}{\lambda (n-m)}. \label{eq4659} \end{align}\tag{21}\] Therefore, we have completed the proof. ◻

8 sec:Additional32Experiments↩︎

The experiments were conducted on the NVIDIA GeForce RTX 4090. The code were implemented in PyTorch 2.0.0 and leverage the CUDA Toolkit version 11.8. Our comprehensive tests were conducted on AMD EPYC 7763 CPU @1.50GHz with 64 cores under Ubuntu20.04.6 LTS.
Source Code: https://github.com/Anonymous202401/If-Recollecting-were-Forgetting

8.1 Detailed Hyperparameters Settings↩︎

In this subsection, we provide detailed hyperparameter configurations.

  • In the Verification Experiments, we utilized the MNIST dataset for method evaluation, consisting of 1,000 data points. For Logistic Regression, training was conducted for 50 epochs with stepsize of 0.05 and full batch updates. Additionally, the stochastic gradient norm was clipped at 10 during training. For CNN, training was carried out for 20 epochs with stepsize of 0.05 and a batch size of 64. Similar to LR, a threshold of 10 was employed for the stochastic gradient norm during training. In both the convex setting and the non-convex setting, the step size decay is set to 0.995.

  • In the Application Experiments, our hyperparameter configurations for both Logistic Regression and CNN on MNIST dataset are consistent with the verification experiments. Additionally, we evaluated CNN and LeNet on Fashion MNIST, which consists of 4,000 data points, and assessed ResNet-18 on CelebA, consisting of 1,000 data points. Specifically, for CNN and LeNet, training was conducted for 30 epochs with a step size of 0.5 and a batch size of 256. the threshold of 0.5 was used for the stochastic gradient norm during training. for ResNet-18, training was conducted for 10 epochs with a step size of 0.05 and a batch size of 32, and a threshold of 10 was used for the gradient norm during training. For all the aforementioned scenarios, we introduced noise scale with \(0.1\), and the step size decay was set to 0.995.

Multinomial Logistic Regression for the MNIST dataset. It comprises a single layer that performs a linear transformation on the input data. The model has a total of 7,850 parameters. Specifically, the input tensor x is flattened to a one-dimensional shape, and then it undergoes a linear transformation. This transformation is essentially a weighted sum of the input features, producing the final output of the model. The purpose of reshaping the input is to facilitate this linear transformation.

A simple CNN for the MNIST/FMNIST dataset comprises a total of 21,840 parameters and consists of two convolutional layers for image classification. The first layer processes input data with a specified number of channels, using 10 filters of size 5x5, followed by a ReLU activation function and 2x2 max pooling. The subsequent layer employs 20 filters of size 5x5, incorporating a ReLU activation function and 2x2 max pooling with dropout regularization to mitigate overfitting. The flattened output is then connected to a fully connected layer with 320 input features, succeeded by another fully connected layer with 50 units and a ReLU activation. The final layer produces the model’s output with several classes specified by the task, and a log-softmax function is applied for classification purposes.

8.2 sec:Additional32Experiments↩︎

We further evaluate on larger-scale model (ResNet18) and more sophisticated dataset (CIFAR10 [40]) with more data points (50,000). In addition to the distance between the learned model and the retrained model, we also consider different metrics in terms of accuracy on the forgetting dataset and accuracy on the remaining dataset. Since the previous works NS and IJ are difficult to compute in this scenario, we instead use the following baselines, lacking theoretical indistinguishability guarantees, as described and similarly set up in [25]:

  • FineTune: In case of finetuning, the original learned model is finetuned on the remaining dataset.

  • NegGrad: In case of gradient ascent, the learned model is finetuned using negative of the models gradients on the forgetting dataset.

Configurations: We evaluated on ResNet18 for CIFAR-10 featuring a total of 11,689,512 parameters with 50,000 data samples. The Learning stage was conducted for 40 epochs with stepsize of 0.001 and a batch size of 256. For FinetuTune, 10 epochs of training are done with a learning rate of 0.001. For NegGrad, we run gradient ascent for 2 epochs with a learning rate of 0.0001 and batchsize of 1. We define \(D_f\) as the forgetting dataset, \(D_r\) as the remaining dataset, and \(D_t\) as the test dataset. We randomly unlearned 50 data points. The following Table 2 are our evaluation results:

The values in parentheses represent the change in accuracy compared to the original learned model. For example, in the retrain method, 78.00 (-2) means that the accuracy on decreased by 2% compared to the original learned model and dataset.

Table 2: Verification experiments . We evaluated on ResNet18 for CIFAR-10 featuring a total of 11,689,512 parameters with 50,000 data samples. Firstly, our results achieve indistinguishability from the retrained model (distance of 0.09) in a near-instantaneous unlearning time (0.0006s), which demonstrates the potential of our method for over-parameterized models. On the other hand, according to the results of FineTune, having similar accuracy to the retrained model does not necessarily imply model indistinguishability.
Method Accuracy on \(D_f\) (%) Accuracy on \(D_r\) (%) Accuracy on \(D_t\) (%) Distance Unlearning Time (Sec)
FineTune 78.00 (-2) 84.99 (+0.57) 79.42 (-0.24) 2.21 106.76 s
NegGrad 64.00 (-16) 73.60 (-10.82) 70.57 (-9.09) 0.41 1.15 s
Retrain 78.00 (-2) 84.38 (-0.04) 79.49 (-0.17) 468.65 s
Ours 78.00 (-2) 84.41 (-0.01) 79.63 (-0.03) 0.09 0.0006 s

Performances of Our Scheme: We still maintain model indistinguishability and accuracy indistinguishability from the retrained model on more complex tasks and larger models, where the distance is only 0.09. Moreover, compared to other baselines, our method achieves forgetting in only 0.6 ms, demonstrating the potential of our method on larger models.

Model Indistinguishability v.s. Accuracy: For FineTune, we can observe that even though the accuracies are similar to retained model, the distance is very large at 2.21. For NegGrad, we ran on with a very small learning rate to avoid gradient explosion. We observe that the accuracy on forgetting dataset \(D_f\) and the accuracy on remaining dataset \(D_r\) would rapidly decrease simultaneously, but even in this case, the distance is still lower than that of FineTune. This demonstrates that a small difference in accuracy does not indicatethe model indistinguishability. Furthermore, although distance can serve as a reliable metric to evaluate model indistinguishability for assessing the algorithm of a theoretical work, it is completely impractical in real-world scenarios because it requires retraining a model. Therefore, exploring more effective metrics to evaluate model indistinguishability is an interesting and important topic in the future.

8.3 sec:Ablation32Studies↩︎

Here, we investigate the sensitivity of the proposed approach to hyperparameters. Specifically, we evaluated the effects of different step sizes, epochs, decay rates of step size, and dataset sizes on proposed approach. We conducted simulations in both convex and non-convex scenarios, using Logistic Regression for the convex case and CNN for the non-convex case, respectively. Additionally, we performed the aforementioned ablation studies on both MNIST datasets by using the distance metric between unlearned model and retrained model to evaluated similarity, and accuracy metric to evaluated performance.

Configurations: Building upon the hyperparameter settings outlined in the paper, we systematically varied these parameters to investigate the impact of hyperparameters on our method. We discuss the results of these hyperparameter settings under low deletion rates (1%, 5%) and high deletion rates (20%, 30%).

Takeaway. Let’s first provide a brief summary for the main results of ablation studies.

  • Impact of Step Sizes. Firstly, a smaller learning rate enables our model to be closer to the retrained model. This is because the remainder term \(o(\mathbf{w}_{e,b}^{-u} - \mathbf{w}_{e,b})\) which is the source of approximation error will be scaled by subsequent \(\mathbf{I} - \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}\). Specifically, the analysis results from Theorem 2 and 3 indicate that the spectral radius \(\rho\) of \(\mathbf{I} - \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}\) determines error of our proposed method. If \(\rho > 1\), approximation error will exponentially increase. Therefore, a smaller stepsize constrain the maximum eigenvalue of the matrix \(\frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}\), finally leading to a reduction in the error by decreasing the value of \(\rho\). Moreover, with a small step size, increasing epochs does not lead to significant approximation errors; instead, it exhibits better performance under the distance metric compared to a large step size.

  • Impact of Epochs. Secondly, similar to Theorem 2 and the preceding analysis of stepsize, when the step size is sufficiently small, the approximation error is insensitive and exhibits slow growth with respect to the number of epochs, i.e., excessive iterations in this scenario do not lead to significant errors. Besides, it’s important to note that the computational cost is related to the number of \(T\), and thus choosing a smaller step size may result in increased computational overhead. This requires a tradeoff based on the practical considerations of the specific scenario.

  • Impact of the Decay Rates of Step Size. Thirdly, our results indicate that as the decay rate \(q\) decreases, our approximation error diminishes. Specifically, the step size decay strategy introduces an upper bound on our error, preventing a continuous increase in error as epochs progress. Besides, it is crucial to avoid excessively small decay rates to prevent the step size from rapidly decaying to 0, which may hinder the convergence of the learned model.

  • Impact of Dataset Sizes. Finally, our numerical findings show that the approximation error reduces with larger datasets. Intuitively, as the dataset size increases, the impact of individual data points on model updates diminishes during training. This is typically reflected in the training process, where with an increase in the dataset size, each batch update is less likely to frequently sample a specific data point. Consequently, the impact of any individual data point on updates between the retrained original models diminishes, leading to a more accurate approximation in our method.

8.3.1 Impact of Step Sizes↩︎

We investigate the impact of varying step sizes {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} on both distance and accuracy metrics.

As shown in Table [Table:32Impact32of32Step32Sizes32on32Convex] and Table [Table:32Impact32of32Step32Sizes32on32Non-Convex], a smaller learning rate enables our model to be closer to the retrained model. Specifically, as Theorem 2 demonstrates that, in this process, the error arises from the scaled Peano remainder term, i.e., the term \(o(\mathbf{w}_{e,b}^{-u} - \mathbf{w}_{e,b})\) is scaled by the subsequent \(\mathbf{M}_{e,b(u)}\), where \(\mathbf{M}_{e,b(u)} = \frac{\eta_{E,b(u)}}{|\mathcal{B}_{E,b(u)}|} \prod^E_{k=e} \prod_{b=b(u)}^{B-1}(\mathbf{I} - \frac{\eta_{k,b}}{|\mathcal{B}_{k,b}|} \mathbf{H}_{k,b})\). Therefore, a smaller step size can shrink the maximum eigenvalue \(\lambda_1\) of the average Hessian matrix \(\frac{1}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}\), thereby reducing the spectral radius \(\rho\) of the matrix \(\mathbf{I} - \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}\), and finally reducing approximation error.

image

Impact of Step Sizes on Non-Convex Setting. Stepsize \(\eta =0.05\) represent hyperparameter choices of previous experiments in our paper. It can be observed that as the step size decreases, our method’s unlearned model becomes closer to the retrained model. (Seed 42)

a

b

c

d

e

f

g

h

Figure 5: ****Impact of Step Size.**** (a)(b) correspond to LR and CNN with 20% of the data to be forgotten, respectively. In the left bar chart of (a)(b), the red bar on the left represents the distance metric when using a larger learning rate (0.05), while the blue bar in the middle and the light blue bar on the right represent the distance metric with a small stepsize during model convergence and during model overfitting, respectively. The three right line charts of (a)(b) depict the accuracy metric, with the last round representing the accuracy after unlearning. It can be observed that proposed unlearned model that is closer to the retrained model and there is no apparent performance degradation when using a smaller step size, which shows insensitivity to the growth of epochs. (Seed 42).

Furthermore, If the step size is extremely large, the maximum eigenvalue \(\lambda_1\) of the average Hessian will be amplified, leading to an exponential growth in errors, as shown in Table [Table:32Impact32of32Step32Sizes32on32Convex] and Table [Table:32Impact32of32Step32Sizes32on32Non-Convex] for initial stepsize \(\eta =1\). In this scenario where \(\eta \geq 1\), our method leads to a complete breakdown of the model, rendering it unusable. However, it is noteworthy that theoretical predictions [41] and empirical validation [42] suggest that, across all datasets and models, successful training occurs only when optimization enters a stable region of parameter space where \(\lambda_1 \cdot \eta \leq 2\). Therefore, achieving good optimization results typically does not require a strategy of using aggressive step sizes.

In fact, we can gain insights from these works. Specifically, these studies aim to explore the sharpness of the loss Hessian \(\frac{1}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}\), and the term ‘sharpness’ refers to the maximum eigenvalue of the loss Hessian denoted as \(\lambda_1\) in some studies, such as [42]. Previous works [43], [44] have pointed out that flat minima are advantageous for generalization. A flat loss space is widely recognized as beneficial for gradient descent optimization, leading to improved convergence properties. Importantly, [42] demonstrates the central role that \(\lambda_1\) plays in neural network optimization, emphasizing that maintaining a sufficiently small \(\lambda_1\) during optimization is a necessary condition for successful training (without causing divergence) at large step sizes. Models which train successfully enter a region where \(\lambda \cdot \eta \leq 2\) mid training (or fluctuate just above this bound). In our work, we leverage the second-order information from the loss Hessian at each iteration during the training process to achieve forgetting. As indicated in Theorem 2, our analysis indicates that a small \(\lambda_1\) of the loss Hessian is beneficial for reducing the approximation error. In other words, when training requires a large step size \(\eta\), as suggested in [42], strategies like stepsize warm-up or initialization strategies for architectures can be employed to stabilize learning and potentially reduce unlearning approximation errors by decreasing \(\lambda_1\). We will explore such strategies in the future to implement our unlearning method at aggressive stepsizes.

Finally, with a very small step size during model training, we observed a low approximation error. To investigate the impact of epochs under small step sizes, LR was trained for 750 epochs (converged) and 2000 epochs (overfitting) with a step size of 0.001 (originally 0.05 with 50 epochs). Additionally, when the step size is very small (0.001) in LR, the step size decay strategy causes premature decay to zero before the model convergence. We thus increased the step size decay rate from 0.995 to 1. Subsequent experiments will demonstrate that this leads to an increase in approximation error. For CNN, training for 50 epochs (converged) and 500 epochs (overfitting) with a step size of 0.025 (originally 0.05 with 20 epochs) was performed. As shown in Fig. 5, even as the model approaches overfitting, the approximation error remains lower than in the case with a step size of 0.05. This suggests that a smaller step size reduces the approximation error, bringing our algorithm closer to the retrained model. It is crucial to note that a smaller step size requires more iterations to converge, resulting in increased computational cost. Therefore, choosing an appropriate step size is essential based on different scenarios.

8.3.2 Impact of Epochs↩︎

Our previous experiments indicate that the increase in epochs does not become a major factor affecting approximation errors when applying small step sizes, i.e., with LR at a step size of 0.001 and CNN at a step size of 0.025, the growth of epochs does not significantly impact the approximation error. Conversely, in the case of small step sizes, even as epochs extend to a point where the model approaches overfitting, the error remains lower than scenarios with larger step sizes.

Furthermore, we explore the impact of the epoch with a fixed and larger step size, such as the one \(\eta=0.05\) we selected in our study. In this scenario, we evaluate the impact of the epochs on both the distance and accuracy metrics. Specifically, we investigated the distance under the epoch set {10, 25, 50, 75, 100} and provided the accuracy on 100 epoch after forgetting 20% data. In addition, the selection of other hyperparameters remains consistent with the ones used previously.

a

b

c

d

Figure 6: Impact of Epoch. (a)(b) illustrate the in the distance metric for LR and CNN by varying epoch from 10 and 100. In particular, the right of (a)(b) depict the accuracy during 100 epochs, with the last round representing the accuracy after forgetting 20% data. As shown in the left figure of (a)(b), the error accumulates with the increase in epochs. However, as depicted in the right figure of (a)(b), with an appropriate choice of step size, such growth is entirely acceptable (Seed 42)..

In Fig. 6, we can observe that the distance slowly increases with the growth of epochs. This is attributed to the fact that each iteration involves an approximation, and multiple rounds of iterations lead to the accumulation on approximation error. However, as the right of Fig. 6 (a)(b) shows that, the accumulation of this error is sufficiently small and does not lead to a significant performance degradation. Additionally, we find that the distance gradually stabilizes (i.e., the error generated with each iteration becomes progressively smaller.) as epochs continue to increase. We attribute this behavior to the step size decay strategy, and we will provide detailed results in subsequent experiments.

8.3.3 Impact of Decay Rates of Stepsize↩︎

This experiments aim to study the impacts of stepsizes. We consider two scenarios: one with stepsize decay (\(q=0.995\)) and the other without stepsize decay (\(q=1\)). We keep the other hyperparameters constant and only increase epochs.

From Fig. 7 (a)(b), we can observe that the Euclidean distance for the stepsize decay strategy is generally smaller compared to that without the stepsize decay. This aligns with our earlier analysis of step size in Appendix 8.3.1, i.e., a smaller step size leads to a smaller approximation error and choosing an appropriate step size decay rate ensures a continuous reduction in step size, preventing the unbounded growth of the approximation error.

a

b

Figure 7: Impacts of the stepsize decay rate on the Euclidean distances for (a) the logistic regression and (b) the CNN by varying epochs from 1 to 200. The green line represents learning with a 0.995 step size decay rate, while the blue line represents without step size decay. It can be observed that step size decay effectively reduces the approximation error. (Seed 42).

Additionally, as in Fig. 7 (b), we find that the Euclidean distance of CNN tends to stabilize even without step size decay. This result can be explained by the conclusions in [45], [46]. Specifically, [45], [46] analyze the spectral distribution of the Hessian eigenvalues during CNN training. It reveals that, initially, the eigenvalues of loss Hessian are symmetrically distributed around 0. As training progresses, they converge towards 0, indicating Hessian degeneracy. Towards the end of training, only a small number of eigenvalues become negative and a large portion of the Hessian eigenvalues approach 0 when the training closes to convergence. Let’s recall our previous analysis of stepsize in Appendix 8.3.1 and the Theorem 2, where we demonstrates that the approximation error is primarily determined by \(\rho\) of \(\mathbf{I} - \frac{\eta_{e,b}}{|\mathcal{B}_{e,b}|} \mathbf{H}_{e,b}\), and \(\rho\) is determined by \(\eta_{e,b}\) and \(\lambda_1\). Here, \(\lambda_1\) represents the maximum eigenvalue of the average Hessian matrix \(\frac{1}{|\mathcal{B}_{e,b}|}\mathbf{H}_{e,b}\). As the number of iterations increases, the eigenvalues of the loss Hessian \(\mathbf{H}_{e,b}\) tends to 0. This implies a decrease in \(\lambda_1\), leading to a slower increase in error. Therefore, in Fig. 7 (b), the distance gradually stabilizes even without a decay strategy due to the decrease in \(\lambda_1\) during training, and it still increases due to the existence of some larger outlier eigenvalues.

8.3.4 Impact of Dataset Sizes↩︎

Finally, we assess the impacts of different dataset sizes. To maintain the total number of iterations, we adapt the number of epochs based on the dataset size. We keep other hyperparameters constant.

As shown in Table [Table:32Impact32of32Dataset32Sizes32on32Convex] and Table [Table:32Impact32of32Dataset32Sizes32on32Non-Convex], we observe that larger dataset size generally tends to decrease the approximation error, making the unlearned model more closely resemble the retrained model. A reasonable explanation is that the enlargement of the dataset implies less frequent sampling of a specific data point for updating the model, consequently reducing the impact of data on the model updates in Eq. 3 . This, in turn, leads to a decrease in the approximation error.

image

Impact of Dataset Sizes on Non-Convex Setting. Datasize \(n=1,000\) represent hyperparameter choices of previous experiments in our paper. (Seed 42)

9 Discussion↩︎

9.1 sec:Deletion32Capacity↩︎

Subsequently, we analyzed the deletion capacity of our method based on [8], which formalizes how many samples can be deleted from the model parameterized by the original empirical risk minimizer while maintaining reasonable guarantees on the test loss.

Theorem 4 (Definition of Deletion Capacity [8]). Let \(\varepsilon, \delta \geq 0\). Let \(S\) be a dalasel of size \(n\) drawn i.i.d. from \(\mathcal{D}\), and let \(\ell(w; z)\) be a loss function. For a pair of learning and unlearning algorithms \({\Omega}, \bar{\Omega}\) that are \((\varepsilon, \delta)\)-unlearning, the deletion capacity \(m_{\varepsilon, \delta}^{\Omega, \bar{\Omega}}(d, n)\) is defined as the maximum number of samples \(U\) to be forgotten, while still ensuring an excess population risk is \(\gamma\). Specifically, \[\begin{align} m_{\varepsilon, \delta,\gamma}^{\Omega, \bar{\Omega}}(d, n)&:=\max \left\{m \mid \mathbb{E}\left[\max _{U \subseteq \mathcal{S}:|U| \leq m} F(\bar{\Omega}(\Omega(\mathcal{S}); T(\mathcal{S}))-F(\mathbf{w}^*)\right] \leq \gamma\right\}, \end{align}\] where the expectation above is with respect to \(S \sim \mathcal{D}^n\) and output of the algorithms \(\Omega\) and \(\bar{\Omega}\).

Similar to [8], we provide upper and lower bounds for our algorithm. Intuitively, according to Theorem 3, in the optimal situation, we discussed the excess loss of our algorithm and demonstrated its superiority over previous approaches, indicating a potentially improved deletion capacity. We first give the upper bound of deletion capacity.

Theorem 5 (Upper Bound of Deletion Capacity [8]). Let \(\delta \leq 0.005\) and \(\epsilon=1\). There exists a 4-Lipschitz and 1-strongly convex loss function \(f\), and a distribution \(\mathcal{D}\) such that for any learning algorithm \(A\) and removal mechanism \(M\) that satisfies \((\epsilon, \delta)\)-unlearning and has access to all remaining samples \(\mathcal{S} \backslash U\), then the deletion capacity is: \[\begin{align} m_{\epsilon, \delta, \gamma}^{A, M}(d, n) \leq c n \end{align}\] where \(c\) depends on the properties of function \(f\) and is strictly less than 1.

A comprehensive proof of Theorem 5 is available in [8]. Furthermore, Based on Theorem 3, we present lower bound of deletion capacity below:

Theorem 6 (Lower Bound of Deletion Capacity [8]). There exists a learning algorithm \(\Omega\) and an unlearning algorithm \(\bar{\Omega}\) such that for any convex (and hence strongly convex), L-Lipschitz, and M-Hessian-Lipschitz loss \(f\) and distribution \(\mathcal{D}\). Choose step size \(\eta \leq \frac{2}{M+\lambda}\) and \(T\) approaches infinity, then we have \[\begin{align} m_{\varepsilon, \delta,\gamma}^{\Omega, \bar{\Omega}}(d, n) \geq n - \frac{4L^2}{\gamma \lambda} \end{align}\]

Proof of Theorem 6. Based on Theorem 3, when \(T\) approaches infinity, we have \[\begin{align} F(\tilde{\mathbf{w}}_{E, B}^{-U} ) - \mathbb{E}[F\left(\mathbf{w}^*\right) ] = \mathcal{O}\left( \frac{4 L^2}{\lambda (n-m)} \right). \end{align} \label{jmprtueb}\tag{22}\] where \(\mathbf{w}_{E, B}^{-U}\) denotes the output point \(\bar{\Omega}(\Omega(\mathcal{S}); T(\mathcal{S}))\) and \(\mathbf{w}^*\) is the minimizer of the population risk in 1 . The above upper bound on the excess risk implies that we can delete at least \[m = n - \frac{4L^2}{\gamma \lambda}\] ◻

9.2 Online Unlearning-Repairing Strategy↩︎

Unlike most unlearning methods, our proposed unlearning approach does not require access to the remaining data set \(\mathcal{S}/U\) or forgetting data \(U\) in Stage III. However, if we do have access to the remaining dataset, we can adopt a simple fine-tuning strategy to mitigate the performance degradation caused by excessive deletions and make our algorithm more robust.

Specifically, when the continuous incoming deletion requests to forget data, we can rapidly implement certified unlearning through vector addition. This involves augmenting the current model with approximators, where errors accumulate during this process and thus cause the performance degradation. Suppose, upon the arrival of a new round of deletion requests for a particular model and dataset, the model’s accuracy significantly drops due to the removal of a substantial amount of data. In such a scenario, we can address this performance degradation by fine-tuning the current model for \(E_r\) epochs (usually just 2 epoch) on the remaining dataset, referred to as the ‘repairing’ process. Additionally, by computing the approximators for the remaining dataset during the repairing process, and adding these approximators during the repairing process to the pre-stored approximators during the learning process, we obtain the new set of approximators for the remaining data.

Through the aforementioned strategy, we repair the model’s accuracy and update the approximators corresponding to the remaining dataset. When subsequent forgetting requests occur, we can still utilize the new approximators to implement forgetting. We conducted a simple validation on ResNet-18. Specifically, the hyperparameter configuration remains consistent with the previous setup in the application experiment, i.e., we use a relatively large learning rate to train the model, which can result in our method generating larger errors as described in Section 8.3.1. Each time we forget 20% of the data, we apply the repairing strategy outlined in Algorithm 9, conducting \(E_r=2\) epoch on the remaining dataset to repair performance. As shown in Figure 8, our method performs a fine-tuning operation on the remaining dataset every time 20% of the data was forgotten, reparing the model performance to a satisfactory level. Through the Online Unlearning-Repairing Strategy, our approach can handle more forgetting requests on ResNet-18 without causing significant performance losses.

Figure 8: Unlearning-Repairing. The blue, red, and green lines respectively represent the learning, unlearning, and repairing process.

Figure 9: Online Unlearning-Repairing (OUR) Strategy

References↩︎

[1]
Becker, A. and Liebig, T. Evaluating machine unlearning via epistemic uncertainty. CoRR, abs/2208.10836, 2022.
[2]
Warnecke, A., Pirch, L., Wressnegger, C., and Rieck, K. Machine unlearning of features and labels. In 30th Annual Network and Distributed System Security Symposium, NDSS 2023, San Diego, California, USA, February 27 - March 3, 2023. The Internet Society, 2023.
[3]
Guo, C., Goldstein, T., Hannun, A. Y., and van der Maaten, L. Certified data removal from machine learning models. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 3832–3842. PMLR, 2020.
[4]
Neel, S., Roth, A., and Sharifi-Malvajerdi, S. Descent-to-delete: Gradient-based methods for machine unlearning. In Feldman, V., Ligett, K., and Sabato, S. (eds.), Algorithmic Learning Theory, 16-19 March 2021, Virtual Conference, Worldwide, volume 132 of Proceedings of Machine Learning Research, pp. 931–962. PMLR, 2021.
[5]
Gupta, V., Jung, C., Neel, S., Roth, A., Sharifi-Malvajerdi, S., and Waites, C. Adaptive machine unlearning. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 16319–16330, 2021.
[6]
Chien, E., Pan, C., and Milenkovic, O. Efficient model updates for approximate unlearning of graph-structured data. In The Eleventh International Conference on Learning Representations, 2022.
[7]
Suriyakumar, V. M., Wilson, A. C., et al. Algorithms that approximate data removal: New results and limitations. In Advances in Neural Information Processing Systems 35, NeurIPS, 2022.
[8]
Sekhari, A., Acharya, J., Kamath, G., and Suresh, A. T. Remember what you want to forget: Algorithms for machine unlearning. In Advances in Neural Information Processing Systems 34, NeurIPS, pp. 18075–18086, 2021.
[9]
Cao, Y. and Yang, J. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015, pp. 463–480. IEEE Computer Society, 2015.
[10]
Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021, pp. 141–159. IEEE, 2021.
[11]
Brophy, J. and Lowd, D. Machine unlearning for random forests. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 1092–1104. PMLR, 2021.
[12]
Schelter, S., Grafberger, S., and Dunning, T. Hedgecut: Maintaining randomised trees for low-latency machine unlearning. In Li, G., Li, Z., Idreos, S., and Srivastava, D. (eds.), SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, pp. 1545–1557. ACM, 2021.
[13]
Yan, H., Li, X., Guo, Z., Li, H., Li, F., and Lin, X. an efficient architecture for exact machine unlearning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pp. 4006–4013, 2022.
[14]
Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., and Zhang, Y. Graph unlearning. In Yin, H., Stavrou, A., Cremers, C., and Shi, E. (eds.), Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, CCS 2022, Los Angeles, CA, USA, November 7-11, 2022, pp. 499–513. ACM, 2022.
[15]
Chen, C., Sun, F., Zhang, M., and Ding, B. Recommendation unlearning. In Laforest, F., Troncy, R., Simperl, E., Agarwal, D., Gionis, A., Herman, I., and Médini, L. (eds.), WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, pp. 2768–2777. ACM, 2022.
[16]
Wu, Y., Dobriban, E., and Davidson, S. B. Deltagrad: Rapid retraining of machine learning models. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 10355–10366. PMLR, 2020.
[17]
Nguyen, Q. P., Low, B. K. H., and Jaillet, P. Variational bayesian unlearning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[18]
Golatkar, A., Achille, A., and Soatto, S. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 9301–9309. Computer Vision Foundation / IEEE, 2020.
[19]
Izzo, Z., Anne Smart, M., Chaudhuri, K., and Zou, J. Approximate data deletion from machine learning models. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pp. 2008–2016, 2021.
[20]
Mehta, R., Pal, S., Singh, V., and Ravi, S. N. Deep unlearning via randomized conditionally independent hessians. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 10412–10421. IEEE, 2022.
[21]
Wu, G., Hashemi, M., and Srinivasa, C. performance unchanged model augmentation for training data removal. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pp. 8675–8682. AAAI Press, 2022.
[22]
Tanno, R., Pradier, M. F., Nori, A. V., and Li, Y. Repairing neural networks by leaving the right past behind. In NeurIPS, 2022.
[23]
Becker, A. and Liebig, T. Certified data removal in sum-product networks. In Li, P., Yu, K., Chawla, N. V., Feldman, R., Li, Q., and Wu, X. (eds.), IEEE International Conference on Knowledge Graph, ICKG 2022, Orlando, FL, USA, November 30 - Dec. 1, 2022, pp. 14–21. IEEE, 2022.
[24]
Jagielski, M., Thakkar, O., Tramèr, F., Ippolito, D., Lee, K., Carlini, N., Wallace, E., Song, S., Thakurta, A. G., Papernot, N., and Zhang, C. Measuring forgetting of memorized training examples. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
[25]
Tarun, A. K., Chundawat, V. S., Mandal, M., and Kankanhalli, M. S. Deep regression unlearning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 33921–33939. PMLR, 2023.
[26]
Liu, J., Lou, J., Qin, Z., and Ren, K. Certified minimax unlearning with generalization rates and deletion capacity. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023., 2023.
[27]
Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
[28]
Gürbüzbalaban, M., Simsekli, U., and Zhu, L. The heavy-tail phenomenon in SGD. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 3964–3975. PMLR, 2021.
[29]
Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural Comput., 6 (1): 147–160, 1994.
[30]
Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
[31]
Qian, J., Wu, Y., Zhuang, B., Wang, S., and Xiao, J. Understanding gradient clipping in incremental gradient methods. In Banerjee, A. and Fukumizu, K. (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 1504–1512. PMLR, 2021.
[32]
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020.
[33]
Wright, S. Correlation and causation. Journal of agricultural research, 20 (7): 557–585, 1921.
[34]
Spearman, C. The proof and measurement of association between two things. 1961.
[35]
Basu, S., Pope, P., and Feizi, S. Influence functions in deep learning are fragile. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
[36]
Basu, S., You, X., and Feizi, S. On second-order group influence functions for black-box predictions. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 715–724. PMLR, 2020.
[37]
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016.
[38]
Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
[39]
Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Stochastic convex optimization. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009.
[40]
Alex, K. Learning multiple layers of features from tiny images. https://www. cs. toronto. edu/kriz/learning-features-2009-TR. pdf, 2009.
[41]
Wu, L., Ma, C., and E, W. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 8289–8298, 2018.
[42]
Gilmer, J., Ghorbani, B., Garg, A., Kudugunta, S., Neyshabur, B., Cardoze, D., Dahl, G. E., Nado, Z., and Firat, O. A loss curvature perspective on training instabilities of deep learning models. In International Conference on Learning Representations, 2022.
[43]
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
[44]
Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, M. W. Hessian-based analysis of large batch training and robustness to adversaries. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 4954–4964, 2018.
[45]
Sagun, L., Evci, U., Güney, V. U., Dauphin, Y. N., and Bottou, L. Empirical analysis of the hessian of over-parametrized neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. OpenReview.net, 2018.
[46]
Sagun, L., Bottou, L., and LeCun, Y. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.

  1. Similar analysis has been conducted in other lines of work to investigate the behavior of SGD, e.g., reference [28] focused on the properties of the matrix \(\mathbf{I} - \frac{\eta}{|\mathcal{B}|} \mathbf{H}\), which is termed as multiplicative noise, serving as the main source of heavy-tails in SGD and governing the update behavior of \(\mathbf{w}_{e,b}\).↩︎

  2. It’s noteworthy that in the realm of deep neural networks for most computer vision tasks, dataset size \(n\) usually falls short of the model parameter dimension \(d\). Even for neural language models, the empirical data-to-parameter scaling law is expected to remain at \(n \sim d^{0.74}\), as suggested in [32].↩︎