Abstract

We analyze gradient descent with polyak1964some? heavy-ball momentum (HB) whose fixed momentum parameter \(\beta \in (0, 1)\) provides exponential decay of memory. Building on kovachki2021continuous?, we prove that on an exponentially attractive invariant manifold the algorithm is exactly plain gradient descent with a modified loss, provided that the step size \(h\) is small enough. Although the modified loss does not admit a closed-form expression, we describe it with arbitrary precision and prove global (finite “time” horizon) approximation bounds \(O(h^{\ord})\) for any finite order \(\ord \geq 2\). We then conduct a fine-grained analysis of the combinatorics underlying the memoryless approximations of HB, in particular, finding a rich family of polynomials in \(\beta\) hidden inside which contains Eulerian and Narayana polynomials. We derive continuous modified equations of arbitrary approximation order (with rigorous bounds) and the principal flow that approximates the HB dynamics, generalizing rosca2023on?. Approximation theorems cover both full-batch and mini-batch HB. Our theoretical results shed new light on the main features of gradient descent with heavy-ball momentum, and outline a road-map for similar analysis of other optimization algorithms.

2025-09-14

1 Introduction↩︎

Gradient descent with polyak1964some? heavy-ball momentum (HB) is a well-known optimization algorithm used in practice. Given a loss function \(L(\btheta): \mathbb{R}^d \mapsto \mathbb{R}\), and initial conditions \(\btheta^{(0)} \in \mathbb{R}^d\) and \(\bv^{(0)} \in \mathbb{R}^d\), its full-batch iteration is \[\label{eq:hb} \begin{align} &\begin{pmatrix} \btheta^{(n + 1)}\\ \bv^{(n + 1)} \end{pmatrix} = \begin{pmatrix} \btheta^{(n)} - h \nabla L(\btheta^{(n)}) + h \beta \bv^{(n)}\\ \beta \bv^{(n)} - \nabla L(\btheta^{(n)}) \end{pmatrix},\quad n \in \mathbb{Z}_{\geq 0}, \end{align}\tag{1}\] or, more compactly, \[\label{eq:hb-iter-compact} \btheta^{(n + 1)} = \btheta^{(n)} - h \sum_{k = 0}^n \beta^k \nabla L(\btheta^{(n - k)}) + h \beta^{n + 1} \bv^{(0)},\tag{2}\] where \(\beta \in (0, 1)\) is a momentum (tuning) parameter, which we treat as a fixed constant throughout this article. The next iterate \(\btheta^{(n + 1)}\) depends on the whole history \(\crl{\btheta^{(s)}}_{s = 0}^n\), rather than just the current iterate \(\btheta^{(n)}\), which we interpret as having “memory”. This algorithm or its variants, often referred to as gradient descent (GD) with momentum, are widely used in modern machine learning goodfellow2016deep?, prince2023understanding?, krizhevsky2012imagenetclassificationdeep?, sutskever2013importanceinitializationmomentum?, he2016deep?, xie2017aggregated?, he2019bag?. A mini-batch version of this algorithm is often used when training large-scale deep learning models.

This article studies the theoretical properties of HB, considering both full-batch and mini-batch implementations, and establishes three main results. First, we prove that on an exponentially attractive invariant manifold the algorithm is exactly plain gradient descent with a modified loss, provided that the step size \(h\) is small enough. Second, we describe the modified loss with arbitrary precision and prove global (finite “time” horizon) approximation bounds \(O(h^{\ord})\) for any finite order \(\ord \geq 2\). Finally, we derive continuous modified equations of arbitrary approximation order (with rigorous bounds) and the principal flow that approximates the optimization dynamics of the algorithm. Our theoretical results not only shed new light on the main properties of HB, but also outline a road-map for similar analysis of other popular optimization algorithms.

1.1 HB is GD on an Invariant Manifold↩︎

kovachki2021continuous? proved that there exists a function \(\bg_h(\btheta)\) such that \(\crl{(\btheta, \bv): \bv = - (1 - \beta)^{-1} \nabla L(\btheta) + h \boldsymbol{g}_h(\btheta)}\) is an exponentially attractive invariant manifold. This perspective draws on the theory of attractive invariant manifolds fenichel1971persistence?, hirsch1970invariant?, wiggins2013normally?. By definition, on this manifold \[\btheta^{(n + 1)} = \btheta^{(n)} - h \nabla L(\btheta^{(n)}) + h \beta \prn[\bigg]{- \frac{\nabla L(\btheta^{(n)})}{1 - \beta} + h \boldsymbol{g}_h(\btheta^{(n)})},\] and therefore we obtain an algorithm representation without memory \[\label{eq:qcUUpn} \btheta^{(n + 1)} = \btheta^{(n)} - h \frac{1}{1 - \beta} \nabla L(\btheta^{(n)}) + h^2 \beta \boldsymbol{g}_h(\btheta^{(n)}).\tag{3}\]

Leveraging this insight, we further show that \(\boldsymbol{g}_h(\cdot)\) has a particular structure that makes 3 plain gradient descent.

1 establishes that \(\boldsymbol{g}_h\) is a gradient, given by \(\boldsymbol{g}_h := \nabla G_h\) with \(G_h\) implicitly defined as (the anti-derivative of) the solution of a fixed point equation. This implies that on an exponentially attractive invariant manifold, HB is plain gradient descent: \[\btheta^{(n + 1)} = \btheta^{(n)} - \frac{h}{1 - \beta} \nabla \crl[\big]{\underbrace{L - h \beta (1 - \beta) G_h}_{\text{modified loss}}}(\btheta^{(n)}).\]

We then conduct a fine-grained analysis of how the loss is modified by memory. From the definition of \(\boldsymbol{g}_h\), it is possible to write a fixed point equation that gives a formal power series expansion for \(\boldsymbol{g}_h\) in \(h\) (4.2). Furthermore, there are two wishes that we would like to fulfill. First, we want to have precise approximation guarantees rather than just a formal series expansion. Second, it is practically useful to cover mini-batch HB, where the loss function can be different at each iteration, rather than just full-batch. To do that, we use a different approach.

1.2 Finite-Order Memoryless Approximation↩︎

Consider the mini-batch version of 2 with the typical setting \(\bv^{(0)} = \boldsymbol{0}\): \[\label{eq:hb-iter-minibatch} \btheta^{(n + 1)} = \btheta^{(n)} - h \sum_{k = 0}^n \beta^k \nabla L^{(n - k)}(\btheta^{(n - k)}),\tag{4}\] where \(\crl{L^{(s)}}_{s \in \mathbb{Z}_{\geq 0}}\) are mini-batch losses. In practice, \(L^{(s)}\) is the loss function obtained by taking the \(s\)th mini-batch of samples. (For now, we are agnostic to how exactly the samples are batched.) Mini-batch training usually achieves higher test accuracies keskar2017on?, masters2018revisiting?, smith2020generalizationbenefitnoise?, and is therefore widely used in practice bottou2018optimization?, prince2023understanding?. cattaneo2025howmemoryoptimization? introduced a technique for converting a numerical optimization method with decaying memory into a memoryless one up to \(O(h^2)\) error, that is, a second-order approximation. This paper generalizes their technique and proves an approximation bound with any desired order.

For any approximation order \(\ordDef \in \mathbb{Z}_{\geq 2}\), 2 establishes a memoryless approximation \[\label{eq:memoryless-approx-contrib} \tilde{\btheta}^{(n + 1)} = \tilde{\btheta}^{(n)} + \sum_{j = 1}^{\ord} h^j \nomemcoef{j}{n}(\tilde{\btheta}^{(n)})\tag{5}\] with the global guarantee \[\label{eq:global-guarantee-contrib} \sup_{n \in \range{0}{\lfloor T / h \rfloor}} \norm{\btheta^{(n)} - \tilde{\btheta}^{(n)}} = O(h^{\ord}),\tag{6}\] where \(T\) is any “time” horizon. Furthermore, 3 establishes a tractable form of the memoryless iteration coefficients \(\nomemcoef{j}{n}(\btheta)\) involving a sum over unlabeled rooted trees with \(j\) vertices.

This contribution provides a higher-order, more detailed understanding of HB and its memoryless representation, which we leverage to obtain new insights on the implicit regularization and dynamics of the algorithm. For example, \[\nomemcoef{1}{n}(\btheta) = - \frac{1}{1 - \beta} \nabla L^{(n)}(\btheta), \quad \nomemcoef{2}{n}(\btheta) = - \beta \sum_{b = 0}^{n - 1} \beta^{b} \sum_{l' = 1}^{b + 1} \sum_{b' = 0}^{n - l'} \beta^{b'} \nabla^2 L^{(n - 1 - b)}(\btheta) \nabla L^{(n - l' - b')}(\btheta).\] In the mini-batch case, such an expression can be insightful after averaging over permutations of samples smith2021on?, beneventano2024trajectoriessgdwithout?, cattaneo2025howmemoryoptimization?. We will illustrate it as follows. Assume that there are \(n + 1\) batches in an epoch, with each batch consisting of \(\batchsizeDef\) samples, and the \(k\)th mini-batch loss is given by \[L^{(k)}(\btheta) = \frac{1}{\batchsize} \sum_{r = k \batchsize + 1}^{k \batchsize + \batchsize} \ell_{\pi(r)}(\btheta),\quad k \in \range{0}{n},\] where \(\crl{\ell_s}_{s = 1}^{(n + 1) \batchsize}\) are per-sample losses and \(\pi\) is a random permutation of \(\range{1}{(n + 1) \batchsize}\), distributed uniformly over all \(((n + 1) \batchsize)!\) such permutations. This corresponds to sampling without replacement as common in practice. Denote \(L(\btheta) = [(n + 1) \batchsize]^{-1} \sum_{s = 1}^{(n + 1) \batchsize} \ell_s(\btheta)\) and let \[\label{eq:empirical-covariance} \Sigma_{i j} := \frac{1}{(n + 1) \batchsize} \sum_{p = 1}^{(n + 1) \batchsize} \partial_i (\ell_p - L) \partial_j (\ell_p - L)\tag{7}\] be the elements of the empirical covariance matrix \(\mSigma(\btheta)\) of per-sample gradients. Then \[\begin{align} &\Eletpi \prn[\big]{\nomemcoef{1}{n}(\btheta) + h \nomemcoef{2}{n}(\btheta)}\\ &\quad = - \frac{1}{1 - \beta} \nabla \prn[\bigg]{ L(\btheta) + \underbrace{h \frac{\beta + o_n(1)}{2 (1 - \beta)^2} \norm{\nabla L(\btheta)}^2}_{\text{regularization by memory}} + \underbrace{h \frac{\beta + o_n(1)}{2 (1 - \beta) (1 + \beta)} \frac{\trace \mSigma(\btheta)}{\batchsize}}_{\text{regularization by stochasticity}} }, \end{align}\] where \(\Eletpi\) denotes the expectation over \(\pi\), \(o_n(1)\) are terms that go to zero as \(n \to \infty\) (for fixed \(\beta\)) regardless of \(\batchsize \in \range{1}{n + 1}\) (16).

In the full-batch case \(L^{(s)} \equiv L\), for example, \[\begin{align} \nomemcoef{1}{n}(\btheta) = - \frac{1}{1 - \beta} \nabla L(\btheta), \quad \nomemcoef{2}{n}(\btheta) = - \frac{\beta + o_n(1)}{(1 - \beta)^3} \nabla^2 L(\btheta) \nabla L(\btheta),\\\nomemcoef{3}{n}(\btheta) = &- \frac{\beta (1 + \beta) + o_n(1)}{(1 - \beta)^5} \nabla^2 L(\btheta) \nabla^2 L(\btheta) \nabla L(\btheta)\\ &- \frac{\beta (1 + \beta) + o_n(1)}{2 (1 - \beta)^5} \nabla^3 L(\btheta) \brk[\big]{\nabla L(\btheta), \nabla L(\btheta)},\end{align}\] so the memoryless approximation of order \(\ord = 3\) is approximately \[\tilde{\btheta}^{(n + 1)} = \tilde{\btheta}^{(n)} - \frac{h}{1 - \beta} \nabla \underbrace{\prn[\bigg]{ L + \frac{h \beta}{2 (1 - \beta)^2} \norm{\nabla L}^2 + \frac{h^2 \beta (1 + \beta)}{4 (1 - \beta)^4} \prn[\big]{\nabla \norm{\nabla L}^2 \cdot \nabla L}}}_{\text{modified loss up to O(h^3)}}(\tilde{\btheta}^{(n)}),\] where the dot denotes the scalar product. We see two “implicit regularization” terms added to the loss by memory: the rescaled squared gradient norm and the rescaled directional derivative of \(\norm{\nabla L}^2\) along \(\nabla L\).

In both full-batch and mini-batch cases, analyzing the combinatorics of \(\nomemcoef{j}{n}(\btheta)\) leads to interesting findings. First, in the limit \(n \to \infty\) and after multiplying by a power of \(1 - \beta\), we investigate the form of the coefficients accompanying the high-order loss derivatives in \(\nomemcoef{j}{n}(\btheta)\) and uncover a rich family of polynomials in \(\beta\): in particular, we prove that it contains Eulerian and Narayana polynomials (5.3). Second, using the natural heuristic that \(h\) multiplied by a high-order derivative (higher than two) of the loss is small, but \(h\) times the Hessian (\(h \nabla^2 L\)) does not have to be small mid-training cohen2021gradient?, andreyev2025edgestochasticstability?, we can ask what the “principal” part of 5 looks like, that is, after neglecting the derivatives of the loss of order higher than two (always multiplied by some positive power of \(h\)). This leads to what we can call the principal iteration \[\tilde{\btheta}^{(n + 1)} = \tilde{\btheta}^{(n)} - h \sigma_{\beta}\prn[\big]{h \nabla^2 L\prn[\big]{\tilde{\btheta}^{(n)}}} \nabla L\prn[\big]{\btheta^{(n)}} + \np,\] after taking \(\ord\) formally to infinity, where \(\sigma_{\beta}(\cdot)\) is a power series expansion (in powers of \(z\)) of \[\sigma_{\beta}(z) = \frac{2}{1 - \beta + z + \sqrt{(1 - \beta - z)^2 - 4 \beta z}}\] (2), and \(\np\) means “non-principal terms” (with \(\nabla^3 L\) etc.). The term “principal iteration” comes from the analogous term principal flow coined in rosca2023on? for continuous modified equations of plain GD. Third, we combine our framework with continuous approximations (discussed next) to derive principal flow for HB (4). See rosca2023on? for a discussion of the importance of such (complex) flows: they capture oscillatory behavior and divergence, in contrast to standard continuous approximations. We provide an illustration in 5.4.

In the case of a quadratic loss, the principal iteration and principal flow are both exact (if HB is initialized on the manifold introduced above; see 5.4). To illustrate the implicit regularization effect further, consider the example of least-squares regression. Let \(\bX \in \mathbb{R}^{N \times d}\), \(\by \in \mathbb{R}^N\) and \[L(\btheta) = \frac{1}{2 N} \norm{\bX \btheta - \by}^2.\] Additionally, let \(\btheta^{\star}\) satisfy the normal equations \(\bX\trans \bX \btheta^{\star} = \bX\trans \by\); in particular, \(L(\btheta) = \frac{1}{2 N} (\btheta - \btheta^{\star})\trans \bX\trans \bX (\btheta - \btheta^{\star}) + C\) (where \(C\) does not depend on \(\btheta\)). Then, the memoryless iteration is the gradient descent \[\tilde{\btheta}^{(n + 1)} = \tilde{\btheta}^{(n)} - h \nabla \tilde{L}\prn[\big]{\tilde{\btheta}^{(n)}}\quad\text{with}\quad\tilde{L}(\btheta) = \frac{1}{2 N} (\btheta - \btheta^{\star})\trans \sigma_{\beta}\prn[\bigg]{\frac{h}{N} \bX\trans \bX} \bX\trans \bX (\btheta - \btheta^{\star}).\] Introducing \(\sigma_{\beta}\prn[\big]{\frac{h}{N} \bX\trans \bX}\) rescales the learning rate (by \((1 - \beta)^{-1}\)) and applies a spectral filter to the Hessian.

1.3 Modified Equations↩︎

The discrete iteration without memory 5 has the following advantage over HB: it only has one evolving variable \(\btheta^{(n)} \in \mathbb{R}^d\), and it can be approximated by a continuous ODE solution without sacrificing anything in terms of the approximation guarantee. (Enlarging the phase space by treating HB as an evolution of \((\btheta^{(n)}, \bv^{(n)}) \in \mathbb{R}^{2 d}\) or, equivalently, considering higher-order ODEs does not lead to an improvement in the approximation guarantee kovachki2021continuous?.) Then, finding continuous trajectories closely tracking discrete numerical iterations allows to analyze qualitatively the finite-\(h\) behavior of the iteration. This approach, widely used in numerical analysis and machine learning, is called the method of modified equations, backward error analysis (BEA), or high-resolution ODE approximation: it has been employed and studied for many decades, e. g. in works wilkinson1960error?, warming1974modified?, griffiths1986scope?, feng1991formal?, beyn1991numerical?, eirola1993aspects?, Hairer1994Backward?, calvo1994modified?, reich1999backward?, chartier2007numerical?, shi2022understanding?, farazmand2020multiscale?, kovachki2021continuous? and many others; textbook references include stuart1998dynamical?, hairer2006?. Recently, this method was used in machine learning for finding implicit biases of gradient-based algorithms barrett2021implicit?, smith2021on?, miyagawa2022toward?, ghosh2023implicit?, pmlr-v235-cattaneo24a?, rosca2023on?. For us, it is just a corollary of the approximation 5 .

For any approximation order \(\ord \in \mathbb{Z}_{\geq 2}\), 1 finds a set of functions \(\crl{\beacoef{j}{n}(\btheta): j \in \range{1}{\ord}, n \in \mathbb{Z}_{\geq 0}}\) such that the (unique) continuous solution to the piecewise ODE \[\dot{\btheta}(t) = \sum_{j = 1}^{\ord} h^{j - 1} \beacoef{j}{n}(\btheta(t)), \quad t \in [n h, (n + 1)h],\] with an appropriate initial condition, has a global approximation guarantee \[\label{eq:ode-approx-contrib} \sup_{n \in \range{0}{\lfloor T / h \rfloor}} \norm{\btheta^{(n)} - \btheta(n h)} = O(h^{\ord}),\tag{8}\] where \(T\) is any time horizon.

The ODE is defined piecewise because of the dependence of \(\nomemcoef{j}{n}(\cdot)\) on \(n\), and the solution will be continuous but not necessarily smooth. Nonetheless, in the full-batch case, the ODEs corresponding to neighboring pieces are the same up to an exponentially decaying error: we do not lose any information compared to having an ODE only on the attractive manifold but globally defined, as done for \(\ord = 2\) in kovachki2021continuous?. The idea of using piecewise ODEs in this setting is due to ghosh2023implicit?, who derived the modified equations for \(\ord = 2\).

In contrast to the discrete case, the right-hand side of the modified equation does not become a gradient for large \(n\), even for full-batch HB. Therefore, HB does not approximately follow a smooth gradient flow trajectory with a modified loss (with approximation order higher than \(h^2\)), as we make explicit in [rem:not-gf].

1.4 Organization↩︎

The paper continues as follows. In an attempt to make the paper self-contained, we introduce the relevant mathematical concepts and notation (e. g., labeled and unlabeled rooted trees, their symmetry coefficients, marking vertices to split the tree into a forest) in 1.5. 2 states the theorem and briefly outlines the argument corresponding to [contrib:manifold]. 3 is devoted to the main approximation theorem corresponding to [contrib:approx], removing memory from HB and controlling the error while doing so. In particular, we introduce a somewhat complicated-looking recursive definition for the memoryless iteration coefficients \(\nomemcoef{j}{n}(\btheta)\). In 4, we analyze the form of these coefficients, and prove that a much simpler characterization holds which involves the sum over rooted trees. We use the notation introduced to solve a fixed-point equation for \(\bg_h\) as a series over rooted trees. 5 reports corollaries and implications of our main theorems, including an approximation theorem for the modified equation corresponding to [contrib:modified-eq], a study of the polynomials arising in the form of \(\nomemcoef{j}{n}(\btheta)\), a formalization of the principal iteration, and a discussion of principal flow. All proof strategies are explained in the main text, and omitted technical details are provided in the Appendix.

1.5 Technical Background and Notation↩︎

We start with well-known definitions of labeled and unlabeled rooted trees. See, for example, faris2021rootedtreegraphs?, Kerber1999? for details. For a fixed non-empty finite set \(V\) and a point \(r \in V\), a labeled rooted tree \(\tau\) with root \(r\) is a pair \((G, r)\), where \(G = (V, E)\) is a connected acyclic graph with \(V\) as the vertex set (where \(E\) is the edge set; \(\abs{E} = \abs{V} - 1\)). Taking a different point of view, \(\tau\) is a function \(\tau\colon V \setminus \crl{r} \to V\) with no non-empty invariant subset (mapping a vertex to its parent). Let \(\mathcal{A}[V]\) denote all labeled trees with vertex set \(V\). We will write \(\abs{\tau} := \abs{V}\) for the number of vertices. It will also be convenient to use the notation \(\mathcal{A} := \bigcup_{m = 1}^{\infty} \mathcal{A}\range{1}{m}\) and \(\mathcal{A}_{\varnothing} := \crl{\varnothing} \cup \mathcal{A}\).

Two labeled rooted trees \(\tau\colon V \setminus \crl{r} \to V\) and \(\tau'\colon V' \setminus \crl{r'} \to V'\) are called isomorphic if there is a bijection \(\mathcal{b}\colon V \to V'\) such that \(\mathcal{b}(r) = r'\) and \(\tau' = \mathcal{b} \circ \tau \circ \left.\mathcal{b}^{-1}\right|_{V' \setminus \crl{r'}}\). An unlabeled rooted tree with \(m\) vertices is an object that corresponds to a class of isomorphic labeled rooted trees with \(m\) vertices. Specifically, we can fix a canonical set of \(m\) elements \(\range{1}{m}\) and see an unlabeled rooted tree as an orbit of the permutation group \(S\range{1}{m}\) acting on \(\mathcal{A}\range{1}{m}\): the action of \(\pi \in S\range{1}{m}\) on \(\tau \in \mathcal{A}\range{1}{m}\) is \(\pi \tau = \pi \circ \tau \circ \left.\pi^{-1}\right|_{\range{1}{m} \setminus \crl{\pi(r)}}\). The set of unlabeled rooted trees with \(m\) vertices will be denoted \(\tilde{\mathcal{A}}[m]\), and the set of unlabeled rooted trees with any number of vertices by \(\tilde{\mathcal{A}} := \bigcup_{m = 1}^{\infty} \tilde{\mathcal{A}}[m]\); in addition, put \(\tilde{\mathcal{A}}_{\varnothing} := \crl{\varnothing} \cup \tilde{\mathcal{A}}\). For example, the following orbit (consisting of three labeled rooted trees) is the unlabeled rooted tree \(\rootedtree[[][]]\): \[\label{eq:kljqh} \bigrootedtree[1[2][3]]\quad \bigrootedtree[2[1][3]]\quad \bigrootedtree[3[1][2]]\tag{9}\] For any such orbit, the order of the stabilizer group of each element is the same. (The stabilizer group is the subgroup of permutations leaving a labeled rooted tree intact. For example, the first tree in 9 has stabilizer group \(\crl{\text{id}, 2 \leftrightarrow 3}\).) Hence, for an unlabeled rooted tree \(\tau\) we can define the symmetry coefficient \(\sigma(\tau)\) as the order of the stabilizer group of each element. By the well-known theorem, the length of the orbit is the order of the group \(m!\) divided by the order of each stabilizer group \(\sigma(\tau)\); so, there are \(m! / \sigma(\tau)\) labeled rooted trees (on a fixed vertex set of size \(m\)) corresponding to an unlabeled rooted tree \(\tau\).

Whenever there is no chance of confusion, we will use the same terms and symbols when working with labeled and unlabeled rooted trees, for example, we will write \(\sigma(\tau)\) regardless of whether \(\tau \in \mathcal{A}\) or \(\tau \in \tilde{\mathcal{A}}\) (because each labeled rooted tree has a corresponding unlabeled rooted tree); we will say an unlabeled rooted tree \(\tau \in \tilde{\mathcal{A}}[m]\) “has \(\abs{\tau} = m\) vertices” even though there is no such thing as a vertex of an unlabeled rooted tree, etc.

Since the ordering of subtrees does not matter, there is a one-to-one correspondence between \(\tau \in \tilde{\mathcal{A}}[m]\) and a multiset \([\tau_1, \ldots, \tau_{\ell}]\) of unlabeled rooted trees whose sum of vertices is \(m - 1\) (the subtrees rooted at the children of \(\tau\)’s root). We will sometimes write simply \(\tau = [\tau_1, \ldots, \tau_{\ell}]\) to reflect this fact. Fixing some canonical ordering in each set \(\tilde{\mathcal{A}}[s]\), we will denote by \(\crl[\big]{\mu_1^s(\tau), \ldots, \mu_{\abs{\tilde{\mathcal{A}}[s]}}^s(\tau)}_{s = 1}^{m - 1}\) the multiplicities of such subtrees: this means that the first unlabeled subtree with \(s\) vertices appears \(\mu_1^s(\tau)\) times, the second one appears \(\mu_2^s\) times, and so on. In particular, \(\sum_{s = 1}^{m - 1} \prn[\big]{\mu_1^s(\tau) + \ldots + \mu_{\abs{\tilde{\mathcal{A}}[s]}}^s(\tau)} = \ell\). It is a standard fact that \[\label{eq:recursive-formula-for-sym-coef} \sigma(\tau) = \sigma(\tau_1) \ldots \sigma(\tau_{\ell}) \prod_{s = 1}^{m - 1} \prod_{t = 1}^{\abs{\tilde{\mathcal{A}}[s]}} \mu_t^s(\tau)!.\tag{10}\]

In addition, it will be convenient to call \(\mathcal{c}_m \in \tilde{\mathcal{A}}[m]\) the chain with \(m\) vertices if either \(m = 1\) or it is the (unique) unlabeled rooted tree corresponding to any element of \(\mathcal{A}\range{1}{m}\) whose root has degree 1 and all other vertex degrees do not exceed 2.

1.5.0.1 Marking Vertices (via Admissible Cuts).

For a labeled rooted tree \(\tau \in \mathcal{A}[V]\), define a marking \(\mrk\) of that tree as a subset of non-root vertices of \(\tau\) (interpreted as marked) such that if \(v \in V\) is marked, then no other vertices in the subtree of \(v\) are marked. In other words, the marked vertices are the upper (farther from the root) vertices of an admissible cut (e. g. Bai2024PostGroups?), or the roots of the forest obtained after removing a subtree with the same root faris2021rootedtreegraphs?, chartier2010algebraic?. Hence, marking \(\abs{\mrk}\) vertices is the same as selecting \(\abs{\mrk}\) disjoint subtrees \(\tau_1, \ldots, \tau_{\abs{\mrk}}\) (rooted at those vertices), and it splits the tree \(\tau\) into \(\abs{\mrk} + 1\) labeled rooted trees \(\tau_0^{\mrk} = \tau \setminus (\tau_1^{\mrk} \cup \ldots \cup \tau^{\mrk}_{\abs{\mrk}})\), \(\tau_1^{\mrk}, \ldots, \tau^{\mrk}_{\abs{\mrk}}\). So, we will sometimes write \(\mrk = (\tau_0^{\mrk}, \crl{\tau_1^{\mrk}, \ldots, \tau^{\mrk}_{\abs{\mrk}}})\) to reflect this fact. Denote \(\mrkset_{\tau, i}\) the set of all markings of \(\tau\) with \(\abs{\mrk} = i\) marked vertices, and \(\mrkset_{\tau} := \bigcup_{i = 0}^{\abs{V} - 1} \mrkset_{\tau, i}\) the set of all markings of \(\tau\).

1.5.0.2 Other Notation.

We denote by \(L(\btheta)\) the full-batch loss and by \(L^{(s)}(\btheta)\) the \(s\)th mini-batch loss, where \(\btheta \in \mathbb{R}^d\) is the evolving parameter of fixed dimension \(d\). We denote by \(\nabla^k\) the \(k\)th derivative tensor, for example, \(\nabla^3 L(\btheta) \brk[\big]{\nabla L(\btheta), \nabla L(\btheta)}\) is a vector whose \(j\)th component is \(\sum_{i, l = 1}^d \partial_{j i l} L(\btheta) \partial_i L(\btheta) \partial_l L(\btheta)\). The notation for the norm \(\norm{\cdot}\) without indices will always mean the Euclidean (operator) norm. The set of integers no smaller than some \(k\) will be denoted by \(\mathbb{Z}_{\geq k} := [k, + \infty) \cap \mathbb{Z}\), and the set of integers between \(a\) and \(b\) inclusive by \(\range{a}{b} := [a, b] \cap \mathbb{Z}\). A sum over an empty set is be definition \(0\), and a product is \(1\). 3 defines the set \(\mathcal{K}_{i, l}\), the memoryless iteration coefficients \(\nomemcoef{j}{n}(\btheta)\), the history coefficients \(\histcoef{m}{n}{a}(\btheta)\); 5.1 defines the backward error analysis coefficients \(\beacoef{j}{n}(\btheta)\). For any operator \(\ba_{n, h}\) such that \(\norm{\ba_{n, h}}\) is defined, we write \(\ba_{n, h} = O(g)\) if \(\sup_{n, h} \norm{\ba_{n, h}} \leq C g\) with some constant \(C\), where the supremum is over the set of admissible \(n\) and \(h\) which is clear from context.

2 Full-Batch HB is (Almost) GD with a Modified Loss↩︎

kovachki2021continuous? proved that for full-batch HB 1 there exists a function \(\boldsymbol{g}_h(\btheta)\) such that \(\crl{(\btheta, \bv): \bv = - (1 - \beta)^{-1} \nabla L(\btheta) + h \boldsymbol{g}_h(\btheta)}\) is an invariant manifold, which means \[\bv^{(n)} = - \frac{\nabla L(\btheta^{(n)})}{1 - \beta} + h \boldsymbol{g}_h(\btheta^{(n)}) \quad \Rightarrow \quad \bv^{(n + 1)} = - \frac{\nabla L(\btheta^{(n + 1)})}{1 - \beta} + h \boldsymbol{g}_h(\btheta^{(n + 1)}).\]

On this invariant manifold, on the one hand, \[\begin{align} \bv^{(n + 1)} &= - (1 - \beta)^{-1} \nabla L(\btheta^{(n + 1)}) + h \boldsymbol{g}_h(\btheta^{(n + 1)})\\ &= - \frac{1}{1 - \beta} \nabla L \prn[\bigg]{\btheta^{(n)} - \frac{h}{1 - \beta} \nabla L(\btheta^{(n)}) + h^2 \beta \boldsymbol{g}_h(\btheta^{(n)})}\\ &\quad + h \boldsymbol{g}_h \prn[\bigg]{\btheta^{(n)} - \frac{h}{1 - \beta} \nabla L(\btheta^{(n)}) + h^2 \beta \boldsymbol{g}_h(\btheta^{(n)})}, \end{align}\] while, on the other hand, \[\begin{align} \bv^{(n + 1)} &= \beta \bv^{(n)} - \nabla L(\btheta^{(n)}) = - \frac{\beta}{1 - \beta} \nabla L(\btheta^{(n)}) + h \beta \boldsymbol{g}_h(\btheta^{(n)}) - \nabla L(\btheta^{(n)})\\ &= - \frac{1}{1 - \beta} \nabla L(\btheta^{(n)}) + h \beta \boldsymbol{g}_h(\btheta^{(n)}). \end{align}\] This must hold for any \(\btheta^{(n)}\), so \(\boldsymbol{g}_h(\cdot)\) must satisfy \[\label{eq:uruw} \begin{align} &- \frac{1}{1 - \beta} \nabla L \prn[\bigg]{\btheta - \frac{h}{1 - \beta} \nabla L(\btheta) + h^2 \beta \boldsymbol{g}_h(\btheta)} + h \boldsymbol{g}_h \prn[\bigg]{\btheta - \frac{h}{1 - \beta} \nabla L(\btheta) + h^2 \beta \boldsymbol{g}_h(\btheta)}\\ &\quad = - \frac{1}{1 - \beta} \nabla L(\btheta) + h \beta \boldsymbol{g}_h(\btheta),\quad \btheta \in \mathbb{R}^d. \end{align}\tag{11}\]

To prove that such a function exists, they define a mapping \(T\colon \Gamma \to \Gamma\), where \(\Gamma\) is an appropriately chosen closed subset of \(C(\mathbb{R}^d, \mathbb{R}^d)\) with the usual sup-norm, by \[\label{eq:def-of-t} T \boldsymbol{g}(\bzeta) = \frac{1}{h (1 - \beta)} \crl[\big]{\nabla L(\bzeta) - \nabla L(\btheta)} + \beta \boldsymbol{g}(\btheta),\tag{12}\] where \(\btheta \leftrightarrow \bzeta = \btheta - h (1 - \beta)^{-1} \nabla L(\btheta) + h^2 \beta \boldsymbol{g}(\btheta)\) is a bijection between \(\mathbb{R}^d\) and itself for any \(\boldsymbol{g} \in \Gamma\) (this fact is 4). A fixed point of such a mapping \(T\) would satisfy 11 . They prove that \(T\) is a contraction on \(\Gamma\), allowing to apply the contracting mapping principle.

We follow the same conceptual approach, but apply some technical tweaks (different metric, different \(\Gamma\)), to prove that the additional requirement \(\bg_h \in C^1(\mathbb{R}^d, \mathbb{R}^d)\) with symmetric \(\nabla \bg_h\) is satisfiable as well, which implies that such \(\bg_h\) is a gradient. Specifically, define the set \[\Gamma := \crl[\big]{\boldsymbol{g} \in C^1(\mathbb{R}^d, \mathbb{R}^d): \sup_{\btheta \in \mathbb{R}^d} \norm{\boldsymbol{g}} \leq \gamma, \sup_{\btheta \in \mathbb{R}^d} \norm{\nabla \boldsymbol{g}} \leq \delta, \nabla \boldsymbol{g}\text{ is symmetric and \lambda-Lipschitz}},\] with the norm \[\label{eq:gamma-norm} \| \boldsymbol{g} \|_{\Gamma} := \sup_{\btheta \in \mathbb{R}^d} \| \boldsymbol{g}(\btheta) \| + \sup_{\btheta \in \mathbb{R}^d} \| \nabla \boldsymbol{g}(\btheta) \|,\tag{13}\] where both the vector norm and the matrix norm in the right-hand side are Euclidean. The space \(\crl[\big]{\boldsymbol{g} \in C^1(\mathbb{R}^d, \mathbb{R}^d): \| \boldsymbol{g} \|_{\Gamma} < \infty}\) with norm 13 is a Banach space, and \(\Gamma\) is a closed subset of it; therefore \(\Gamma\) can be seen as a complete metric space with the metric induced by the norm \(\| \cdot \|_{\Gamma}\). The constants \(\gamma\), \(\delta\), \(\lambda\) are chosen in 7.

The assumptions of the theorem are essentially the same as in kovachki2021continuous?, except we (naturally) need one more derivative of the loss to be Lipschitz.

Theorem 1. Assume \(L(\cdot) \in C^3(\mathbb{R}^d, \mathbb{R})\) with constants \(D_1, D_2, D_3, D_4\) such that \[\sup_{\btheta} \norm{\nabla^j L(\btheta)} \leq D_j\text{ for j \in \range{1}{3}}, \quad \sup_{\btheta_1 \neq \btheta_2} \frac{\norm{\nabla^3 L(\btheta_1) - \nabla^3 L(\btheta_2)}}{\norm{\btheta_1 - \btheta_2}} \leq D_4,\] where the norms are Euclidean operator norms. Then, if \(h\) is small enough, there exists a unique \(\bg_h(\cdot) \in \Gamma\) satisfying 11 , and the following exponential attractivity property holds: \[\norm[\bigg]{\bv^{(n)} + \frac{1}{1 - \beta} \nabla L(\btheta^{(n)}) - h \bg_h(\btheta^{(n)})} \leq (\beta + h^2 \beta \delta)^n \norm[\bigg]{\bv^{(0)} + \frac{1}{1 - \beta} \nabla L(\btheta^{(0)}) - h \bg_h(\btheta^{(0)})}.\]

The proof is a bounding argument very heavy on long equations, so it is moved to 7. 5 proves that indeed \(T\) maps \(\Gamma\) to itself, and 6 proves that \(T\) is a contraction on \(\Gamma\). Since \(\Gamma\) is a complete metric space with the metric \(\norm{\bg_1 - \bg_2}_{\Gamma}\), the contraction mapping principle implies that there is a unique fixed point \(\bg_h \in \Gamma\) of the operator \(T\), i.e., \(T \bg_h = \bg_h\). Exponential attractivity is tackled in 7.

3 Mini-Batch HB: Approximation Theorem↩︎

cattaneo2025howmemoryoptimization? introduced a method for removing memory in a class of numerical optimization algorithms with decaying memory, which can be applied to HB up to \(O(h^2)\). Following their idea, we prove the approximation of HB by a memoryless iteration with the global error bound \(O(h^{\ord})\) where \(\ord \in \mathbb{Z}_{\geq 2}\). Considering higher-order approximations requires additional notation and technical work. Denoting \[\mathcal{K}_{i, l} = \crl[\big]{(k_0, \ldots, k_l) \in \mathbb{Z}_{\geq 0}^{l + 1}: k_0 + \ldots + k_l = i, k_1 + \ldots + k_l = l},\] define the memoryless iteration coefficients \[\label{eq:nomem-coef-def} \begin{align} \nomemcoef{1}{n}(\btheta) &= - \sum_{k = 0}^n \beta^k \nabla L^{(n - k)}(\btheta),\\ \nomemcoef{m}{n}(\btheta) &= - \sum_{k = 1}^n \beta^k \sum_{\substack{i, l \geq 0\\i + l = m - 1}} \sum_{(i_0, \ldots, i_l) \in \mathcal{K}_{i, l}} \frac{1}{i_0! \ldots i_l!}\\ &\qquad \times \nabla^{i + 1} L^{(n - k)}(\btheta) \prn[\bigg]{\underbrace{\histcoef{1}{n}{k}(\btheta)}_{\text{i_0 times}}, \ldots, \underbrace{\histcoef{l + 1}{n}{k}(\btheta)}_{\text{i_l times}}},\quad m \geq 2, \end{align}\tag{14}\] where the history terms satisfy the iteration \[\label{eq:jfjl} \histcoef{m}{n}{a}(\btheta) = - \sum_{s = 1}^a \sum_{\substack{j \geq 1, i, l \geq 0\\i + j + l = m}} \sum_{(k_0, \ldots, k_l) \in \mathcal{K}_{i, l}} \frac{1}{k_0! \ldots k_l!} \nabla^i \nomemcoef{j}{n - s}(\btheta) \prn[\bigg]{\underbrace{\histcoef{1}{n}{s}(\btheta)}_{\text{k_0 times}}, \ldots, \underbrace{\histcoef{l + 1}{n}{s}(\btheta)}_{\text{k_l times}}},\tag{15}\] with \(a \in \range{1}{n}\), and \(m \in \mathbb{Z}_{\geq 1}\).

Although this recursion may look complex, we will use it as the main definition for \(\crl{\nomemcoef{m}{n}(\btheta)}\) because it is convenient for proving 2. It involves taking high-order derivatives of recursively defined quantities, specifically \(\nabla^i \nomemcoef{j}{n - s}\) in 15 , and for this reason it is hard to analyze. We provide a different, more natural, form in 4, where only the loss function is differentiated.

Theorem 2 (Approximation by a memoryless iteration). Let a family of loss functions \(\crl{L^{(n)}}_{n \geq 0}\) be defined on an open convex domain \(\mathcal{D}\) in \(\mathbb{R}^d\), and assume each loss function \(L^{(n)}\colon \mathcal{D} \to \mathbb{R}\) is \(2 \ord\)-times continuously differentiable with bounded uniformly in \(n\) derivatives up to order \(2 \ord\), where \(\ord \in \mathbb{Z}_{\geq 2}\). Let \(\crl[\big]{\btheta^{(n)}}_{n = 0}^{\infty} \subset \mathcal{D}\) be the HB iteration 4 with initial condition \(\btheta^{(0)} \in \mathcal{D}\), and \(\crl[\big]{\tilde{\btheta}^{(n)}}_{n = 0}^{\infty} \subset \mathcal{D}\) the iteration 5 with the same initial condition \(\tilde{\btheta}^{(0)} = \btheta^{(0)}\). Let \(T > 0\) be a fixed “time” horizon. Then, for each \(h \in (0, 1 / 2)\), \[\label{eq:onestep-error-bound} \sup_{n \in \range{0}{\lfloor T / h \rfloor}} \norm[\bigg]{\tilde{\btheta}^{(n + 1)} - \tilde{\btheta}^{(n)} + h \sum_{k = 0}^n \beta^k \nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}}} \leq \oC{clocal} h^{\ord + 1}\qquad{(1)}\] and, as a consequence, \[\label{eq:global-error-bound} \sup_{n \in \range{0}{\lfloor T / h \rfloor}} \norm[\big]{\btheta^{(n)} - \tilde{\btheta}^{(n)}} \leq \oC{cglobal} h^{\ord},\qquad{(2)}\] where \(\nC{clocal}\) and \(\nC{cglobal}\) are some constants depending on \(T\).

To build intuition we offer a sketch of out proof next; the full argument can be found in 8.

3.1 Proof Sketch↩︎

By definition, ?? will follow from the following bound on the error introduced by removing memory: \[\label{eq:kwPSbq} - \sum_{k = 0}^n \beta^k \nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}} \overset{?}{=} \sum_{m = 0}^{\ord - 1} h^{m} \nomemcoef{m + 1}{n}\prn[\big]{\tilde{\btheta}^{(n)}} + O(h^{\ord}),\tag{16}\] that is, our task is to rewrite the left-hand side in such a way that instead of a function of all previous \(\tilde{\btheta}^{(n - k)}\) it becomes just a function of \(\tilde{\btheta}^{(n)}\).

It is shown by induction (8) that \(\nomemcoef{j}{n}(\btheta) = O(1)\) which implies \(\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)} = O(k h)\) (9). This allows to claim that the remainder is \(\text{Rem}^{(n - k)} = O(k^{\ord} h^{\ord})\) in the Taylor expansion \[\label{eq:lqyfdV} \begin{align} \nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}} = &\sum_{i = 0}^{\ord - 1} \frac{1}{i!} \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\big]{\underbrace{\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}, \ldots, \tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}}_{\text{i times}}}\\ &+ \text{Rem}^{(n - k)}, \end{align}\tag{17}\] where \[\begin{gather} \text{Rem}^{(n - k)} = \frac{1}{(\ord - 1) !} \int_0^1 (1 - t)^{\ord - 1}\\ \times \nabla^{\ord + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)} + t \prn[\big]{\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}}} \prn[\big]{\underbrace{\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}, \ldots, \tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}}_{\text{\ord times}}}\,\mathrm{d} t. \end{gather}\] This is partial progress: we rewrote \(\nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}}\) in such a way that it depends only multi-linearly on (copies of) \(\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}\), whereas all loss derivatives are evaluated at the current iterate \(\tilde{\btheta}^{(n)}\). Now we need to express \(\tilde{\btheta}^{(n - k)}\) through \(\tilde{\btheta}^{(n)}\). This is done in the following lemma, which clarifies the meaning of the history terms \(\histcoef{m}{n}{k}(\btheta)\):

Lemma 1. We have for \(r \in \range{1}{\ord}\), \(k \in \range{1}{n}\) \[\label{eq:lkjru} \tilde{\btheta}^{(n - k)} = \tilde{\btheta}^{(n)} + \sum_{m = 1}^{r - 1} h^m \histcoef{m}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}} + O(k^r h^r).\qquad{(3)}\]

The proof is by induction and given in 8. Inserting this history expansion ?? into the main term in the right-hand side of 17 and carefully keeping track of the error gives \[\begin{align} &\nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}}\\ &\quad= \sum_{m = 0}^{\ord - 1} h^m \sum_{\substack{i, l \geq 0\\i + l = m}} \sum_{(i_0, \ldots, i_l) \in \mathcal{K}_{i, l}} \frac{1}{i_0! \ldots i_l!} \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\histcoef{1}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_0 times}}, \ldots, \underbrace{\histcoef{l + 1}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_l times}}}\\ &\qquad + O(k^{\ord} h^{\ord}). \end{align}\]

It is left to sum this over \(k\) with an exponentially decaying weight \(\beta^k\). The error \(O(k^{\ord} h^{\ord})\) is polynomial in \(k\) but it is not a problem because it will turn into \(\sum_{k = 0}^n \beta^k O(k^{\ord} h^{\ord}) = O(h^{\ord})\) by exponential summation. The coefficients \(\nomemcoef{m}{n}\prn[\big]{\btheta}\) are defined in such a way that the result is exactly 16 that we need, proving the local error bound ?? . A standard argument (10) for converting a local error bound to a global error bound gives ?? .

4 Analyzing the Form of Memoryless Iteration Coefficients↩︎

Let us write the first few terms \(\nomemcoef{m}{n}(\btheta)\) from 14 and describe the pattern that can be used to generate them. To declutter notation, we will omit the argument \(\btheta\) since it will be fixed. By definition, the first memoryless iteration coefficient is just the exponential average of past gradients: \[\nomemcoef{1}{n} = - \sum_{b = 0}^n \beta^b \nabla L^{(n - b)}.\]

Using the definition, we can also write the following expression for \(\nomemcoef{2}{n}\): \[\nomemcoef{2}{n} = - \beta \sum_{b = 0}^{n - 1} \beta^{b} \nabla^2 L^{(n - 1 - b)} \sum_{l' = 1}^{b + 1} \sum_{b' = 0}^{n - l'} \beta^{b'} \nabla L^{(n - l' - b')}.\] This triple sum on the right can be generated as follows. Write the rooted tree consisting of two nodes \((1, 2)\) with corresponding parents \((\varnothing, 1)\). Let us introduce a variable \(l\) with value \(1\), which we will call the memory distance variable. Corresponding to the root \(1\), write the symbolic “sum” \[\label{eq:lkjlkfjy} \sum_{b = 0}^{n - l} \beta^b \nabla^2 L^{(n - l - b)} \sum_{l' = 1}^{b + 1} \square = \sum_{b = 0}^{n - 1} \beta^b \nabla^2 L^{(n - 1 - b)} \sum_{l' = 1}^{b + 1} \square\tag{18}\] and call \(l'\) the new memory distance variable (it is now being summed over, so it does not have a fixed value). We write the second derivative tensor \(\nabla^2 L\) (matrix in this case) because the number of children is \(2\); later, the order of the derivative will be \(\ell + 1\) where \(\ell\) is the number of children. Let us go down the tree and consider node \(2\). Replace \(\square\) in 18 with the corresponding expression: \[\sum_{b' = 0}^{n - l'} \beta^{b'} \nabla L^{(n - l' - b')}.\] Again, the order of the derivative \(\nabla L\) is equal to the number of children (zero) plus one. The upper limit of the sum is \(n\) minus the current memory distance variable. We do not write a trailing sum at the end like in 18 because there are no children.

We will further illustrate this process by looking at \(\nomemcoef{3}{n}\). Using \(\eqref{eq:nomem-coef-def}\) again and after some algebra, we can write \[\begin{align} \nomemcoef{3}{n} = &- \beta \sum_{b = 0}^{n - 1} \beta^b \nabla^2 L^{(n - 1 - b)} \sum_{l' = 1}^{b + 1} \sum_{b' = 0}^{n - l'} \beta^{b'} \nabla^2 L^{(n - l' - b')} \sum_{l'' = 1}^{l' + b'} \sum_{b'' = 0}^{n - l''} \beta^{b''} \nabla L^{(n - l'' - b'')}\tag{19}\\ &- \frac{\beta}{2} \sum_{b = 0}^{n - 1} \beta^b \nabla^3 L^{(n - b - 1)} \brk[\bigg]{\sum_{l' = 1}^{b + 1} \sum_{b' = 0}^{n - l'} \beta^{b'} \nabla L^{(n - l' - b')}, \sum_{l' = 1}^{b + 1} \sum_{b' = 0}^{n - l'} \beta^{b'} \nabla L^{(n - l' - b')}}.\tag{20} \end{align}\]

There are two non-isomorphic rooted trees with \(3\) vertices. The first one is a “chain”: the nodes \((1, 2, 3)\) have corresponding parents \((\varnothing, 1, 2)\). Let us now describe a sum that corresponds to this tree. As previously, the current memory distance is \(l = 1\), and the root \(1\) (having one child) generates a symbolic expression \[\sum_{b = 0}^{n - 1} \beta^b \nabla^2 L^{(n - l - b)} \sum_{l' = 1}^{b + 1} \square = \sum_{b = 0}^{n - 1} \beta^b \nabla^2 L^{(n - 1 - b)} \sum_{l' = 1}^{b + 1} \square.\] The current memory distance variable is now \(l'\) (with no fixed value). Next comes node \(2\) with one child, generating a symbolic expression \[\sum_{b' = 0}^{n - l'} \beta^{b'} \nabla^2 L^{(n - l' - b')} \sum_{l'' = 1}^{l' + b'} \square.\] The current memory distance is \(l''\) (with no fixed value). Finally, node \(3\) has no children, so it closes the sum with the expression \[\sum_{b'' = 0}^{n - l''} \beta^{b''} \nabla L^{(n - l'' - b'')}.\] Up to coefficient \(-\beta\) in front, we have obtained 19 .

The second rooted tree with \(3\) vertices is the tree consisting of nodes \((1, 2, 3)\) with corresponding parents \((\varnothing, 1, 1)\) (root with two children). The initial memory distance variable is denoted as \(l\) and has value \(1\). The root has two children, so the order of the derivative corresponding to it will be \(3\), and we will write two \(\square\) signs corresponding to two subtrees: \[\sum_{b = 0}^{n - l} \beta^b \nabla^3 L^{(n - b - 1)} \brk[\bigg]{\sum_{l' = 1}^{b + 1} \square, \sum_{l' = 1}^{b + 1} \square}.\] Then we replace the first \(\square\) with the expression corresponding to node \(2\) (with initial memory variable \(l'\)): \[\sum_{b' = 0}^{n - l'} \beta^{b'} \nabla L^{(n - l' - b')},\] and the second \(\square\) with the same expression corresponding to node \(3\). Up to coefficient \(- \beta / 2\) in front, we have obtained 20 . The reason for the division by \(2\) is that this tree has a symmetry coefficient \(2\).

This consideration of special cases highlights a pattern in how the memoryless iteration coefficients are structured. The following result is a formalization of this pattern.

Theorem 3 (The form of memoryless iteration coefficients). For \(m \geq 2\), the memoryless iteration coefficient \(\nomemcoef{m}{n}\) is equal (up to a coefficient \(- \beta\)) to a sum over the set \(\tilde{\mathcal{A}}[m]\) of unlabeled rooted trees with \(m\) vertices: \[\nomemcoef{m}{n} = - \beta \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{1}{n},\] where \(\sigma(\tau)\) is the symmetry coefficient of the tree (background in 1.5). The expression \(\EDef{\tau}{l}{n} := \E{\tau}{l}{n}(\beta)\), depending on the iteration number \(n\) and memory distance variable \(l\), is defined recursively by \[\label{eq:e-tau-def} \E{\tau}{l}{n} = \sum_{b = 0}^{n - l} \beta^b \nabla^{\ell + 1} L^{(n - l - b)} \brk[\bigg]{\sum_{l_1 = 1}^{l + b} \E{\tau_1}{l_1}{n}, \ldots, \sum_{l_{\ell} = 1}^{l + b} \E{\tau_{\ell}}{l_{\ell}}{n}}, \quad l \in \range{1}{n},\qquad{(4)}\] where \((\tau_1, \ldots, \tau_{\ell})\) are the subtrees rooted at the children of the root of \(\tau\) (with \(\ell \in \mathbb{Z}_{\geq 0}\)). In particular, \(\E{\rootedtree[]}{l}{n} = \sum_{b = 0}^{n - l} \beta^b \nabla L^{(n - l - b)}\).

The \(\E{\tau}{l}{n}\) expression is an exact translation of our informal observations above into the mathematical language. For example, \[\E{\rootedtree[[]]}{l}{n} = \sum_{b = 0}^{n - l} \beta^b \nabla^2 L^{(n - l - b)} \sum_{l_1 = 1}^{l + b} \sum_{b_1 = 0}^{n - l_1} \beta^{b_1} \nabla L^{(n - l_1 - b_1)}.\] We outline the main ideas underlying the proof next, but the formal details are deferred to 9.

4.1 Proof Sketch↩︎

Before sketching the argument for establishing 3, we introduce some preliminary results. We employ the notation and concepts from 1.5.

4.1.1 Auxiliary Quantity Related to ??↩︎

Fix a labeled rooted tree \(\tau\) in \(\mathcal{A}\range{1}{m}\), and choose a marking \(\mrk \in \mrkset_{\tau}\), where the marked vertices are \(v_1, \ldots, v_{\abs{\mrk}}\) with corresponding subtrees \(\tau^{\mathcal{m}}_1, \ldots, \tau^{\mathcal{m}}_{\abs{\mrk}}\) and the remaining subtree \(\tau_0^{\mrk}\). Let \(p_1, \ldots, p_{\abs{\mrk}}\) be the parents of the marked vertices (not necessarily distinct). Consider the derivative of the loss corresponding to \(p_1\) in the symbolic expression for \(\E{\tau_0^{\mrk}}{a}{n - l}\). Add one to the order of the derivative and add \(\sum_{l' = 1}^l \E{\tau_1^{\mrk}}{l'}{n}\) as an argument. Continue this process (possibly increasing the order of the same derivative more than once) until all marked vertices are processed. We will denote the resulting expression by \[\EmarkDef{\tau}{\mrk}{a}{n - l \to n}.\]

For example, consider the tree \(\tau\) consisting of the root and two leaves, with one of the leaves marked: \[\bigrootedtree[1[2][3,text=red]]\]

Consider the loss derivative corresponding to vertex \(p_1 = 1\) in the symbolic expression \[\E{\rootedtree[[]]}{a}{n - l} = \sum_{b = 0}^{n - l - a} \beta^b \underbrace{\nabla^2 L^{(n - l - b - a)}} \sum_{l_1 = 1}^{a + b} \sum_{b_1 = 0}^{n - l - l_1} \beta^{b_1} \nabla L^{(n - l - l_1 - b_1)}.\] Increase the order of the derivative and insert \(\sum_{l' = 1}^l \E{\rootedtree[]}{l'}{n}\), giving \[\sum_{b = 0}^{n - l - a} \beta^b \nabla^3 L^{(n - l - b - a)} \brk[\bigg]{ \sum_{l_1 = 1}^{a + b} \sum_{b_1 = 0}^{n - l - l_1} \beta^{b_1} \nabla L^{(n - l - l_1 - b_1)}, \sum_{l' = 1}^l \E{\rootedtree[]}{l'}{n}}.\]

4.1.2 Useful Properties of \(\E{\tau}{l}{n}\)↩︎

We give two lemmas about \(\E{\tau}{l}{n}\) that we will use in the argument. Both lemmas are proven in 9. The following important fact is the reason why the induction step in the main argument goes through.

Lemma 2. Let \(m \geq 2\). For any \(\tau \in \mathcal{A}\range{1}{m}\) we have \[\label{eq:sum-over-markings-of-one-tree} \E{\tau}{l + a}{n} = \sum_{\mrk \in \mrkset_{\tau}} \Emark{\tau}{\mrk}{a}{n - l \to n}.\qquad{(5)}\]

The reason why we invoke marked trees is that differentiation naturally creates such trees. The following establishes a connection between the sum over marked trees in ?? and the high-order derivative tensor which arises in the main argument for 3.

Lemma 3. We have for \(m \geq 2\) \[\begin{align} &\sum_{i = 0}^{m - 1} \sum_{j = 1}^{m - i} \sum_{(k_0, \ldots, k_{m - j - i}) \in \mathcal{K}_{i, m - j - i}} \frac{1}{k_0! \ldots k_{m - j - i}!} \sum_{\tau_0 \in \tilde{\mathcal{A}}[j]} \frac{1}{\sigma(\tau_0)} \\ &\qquad \times \nabla^i \E{\tau_0}{a}{n - l} \brk[\bigg]{ \underbrace{\sum_{\tau \in \tilde{\mathcal{A}}[1]}\frac{1}{\sigma(\tau)} \sum_{l' = 1}^l \E{\tau}{l'}{n}}_{\text{k_0 times}}, \ldots, \underbrace{\sum_{\tau \in \tilde{\mathcal{A}}[m - j - i + 1]}\frac{1}{\sigma(\tau)} \sum_{l' = 1}^l \E{\tau}{l'}{n}}_{\text{k_{m - j - i} times}} } = \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{l + a}{n}. \end{align}\]

Note that from the definition of \(\E{\tau}{l}{n}\), \[\label{eq:nJdufq} \nabla^{\ell + 1} L^{(n - l)} \brk[\bigg]{ \sum_{l' = 1}^l \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_\ell}{l'}{n} } + \beta \E{\tau}{l + 1}{n} = \E{\tau}{l}{n}.\tag{21}\]

Replacing \(\E{\tau}{l + 1}{n}\) with \(\sum_{\mrk \in \mrkset_{\tau}} \Emark{\tau}{\mrk}{1}{n - l \to n}\) by 2 and setting \(l = 1\), we get the following alternative recursion: \[\E{\tau}{1}{n} = \nabla^{\ell + 1} L^{(n - 1)} \brk[\big]{\E{\tau_1}{1}{n}, \ldots, \E{\tau_{\ell}}{1}{n}} + \beta \sum_{\mrk \in \mrkset_{\tau}} \Emark{\tau}{\mrk}{1}{n - 1 \to n}.\] The advantage of this form is that the memory distance variable is always \(1\), but the disadvantage is that the right-hand side contains a sum over all markings of \(\tau\).

4.1.3 Proof Sketch of 3 ↩︎

The strategy is to prove the following two statements simultaneously by induction over \(m \geq 2\): \[\begin{align} \nomemcoef{m}{n} &= - \beta \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{1}{n},\quad\text{and} \tag{22} \\ \histcoef{m}{n}{k} &= \sum_{l = 1}^k \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{l}{n}.\tag{23} \end{align}\]

For \(m = 2\), they are already verified above. Note also that the second statement holds for \(m = 1\) as well: \[\histcoef{1}{n}{k} = \sum_{l = 1}^k \E{\rootedtree[]}{l}{n}.\]

By definition, \[\begin{align} &\nomemcoef{m}{n} = - \beta \sum_{b = 0}^{n - 1} \beta^b \sum_{\ell = 0}^{m - 1} \sum_{(i_0, \ldots, i_{m - 1 - \ell}) \in \mathcal{K}_{\ell, m - 1 - \ell}} \frac{1}{i_0! \ldots i_{m - 1 - \ell}!} \times\\ &\qquad \qquad \times \nabla^{\ell + 1} L^{(n - 1 - b)} \prn[\bigg]{\underbrace{\histcoef{1}{n}{b + 1}}_{\text{i_0 times}}, \ldots, \underbrace{\histcoef{m - \ell}{n}{b + 1}}_{\text{i_{m - 1 - \ell} times}}} \end{align}\] Insert the induction hypothesis (recall that 23 holds for \(m = 1\) too): \[\begin{align} &\nomemcoef{m}{n} = - \beta \sum_{b = 0}^{n - 1} \beta^b \sum_{\ell = 0}^{m - 1} \sum_{(i_0, \ldots, i_{m - 1 - \ell}) \in \mathcal{K}_{\ell, m - 1 - \ell}} \frac{1}{i_0! \ldots i_{m - 1 - \ell}!}\\ &\qquad \times \nabla^{\ell + 1} L^{(n - 1 - b)} \prn[\bigg]{\underbrace{\sum_{l = 1}^{b + 1} \sum_{\tau \in \tilde{\mathcal{A}}[1]} \frac{1}{\sigma(\tau)} \E{\tau}{l}{n}}_{\text{i_0 times}}, \ldots, \underbrace{\sum_{l = 1}^{b + 1} \sum_{\tau \in \tilde{\mathcal{A}}[m - \ell]} \frac{1}{\sigma(\tau)} \E{\tau}{l}{n}}_{\text{i_{m - 1 - \ell} times}}} \end{align}\] Careful rearrangement is used to simplify this to \[\nomemcoef{m}{n} = - \beta \sum_{b = 0}^{n - 1} \beta^b \sum_{\ell = 0}^{m - 1} \sum_{\substack{\tau = [\tau_1, \ldots, \tau_{\ell}] \in \tilde{\mathcal{A}}[m]}} \frac{1}{\sigma(\tau)} \nabla^{\ell + 1} L^{(n - 1 - b)} \prn[\bigg]{\sum_{l = 1}^{b + 1} \E{\tau_1}{l}{n}, \ldots, \sum_{l = 1}^{b + 1} \E{\tau_{\ell}}{l}{n}},\] and that is, by definition, equal to \(- \beta \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{1}{n}\). Hence, under the induction hypothesis for smaller \(m\), 22 holds.

By definition of the history terms in 15 and the induction hypothesis, to prove 23 and complete the induction step, it is enough to show that \[\label{eq:aux-fact-in-nonmem-coef} \begin{align} &- \sum_{j = 1}^m \sum_{i = 0}^{m - j} \sum_{(k_0, \ldots, k_{m - i - j}) \in \mathcal{K}_{i, m - i - j}} \frac{1}{k_0! \ldots k_{m - i - j}!}\\ &\qquad \times \nabla^i \nomemcoef{j}{n - l} \brk[\bigg]{ \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_0 times}}, \ldots, \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[m - i - j + 1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_{m - i - j} times}} } \overset{?}{=} \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{l}{n}. \end{align}\tag{24}\]

By the induction hypothesis and 22 already proven, we can replace \(\nabla^i \nomemcoef{j}{n - l}\) in the left-hand side of 24 by \(- \beta \nabla^i \sum_{\tau_0 \in \tilde{\mathcal{A}}[j]} \frac{1}{\sigma(\tau_0)} \E{\tau_0}{1}{n - l}\). The result will involve precisely the big sum that we saw in 3, applying which we will simplify the left-hand side of 24 to \[\begin{align} \sum_{i = 0}^{m - 1} \sum_{\substack{\tau \in \tilde{\mathcal{A}}[m]\\ \tau = [\tau_1, \ldots, \tau_i]}} \frac{1}{\sigma(\tau)} \nabla^{i + 1} L^{(n - l)} \brk[\bigg]{ \sum_{l' = 1}^l \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_i}{l'}{n} } + \beta \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{l + 1}{n}. \end{align}\] Combining this with 21 , we see that the left-hand side of 24 is equal to the right-hand side of 24 . This completes the induction step. Omitted technical details are given in 9.

4.2 Connection with the Solution to the Fixed-Point Equation↩︎

The result in 3 can be connected to the invariant manifold perspective discussed in [sec:intro-hb-is-gd-on-manifold] [sec:full-batch-hb-is-gd]. Recall that HB on that manifold can be rewritten as 3 where \(\bg_h\) satisfies the fixed point equation 11 . Then, write \(\bg_h(\btheta)\) as a formal power series butcher1972algebraic?, hairer1974butcher?, sometimes called B-series, to obtain \[h^2 \bg_h(\btheta) = \sum_{\tau \in \mathcal{A}_{\varnothing}} \frac{h^{\abs{\tau}}}{\abs{\tau}!} g(\tau) \nabla^{\tau} L(\btheta),\] where \(\nabla^{\tau} L\) is the elementary differential defined recursively by \(\nabla^{\varnothing} L(\btheta) = \btheta\), \(\nabla^{\rootedtree[]} L = \nabla L\) and \(\nabla^{\tau} L = \nabla^{\ell + 1} L \brk[\big]{\nabla^{\tau_1} L, \ldots, \nabla^{\tau_{\ell}} L}\) for \(\tau = [\tau_1, \ldots, \tau_{\ell}]\), \(g\colon \tilde{\mathcal{A}}_{\varnothing} \to \mathbb{R}\) is the coefficient mapping (with the induced mapping \(g\colon \mathcal{A}_{\varnothing} \to \mathbb{R}\) denoted by the same symbol), \(g(\varnothing) = g(\rootedtree[]) = 0\).

In addition, define a mapping \(a\colon \tilde{\mathcal{A}}_{\varnothing} \to \mathbb{R}\) by putting \(a(\varnothing) := 1\), \(a(\rootedtree[]) = - (1 - \beta)^{-1}\), and \(a(\tau) = \beta g(\tau)\) for \(\abs{\tau} \geq 2\). Then, by the composition rule (e. g. faris2021rootedtreegraphs?), we have \[\label{eq:xadAIY} h^2 \bg_h\prn[\bigg]{\btheta - \frac{h}{1 - \beta} \nabla L(\btheta) + h^2 \beta \bg_h(\btheta)} = \sum_{\tau \in \mathcal{A}_{\varnothing}} \frac{h^{\abs{\tau}}}{\abs{\tau}!} (a * g)(\tau) \nabla^{\tau} L(\btheta),\tag{25}\] where \(a * g\) is the subtree convolution, that is, \((a * g)(\varnothing) = g(\varnothing) = 0\) and \[(a * g)(\tau) = g(\varnothing) a(\tau) + \sum_{\mrk \in \mrkset_{\tau}} g(\tau_0^{\mrk}) a(\tau_1^{\mrk}) \ldots a(\tau_{\abs{\mrk}}^{\mrk}) = \sum_{\mrk \in \mrkset_{\tau}} g(\tau_0^{\mrk}) a(\tau_1^{\mrk}) \ldots a(\tau_{\abs{\mrk}}^{\mrk}).\] Similarly, by the composition rule, \[\label{eq:LcoekJ} - \frac{h}{1 - \beta} \nabla L\prn[\bigg]{\btheta - \frac{h}{1 - \beta} \nabla L(\btheta) + h^2 \beta \bg_h(\btheta)} = \sum_{\tau \in \mathcal{A}_{\varnothing}} \frac{h^{\abs{\tau}}}{\abs{\tau}!} (a * l)(\tau) \nabla^{\tau} L(\btheta),\tag{26}\] where \(l\colon \tilde{\mathcal{A}}_{\varnothing} \to \mathbb{R}\) is defined by \(l(\varnothing) = 0\), \(l(\rootedtree[]) = - (1 - \beta)^{-1}\), \(l(\tau) = 0\) for \(\abs{\tau} \geq 2\), which means \[(a * l)(\tau) = \sum_{\mrk \in \mrkset_{\tau}} l(\tau_0^{\mrk}) a(\tau_1^{\mrk}) \ldots a(\tau_{\abs{\mrk}}^{\mrk}) = - \frac{1}{1 - \beta} a(\tau_1) \ldots a(\tau_{\ell})\] for \(\tau = [\tau_1, \ldots, \tau_{\ell}]\); in particular, \((a * l)(\rootedtree[]) = - (1 - \beta)^{-1}\). Combining 25 and 26 gives \[\begin{align} &- \frac{h}{1 - \beta} \nabla L\prn[\bigg]{\btheta - \frac{h}{1 - \beta} \nabla L(\btheta) + h^2 \beta \bg_h(\btheta)} + h^2 \bg_h\prn[\bigg]{\btheta - \frac{h}{1 - \beta} \nabla L(\btheta) + h^2 \beta \bg_h(\btheta)}\\ &\quad = \sum_{\tau \in \mathcal{A}_{\varnothing}} \frac{h^{\abs{\tau}}}{\abs{\tau}!} \crl[\big]{(a * l)(\tau) + (a * g)(\tau)} \nabla^{\tau} L(\btheta). \end{align}\] By 11 , this should be equal to \[- \frac{h}{1 - \beta} \nabla L(\btheta) + h^2 \beta \bg_h(\btheta) = - \frac{h}{1 - \beta} \nabla L(\btheta) + \sum_{\tau \in \mathcal{A}_{\varnothing}} \frac{h^{\abs{\tau}}}{\abs{\tau}!} \beta g(\tau) \nabla^{\tau} L(\btheta).\] Matching the coefficients before equal powers of \(h\) gives for \(\tau = [\tau_1, \ldots, \tau_{\ell}]\) with \(\abs{\tau} \geq 2\) \[(a * l)(\tau) + (a * g)(\tau) = \beta g(\tau),\] that is, \[- \frac{1}{1 - \beta} a(\tau_1) \ldots a(\tau_{\ell}) + \sum_{\mrk \in \mrkset_{\tau} \setminus \crl{\varnothing}} g(\tau_0^{\mrk}) a(\tau_1^{\mrk}) \ldots a(\tau_{\abs{\mrk}}^{\mrk}) + g(\tau) = \beta g(\tau).\] Hence, the coefficients \(g(\tau)\) satisfy the recursion \[g(\tau) = \frac{1}{(1 - \beta)^2} a(\tau_1) \ldots a(\tau_{\ell}) - \frac{1}{1 - \beta} \sum_{\mrk \in \mrkset_{\tau} \setminus \crl{\varnothing}} g(\tau_0^{\mrk}) a(\tau_1^{\mrk}) \ldots a(\tau_{\abs{\mrk}}^{\mrk})\] with \(g(\varnothing) = g(\rootedtree[]) = 0\); the same is rewritten in terms of only the \(a\) mapping as \[a(\tau) = \frac{\beta}{(1 - \beta)^2} a(\tau_1) \ldots a(\tau_{\ell}) - \frac{1}{1 - \beta} \sum_{\substack{\mrk \in \mrkset_{\tau} \setminus \crl{\varnothing}\\\abs{\tau_0^{\mrk}} \geq 2}} a(\tau_0^{\mrk}) a(\tau_1^{\mrk}) \ldots a(\tau_{\abs{\mrk}}^{\mrk})\] with \(a(\varnothing) = 1\), \(a(\rootedtree[]) = - \frac{1}{1 - \beta}\). For example, \[a(\rootedtree[[]]) = - \frac{\beta}{(1 - \beta)^3},\quad a(\rootedtree[[[]]]) = a(\rootedtree[[][]]) = - \frac{\beta (1 + \beta)}{(1 - \beta)^5}.\] This is a similar-looking although not quite the same characterization of \(\bg_h\) as we would obtain by taking \(n \to \infty\) in 21 (when all losses are equal \(L^{(s)} \equiv L\)). Of course, they lead to the same results (despite the different recursions), for example, \[\begin{align} h^2 \bg_h(\btheta) = &- \frac{h^2}{(1 - \beta)^3} \nabla^2 L(\btheta) \nabla L(\btheta)\\ &- \frac{h^3}{2} \frac{1 + \beta}{(1 - \beta)^5} \nabla^3 L(\btheta) \brk[\big]{\nabla L(\btheta), \nabla L(\btheta)} - h^3 \frac{1 + \beta}{(1 - \beta)^5} \nabla^2 L(\btheta) \nabla^2 L(\btheta) \nabla L(\btheta) + O(h^4), \end{align}\] giving the memoryless update 3 \[\begin{align} &\btheta^{(n + 1)} = \btheta^{(n)} - \frac{h}{1 - \beta} \crl[\bigg]{ \nabla L(\btheta^{(n)}) + \frac{h \beta}{(1 - \beta)^2} \nabla^2 L(\btheta^{(n)}) \nabla L(\btheta^{(n)})\\ &\quad + \frac{h^2 \beta (1 + \beta)}{2 (1 - \beta)^4} \nabla^3 L(\btheta^{(n)}) \brk[\big]{\nabla L(\btheta^{(n)}), \nabla L(\btheta^{(n)})}\\ &\quad + \frac{h^2 \beta (1 + \beta)}{(1 - \beta)^4} \nabla^2 L(\btheta^{(n)}) \nabla^2 L(\btheta^{(n)}) \nabla L(\btheta^{(n)}) + O(h^3)}. \end{align}\] Here, in contrast to approximation theorems, by \(O(h^3)\) we just mean terms of order \(h^3\) and higher in the formal infinite sum.

5 Corollaries and Implications↩︎

Our main theoretical results (2 and 3) can be used to obtain useful additional results for the analysis of HB and variants thereof. This section focuses on deriving continuous modified equations of arbitrary approximation order (with rigorous bounds), as well as principal iteration and principal flow approximations capturing the HB dynamics. Our results generalize the work in rosca2023on?. Due to their practical importance, we consider both full-batch and mini-batch implementations.

5.1 Modified Equation↩︎

The global approximation by a memoryless iteration (2) allows to prove the existence of a modified equation or, in other words, an approximation by a continuous flow. Define the BEA coefficients \(\crl[\big]{\beacoef{j}{n}(\btheta)}_{j = 1}^{\infty}\) by \[\label{eq:bea-coef-def} \beacoef{j}{n}(\btheta) = \nomemcoef{j}{n}(\btheta) - \sum_{i = 2}^j \frac{1}{i!} \sum_{\substack{k_1, \ldots, k_i \geq 1\\k_1 + \ldots + k_i = j}} (D^{(n)}_{k_1} \ldots D^{(n)}_{k_{i - 1}} \beacoef{k_i}{n})(\btheta),\tag{27}\] with the \(i\)th Lie derivative \(D^{(n)}_i := \sum_l \beacoefsc{i}{n}{l}(\btheta) \partial_l\). This formula is standard in the literature on backward error analysis applied to numerical methods hairer2006?.

We now state the continuous approximation result.

Corollary 1 (Modified equation). Assume the conditions of 2. Let \(\btheta(t) \equiv \btheta(\ord; t)\) be the unique continuous solution to the piecewise ODE in \(\mathcal{D}\) \[\label{eq:ode-def} \dot{\btheta}(t) = \sum_{i = 0}^{\ord - 1} h^i f^{(n)}_{i + 1}(\btheta(t))\qquad{(6)}\] on \(t \in [t_n, t_{n + 1}]\) with the initial condition \(\btheta(0) = \btheta^{(0)}\), assumed to exist, where we use the shortcut \(t_n := n h\). Then for each \(h \in (0, 1 / 2)\) \[\sup_{n \in \range{0}{\lfloor T / h \rfloor}} \norm{\btheta^{(n)} - \btheta(t_n)} \leq \oC{mod} h^{\ord},\] where \(\nC{mod}\) is some constant depending on \(T\).

The proof is by Taylor-expanding \(\btheta(t)\) around \(t_n\) at \(t_{n + 1}\) and rearranging; this allows to get the local error bound, which is then converted into the global error bound. Full details are provided in 10.1.

For example, the second-order approximation is \[\label{eq:XbfyrE} \dot{\btheta}(t) = - \sum_{b = 0}^n \beta^b \nabla L^{(n - b)}(\btheta(t)) + h \beacoef{2}{n}(\btheta(t))\tag{28}\] with \[\begin{align} \beacoef{2}{n}(\btheta) = &- \sum_{b = 1}^n \beta^b \nabla^2 L^{(n - b)}(\btheta) \sum_{l = 1}^b \sum_{b' = 0}^{n - l} \beta^{b'} \nabla L^{(n - l - b')}(\btheta)\\ &- \frac{1}{2} \nabla \sum_{b = 0}^n \beta^b \nabla^2 L^{(n - b)} \sum_{b' = 0}^n \beta^{b'} \nabla L^{(n - b')}(\btheta). \end{align}\]

In the simpler full-batch case (where all \(L^{(s)} \equiv L\)) 28 is rewritten as \[\begin{align} \dot{\btheta}(t) &= - \frac{1 - \beta^{n + 1}}{1 - \beta} \nabla L(\btheta(t))\\ &\quad - \frac{1 + \beta - 4 (n + 1) (\beta^{n + 1} - \beta^{n + 2}) - \beta^{2 n + 2} - \beta^{2 n + 3}}{2 (1 - \beta)^3} \nabla^2 L(\btheta(t)) \nabla L(\btheta(t)). \end{align}\] on the segment \(t \in [n h, (n + 1) h]\). As \(n\) becomes large, this trajectory turns into the smooth solution to \[\label{eq:modified-ode-smooth} \dot{\btheta}(t) = - \frac{1}{1 - \beta} \nabla L(\btheta(t)) - \frac{1 + \beta}{2 (1 - \beta)^3} \nabla^2 L(\btheta(t)) \nabla L(\btheta(t)).\tag{29}\] The equation found in kovachki2021continuous? (after fixing a small typo) is the same ODE after neglecting \(\beta^n\), but for 29 the guarantee 8 (with \(\ord = 2\)) is only true if the initialization happened exactly on the attractive manifold. Importantly, we did not lose any information but gained a guarantee regardless of initialization.

Note that 29 can be rewritten as \[\dot{\btheta}(t) = - \frac{1}{1 - \beta} \nabla \crl[\bigg]{ L + \frac{1 + \beta}{4 (1 - \beta)^2} \norm{\nabla L}^2 }(\btheta(t)).\] This can be seen as gradient flow with a modified loss farazmand2020multiscale?, kovachki2021continuous?, ghosh2023implicit?. However, for \(\ord = 3\) this is already not true: \(\beacoef{3}{n}(\btheta)\) becomes close to \[- \frac{1 + 4 \beta + \beta^2}{3 (1 - \beta)^5} \nabla^2 L(\btheta) \nabla^2(\btheta) \nabla L(\btheta) - \frac{1 + 10 \beta + \beta^2}{12 (1 - \beta)^5} \nabla^3 L(\btheta) \brk[\big]{\nabla L(\btheta), \nabla L(\btheta)},\] which is in general not a gradient.

5.2 Principal Iteration↩︎

Consider full-batch HB 2 . Let us formally take \(\ord\) to infinity in 5 and write a formal series \(\sum_{m = 1}^{\infty} h^m \nomemcoef{m}{n}(\btheta)\). It is an infinite sum of terms that are of two types: terms containing only derivatives of order no higher than two of the loss, which we will call principal terms, and the remaining terms (containing derivatives of order at least three), which we will call non-principal ones. For example, the term in 19 is principal and the term in 20 is non-principal.

Write \[\sum_{m = 2}^{\infty} h^m \nomemcoef{m}{n}(\btheta) = - \beta \sum_{m = 2}^{\infty} v_m^{(n)} h^m \crl{\nabla^2 L(\btheta)}^{m - 1} \nabla L(\btheta) + \np,\] where the notation \(\np\) means “non-principal terms”, \(\crl{v_m^{(n)}}\) are coefficients (not depending on \(h\)). 3 and its proof give an easy way to write down a recursion for \(v_m^{(n)}\).

Corollary 2 (Principal iteration). Define \(\crl{v_m^{(n)}}_{m = 1}^{\infty}\) by putting \(v_1^{(n)} = \sum_{b = 0}^{n - 1} \beta^b\) and so that the principal part of \(\nomemcoef{m}{n}(\btheta)\) is \(- \beta v_m^{(n)} \crl{\nabla^2 L(\btheta)}^{m - 1} \nabla L(\btheta)\) for \(m \geq 2\). Then the coefficients \(v_m^{(n)}\) satisfy \[\label{eq:vm-1-rec} v_m^{(n)} = v_{m - 1}^{(n)} + \beta \sum_{j = 1}^{m - 1} v_j^{(n - 1)} v_{m - j}^{(n)} + \beta v_m^{(n - 1)},\quad m \geq 2\qquad{(7)}\] and there are limits \(v_m^{(\infty)} := \lim_{n \to \infty} v_m^{(n)}\), which satisfy \(v_1^{(\infty)} = (1 - \beta)^{-1}\) and \[\label{eq:lpnIyx} v_m^{(\infty)} = \frac{v_{m - 1}^{(\infty)}}{1 - \beta} + \frac{\beta}{1 - \beta} \sum_{j = 1}^{m - 1} v_j^{(\infty)} v_{m - j}^{(\infty)},\quad m \geq 2.\qquad{(8)}\] The generating function \(g_{\beta}(x) := \sum_{m = 0}^{\infty} v_{m + 1}^{(\infty)} x^m\) is given by \[\label{eq:mDeSYH} g_{\beta}(x) = \frac{1 - \beta - x - \sqrt{(1 - \beta - x)^2 - 4 \beta x}}{2 \beta x}.\qquad{(9)}\] In particular, for \(m \geq 1\) \[v_{m + 1}^{(\infty)} = \frac{N_m(\beta)}{(1 - \beta)^{2 m + 1}},\] where \[N_m(\beta) := \sum_{k = 1}^m \frac{1}{m} \binom{m}{k} \binom{m}{k - 1} \beta^{m - k},\quad m \geq 1\] are the Narayana polynomials.

The proof is given in 10.2.

Informally, 2 means \[\begin{align} \sum_{m = 1}^{\infty} h^m \nomemcoef{m}{n}(\btheta) &\approx - \frac{h}{1 - \beta} \nabla L(\btheta) - \beta \sum_{m = 1}^{\infty} v_{m + 1}^{(\infty)} h^{m + 1} \crl{\nabla^2 L(\btheta)}^{m} \nabla L(\btheta) + \np\\ &= - h \nabla L(\btheta) - \beta h \sum_{m = 0}^{\infty} v_{m + 1}^{(\infty)} h^m \crl{\nabla^2 L(\btheta)}^{m} \nabla L(\btheta) + \np\\ &= - h \nabla L(\btheta) - \beta h g_{\beta}(h \nabla^2 L(\btheta)) \nabla L(\btheta) + \np, \end{align}\] where \(\approx\) hides the fact that \(n\) is taken to infinity. Therefore, the “full-order” memoryless iteration is approximately \[\label{eq:aIctKj} \tilde{\btheta}^{(n + 1)} = \tilde{\btheta}^{(n)} - h \nabla L\prn[\big]{\tilde{\btheta}^{(n)}} - \beta h g_{\beta}\prn[\big]{h \nabla^2 L\prn[\big]{\tilde{\btheta}^{(n)}}} \nabla L\prn[\big]{\tilde{\btheta}^{(n)}} + \np\tag{30}\] for large \(n\). The series \(g_{\beta}\prn[\big]{h \nabla^2 L\prn[\big]{\tilde{\btheta}^{(n)}}}\) converges in Euclidean operator norm when \(\norm[\big]{\nabla^2 L\prn[\big]{\tilde{\btheta}^{(n)}}} < R_{\beta} / h\), where \(R_{\beta} = (1 - \sqrt{\beta})^2\) is the convergence radius.

We thank Boris Hanin for the following interesting observation: \(g_{\beta}(x)\) is the Stieltjes transform of the standard Marchenko-Pastur law with parameter \(\beta\), which we can write as \[g_{\beta}(x) = \Expectlet_{\xi \sim \mathrm{MP}(\beta)} \brk{(\xi - x)^{-1}}.\] Hence, 30 becomes \[\tilde{\btheta}^{(n + 1)} = \tilde{\btheta}^{(n)} - h \Expectlet_{\xi \sim \mathrm{MP}(\beta)} \brk[\big]{\bI + \beta \prn[\big]{\xi \bI - h \nabla^2 L\prn[\big]{\tilde{\btheta}^{(n)}}}^{-1}} \nabla L\prn[\big]{\tilde{\btheta}^{(n)}} + \np\] as long as \(\norm{\nabla^2 L\prn[\big]{\tilde{\btheta}^{(n)}}} < (1 - \sqrt{\beta})^2 / h\) as above.

5.3 Comments on Combinatorics↩︎

We see from 2 that the coefficients corresponding to \(\E{\tau}{1}{n}\), where \(\tau\) is a chain with \(m\) vertices (see 1.5 for the definition of a chain), are the rescaled Narayana polynomials.

2 and 57 show in particular that the Narayana polynomials can be defined as \(N_m(\beta) \equiv N_{m, 1}(\beta)\) where \(\crl{N_{m, l}(\beta)}\) satisfy the recursion \[N_{m, l}(\beta) = (1 - \beta)^2 \sum_{b = 0}^{\infty} \beta^b \sum_{l_1 = 1}^{l + b} N_{m - 1, l_1}(\beta),\quad m \geq 2, l \geq 1\] with initial condition \(N_{1, l}(\beta) = \beta + (1 - \beta) l\) for \(l \geq 1\). We are not aware of this characterization in the literature.

Let us now write the coefficient before another type of trees, namely, the trees consisting only of the root and a number of leaves.

Corollary 3. Define \(q_{m, l}^{(n)}\) as such coefficients that \[\E{\mathcal{r}_m}{l}{n} = q_{m, l}^{(n)} \nabla^m L(\btheta) \brk[\big]{\underbrace{\nabla L(\btheta), \ldots, \nabla L(\btheta)}_{\text{m - 1 times}}}\] in ?? in the full-batch case, where \(\mathcal{r}_m\) consists of a root and \(m - 1\) leaves. Then the limit \(q_{m, 1}^{(\infty)} = \lim_{n \to \infty} q_{m, 1}^{(n)}\) satisfies \[q_{m + 1, 1}^{(\infty)} = \frac{1}{(1 - \beta)^{2 m + 1}} A_m(\beta),\quad m \geq 1,\] where \[A_m(\beta) = (1 - \beta)^{m + 1} \sum_{j = 1}^{\infty} j^m \beta^{j - 1}\] are the Eulerian polynomials.

The proof is given in 10.3.

[cor:principal-coef,cor:eulerian] motivate the following definition.

Definition 1. For \(\tau = [\tau_1, \ldots, \tau_{\ell}] \in \tilde{\mathcal{A}}[m]\), let \(e_{\tau, l}\) be the coefficient before \(\E{\tau}{l}{n}\) in 3 after taking \(n \to \infty\) and multiplying by \((1 - \beta)^{2 m - 1}\), that is, \[e_{\tau, l} \equiv e_{\tau, l}(\beta) = (1 - \beta)^{\ell + 1} \sum_{b = 0}^{\infty} \beta^b \sum_{l_1 = 1}^{l + b} e_{\tau_1, l_1} \ldots \sum_{l_{\ell} = 1}^{l + b} e_{\tau_{\ell}, l_{\ell}}, \quad l \in \mathbb{Z}_{\geq 1}, \quad \ell \in \mathbb{Z}_{\geq 0}.\] In particular, \(e_{\rootedtree[], l} = 1\).

By induction, \(e_{\tau, l}\) is a polynomial of degree no more than \(m - 1\) in the variable \(l (1 - \beta)\) with coefficients that are themselves polynomials only of \(\beta\) (not depending on \(l\)). In particular, \(e_{\tau, l}\) is a polynomial in \(\beta\) (as opposed to just a rational function).

By 2 3, \((1 - \beta)^{2 m - 1} v_m^{(\infty)} = e_{\mathcal{c}_m, 1}\) where \(\mathcal{c}_m\) is the chain with \(m\) vertices; \((1 - \beta)^{2 m - 1} q_{m, l}^{(\infty)} = e_{\mathcal{r}_m, l}\) where \(\mathcal{r}_m\) is the tree consisting of a root and \(m - 1\) leaves, whereas \[e_{\mathcal{c}_m, 1} = N_{m - 1}(\beta),\quad e_{\mathcal{r}_m, 1} = A_{m - 1}(\beta).\] To conclude, \(e_{\tau, 1} \equiv e_{\tau, 1}(\beta)\) in 1 form a rich \(\tilde{\mathcal{A}}[m]\)-parametrized (\(m \geq 1\)) family of polynomials of \(\beta\), containing both the Narayana polynomials and the Eulerian polynomials, and many other polynomials “in-between”. This combinatorial digression may be of independent interest.

5.4 Principal Flow↩︎

Using the same framework as in 5.2, we can derive a “full-order” modified equation up to non-principal terms.

Corollary 4 (Principal flow). Define \(\crl{z_m^{(n)}}\) as such coefficients that the principal part of \(\beacoef{m}{n}(\btheta)\) is \(z_m^{(n)} \crl{\nabla^2 L(\btheta)}^{m - 1} \nabla L(\btheta)\). Then, \[\label{eq:z-coef-rec} \zcoef{m}{n} := \sum_{l = 1}^m \frac{(-1)^{l + 1}}{l} \sum_{\substack{k_1, \ldots, k_l \geq 1\\k_1 + \ldots + k_l = m}} p^{(n)}_{k_1} \ldots p^{(n)}_{k_l},\qquad{(10)}\] where \[p^{(n)}_k := \begin{cases} - v_k^{(n + 1)}, &\text{if } k = 1,\\ - \beta v_k^{(n)}, &\text{if } k \geq 2. \end{cases}\] Moreover, the limiting sequence \(\zcoef{m}{\infty} := \lim_{n \to \infty} \zcoef{m}{n}\) admits a generating function \(\bar{g}_{\beta}(x) := \sum_{k = 0}^{\infty} \zcoef{k + 1}{\infty} x^k\) given by \[\bar{g}_{\beta}(x) = \frac{1}{x} \ln \prn[\bigg]{\frac{1 + \beta - x + \sqrt{(1 - \beta - x)^2 - 4 \beta x}}{2}}.\]

The proof is given in 10.4.

Informally, the “full-order” modified equation (called “principal flow” by rosca2023on?) is approximately \[\begin{align} \dot{\btheta}(t) &= \sum_{k = 0}^{\infty} z_{k + 1}^{(n)} h^k \crl{\nabla^2 L(\btheta(t))}^{k} \nabla L(\btheta(t)) + \np\\ &= \bar{g}_{\beta}\prn[\big]{h \nabla^2 L(\btheta(t))} \nabla L(\btheta(t)) + \np \end{align}\] for large \(n\). Taking \(\beta = 0\), we recover the result of rosca2023on? (for GD).

For example, consider the one-dimensional case and \(L(\theta) = \theta^2 / 2\). The HB iteration 1 is solved by \[\begin{pmatrix} \theta^{(n)}\\ v^{(n)} \end{pmatrix} = \begin{pmatrix} 1 - h & h \beta\\ - 1 & \beta \end{pmatrix}^n \begin{pmatrix} \theta^{(0)}\\ v^{(0)}. \end{pmatrix}\] The general solution is \[\begin{align} &\theta^{(n)} = \frac{((1 - h - \lambda_-) \theta^{(0)} + h \beta v^{(0)}) \lambda_+^n + ((\lambda_+ + h - 1) \theta^{(0)} - h \beta v^{(0)}) \lambda_-^n}{\lambda_+ - \lambda_-},\\ &v^{(n)} = \frac{\prn[\big]{(\beta - \lambda_-) v^{(0)} - \theta^{(0)}} \lambda_+^n + \prn[\big]{\theta^{(0)} + (\lambda_+ - \beta) v^{(0)}} \lambda_-^n}{\lambda_+ - \lambda_-} \end{align}\] with \[\lambda_{\pm} = \frac{1 + \beta - h \pm \sqrt{(1 - \beta - h)^2 - 4 \beta h}}{2},\] taking for simplicity the case \(\lambda_+ \neq \lambda_-\) and \(\lambda_{\pm} \in \mathbb{R}\). If the initialization happened on the invariant manifold we discussed above (attractive if \(h\) is small enough), that is, \(v^{(0)} = \theta^{(0)} (\lambda_+ + h - 1) / (h \beta)\), then \[\label{eq:oMWSQv} \theta^{(n)} = \lambda_+^n \theta^{(0)}.\tag{31}\]

The solution of the principal flow \(\dot{\theta}(t) = \bar{g}_{\beta}\prn[\big]{h} \theta(t)\) in this case is \[\label{eq:XtfJtO} \theta(t) = \theta(0) \exp \crl[\bigg]{\frac{t}{h} \ln \prn[\bigg]{\frac{1 + \beta - h + \sqrt{(1 - \beta - h)^2 - 4 \beta h}}{2}}},\tag{32}\] coinciding with 31 at points \(t = n h\).

We note in passing that for large step sizes \((1 + \sqrt{\beta})^2 < h < 2 + 2 \beta\) there is another attractive manifold of importance \(v / \theta = (h + \lambda_- - 1) / (h \beta)\), where the right-hand side blows up as \(h \to 0\). If the initialization happened there, the solution will be \[\theta^{(n)} = \lambda_{-}^n \theta^{(0)}.\] This also coincides with principal flow 32 by choosing appropriate values of the complex square root and logarithm. We illustrate both situations in 1.

6 Concluding Remarks↩︎

We studied the theoretical properties of gradient descent with polyak1964some? heavy-ball momentum, the simplest and one of the most commonly used algorithm with memory in optimization. We established an approximation of the algorithm by a memoryless iteration with arbitrary precision. In the full-batch case, this memoryless iteration is (roughly speaking) plain gradient descent with a modified loss. This loss modification can be seen as implicit regularization by memory, and can sometimes explain good empirical performance barrett2021implicit?, smith2021on?, ghosh2023implicit?. In the stochastic (mini-batch) case, additional implicit regularization by mini-batch noise can be identified. These insights can be of practical importance in machine learning, not just for understanding existing algorithms, but also for proposing new ones, e. g. by introducing (or strengthening) similar regularization explicitly foret2021sharpnessaware?, zhao2022penalizinggradientnorm?. In addition, analyzing the terms of the memoryless iteration revealed rich combinatorics hidden inside the algorithm, which may be of independent theoretical interest.

bernstein2024old? notes that “the precise role of EMA [exponential moving averages] is perhaps still an open problem” in optimization. Even though we tailored the presentation to be specifically about HB, one of the strengths of our techniques is that they can be used to study other algorithms with decaying memory (defined by smooth enough functions of the parameter \(\btheta\)). Thus, our paper not only makes a step in the large task of theoretically understanding memory and its effects for a specific optimization algorithm, but also outlines a more general framework that can be used to analyze other algorithms. For example, our techniques could be used to study other (ubiquitous) numerical methods with memory such as Adam kingma2017adammethodstochasticoptimization? or Shampoo gupta2018shampoopreconditionedstochastic?.

7 Proof of 1 ↩︎

7.1 Constants↩︎

It will be convenient to define \[\begin{align} &\kf := D_1 / (1 - \beta) + h \beta \gamma,\tag{33}\\ &\ks := D_2 / (1 - \beta) + h \beta \delta.\tag{34} \end{align}\]

The constants \(\gamma\), \(\delta\) and \(\lambda\) are chosen as follows: \(\gamma\) is as in kovachki2021continuous?, that is, \(\gamma \in [\tau_1, \infty)\) with \[\tau_1 := \frac{(1 - \beta)^{-2} D_1 D_2}{1 - \beta - h D_2 \beta / (1 - \beta)},\] \(\delta\) is chosen at the end of the proof of 5: namely, it needs to be positive and satisfy 38 , and \(\lambda\) is chosen to lie in \([\tau_2(\delta), \infty)\) with \[\begin{align} \tau_2(\delta) := &\prn[\bigg]{\frac{D_3 (1 - \beta)^{-1 } \brk{(1 - \beta)^{-1} D_2 + h \beta \delta}}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}\\ &\quad + \frac{(1 - \beta)^{-1} \brk{D_4 \kf + D_3 \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2}\\ &\quad + \frac{\prn[\big]{(1 - \beta)^{-1} D_2 + h \beta \delta} \brk[\big]{(1 - \beta)^{-1} D_3}}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^3}}\\ &\times \prn[\bigg]{1 - \frac{\beta}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} - \frac{h \beta \brk{(1 - \beta)^{-1} D_2 + h \beta \delta}}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^3}}^{-1}. \end{align}\] Notice that \(\tau_2(\delta) > 0\) for small enough \(h\).

7.2 Omitted Lemmas↩︎

Lemma 4 (\(\btheta \mapsto \bzeta\) is bijective). For small enough \(h\) and for any \(\boldsymbol{g} \in \Gamma\), the function \(\btheta \mapsto \bzeta = \btheta - h (1 - \beta)^{-1} \nabla L(\btheta) + h^2 \beta \boldsymbol{g}(\btheta)\) is a bijection \(\mathbb{R}^d \to \mathbb{R}^d\).

Proof. This is proven in kovachki2021continuous?. ◻

Lemma 5 (\(T\) maps from \(\Gamma\) to \(\Gamma\)). For small enough \(h\), the operator \(T\) defined in 12 maps from \(\Gamma\) to \(\Gamma\).

Proof. Take \(\boldsymbol{g} \in \Gamma\). By 4, the function \(\btheta \mapsto \bzeta = \btheta - h (1 - \beta)^{-1} \nabla L(\btheta) + h^2 \beta \boldsymbol{g}(\btheta)\) is a bijection. This function is continuously differentiable: \[\nabla_{\btheta} \bzeta(\btheta) = I - \frac{h}{1 - \beta} \nabla^2 L(\btheta) + h^2 \beta \nabla \boldsymbol{g}(\btheta).\] For \(h\) small enough, this matrix is invertible. By the inverse mapping theorem, we see that the inverse mapping is continuously differentiable with Jacobian \[\label{eq:ljfhhh} \nabla_{\bzeta} \btheta(\bzeta) = \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}.\tag{35}\] Therefore, the function \(T \boldsymbol{g}\) is also continuously differentiable with derivative \[\begin{align} &\nabla_{\bzeta} T \boldsymbol{g}(\bzeta) = \frac{1}{h (1 - \beta)} \nabla_{\bzeta} \crl[\big]{\nabla L(\bzeta) - \nabla L(\btheta(\bzeta))} + \beta \nabla_{\bzeta} \crl[\big]{\boldsymbol{g}(\btheta(\bzeta))}\\ &\quad = \frac{1}{h (1 - \beta)} \crl[\big]{\nabla^2 L(\bzeta) - \nabla^2 L(\btheta(\bzeta)) \nabla_{\bzeta} \btheta(\bzeta)} + \beta \nabla \boldsymbol{g}(\btheta(\bzeta)) \nabla_{\bzeta} \btheta(\bzeta)\\ &\quad = \frac{1}{h (1 - \beta)} \crl[\bigg]{\nabla^2 L(\bzeta) - \nabla^2 L(\btheta(\bzeta)) \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}}\\ &\qquad + \beta \nabla \boldsymbol{g}(\btheta(\bzeta)) \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}\\ &\quad = \frac{1}{h (1 - \beta)} \nabla^2 L(\bzeta)\\ &\qquad + \frac{1}{h^2} \prn[\bigg]{- \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))} \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}\\ &\quad = \frac{\nabla^2 L(\bzeta) - \nabla^2 L(\btheta(\bzeta))}{h (1 - \beta)} + \beta \nabla \boldsymbol{g}(\btheta(\bzeta)) + \frac{1}{h^2} \prn[\bigg]{\frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) - h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}\\ &\qquad - \frac{1}{h^2} \prn[\bigg]{\frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) - h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))} \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}\\ &\quad = \frac{\nabla^2 L(\bzeta) - \nabla^2 L(\btheta(\bzeta))}{h (1 - \beta)} + \beta \nabla \boldsymbol{g}(\btheta(\bzeta))\\ &\qquad + \frac{1}{h^2} \prn[\bigg]{\frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) - h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}\\ & \times \crl[\bigg]{I - \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}}.\numberthis\label{eq:derivative-of-t} \end{align}\tag{36}\] This matrix is symmetric for any \(\boldsymbol{g} \in \Gamma\) because any square matrix \(A\) commutes with \((I - A)^{-1}\) (provided the latter exists). We have proven that \(T \boldsymbol{g}\) is a continuously differentiable function and its Jacobian is symmetric.

7.2.0.1 Bounding \(T \boldsymbol{g}\)

We need to show that the norm of the function \(T \boldsymbol{g}\) does not exceed \(\gamma\) for any \(g \in \Gamma\). We have for any \(\bzeta \in \mathbb{R}^d\) \[\begin{align} \norm{T \boldsymbol{g}(\bzeta)} &\leq \frac{\sup_{\btheta} \norm{\nabla^2 L(\btheta)}}{h (1 - \beta)} \norm{\bzeta - \btheta(\bzeta)} + \beta \norm{\boldsymbol{g}(\btheta(\bzeta))}\\ \intertext{(where we used the definition of T in \ref{eq:def-of-t})} &= \frac{\sup_{\btheta} \norm{\nabla^2 L(\btheta)}}{h (1 - \beta)} \norm[\bigg]{- \frac{h}{1 - \beta} \nabla L(\btheta(\bzeta)) + h^2 \beta \boldsymbol{g}(\btheta(\bzeta))} + \beta \norm{\boldsymbol{g}(\btheta(\bzeta))}\\ &\leq \frac{\sup_{\btheta} \norm{\nabla^2 L(\btheta)} \sup_{\btheta} \norm{\nabla L(\btheta)}}{(1 - \beta)^2} + \prn[\bigg]{\beta + h \frac{\beta}{1 - \beta} \sup_{\btheta} \norm{\nabla^2 L(\btheta)}} \gamma\\ &= \frac{D_1 D_2}{(1 - \beta)^2} + \prn[\bigg]{\beta + h \frac{\beta}{1 - \beta} D_2} \gamma\\ &\leq \gamma,\numberthis\label{eq:ldlkjfhh} \end{align}\tag{37}\] where the last inequality follows from the definition of \(\tau_1\) and since \(\gamma\) was taken to be at least \(\tau_1\).

7.2.0.2 Bounding the derivative of \(T \boldsymbol{g}\)

Consider the expression ending in 36 . We will use \[\begin{align} &\norm{\nabla_{\bzeta} T \boldsymbol{g}(\bzeta)} \leq \norm[\bigg]{\frac{\nabla^2 L(\bzeta) - \nabla^2 L(\btheta(\bzeta))}{h (1 - \beta)}}\\ &\qquad + \norm[\bigg]{\frac{1}{h (1 - \beta)} \nabla^2 L(\btheta(\bzeta)) \crl[\bigg]{I - \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}}}\\ &\qquad + \norm[\bigg]{\beta \nabla \boldsymbol{g}(\btheta(\bzeta)) \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}} \end{align}\] and bound each of the three terms in the right-hand side separately.

The first term is easy to bound: \[\begin{align} &\norm[\bigg]{\frac{\nabla^2 L(\bzeta) - \nabla^2 L(\btheta(\bzeta))}{h (1 - \beta)}} \leq \frac{D_3}{h (1 - \beta)} \norm{\bzeta - \btheta(\bzeta)}\\ &\quad = \frac{D_3}{h (1 - \beta)} \norm[\bigg]{- \frac{h}{1 - \beta} \nabla L(\btheta(\bzeta)) + h^2 \beta \boldsymbol{g}(\btheta(\bzeta))}\\ &\quad \leq \frac{D_3}{h (1 - \beta)} \crl[\bigg]{\frac{h D_1}{1 - \beta} + h^2 \beta \gamma} = \frac{D_3 \kf}{1 - \beta}. \end{align}\]

Now we bound \[\begin{align} &\norm[\bigg]{\frac{1}{h (1 - \beta)} \nabla^2 L(\btheta(\bzeta)) \crl[\bigg]{I - \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}}}\\ &\quad \leq \frac{D_2}{h (1 - \beta)} \norm[\bigg]{I - \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}}. \end{align}\] Notice that the eigenvalues of the matrix \[I - \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}\] are of the form \(- h \lambda_i / (1 - h \lambda_i)\) where \(\crl{\lambda_i}\) are eigenvalues of the matrix \((1 - \beta)^{-1} \nabla^2 L(\btheta(\bzeta)) - h \beta \nabla \boldsymbol{g}(\btheta(\bzeta))\). For small enough \(h\), we have \(h \max_i \abs{\lambda_i} < 1\), so we can bound \[\begin{align} \frac{\abs{\lambda_i}}{\abs{1 - h \lambda_i}} &\leq \frac{\abs{\lambda_i}}{1 - h \abs{\lambda_i}} \leq \frac{\max_i \abs{\lambda_i}}{1 - h \max_i \abs{\lambda_i}} = \frac{\norm{(1 - \beta)^{-1} \nabla^2 L(\btheta(\bzeta)) + h \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}}{1 - h \norm{(1 - \beta)^{-1} \nabla^2 L(\btheta(\bzeta)) + h \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}}\\ &\leq \frac{(1 - \beta)^{-1} D_2 + h \beta \delta}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}. \end{align}\]

Similarly, \[\begin{align} \norm[\bigg]{\beta \nabla \boldsymbol{g}(\btheta(\bzeta)) \prn[\bigg]{I - \frac{h}{1 - \beta} \nabla^2 L(\btheta(\bzeta)) + h^2 \beta \nabla \boldsymbol{g}(\btheta(\bzeta))}^{-1}} \leq \frac{\beta \delta}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}. \end{align}\]

We conclude that \[\begin{align} &\norm{\nabla_{\bzeta} T \boldsymbol{g}(\bzeta)} \leq \frac{D_3 \kf}{1 - \beta} + \frac{D_2}{1 - \beta} \frac{(1 - \beta)^{-1} D_2 + h \beta \delta}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} + \frac{\beta \delta}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}\\ &\quad = \frac{D_3 \kf \prn[\big]{1 - h \crl[\big]{\frac{D_2}{1 - \beta} + h \beta \delta}} + D_2 \prn[\big]{\frac{D_2}{1 - \beta} + h \beta \delta} + \beta \delta (1 - \beta)}{(1 - \beta) [1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}]}\\ &\quad = \frac{(D_2 - h D_3 \kf) \prn[\big]{\frac{D_2}{1 - \beta} + h \beta \delta} + D_3 \kf + \beta \delta (1 - \beta)}{(1 - \beta) [1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}]}. \end{align}\]

The inequality \[\label{eq:constaint-on-delta} \frac{(D_2 - h D_3 \kf) \prn[\big]{\frac{D_2}{1 - \beta} + h \beta \delta} + D_3 \kf + \beta \delta (1 - \beta)}{(1 - \beta) [1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}]} \leq \delta\tag{38}\] can be rewritten as \[\begin{align} h^2 \beta \delta^2 + \prn[\bigg]{\beta -1 + h \frac{1 + \beta}{1 - \beta} D_2 - \frac{h^2 \beta}{1 - \beta} \kf D_3} \delta + \frac{D_2 (D_2 - h D_3 \kf)}{(1 - \beta)^2} + \frac{D_3 \kf}{1 - \beta} \leq 0. \end{align}\]

For small enough \(h\), we can ensure that the quadratic equation in the left-hand side has a positive root, which means we can take any \(\delta > 0\) slightly less than this root, and the inequality will hold.

7.2.0.3 Bounding the Lipschitz constant of the derivative of \(T \boldsymbol{g}\)

Consider the expression ending in 36 . To declutter notation, denote \(A_i := (1 - \beta)^{-1} \nabla^2 L(\btheta(\bzeta_i)) - h \beta \nabla \boldsymbol{g}(\btheta(\bzeta_i))\) for \(i \in \crl{1, 2}\). First, decompose \[\begin{align} &\nabla_{\bzeta} T \boldsymbol{g}(\bzeta_1) - \nabla_{\bzeta} T \boldsymbol{g}(\bzeta_2) = \frac{\nabla^2 L(\bzeta_1)}{h (1 - \beta)} - \prn[\bigg]{\frac{\nabla^2 L(\btheta(\bzeta_1))}{h (1 - \beta)} - \beta \nabla \boldsymbol{g}(\btheta(\bzeta_1))} \prn{I - h A_1}^{-1}\\ &\quad - \frac{\nabla^2 L(\bzeta_2)}{h (1 - \beta)} + \prn[\bigg]{\frac{\nabla^2 L(\btheta(\bzeta_2))}{h (1 - \beta)} - \beta \nabla \boldsymbol{g}(\btheta(\bzeta_2))} \prn{I - h A_2}^{-1}\\ &\quad = \frac{\nabla^2 L(\bzeta_1) - \nabla^2 L(\bzeta_2)}{h (1 - \beta)}\\ &\qquad - \prn[\bigg]{\frac{\nabla^2 L(\btheta(\bzeta_1)) - \nabla^2 L(\btheta(\bzeta_2))}{h (1 - \beta)} - \beta \nabla \boldsymbol{g}(\btheta(\bzeta_1)) + \beta \nabla \boldsymbol{g}(\btheta(\bzeta_2))} \prn{I - h A_1}^{-1}\\ &\qquad - \prn[\bigg]{\frac{\nabla^2 L(\btheta(\bzeta_2))}{h (1 - \beta)} - \beta \nabla \boldsymbol{g}(\btheta(\bzeta_2))} \crl[\big]{\prn{I - h A_1}^{-1} - \prn{I - h A_2}^{-1}}\\ &\quad = T_1 - T_2 - T_3, \end{align}\] where \[\begin{align} &T_1 := \frac{\nabla^2 L(\bzeta_1) - \nabla^2 L(\bzeta_2)}{h (1 - \beta)} \crl[\big]{I - (I - h A_1)^{-1}},\\ &T_2 := \prn[\bigg]{\frac{\nabla^2 L(\btheta(\bzeta_1)) - \nabla^2 L(\bzeta_1) - \nabla^2 L(\btheta(\bzeta_2)) + \nabla^2 L(\bzeta_2)}{h (1 - \beta)} - \beta \nabla \boldsymbol{g}(\btheta(\bzeta_1)) + \beta \nabla \boldsymbol{g}(\btheta(\bzeta_2))} \times\\ &\qquad \times \prn{I - h A_1}^{-1},\\ &T_3 := \prn[\bigg]{\frac{\nabla^2 L(\btheta(\bzeta_2))}{h (1 - \beta)} - \beta \nabla \boldsymbol{g}(\btheta(\bzeta_2))} \crl[\big]{\prn{I - h A_1}^{-1} - \prn{I - h A_2}^{-1}}. \end{align}\]

To bound \(T_1\), use \(\norm{I - (I - h A_1)^{-1}} \leq h \frac{(1 - \beta)^{-1} D_2 + h \beta \delta}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}\) and conclude \[\begin{align} &\norm{T_1} \leq \frac{D_3 (1 - \beta)^{-1 } \brk{(1 - \beta)^{-1} D_2 + h \beta \delta}}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} \norm{\bzeta_1 - \bzeta_2}. \end{align}\]

To bound \(T_2\), note that \[\begin{align} &\nabla_{\bzeta} \crl[\big]{\nabla^2 L(\btheta(\bzeta)) - \nabla^2 L(\bzeta)} = \nabla^3 L(\btheta(\bzeta)) (I - h A)^{-1} - \nabla^3 L(\bzeta)\\ &\quad = \prn[\big]{\nabla^3 L(\btheta(\bzeta)) - \nabla^3 L(\bzeta)} (I - h A)^{-1} + \nabla^3 L(\bzeta) \prn[\big]{(I - h A)^{-1} - I} \end{align}\] where \(A := (1 - \beta)^{-1} \nabla^2 L(\btheta(\bzeta)) - h \beta \nabla \boldsymbol{g}(\btheta(\bzeta))\). Therefore, \[\begin{align} &\norm{\nabla^2 L(\btheta(\bzeta_1)) - \nabla^2 L(\bzeta_1) - \nabla^2 L(\btheta(\bzeta_2)) + \nabla^2 L(\bzeta_2)}\\ &\quad\leq \sup_{\bzeta} \norm[\big]{\nabla_{\bzeta} \crl[\big]{\nabla^2 L(\btheta(\bzeta)) - \nabla^2 L(\bzeta)}} \norm{\bzeta_1 - \bzeta_2}\\ &\quad \leq \frac{D_4 \sup_{\bzeta} \norm{\btheta(\bzeta) - \bzeta}}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} \norm{\bzeta_1 - \bzeta_2} + h \frac{D_3 \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} \norm{\bzeta_1 - \bzeta_2}\\ &\quad \leq h \frac{D_4 \kf + D_3 \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} \norm{\bzeta_1 - \bzeta_2}, \end{align}\] where the last inequality is obtained by bounding \(\sup_{\bzeta} \norm{\btheta(\bzeta) - \bzeta}\) like for 37 . Since also \[\begin{align} \beta \norm{\nabla \boldsymbol{g}(\btheta(\bzeta_1)) - \nabla \boldsymbol{g}(\btheta(\bzeta_2))} &\leq \beta \lambda \norm{\btheta(\bzeta_1) - \btheta(\bzeta_2)} \leq \beta \lambda \sup_{\bzeta} \norm{\nabla_{\bzeta} \btheta(\bzeta)} \norm{\bzeta_1 - \bzeta_2}\\ &\leq \frac{\beta \lambda}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} \norm{\bzeta_1 - \bzeta_2}, \end{align}\] we conclude \[\begin{align} \norm{T_2} \leq \frac{(1 - \beta)^{-1} \brk{D_4 \kf + D_3 \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} + \beta \lambda}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} \norm{\bzeta_1 - \bzeta_2}. \end{align}\]

To bound \(T_3\), use \[\begin{align} &\norm{(I - h A_1)^{-1} - (I - h A_2)^{-1}} = h \norm{(I - h A_1)^{-1} \brk{A_1 - A_2} (I - h A_2)^{-1}}\\ &\quad \leq \frac{h}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2}\\ &\qquad \times \norm[\bigg]{\frac{\nabla^2 L(\btheta(\bzeta_1)) - \nabla^2 L(\btheta(\bzeta_2))}{1 - \beta} - h \beta \crl[\big]{\nabla \boldsymbol{g}(\btheta(\bzeta_1)) - \nabla \boldsymbol{g}(\btheta(\bzeta_2))}}\\ &\quad \leq \frac{h}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} \brk[\bigg]{\frac{D_3}{1 - \beta} + h \beta \lambda} \norm{\btheta(\bzeta_1) - \btheta(\bzeta_2)}\\ &\quad \leq \frac{h \brk[\big]{(1 - \beta)^{-1} D_3 + h \beta \lambda}}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^3} \norm{\bzeta_1 - \bzeta_2}, \end{align}\] and conclude \[\norm{T_3} \leq \frac{\prn[\big]{(1 - \beta)^{-1} D_2 + h \beta \delta} \brk[\big]{(1 - \beta)^{-1} D_3 + h \beta \lambda}}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^3} \norm{\bzeta_1 - \bzeta_2}.\]

Combining, we get \[\begin{align} &\norm{\nabla_{\bzeta} T \boldsymbol{g}(\bzeta_1) - \nabla_{\bzeta} T \boldsymbol{g}(\bzeta_2)}\\ &\quad \leq \prn[\bigg]{\frac{D_3 (1 - \beta)^{-1 } \brk{(1 - \beta)^{-1} D_2 + h \beta \delta}}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}\\ &\qquad + \frac{(1 - \beta)^{-1} \brk{D_4 \kf + D_3 \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} + \beta \lambda}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2}\\ &\qquad + \frac{\prn[\big]{(1 - \beta)^{-1} D_2 + h \beta \delta} \brk[\big]{(1 - \beta)^{-1} D_3 + h \beta \lambda}}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^3}} \norm{\bzeta_1 - \bzeta_2}. \end{align}\]

It is left to note that \[\begin{align} &\frac{D_3 (1 - \beta)^{-1 } \brk{(1 - \beta)^{-1} D_2 + h \beta \delta}}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}\\ &\qquad + \frac{(1 - \beta)^{-1} \brk{D_4 \kf + D_3 \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} + \beta \lambda}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2}\\ &\qquad + \frac{\prn[\big]{(1 - \beta)^{-1} D_2 + h \beta \delta} \brk[\big]{(1 - \beta)^{-1} D_3 + h \beta \lambda}}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^3} \leq \lambda \end{align}\] for small enough \(h\) because \(\lambda\) was chosen to be no less than \(\tau_2(\delta)\). ◻

Lemma 6 (\(T\) is a contraction). For any \(\boldsymbol{g}_1, \boldsymbol{g}_2 \in \Gamma\), we have \[\norm{T \boldsymbol{g}_1 - T \boldsymbol{g}_2}_{\Gamma} \leq \alpha \norm{\boldsymbol{g}_1 - \boldsymbol{g}_2}_{\Gamma}\] with some \(\alpha < 1\).

Proof. It is proven in kovachki2021continuous? that for \(h\) small enough \(h \ks < 1\) and \[\label{eq:jjfhhbc} \sup_{\bzeta \in \mathbb{R}^d}\norm{T \boldsymbol{g}_1(\bzeta) - T \boldsymbol{g}_2(\bzeta)} \leq \alpha_1 \sup_{\btheta \in \mathbb{R}^d} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)}\tag{39}\] with \[\alpha_1 := \beta + h \beta \frac{D_2 + h D_3 \kf}{1 - \beta} + \frac{h^2 \beta}{1 - h \ks} \prn[\Big]{\beta \delta + \frac{(D_2 + h D_3 \kf)(D_2 / (1 - \beta) + h \beta \delta) + D_3 \kf}{1 - \beta}}.\]

Next, fix \(\bzeta \in \mathbb{R}^d\), \(\boldsymbol{g}_1, \boldsymbol{g}_2 \in \Gamma\). Let \(\btheta_1\) be the preimage of \(\bzeta\) under the mapping \(\btheta \mapsto \bzeta_1(\btheta) = \btheta - h (1 - \beta)^{-1} \nabla L(\btheta) + h^2 \beta \boldsymbol{g}_1(\btheta)\), and \(\btheta_2\) the preimage of \(\bzeta\) under the mapping \(\btheta \mapsto \bzeta_2(\btheta) = \btheta - h (1 - \beta)^{-1} \nabla L(\btheta) + h^2 \beta \boldsymbol{g}_2(\btheta)\). Recall that the form of the derivative of \(T \boldsymbol{g}_i(\bzeta)\) is given in 36 . Decompose \[\begin{align} &\nabla_{\bzeta} T \boldsymbol{g}_1(\bzeta) - \nabla_{\bzeta} T \boldsymbol{g}_2(\bzeta)\\ &\quad = \prn[\bigg]{\frac{\nabla^2 L(\btheta_2)}{h (1 - \beta)} - \beta \nabla \boldsymbol{g}_2(\btheta_2)} (I - h A_2)^{-1} - \prn[\bigg]{\frac{\nabla^2 L(\btheta_1)}{h (1 - \beta)} - \beta \nabla \boldsymbol{g}_1(\btheta_1)} (I - h A_1)^{-1}\\ &\quad = \underbrace{\brk[\bigg]{\frac{\nabla^2 L(\btheta_2) - \nabla^2 L(\btheta_1)}{h (1 - \beta)} - \beta \crl[\big]{\nabla \boldsymbol{g}_2(\btheta_2) - \nabla \boldsymbol{g}_1(\btheta_1)}} (I - h A_2)^{-1}}_{T_1}\\ &\qquad + \underbrace{\brk[\bigg]{\frac{\nabla^2 L(\btheta_1)}{h (1 - \beta)} - \beta \nabla \boldsymbol{g}_1(\btheta_1)} \prn[\big]{(I - h A_2)^{-1} - (I - h A_1)^{-1}}}_{T_2}, \end{align}\] where \(A_i = (1 - \beta)^{-1} \nabla^2 L(\btheta_i) - h \beta \nabla \boldsymbol{g}_i(\btheta_i)\), \(i \in \crl{1, 2}\).

To bound \(T_1\), first note \[\label{eq:lkjfhh} \norm[\bigg]{\frac{\nabla^2 L(\btheta_1) - \nabla^2 L(\btheta_2)}{h (1 - \beta)}} \leq \frac{D_3}{h (1 - \beta)} \norm{\btheta_1 - \btheta_2}.\tag{40}\] Since \[\bzeta = \btheta_1 - \frac{h}{1 - \beta} \nabla L(\btheta_1) + h^2 \beta \boldsymbol{g}_1(\btheta_1) = \btheta_2 - \frac{h}{1 - \beta} \nabla L(\btheta_2) + h^2 \beta \boldsymbol{g}_2(\btheta_2),\] we have \[\begin{align} \norm{\btheta_1 - \btheta_2} &\leq \frac{h}{1 - \beta} \norm{\nabla L(\btheta_1) - \nabla L(\btheta_2)} + h^2 \beta \norm{\boldsymbol{g}_1(\btheta_1) - \boldsymbol{g}_2(\btheta_2)}\\ &\leq \frac{h D_2}{1 - \beta} \norm{\btheta_1 - \btheta_2} + h^2 \beta \norm{\boldsymbol{g}_1(\btheta_1) - \boldsymbol{g}_1(\btheta_2)} + h^2 \beta \norm{\boldsymbol{g}_1(\btheta_2) - \boldsymbol{g}_2(\btheta_2)}\\ &\leq \prn[\bigg]{\frac{h D_2}{1 - \beta} + h^2 \beta \delta} \norm{\btheta_1 - \btheta_2} + h^2 \beta \norm{\boldsymbol{g}_1(\btheta_2) - \boldsymbol{g}_2(\btheta_2)}. \end{align}\] For \(h\) small enough, \(\frac{h D_2}{1 - \beta} + h^2 \beta \delta < 1\), and we obtain \[\label{eq:jfhhhfb} \norm{\btheta_1 - \btheta_2} \leq \frac{h^2 \beta}{1 - \frac{h D_2}{1 - \beta} - h^2 \beta \delta} \norm{\boldsymbol{g}_1(\btheta_2) - \boldsymbol{g}_2(\btheta_2)}.\tag{41}\] Continuing 40 , we get \[\begin{align} \norm[\bigg]{\frac{\nabla^2 L(\btheta_1) - \nabla^2 L(\btheta_2)}{h (1 - \beta)}} &\leq \frac{D_3}{h (1 - \beta)} \frac{h^2 \beta}{1 - \frac{h D_2}{1 - \beta} - h^2 \beta \delta} \sup_{\btheta} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)}\\ &= \frac{(1 - \beta)^{-1} h \beta D_3}{1 - \frac{h D_2}{1 - \beta} - h^2 \beta \delta} \sup_{\btheta} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)}. \end{align}\] Next, use \[\begin{align} \norm{\nabla \boldsymbol{g}_1(\btheta_1) - \nabla \boldsymbol{g}_2(\btheta_2)} &\leq \norm{\nabla \boldsymbol{g}_1(\btheta_1) - \nabla \boldsymbol{g}_1(\btheta_2)} + \norm{\nabla \boldsymbol{g}_1(\btheta_2) - \nabla \boldsymbol{g}_2(\btheta_2)}\\ &\leq \lambda \norm{\btheta_1 - \btheta_2} + \sup_{\btheta} \norm{\nabla \boldsymbol{g}_1(\btheta) - \nabla \boldsymbol{g}_2(\btheta)}\\ &\leq \frac{h^2 \beta \lambda}{1 - \frac{h D_2}{1 - \beta} - h^2 \beta \delta} \sup_{\btheta} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)} + \sup_{\btheta} \norm{\nabla \boldsymbol{g}_1(\btheta) - \nabla \boldsymbol{g}_2(\btheta)} \end{align}\] with the last inequality by 41 . As previously, using \(\norm{I - h A_1}^{-1} \leq \brk{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^{-1}\), we can now conclude \[\begin{align} \norm{T_1} &\leq \prn[\bigg]{\frac{(1 - \beta)^{-1} h \beta D_3}{\prn[\big]{1 - \frac{h D_2}{1 - \beta} - h^2 \beta \delta}^2} + \frac{h^2 \beta^2 \lambda}{\prn[\big]{1 - \frac{h D_2}{1 - \beta} - h^2 \beta \delta}^2}} \sup_{\btheta} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)}\\ &\quad + \frac{\beta}{1 - \frac{h D_2}{1 - \beta} - h^2 \beta \delta} \sup_{\btheta} \norm{\nabla \boldsymbol{g}_1(\btheta) - \nabla \boldsymbol{g}_2(\btheta)}. \end{align}\]

To bound \(T_2\), use \[\begin{align} &\norm[\big]{(I - h A_2)^{-1} - (I - h A_1)^{-1}} = h \norm{(I - h A_2)^{-1} \brk{A_2 - A_1} (I - h A_1)^{-1}}\\ &\quad \leq \frac{h}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} \norm{A_2 - A_1}\\ &\quad = \frac{h}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} \norm[\bigg]{\frac{\nabla^2 L(\btheta_2) - \nabla^2 L(\btheta_1)}{1 - \beta} - h \beta \crl[\big]{\nabla \boldsymbol{g}_2(\btheta_2) - \nabla \boldsymbol{g}_1(\btheta_1)}}\\ &\quad \leq \frac{h}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} \brk[\bigg]{\frac{(1 - \beta)^{-1} h^2 \beta D_3 + h^3 \beta^2 \lambda}{1 - \frac{h D_2}{1 - \beta} - h^2 \beta \delta} \sup_{\btheta} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)}\\ &\quad + h \beta \sup_{\btheta} \norm{\nabla \boldsymbol{g}_1(\btheta) - \nabla \boldsymbol{g}_2(\btheta)}}\\ &\quad \leq \frac{(1 - \beta)^{-1} h^3 \beta D_3 + h^4 \beta^2 \lambda}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^3} \sup_{\btheta} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)}\\ &\quad + \frac{h^2 \beta}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} \sup_{\btheta} \norm{\nabla \boldsymbol{g}_1(\btheta) - \nabla \boldsymbol{g}_2(\btheta)}, \end{align}\] and conclude \[\begin{align} \norm{T_2} &\leq \frac{\prn[\big]{(1 - \beta)^{-1} D_2 + h \beta \delta} \brk[\big]{(1 - \beta)^{-1} h^2 \beta D_3 + h^3 \beta^2 \lambda}}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^3} \sup_{\btheta} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)}\\ &\quad + \frac{\prn[\big]{(1 - \beta)^{-1} D_2 + h \beta \delta} h \beta}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} \sup_{\btheta} \norm{\nabla \boldsymbol{g}_1(\btheta) - \nabla \boldsymbol{g}_2(\btheta)}. \end{align}\]

Combining the bounds on \(T_1\) and \(T_2\) gives \[\norm[\big]{\nabla_{\bzeta} T \boldsymbol{g}_1(\bzeta) - \nabla_{\bzeta} T \boldsymbol{g}_2(\bzeta)} \leq \tilde{\alpha}_1 \sup_{\btheta} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)} + \tilde{\alpha}_2 \sup_{\btheta} \norm{\nabla \boldsymbol{g}_1(\btheta) - \nabla \boldsymbol{g}_2(\btheta)},\] where \[\begin{align} &\tilde{\alpha}_1 := \frac{(1 - \beta)^{-1} h \beta D_3}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} + \frac{h^2 \beta^2 \lambda}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2}\\ &\qquad + \frac{\prn[\big]{(1 - \beta)^{-1} D_2 + h \beta \delta} \brk[\big]{(1 - \beta)^{-1} h^2 \beta D_3 + h^3 \beta^2 \lambda}}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^3},\\ &\quad = \frac{\beta h \crl{D_3 / (1 - \beta) + h \beta \lambda}}{(1 - h \ks)^3},\\ &\tilde{\alpha}_2 := \frac{\beta}{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}} + \frac{\prn[\big]{(1 - \beta)^{-1} D_2 + h \beta \delta} h \beta}{\prn[\big]{1 - h \crl{(1 - \beta)^{-1} D_2 + h \beta \delta}}^2} = \frac{\beta}{(1 - h \ks)^2}. \end{align}\]

Combining with 39 , we get \[\begin{align} &\sup_{\bzeta} \norm{T \boldsymbol{g}_1(\bzeta) - T \boldsymbol{g}_2(\bzeta)} + \sup_{\bzeta} \norm[\big]{\nabla_{\bzeta} T \boldsymbol{g}_1(\bzeta) - \nabla_{\bzeta} T \boldsymbol{g}_2(\bzeta)}\\ &\quad \leq (\alpha_1 + \tilde{\alpha}_1) \sup_{\btheta} \norm{\boldsymbol{g}_1(\btheta) - \boldsymbol{g}_2(\btheta)} + \tilde{\alpha}_2 \sup_{\btheta} \norm{\nabla \boldsymbol{g}_1(\btheta) - \nabla \boldsymbol{g}_2(\btheta)}. \end{align}\]

For \(h\) small enough \((\alpha_1 + \tilde{\alpha}_1) \vee \tilde{\alpha}_2 < 1\). ◻

Lemma 7 (Exponential attractivity). 1 satisfies \[\norm[\bigg]{\bv^{(n)} + \frac{1}{1 - \beta} \nabla L(\btheta^{(n)}) - h \bg_h(\btheta^{(n)})} \leq (\beta + h^2 \beta \delta)^n \norm[\bigg]{\bv^{(0)} + \frac{1}{1 - \beta} \nabla L(\btheta^{(0)}) - h \bg_h(\btheta^{(0)})}.\]

Proof. The same argument as in kovachki2021continuous? proves this result as well. ◻

8 Proof of 2 ↩︎

In the proof of this theorem, \(O(\cdot)\) will denote a term that is bounded (in absolute value or in norm) by a constant times the argument, where the constant cannot depend on \(n\) or \(h\) but can depend on other fixed values (like \(\ord\), \(T\), \(\beta\)).

Lemma 8. For all \(\btheta \in \mathcal{D}\) and all \(m \in \range{1}{\ord}\), we have \[\label{eq:nomem-coef-bounds} \nabla^{r} \nomemcoef{m}{n}(\btheta) = O(1), \quad \nabla^{r} \histcoef{m}{n}{a}(\btheta) = O(a^m),\quad r \in \range{0}{2 \ord - m}.\qquad{(11)}\]

Proof. Note that if \(\histcoef{l + 1}{n}{s}(\btheta)\) actually appears in the sum on the right of 15 , then \(l + 1\) cannot exceed \(m - 1\), because \(l + 1 \geq m\) would mean \(i \leq 0\) and this term is actually absent (there are no history terms if \(i = 0\)). Therefore, \(\histcoef{m}{n}{a}(\btheta)\) is defined by \(\crl[\big]{\histcoef{j}{n}{s}(\btheta)}_{1 \leq j \leq m - 1, 1 \leq s \leq a}\) and \(\crl[\big]{\nomemcoef{j}{n - s}(\btheta)}_{1 \leq j \leq m}\). By the same logic, from 14 , \(\nomemcoef{m}{n}(\btheta)\) is defined through \(\crl[\big]{\histcoef{j}{n}{k}(\btheta)}_{1 \leq j \leq m - 1, 1 \leq k \leq n}\).

So, we prove ?? by induction in \(m\). For \(m = 1\) it is easy to check. To prove that the statement holds for \(m\) assuming that it holds for all \(m' < m\), we notice that since there is exponential averaging in the definition of \(\nomemcoef{m}{n}(\btheta)\) and the derivatives of the mini-batch losses are bounded, we have \[\begin{align} &\abs{\nomemcoef{m}{n}(\btheta)} \lesssim \sum_{k = 1}^n \beta^k \sum_{\substack{i, l \geq 0\\i + l = m - 1}} \sum_{(i_0, \ldots, i_l) \in \mathcal{K}_{i, l}} k^{i_0 + \ldots + (l + 1) i_l}\\ &\quad \lesssim \sum_{k = 1}^n \beta^k \sum_{\substack{i, l \geq 0\\i + l = m - 1}} k^{i + l} \lesssim \sum_{k = 1}^n \beta^k k^{m - 1} = O(1). \end{align}\] Adding \(\nabla^{r}\) in front of \(\nomemcoef{m}{n}(\btheta)\) influences only the number of terms in the sum (but it remains bounded) and constant coefficients in the bounds above (the induction assumption is chosen such that we can still apply it if we add derivatives to the history terms), so the statement \(\nabla^{r} \nomemcoef{m}{n}(\btheta) = O(1)\) also holds.

Using this and bounds on the previous history terms (and their derivatives), we can also see \[\begin{align} &\abs{\histcoef{m}{n}{a}(\btheta)} \lesssim \sum_{s = 1}^a \sum_{\substack{j \geq 1, i, l \geq 0\\i + j + l = m}} \sum_{(k_0, \ldots, k_l) \in \mathcal{K}_{i, l}} s^{k_0 + \ldots + (l + 1) k_l}\\ &\quad \lesssim \sum_{s = 1}^a \sum_{\substack{j \geq 1, i, l \geq 0\\i + j + l = m}} s^{i + l} \lesssim \sum_{s = 1}^a (1 + \ldots + s^{m - 1}) \lesssim \sum_{s = 1}^a s^{m - 1} \lesssim a^m. \end{align}\] Again, adding \(\nabla^{r}\) in front of \(\histcoef{m}{n}{a}(\btheta)\) only influences the number of terms in the sum (though it remains bounded) and constant coefficients in these bounds. This proves \(\nabla^{r} \histcoef{m}{n}{a}(\btheta) = O(a^m)\), completing the induction step. ◻

Lemma 9. We have \(\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)} = O(k h)\) for any \(k \in \range{1}{n}\).

Proof. This follows from the fact that the right-hand side in 5 is \(O(h)\), because \(\nomemcoef{j}{n}(\btheta) = O(1)\) by 8 and \[\sum_{j = 0}^{\ord - 1} h^j = \frac{1 - h^{\ord}}{1 - h} \leq \frac{1}{1 - h} \labrel{1}[=] O(1),\] where is because \(h\) is bounded away from \(1\). ◻

Recall that, by definition, ?? will follow from the following bound 16 on the error introduced by removing memory. By Taylor’s theorem, we have \[\begin{align} &\nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}} = \sum_{i = 0}^{\ord - 1} \frac{1}{i!} \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\big]{\underbrace{\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}, \ldots, \tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}}_{\text{i times}}} + \text{Rem}^{(n - k)}, \end{align}\] where \[\begin{gather} \text{Rem}^{(n - k)} = \frac{1}{(\ord - 1) !} \int_0^1 (1 - t)^{\ord - 1}\\ \times \nabla^{\ord + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)} + t \prn[\big]{\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}}} \prn[\big]{\underbrace{\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}, \ldots, \tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}}_{\text{\ord times}}}\,\mathrm{d} t. \end{gather}\] Using the boundedness of derivatives of \(L^{(n - k)}(\cdot)\) and 9, we can write \[\begin{align} &\nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}} = \sum_{i = 0}^{\ord - 1} \frac{1}{i!} \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\big]{\underbrace{\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}, \ldots, \tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}}_{\text{i times}}} + O(k^{\ord} h^{\ord}).\numberthis\label{eq:oiriy} \end{align}\tag{42}\]

We will now express the difference \(\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}\) in 42 through \(\tilde{\btheta}^{(n)}\).

Proof of 1. We induct over \(r\).

For \(r = 1\) this is the assertion of 9. Further, \(r \geq 2\).

Now make the induction assumption that ?? holds with \(r\) replaced by \(r - j\) for any \(j \in \range{1}{r - 1}\).

Like in the proof of 9, using \(\nomemcoef{r}{n}(\btheta) = O(1)\) by 8 and the fact that \(h\) is bounded away from \(1\), we write for \(s \in \range{1}{k}\) \[\tilde{\btheta}^{(n - s + 1)} = \tilde{\btheta}^{(n - s)} + \sum_{j = 1}^{r - 1} h^j \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n - s)}} + O(h^r).\] Inserting this, we get \[\begin{align} &\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)} = \sum_{s = 1}^k \crl[\big]{\tilde{\btheta}^{(n - s)} - \tilde{\btheta}^{(n - s + 1)}}\\ &\quad = - \sum_{s = 1}^k \crl[\bigg]{\sum_{j = 1}^{r - 1} h^j \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n - s)}} + O(h^r)}\\ &\quad = - \sum_{s = 1}^k \sum_{j = 1}^{r - 1} h^j \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n - s)}} + O(k h^r).\numberthis\label{eq:wpoii} \end{align}\tag{43}\]

By Taylor’s theorem, \[\begin{align} \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n - s)}} &= \sum_{i = 0}^{r - 1 - j} \frac{1}{i!} \nabla^i \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\big]{\underbrace{\tilde{\btheta}^{(n - s)} - \tilde{\btheta}^{(n)}}_{\text{i times}}}\\ &\quad + \frac{1}{(r - 1 - j)!} \int_{0}^1 (1 - t)^{r - 1 - j} \\ &\qquad \times \nabla^{r - j} \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n)} + t \prn[\big]{\tilde{\btheta}^{(n - s)} - \tilde{\btheta}^{(n)}}} \prn[\big]{\underbrace{\tilde{\btheta}^{(n - s)} - \tilde{\btheta}^{(n)}}_{\text{r - j times}}}\,\mathrm{d} t\\ &\labrel{1}[=] \sum_{i = 0}^{r - 1 - j} \frac{1}{i!} \nabla^i \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\big]{\underbrace{\tilde{\btheta}^{(n - s)} - \tilde{\btheta}^{(n)}}_{\text{i times}}} + O\prn[\big]{s^{r - j} h^{r - j}}\\ &\labrel{2}[=] \sum_{i = 0}^{r - 1 - j} \frac{1}{i!} \nabla^i \nomemcoef{j}{n - s}(\tilde{\btheta}^{(n)}) \prn[\big]{\underbrace{\tilde{\btheta}^{(n - s)} - \tilde{\btheta}^{(n)}}_{\text{i times}}} + O\prn[\big]{k^{r - j} h^{r - j}},\numberthis\label{eq:kjlkfy} \end{align}\tag{44}\] where in we used that the derivatives of \(\nomemcoef{j}{s}(\btheta)\) are bounded (8) and 9; is just because \(s \leq k\). By the induction assumption, \[\tilde{\btheta}^{(n - s)} = \tilde{\btheta}^{(n)} + \sum_{l = 1}^{r - 1 - j} h^l \histcoef{l}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}} + O(k^{r - j} h^{r - j}),\] which we insert into 44 , giving \[\begin{align} &\nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n - s)}}\\ &\quad = \sum_{i = 0}^{r - 1 - j} \frac{1}{i!} \nabla^i \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\sum_{l = 1}^{r - 1 - j} h^l \histcoef{l}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}} + O(k^{r - j} h^{r - j})}_{\text{i times}}} + O\prn[\big]{k^{r - j} h^{r - j}}\\ &\quad \labrel{1}[=] \sum_{i = 0}^{r - 1 - j} \frac{1}{i!} \nabla^i \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\sum_{l = 1}^{r - 1 - j} h^l \histcoef{l}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i times}}} + O\prn[\big]{k^{r - j} h^{r - j}}\\ &\quad = \sum_{i = 0}^{r - 1 - j} \frac{h^i}{i!} \nabla^i \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\sum^{r - 2 - j}_{l = 0} h^l \histcoef{l + 1}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i times}}} + O\prn[\big]{k^{r - j} h^{r - j}}\\ &\quad \labrel{2}[=] \sum_{i = 0}^{r - 1 - j} \sum_{l = 0}^{r - 1 - j - i} h^{i + l}\\ &\qquad\quad \times \sum_{(i_0, \ldots, i_l) \in \mathcal{K}_{i, l}} \frac{1}{i_0! \ldots i_l!} \nabla^i \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\histcoef{1}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_0 times}}, \ldots, \underbrace{\histcoef{l + 1}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_l times}}}\\ &\qquad + O\prn[\big]{k^{r - j} h^{r - j}} \end{align}\] where in we used that components of \(\nabla^i \nomemcoef{j}{n - s}(\btheta)\) are bounded (8) and for \(i \geq 1\) we have \((k h)^{i (r - j)} = O \prn[\big]{(k h)^{r - j}}\) because \(k h\) does not exceed \(T\); in we also used that \(\histcoef{l}{n}{s}(\btheta) = O(s^l)\) by 8. Inserting this into 43 gives \[\begin{align} &\tilde{\btheta}^{(n - k)} - \tilde{\btheta}^{(n)}\\ &\quad = - \sum_{s = 1}^k \sum_{j = 1}^{r - 1} h^j \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n - s)}} + O(k h^{r})\\ &\quad = - \sum_{s = 1}^k \sum_{j = 1}^{r - 1} h^j \sum_{i = 0}^{r - 1 - j} \sum_{l = 0}^{r - 1 - j - i} h^{i + l} \times \\ &\qquad \quad \times \sum_{(i_0, \ldots, i_l) \in \mathcal{K}_{i, l}} \frac{1}{i_0! \ldots i_l!} \nabla^i \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\histcoef{1}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_0 times}}, \ldots, \underbrace{\histcoef{l + 1}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_l times}}}\\ &\qquad - \sum_{s = 1}^k \sum_{j = 1}^{r - 1} h^j O\prn[\big]{k^{r - j} h^{r - j}} + O(k h^{r})\\ &\quad = - \sum_{m = 1}^{r - 1} h^m \sum_{s = 1}^k \sum_{\substack{j \geq 1, i, l \geq 0\\i + j + l = m}} \sum_{(i_0, \ldots, i_l) \in \mathcal{K}_{i, l}} \frac{1}{i_0! \ldots i_l!} \nabla^i \nomemcoef{j}{n - s}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\histcoef{1}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_0 times}}, \ldots, \underbrace{\histcoef{l + 1}{n}{s}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_l times}}}\\ &\qquad + O(k^{r} h^{r})\\ &\quad \labrel{1}[=] \sum_{m = 1}^{r - 1} h^m \histcoef{m}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}} + O(k^{r} h^{r}), \end{align}\] where in we used 15 . We have completed the induction step for ?? . ◻

We will now conclude the proof of 16 . Inserting ?? into 42 gives \[\begin{align} &\nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}}\\ &\quad = \sum_{i = 0}^{\ord - 1} \frac{1}{i!} \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\sum_{m = 1}^{\ord - 1} h^m \histcoef{m}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}} + O(k^{\ord} h^{\ord})}_{\text{i times}}} + O(k^{\ord} h^{\ord})\\ &\quad \labrel{1}[=] \sum_{i = 0}^{\ord - 1} \frac{1}{i!} \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\sum_{m = 1}^{\ord - 1} h^m \histcoef{m}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i times}}} + O(k^{\ord} h^{\ord})\\ &\quad = \sum_{i = 0}^{\ord - 1} \frac{h^i}{i!} \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\sum_{m = 0}^{\ord - 2} h^m \histcoef{m + 1}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i times}}} + O(k^{\ord} h^{\ord})\\ &\quad = \sum_{m = 0}^{\ord - 1} h^m \sum_{\substack{i, l \geq 0\\i + l = m}} \sum_{(i_0, \ldots, i_l) \in \mathcal{K}_{i, l}} \frac{1}{i_0! \ldots i_l!} \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\histcoef{1}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_0 times}}, \ldots, \underbrace{\histcoef{l + 1}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_l times}}}\\ &\qquad + O(k^{\ord} h^{\ord}) \end{align}\] where in we used that for \(i \geq 1\) we have \((k h)^{i \ord} = O \prn[\big]{(k h)^{\ord}}\) because \(k h\) does not exceed \(T\). Using exponential summation gives \[\begin{align} &- \sum_{k = 0}^n \beta^k \nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}}\\ &\quad = - \nabla L^{(n)}\prn[\big]{\tilde{\btheta}^{(n)}}\\ &\qquad - \sum_{m = 0}^{\ord - 1} h^m \sum_{k = 1}^n \beta^k \sum_{\substack{i, l \geq 0\\i + l = m}} \sum_{(i_0, \ldots, i_l) \in \mathcal{K}_{i, l}} \frac{1}{i_0! \ldots i_l!} \times\\ &\qquad \qquad \qquad \times \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\histcoef{1}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_0 times}}, \ldots, \underbrace{\histcoef{l + 1}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_l times}}} + O(h^{\ord})\\ &\quad = - \sum_{k = 0}^n \beta^k \nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}}\\ &\qquad - \sum_{m = 1}^{\ord - 1} h^m \sum_{k = 1}^n \beta^k \sum_{\substack{i, l \geq 0\\i + l = m}} \sum_{(i_0, \ldots, i_l) \in \mathcal{K}_{i, l}} \frac{1}{i_0! \ldots i_l!} \times \\ &\qquad \qquad \qquad \times \nabla^{i + 1} L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n)}} \prn[\bigg]{\underbrace{\histcoef{1}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_0 times}}, \ldots, \underbrace{\histcoef{l + 1}{n}{k}\prn[\big]{\tilde{\btheta}^{(n)}}}_{\text{i_l times}}} + O(h^{\ord})\\ &\quad = \sum_{m = 0}^{\ord - 1} h^m \nomemcoef{m + 1}{n}\prn[\big]{\tilde{\btheta}^{(n)}} + O(h^{\ord}), \end{align}\] as desired. Since 16 was sufficient for ?? , we have proven ?? as well.

Lemma 10. ?? follows from ?? .

Proof. The argument is standard. Define the error at the \(n\)th step \[\be^{(n)} := \btheta^{(n)} - \tilde{\btheta}^{(n)}\] and local error \[\bdelta^{(n)} := \tilde{\btheta}^{(n + 1)} - \tilde{\btheta}^{(n)} + h \sum_{k = 0}^n \beta^k \nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}},\] where \(n \in \range{0}{\lfloor T / h \rfloor}\). By definition and Taylor’s theorem, we have \[\begin{align} \be^{(n + 1)} &= \be^{(n)} - h \sum_{k = 0}^n \beta^k \prn[\Big]{ \nabla L^{(n - k)}\prn[\big]{\btheta^{(n - k)}} - \nabla L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)}}} - \bdelta^{(n)}\\ &= \be^{(n)} - h \sum_{k = 0}^n \beta^k \int_0^1 \nabla^2 L^{(n - k)}\prn[\big]{\tilde{\btheta}^{(n - k)} + t \prn[\big]{\btheta^{(n - k)} - \tilde{\btheta}^{(n - k)}} } \prn[\big]{\btheta^{(n - k)} - \tilde{\btheta}^{(n - k)}} \,\mathrm{d} t - \bdelta^{(n)}. \end{align}\]

Since the derivatives of the loss are bounded, this implies \[\norm{\be^{(n + 1)}} \leq \norm{\be^{(n)}} + h \oC{lglpr} \sum_{k = 0}^n \beta^k \norm{\be^{(n - k)}} + \norm{\bdelta^{(n)}}\] with some constant \(\nC{lglpr}\). Denoting \(s^{(n)} := \max_{0 \leq k \leq n} \norm{\be^{(k)}}\), we have \[s^{(n + 1)} \leq s^{(n)} + h \oC{lgl} s^{(n)} + \norm{\bdelta^{(n)}} = (1 + h \oC{lgl}) s^{(n)} + \norm{\bdelta^{(n)}} \labrel{1}[\leq] (1 + h \oC{lgl}) s^{(n)} + \oC{clocal} h^{\ord + 1},\] with some constant \(\nC{lgl}\), where is by ?? . Applying this inequality iteratively, we obtain \[\begin{align} s^{(n)} &\leq (1 + h \oC{lgl})^n s^{(0)} + \frac{(1 + h \oC{lgl})^n - 1}{\oC{lgl}} \oC{clocal} h^{\ord}\\ &= \frac{(1 + h \oC{lgl})^n - 1}{\oC{lgl}} \oC{clocal} h^{\ord} \labrel{1}[\leq] \frac{e^{\oC{lgl} n h} - 1}{\oC{lgl}} \oC{clocal} h^{\ord} \leq \oC{jjqh} e^{\oC{lgl} n h} h^{\ord} \leq \oC{jjqh} e^{\oC{lgl} T} h^{\ord}, \end{align}\] where in we used the inequality \(1 + x \leq e^x\) for any \(x \geq 0\), \(\nC{jjqh}\) is some constant. It is left to apply it with \(n = \lfloor T / h \rfloor\) and put \(\oC{cglobal} = \oC{jjqh} e^{\oC{lgl} T}\). ◻

We have proven both ?? and ?? , concluding the proof of 2.

9 Proof of 3 ↩︎

Proof of 2. We prove this by induction in \(m\).

9.0.0.1 Induction base

For the tree \(\tau\) with vertices \(\crl{1, 2}\) where \(1\) is the root, we have \[\begin{align} \Emark{\tau}{\crl{2}}{a}{n - l \to n} + \Emark{\tau}{\varnothing}{a}{n - l \to n} &= \sum_{b = 0}^{n - l - a} \beta^b \nabla^2 L^{(n - l - a - b)} \sum_{l' = 1}^l \E{\rootedtree[]}{l'}{n} + \sum_{b = 0}^{n - l - a} \beta^b \nabla^2 L^{(n - l - a - b)} \sum_{l' = 1}^{a + b} \E{\rootedtree[]}{l'}{n - l}\\ &= \sum_{b = 0}^{n - l - a} \beta^b \nabla^2 L^{(n - l - a - b)} \sum_{l' = 1}^{l + a + b} \E{\rootedtree[]}{l'}{n} = \E{\rootedtree[[]]}{l + a}{n}, \end{align}\] where we used \[\begin{align} \sum_{l' = 1}^l \E{\rootedtree[]}{l'}{n} + \sum_{l' = 1}^{a + b} \E{\rootedtree[]}{l'}{n - l} &= \sum_{l' = 1}^l \sum_{b_1 = 0}^{n - l'} \beta^{b_1} \nabla L^{(n - l' - b_1)} + \sum_{l' = 1}^{a + b} \sum_{b_1 = 0}^{n - l - l'} \beta^{b_1} \nabla L^{(n - l - l' - b_1)}\\ &= \sum_{l' = 1}^l \sum_{b_1 = 0}^{n - l'} \beta^b \nabla L^{(n - l' - b_1)} + \sum_{l' = l + 1}^{l + a + b} \sum_{b_1 = 0}^{n - l'} \beta^b \nabla L^{(n - l' - b_1)}\\ &= \sum_{l' = 1}^{l + a + b} \sum_{b_1 = 0}^{n - l'} \beta^{b_1} \nabla L^{(n - l' - b_1)} = \sum_{l' = 1}^{l + a + b} \E{\rootedtree[]}{l'}{n}. \end{align}\]

9.0.0.2 Induction step

Let \(r\) be the root of \(\tau\), \(v_1, \ldots, v_{\ell}\) its children, \(\tau_1, \ldots, \tau_{\ell}\) the corresponding subtrees rooted at the children, \(V_1, \ldots, V_{\ell}\) their vertex sets (partitioning \(\range{1}{m} \setminus \crl{r}\)). From the definition, \[\Emark{\tau}{\mrk}{a}{n - l \to n} = \sum_{b = 0}^{n - l - a} \beta^b \nabla^{\ell + 1} L^{(n - l - a - b)} \brk[\bigg]{\sum_{l' = 1}^l \E{\tau_s}{l'}{n} \inde{\text{v_s \in \mrk}} + \sum_{l' = 1}^{a + b} \Emark{\tau_s}{\mrk \cap V_s}{l'}{n - l \to n} \inde{\text{v_s \notin \mrk}}}_{s = 1}^{\ell},\] which means \[\begin{align} &\sum_{\substack{\mrk \in \mrkset_{\tau}\\v_1 \notin \mrk}} \Emark{\tau}{\mrk}{a}{n - l \to n} = \sum_{\substack{\mrk \in \mrkset_{\tau}\\v_1 \notin \mrk}} \sum_{b = 0}^{n - l - a} \beta^b \times \\ &\qquad \quad \times \nabla^{\ell + 1} L^{(n - l - a - b)} \brk[\bigg]{ \sum_{l' = 1}^{a + b} \Emark{\tau_1}{\mrk \cap V_1}{l'}{n - l \to n}, \ldots, \sum_{l' = 1}^l \E{\tau_\ell}{l'}{n} \inde{\text{v_\ell \in \mrk}} + \sum_{l' = 1}^{a + b} \Emark{\tau_{\ell}}{\mrk \cap V_\ell}{l'}{n - l \to n} \inde{\text{v_\ell \notin \mrk}} }\\ &\quad = \sum_{\mrk_1 \in \mrkset_{\tau_1}} \sum_{\mrk_2 \in \mrkset_{\tau \setminus \tau_1}} \sum_{b = 0}^{n - l - a} \beta^b \times \\ &\qquad \quad \times \nabla^{\ell + 1} L^{(n - l - a - b)} \brk[\bigg]{ \sum_{l' = 1}^{a + b} \Emark{\tau_1}{\mrk_1}{l'}{n - l \to n}, \ldots, \sum_{l' = 1}^l \E{\tau_\ell}{l'}{n} \inde{\text{v_\ell \in \mrk_2}} + \sum_{l' = 1}^{a + b} \Emark{\tau_{\ell}}{\mrk_2 \cap V_{\ell}}{l'}{n - l \to n} \inde{\text{v_\ell \notin \mrk_2}} }\\ &\quad = \sum_{\mrk_2 \in \mrkset_{\tau \setminus \tau_1}} \sum_{b = 0}^{n - l - a} \beta^b \times\\ &\qquad \quad \times \nabla^{\ell + 1} L^{(n - l - a - b)} \brk[\bigg]{\sum_{l' = 1}^{a + b} \sum_{\mrk_1 \in \mrkset_{\tau_1}} \Emark{\tau_1}{\mrk_1}{l'}{n - l \to n}, \ldots, \sum_{l' = 1}^l \E{\tau_\ell}{l'}{n} \inde{\text{v_\ell \in \mrk_2}} + \sum_{l' = 1}^{a + b} \Emark{\tau_{\ell}}{\mrk_2 \cap V_\ell}{l'}{n - l \to n} \inde{\text{v_\ell \notin \mrk_2}}}. \end{align}\] Here, we used the one-to-one correspondence between markings \(\mrk\) not containing \(v_1\) and pairs \((\mrk_1, \mrk_2)\), where \(\mrk_1\) is a marking of \(\tau_1\) and \(\mrk_2\) is a marking of the tree \(\tau \setminus \tau_1\). Now we apply the induction hypothesis (since \(\tau_1\) is a smaller tree) and replace \(\sum_{\mrk_1 \in \mrkset_{\tau_1}} \Emark{\tau_1}{\mrk_1}{l'}{n - l \to n}\) with \(\E{\tau_1}{l + l'}{n}\): \[\begin{align} &\sum_{\substack{\mrk \in \mrkset_{\tau}\\v_1 \notin \mrk}} \Emark{\tau}{\mrk}{a}{n - l \to n} = \sum_{\mrk_2 \in \mrkset_{\tau \setminus \tau_1}} \sum_{b = 0}^{n - l - a} \beta^b\\ &\qquad \times \nabla^{\ell + 1} L^{(n - l - a - b)} \brk[\bigg]{\sum_{l' = 1}^{a + b} \E{\tau_1}{l + l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_\ell}{l'}{n} \inde{\text{v_\ell \in \mrk_2}} + \sum_{l' = 1}^{a + b} \Emark{\tau_{\ell}}{\mrk_2 \cap V_{\ell}}{l'}{n - l \to n} \inde{\text{v_\ell \notin \mrk_2}}}. \end{align}\]

We deal with the case \(v_1 \in \mrk\) similarly, obtaining \[\begin{align} &\sum_{\substack{\mrk \in \mrkset_{\tau}\\v_1 \in \mrk}} \Emark{\tau}{\mrk}{a}{n - l \to n} = \sum_{\mrk_2 \in \mrkset_{\tau \setminus \tau_1}} \sum_{b = 0}^{n - l - a} \beta^b\\ &\qquad \times \nabla^{\ell + 1} L^{(n - l - a - b)} \brk[\bigg]{\sum_{l' = 1}^l \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_\ell}{l'}{n} \inde{\text{v_\ell \in \mrk_2}} + \sum_{l' = 1}^{a + b} \Emark{\tau_{\ell}}{\mrk_2 \cap V_\ell}{l'}{n - l \to n} \inde{\text{v_\ell \notin \mrk_2}}}. \end{align}\] Adding the latter two equations and using \(\sum_{l' = 1}^{a + b} \E{\tau_1}{l + l'}{n} + \sum_{l' = 1}^l \E{\tau_1}{l'}{n} = \sum_{l' = 1}^{l + a + b} \E{\tau_1}{l'}{n}\) gives \[\begin{align} &\sum_{\substack{\mrk \in \mrkset_{\tau}}} \Emark{\tau}{\mrk}{a}{n - l \to n} = \sum_{\mrk_2 \in \mrkset_{\tau \setminus \tau_1}} \sum_{b = 0}^{n - l - a} \beta^b\\ &\qquad \times \nabla^{\ell + 1} L^{(n - l - a - b)} \brk[\bigg]{\sum_{l' = 1}^{l + a + b} \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_\ell}{l'}{n} \inde{\text{v_\ell \in \mrk_2}} + \sum_{l' = 1}^{a + b} \Emark{\tau_{\ell}}{\mrk_2 \cap V_\ell}{l'}{n - l \to n} \inde{\text{v_\ell \notin \mrk_2}}}. \end{align}\] Continuing the same argument with \(v_2, \ldots, v_{\ell}\), we will arrive at \[\begin{align} \sum_{\substack{\mrk \in \mrkset_{\tau}}} \Emark{\tau}{\mrk}{a}{n - l \to n} = \sum_{b = 0}^{n - l - a} \beta^b \nabla^{\ell + 1} L^{(n - l - a - b)} \brk[\bigg]{ \sum_{l' = 1}^{l + a + b} \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^{l + a + b} \E{\tau_\ell}{l'}{n} }. \end{align}\] By definition, the right-hand side is equal to \(\E{\tau}{l + a}{n}\), completing the induction step. ◻

Proof of 3. Consider the following operation. Choose a partition of \(\range{1}{m}\) into \(i + 1\) disjoint non-empty sets \((V_0, \crl{V_1, \ldots, V_i})\), where \(0 \leq i \leq m - 1\) (one of the sets is privileged, but the order in the other ones does not matter); choose a labeled rooted tree \(\tau_0 \in \mathcal{A}[V_0]\) with root \(v_0 \in V_0\), another labeled rooted tree \(\tau_1 \in \mathcal{A}[V_1]\) with root \(v_1 \in V_1\), and so on, up \(\tau_i \in \mathcal{A}[V_i]\) with root \(v_i \in V_i\).

Now, assign to each of \(v_1, \ldots, v_i\) a parent among vertices of \(\tau_0\) in all possible ways, by choosing all mappings from \(\crl{v_1, \ldots, v_i}\) to \(V_0\). Write the following expression corresponding to the family of all such assignments: \[\label{eq:expr-corr-to-parent-assignm} \nabla^i \E{\tau_0}{a}{n - l} \brk[\bigg]{\sum_{l' = 1}^l \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_i}{l'}{n}}.\tag{45}\] This way, we have constructed a family of labeled rooted trees \(\tau \in \mathcal{A}\range{1}{m}\) with roots of \(\tau_1, \ldots, \tau_i\) marked.

For example, consider \(m = 3\). When \(i = 0\), there is only one partition and three corresponding marked trees (with no vertices marked): \[V_0 = \crl{1, 2, 3}\quad \bigrootedtree[1[2][3]]\quad \bigrootedtree[2[3][1]]\quad \bigrootedtree[3[1][2]]\] When \(i = 1\), there are six partitions listed below with corresponding marked trees \[\begin{align} &V_0 = \crl{1}, V_1 = \crl{2, 3}\quad \bigrootedtree[1[2,text=red[3]]]\quad \bigrootedtree[1[3,text=red[2]]]\\ &V_0 = \crl{2}, V_1 = \crl{1, 3}\quad \bigrootedtree[2[1,text=red[3]]]\quad \bigrootedtree[2[3,text=red[1]]]\\ &V_0 = \crl{3}, V_1 = \crl{1, 2}\quad \bigrootedtree[3[1,text=red[2]]]\quad \bigrootedtree[3[2,text=red[1]]]\\ &V_0 = \crl{1, 2}, V_1 = \crl{3}\quad \bigrootedtree[1[2][3,text=red]]\quad \bigrootedtree[1[2[3,text=red]]]\quad \bigrootedtree[2[1][3,text=red]]\quad \bigrootedtree[2[1[3,text=red]]]\\ &V_0 = \crl{1, 3}, V_1 = \crl{2}\quad \bigrootedtree[1[3][2,text=red]]\quad \bigrootedtree[1[3[2,text=red]]]\quad \bigrootedtree[3[1][2,text=red]]\quad \bigrootedtree[3[1[2,text=red]]]\\ &V_0 = \crl{2, 3}, V_1 = \crl{1}\quad \bigrootedtree[2[3][1,text=red]]\quad \bigrootedtree[2[3[1,text=red]]]\quad \bigrootedtree[3[2][1,text=red]]\quad \bigrootedtree[3[2[1,text=red]]] \end{align}\] When \(i = 2\), there are three partitions with corresponding marked trees \[\begin{align} &V_0 = \crl{1}, V_1 = \crl{2}, V_2 = \crl{3}\quad \bigrootedtree[1[2,text=red][3,text=red]]\\ &V_0 = \crl{2}, V_1 = \crl{1}, V_2 = \crl{3}\quad \bigrootedtree[2[1,text=red][3,text=red]]\\ &V_0 = \crl{3}, V_1 = \crl{1}, V_2 = \crl{2}\quad \bigrootedtree[3[1,text=red][2,text=red]] \end{align}\]

Sum the expression 45 over the choices of a partition and the trees \(\tau_0, \ldots, \tau_i\): \[\sum_{i = 0}^{m - 1} \sum_{\substack{(V_0, \crl{V_1\ldots, V_i})\\\text{partition of \range{1}{m}}}} \sum_{\tau_0 \in \mathcal{A}[V_0], \ldots, \tau_i \in \mathcal{A}[V_i]} \nabla^i \E{\tau_0}{a}{n - l} \brk[\bigg]{\sum_{l' = 1}^l \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_i}{l'}{n}}.\] Since all marked trees in \(\mathcal{A}\range{1}{m}\) can be constructed this way and are counted exactly once in this sum, this equals \[\sum_{\tau \in \mathcal{A}\range{1}{m}} \sum_{\mrk \in \mrkset_{\tau}} \Emark{\tau}{\mrk}{a}{n - l \to n}.\] But \(\sum_{\mrk \in \mrkset_{\tau}} \Emark{\tau}{\mrk}{a}{n - l \to n}\) is \(\E{\tau}{l + a}{n}\) by ?? . Hence, we have obtained \[\sum_{i = 0}^{m - 1} \sum_{\substack{(V_0, \crl{V_1\ldots, V_i})\\\text{partition of \range{1}{m}}}} \sum_{\tau_0 \in \mathcal{A}[V_0], \ldots, \tau_i \in \mathcal{A}[V_i]} \nabla^i \E{\tau_0}{a}{n - l} \brk[\bigg]{\sum_{l' = 1}^l \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_i}{l'}{n}} = \sum_{\tau \in \mathcal{A}\range{1}{m}} \E{\tau}{l + a}{n}.\] Recall that \(m! / \sigma(\tau)\) labeled rooted trees correspond to the same unlabeled rooted tree. Hence, this can be rewritten as \[\label{eq:ljdshh} \begin{align} &\sum_{i = 0}^{m - 1} \sum_{\substack{(V_0, \crl{V_1\ldots, V_i})\\\text{partition of \range{1}{m}}}} \sum_{\tau_0 \in \tilde{\mathcal{A}}[\abs{V_0}], \ldots, \tau_i \in \tilde{\mathcal{A}}[\abs{V_i}]} \frac{\abs{V_0}!}{\sigma(\tau_0)}\\ &\qquad \times \nabla^i \E{\tau_0}{a}{n - l} \brk[\bigg]{\frac{\abs{V_1}!}{\sigma(\tau_1)} \sum_{l' = 1}^l \E{\tau_1}{l'}{n}, \ldots, \frac{\abs{V_i}!}{\sigma(\tau_i)} \sum_{l' = 1}^l \E{\tau_i}{l'}{n}} = \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{m!}{\sigma(\tau)} \E{\tau}{l + a}{n}. \end{align}\tag{46}\]

Let us fix \(j\), \((k_0, \ldots, k_{m - j - i}) \in \mathcal{K}_{i, m - j - i}\) and count the number of partitions with \(\abs{V_0} = j\), where among \(\crl{V_1, \ldots, V_i}\) there are \(k_0\) sets of size \(1\), \(\ldots\), \(k_{m - j - i}\) sets of size \(m - j - i + 1\). First, we choose the elements of \(V_0\) in \(\binom{m}{j}\) ways, then we order the remaining elements in \((m - j)!\) ways and assign the first \(k_0\) elements in this ordering as singletons, the next \(k_1\) pairs as two-sets, and so on. Notice that each partition of \(V \setminus V_0\) will be counted \(k_0! \ldots k_{m - j - i}! 1!^{k_0} \ldots (m - j - i + 1)!^{k_{m - j - i}}\) times (because the order of sets with the same length does not matter, and the order within each set does not matter). So, the required number of partitions is \[\begin{align} &\binom{m}{j} \frac{(m - j)!}{k_0! \ldots k_{m - j - i}! 1!^{k_0} \ldots (m - j - i + 1)!^{k_{m - j - i}}}\\ &\quad = \frac{m!}{j! k_0! \ldots k_{m - j - i}! 1!^{k_0} \ldots (m - j - i + 1)!^{k_{m - j - i}}}. \end{align}\] Thus, 46 can be rewritten as \[\begin{align} &\sum_{i = 0}^{m - 1} \sum_{j = 1}^{m - i} \sum_{(k_0, \ldots, k_{m - j - i}) \in \mathcal{K}_{i, m - j - i}} \frac{m!}{k_0! \ldots k_{m - j - i}!} \sum_{\tau_0 \in \tilde{\mathcal{A}}[j]} \frac{1}{\sigma(\tau_0)} \\ &\qquad \times \nabla^i \E{\tau_0}{a}{n - l} \brk[\bigg]{ \underbrace{\sum_{\tau \in \tilde{\mathcal{A}}[1]}\frac{1}{\sigma(\tau)} \sum_{l' = 1}^l \E{\tau}{l'}{n}}_{\text{k_0 times}}, \ldots, \underbrace{\sum_{\tau \in \tilde{\mathcal{A}}[m - j - i + 1]}\frac{1}{\sigma(\tau)} \sum_{l' = 1}^l \E{\tau}{l'}{n}}_{\text{k_{m - j - i} times}} } = \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{m!}{\sigma(\tau)} \E{\tau}{l + a}{n}, \end{align}\] where \(j! 1!^{k_0} \ldots (m - j - i + 1)!^{k_{m - j - i}}\) canceled out with \(\abs{V_0}! \abs{V_1}! \ldots \abs{V_i}!\). Dividing both sides by \(m!\) completes the proof. ◻

Proof of 3. We prove 22 and 23 by induction over \(m \geq 2\).

For \(m = 2\), they are already verified above. Note also that the second statement holds for \(m = 1\) as well: \[\histcoef{1}{n}{k} = \sum_{l = 1}^k \E{\rootedtree[]}{l}{n}.\]

The sum over \(\tilde{\mathcal{A}}[s]\), repeated \(i_{s - 1}\) times, generates a list of \(i_{s - 1}\)-tuples of trees with \(s\) vertices. On this list, each multiset of \(i_{s - 1}\) trees with multiplicities \(\mu_1^s, \ldots, \mu^s_{\abs{\tilde{\mathcal{A}}[s]}}\) appears \(\binom{i_{s - 1}}{\mu_1^s \ldots \mu^s_{\abs{\tilde{\mathcal{A}}[s]}}}\) times. Therefore, in the large sum above each multiset of \(\ell\) trees with the total number of vertices \(m - 1\) and matching the vertex-count multiplicities \(i_0, \ldots, i_{m - 1 - \ell}\) (equivalently, each tree \(\tau\) with \(m\) vertices whose root has \(\ell\) children matching these vertex-count multiplicities) appears \[\prod_{s = 1}^{m - \ell} \frac{i_{s - 1}!}{\mu_1^s(\tau)! \ldots \mu_{\abs{\tilde{\mathcal{A}}[s]}}^s(\tau)!} = \frac{i_0! \ldots i_{m - 1 - \ell}!}{\prod_{s = 1}^{m - \ell} \prn[\big]{\mu_1^s(\tau)! \ldots \mu_{\abs{\tilde{\mathcal{A}}[s]}}^s(\tau)!}}\] times, and we can rewrite \[\begin{align} &\nomemcoef{m}{n} = - \beta \sum_{b = 0}^{n - 1} \beta^b \sum_{\ell = 0}^{m - 1} \sum_{(i_0, \ldots, i_{m - 1 - \ell}) \in \mathcal{K}_{\ell, m - 1 - \ell}} \sum_{\substack{\tau = [\tau_1, \ldots, \tau_{\ell}] \in \tilde{\mathcal{A}}[m]:\\ \text{\tau matches i_0, \ldots, i_{m - 1 - \ell}}}} \frac{1}{\prod_{s = 1}^{m - \ell} \prn[\big]{\mu_1^s(\tau)! \ldots \mu_{\abs{\tilde{\mathcal{A}}[s]}}^s(\tau)!}}\\ &\qquad \times \nabla^{\ell + 1} L^{(n - 1 - b)} \prn[\bigg]{\frac{1}{\sigma(\tau_1)} \sum_{l = 1}^{b + 1} \E{\tau_1}{l}{n}, \ldots, \frac{1}{\sigma(\tau_{\ell})} \sum_{l = 1}^{b + 1} \E{\tau_{\ell}}{l}{n}}. \end{align}\] Using 10 simplifies this further: \[\begin{align} &\nomemcoef{m}{n} = - \beta \sum_{b = 0}^{n - 1} \beta^b \sum_{\ell = 0}^{m - 1} \sum_{(i_0, \ldots, i_{m - 1 - \ell}) \in \mathcal{K}_{\ell, m - 1 - \ell}} \sum_{\substack{\tau = [\tau_1, \ldots, \tau_{\ell}] \in \tilde{\mathcal{A}}[m]:\\ \text{\tau matches i_0, \ldots, i_{m - 1 - \ell}}}} \frac{1}{\sigma(\tau)}\\ &\qquad \times \nabla^{\ell + 1} L^{(n - 1 - b)} \prn[\bigg]{\sum_{l = 1}^{b + 1} \E{\tau_1}{l}{n}, \ldots, \sum_{l = 1}^{b + 1} \E{\tau_{\ell}}{l}{n}}\\ &\quad = - \beta \sum_{b = 0}^{n - 1} \beta^b \sum_{\ell = 0}^{m - 1} \sum_{\substack{\tau = [\tau_1, \ldots, \tau_{\ell}] \in \tilde{\mathcal{A}}[m]}} \frac{1}{\sigma(\tau)} \nabla^{\ell + 1} L^{(n - 1 - b)} \prn[\bigg]{\sum_{l = 1}^{b + 1} \E{\tau_1}{l}{n}, \ldots, \sum_{l = 1}^{b + 1} \E{\tau_{\ell}}{l}{n}}\\ &\quad = - \beta \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{1}{n}. \end{align}\]

We have proven that under the induction hypothesis for smaller \(m\), 22 holds.

By definition of the history terms in 15 and the induction hypothesis, to prove 23 it is enough to show 24 .

By the induction hypothesis and 22 already proven, the left-hand side of 24 is \[\begin{align} &\sum_{i = 0}^{m - 1} \sum_{(k_0, \ldots, k_{m - i - 1}) \in \mathcal{K}_{i, m - i - 1}} \frac{1}{k_0! \ldots k_{m - i - 1}!} \tag{47}\\ &\qquad \times \sum_{b = 0}^{n - l} \beta^b \nabla^{i + 1} L^{(n - l - b)} \brk[\bigg]{ \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_0 times}}, \ldots, \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[m - i]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_{m - i - 1} times}} } \tag{48}\\ &\qquad + \beta \sum_{j = 2}^m \sum_{i = 0}^{m - j} \sum_{(k_0, \ldots, k_{m - i - j}) \in \mathcal{K}_{i, m - i - j}} \frac{1}{k_0! \ldots k_{m - i - j}!} \tag{49}\\ &\qquad \times \nabla^i \sum_{\tau_0 \in \tilde{\mathcal{A}}[j]} \frac{1}{\sigma(\tau_0)} \E{\tau_0}{1}{n - l} \brk[\bigg]{ \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_0 times}}, \ldots, \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[m - i - j + 1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_{m - i - j} times}} } \tag{50}\\ &\quad = \sum_{i = 0}^{m - 1} \sum_{(k_0, \ldots, k_{m - i - 1}) \in \mathcal{K}_{i, m - i - 1}} \frac{1}{k_0! \ldots k_{m - i - 1}!} \tag{51}\\ &\qquad \times \sum_{b = 0}^{n - l} \beta^b \nabla^{i + 1} L^{(n - l - b)} \brk[\bigg]{ \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_0 times}}, \ldots, \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[m - i]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_{m - i - 1} times}} } \tag{52}\\ &\qquad - \beta \sum_{i = 0}^{m - 1} \sum_{(k_0, \ldots, k_{m - i - 1}) \in \mathcal{K}_{i, m - i - 1}} \frac{1}{k_0! \ldots k_{m - i - 1}!} \tag{53}\\ &\qquad \times \nabla^i \E{\rootedtree[]}{1}{n - l} \brk[\bigg]{ \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_0 times}}, \ldots, \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[m - i]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_{m - i - 1} times}} } \tag{54}\\ &\qquad + \beta \sum_{j = 1}^m \sum_{i = 0}^{m - j} \sum_{(k_0, \ldots, k_{m - i - j}) \in \mathcal{K}_{i, m - i - j}} \frac{1}{k_0! \ldots k_{m - i - j}!} \tag{55}\\ &\qquad \times \sum_{\tau_0 \in \tilde{\mathcal{A}}[j]} \frac{1}{\sigma(\tau_0)} \nabla^i \E{\tau_0}{1}{n - l} \brk[\bigg]{ \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_0 times}}, \ldots, \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[m - i - j + 1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_{m - i - j} times}} }.\tag{56} \end{align}\] Inserting \(\E{\rootedtree[]}{1}{n - l} = \sum_{b = 0}^{n - l - 1} \beta^b \nabla L^{(n - l - 1 - b)}\), we see that the sum in 51 52 53 54 evaluates to \[\begin{align} &\sum_{i = 0}^{m - 1} \sum_{(k_0, \ldots, k_{m - i - 1}) \in \mathcal{K}_{i, m - i - 1}} \frac{1}{k_0! \ldots k_{m - i - 1}!} \\ &\qquad \times \nabla^{i + 1} L^{(n - l)} \brk[\bigg]{ \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[1]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_0 times}}, \ldots, \underbrace{ \sum_{l' = 1}^l \sum_{\tau \in \tilde{\mathcal{A}}[m - i]} \frac{1}{\sigma(\tau)} \E{\tau}{l'}{n} }_{\text{k_{m - i - 1} times}} }\\ &\quad = \sum_{i = 0}^{m - 1} \sum_{\substack{\tau = [\tau_1, \ldots, \tau_i] \in \tilde{\mathcal{A}}[m]}} \frac{1}{\sigma(\tau)} \nabla^{i + 1} L^{(n - l)} \brk[\bigg]{ \sum_{l' = 1}^l \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_i}{l'}{n} }, \end{align}\] By 3, the sum in 55 56 is \[\beta \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{l + 1}{n}.\]

We have obtained that the left-hand side of 24 is equal to \[\begin{align} \sum_{i = 0}^{m - 1} \sum_{\substack{\tau \in \tilde{\mathcal{A}}[m]\\ \tau = [\tau_1, \ldots, \tau_i]}} \frac{1}{\sigma(\tau)} \nabla^{i + 1} L^{(n - l)} \brk[\bigg]{ \sum_{l' = 1}^l \E{\tau_1}{l'}{n}, \ldots, \sum_{l' = 1}^l \E{\tau_i}{l'}{n} } + \beta \sum_{\tau \in \tilde{\mathcal{A}}[m]} \frac{1}{\sigma(\tau)} \E{\tau}{l + 1}{n}. \end{align}\] Combining this with 21 , we see that the left-hand side of 24 is equal to the right-hand side of 24 . This completes the induction step and the whole proof of 3. ◻

10 Proof of Corollaries↩︎

10.1 Proof of 1 ↩︎

Lemma 11. For all \(\btheta \in \mathcal{D}\) and all \(m \in \range{1}{\ord}\), we have \[\nabla^{r} \beacoef{m}{n}(\btheta) = O(1),\quad r \in \range{0}{2 \ord - m}.\]

Proof. If we express \(\beacoef{m}{n}(\btheta)\) through \(\crl[\big]{\nomemcoef{j}{n}}\) and their derivatives, the derivatives will be of the form \(\nabla^l \nomemcoef{j}{n}\) where \(l + j \leq m\). Therefore, the derivatives in \(\nabla^{r} \beacoef{m}{n}(\btheta)\) for \(r \in \range{0}{2 \ord - m}\) will be of the form \(\nabla^l \nomemcoef{j}{n}\) where \(l + j \leq 2 \ord\). So the result follows immediately from 8. ◻

Lemma 12. For all \(r \in \range{1}{\ord}\) \[\label{eq:hb-no-memory-iteration} \btheta(t_{n + 1}) = \btheta(t_n) + \sum_{j = 1}^{r} h^j \nomemcoef{j}{n}(\btheta(t_n)) + O(h^{r + 1}).\qquad{(12)}\]

Proof. Differentiating, we get the exact equality on \(t \in [t_n, t_{n + 1}]\) \[\frac{\mathrm{d}^i \btheta}{\mathrm{d} t^i}(t) = \sum_{k_1, \ldots, k_i = 1}^{\ord} h^{k_1 + \ldots + k_i - i} (D_{k_1}^{(n)} \ldots D_{k_{i - 1}}^{(n)} \beacoef{k_i}{n}) (\btheta(t)),\quad1 \leq i \leq \ord + 1.\]

Taylor’s theorem gives \[\begin{align} &\btheta(t_{n + 1}) = \btheta(t_n) + \sum_{i = 1}^{\ord} \frac{h^i}{i!} \frac{\mathrm{d}^i \btheta}{\mathrm{d} t^i}(t_n^+) + \frac{h^{\ord + 1}}{(\ord + 1)!} \frac{\mathrm{d}^{\ord + 1} \btheta}{\mathrm{d} t^{\ord + 1}}(\tilde{t})\\ &\quad = \btheta(t_n) + \sum_{i = 1}^{\ord} \frac{1}{i!} \sum_{k_1, \ldots, k_i = 1}^{\ord} h^{k_1 + \ldots + k_i} (D_{k_1}^{(n)} \ldots D_{k_{i - 1}}^{(n)} \beacoef{k_i}{n}) (\btheta(t_n))\\ &\qquad+ \frac{1}{(\ord + 1)!} \sum_{k_1, \ldots, k_{\ord + 1} = 1}^{\ord} h^{k_1 + \ldots + k_{\ord + 1}} (D_{k_1}^{(n)} \ldots D_{k_{\ord}}^{(n)} \beacoef{k_{\ord + 1}}{n}) (\btheta(\tilde{t}))\\ &\quad = \btheta(t_n) + \sum_{i = 1}^{\ord} \frac{1}{i!} \sum_{\substack{1 \leq k_1, \ldots, k_i \leq \ord\\i \leq k_1 + \ldots + k_i \leq \ord}} h^{k_1 + \ldots + k_i} (D_{k_1}^{(n)} \ldots D_{k_{i - 1}}^{(n)} \beacoef{k_i}{n}) (\btheta(t_n))\\ &\qquad + \sum_{i = 1}^{\ord} \frac{1}{i!} \sum_{\substack{1 \leq k_1, \ldots, k_i \leq \ord\\\ord + 1 \leq k_1 + \ldots + k_i \leq i \ord}} h^{k_1 + \ldots + k_i} (D_{k_1}^{(n)} \ldots D_{k_{i - 1}}^{(n)} \beacoef{k_i}{n}) (\btheta(t_n))\\ &\qquad + \frac{1}{(\ord + 1)!} \sum_{k_1, \ldots, k_{\ord + 1} = 1}^{\ord} h^{k_1 + \ldots + k_{\ord + 1}} (D_{k_1}^{(n)} \ldots D_{k_{\ord}}^{(n)} \beacoef{k_{\ord + 1}}{n}) (\btheta(\tilde{t}))\\ &\quad \labrel{1}[=] \btheta(t_n) + \sum_{i = 1}^{\ord} \frac{1}{i!} \sum_{j = i}^{\ord} h^j \sum_{\substack{k_1, \ldots, k_i \geq 1\\k_1 + \ldots + k_i = j}} (D^{(n)}_{k_1} \ldots D^{(n)}_{k_{i - 1}} \beacoef{k_i}{n})(\btheta(t_n)) + O(h^{\ord + 1})\\ &\quad = \btheta(t_n) + \sum_{j = 1}^{\ord} h^j \sum_{i = 1}^j \frac{1}{i!} \sum_{\substack{k_1, \ldots, k_i \geq 1\\k_1 + \ldots + k_i = j}} (D^{(n)}_{k_1} \ldots D^{(n)}_{k_{i - 1}} \beacoef{k_i}{n})(\btheta(t_n)) + O(h^{\ord + 1})\\ &\quad \labrel{2}[=] \btheta(t_n) + \sum_{j = 1}^{\ord} h^j \nomemcoef{j}{n}(\btheta(t_n)) + O(h^{\ord + 1}), \end{align}\] where \(\tilde{t}\) is between \(t_n\) and \(t_{n + 1}\); in we used 11 and in we used 27 .

It is left to use the boundedness of \(\nomemcoef{j}{n}(\btheta)\) in the region of interest (8). ◻

Lemma 13. The following global bound holds: \[\sup_{n \in \range{0}{\lfloor T / h \rfloor}} \norm[\big]{\tilde{\btheta}^{(n)} - \btheta(t_n)} \leq \oC{mod-proof-gl} h^{\ord},\] where \(\nC{mod-proof-gl}\) is some constant.

Proof. Note that \(\btheta(0) = \tilde{\btheta}^{(0)}\), the local error bound is already proven in 12, and each \(\nomemcoef{j}{n}(\cdot)\) is Lipschitz because their derivatives are bounded by 8. So this result follows by the same standard argument as for 10 with \(\tilde{\btheta}^{(n)}\) replaced by \(\btheta(t_n)\) and \(\btheta^{(n)}\) replaced by \(\tilde{\btheta}^{(n)}\). ◻

To conclude the proof of 1, it is left to combine 13 with ?? .

10.2 Proof of [{cor:principal-coef}]↩︎

Define \(\crl{v_{m, l}^{(n)}}\) by the recursion \[\label{eq:vm-l-rec} \begin{align} &v_{m, l}^{(n)} = \sum_{b = 0}^{n - l} \beta^b \sum_{l_1 = 1}^{l + b} v_{m - 1, l_1}^{(n)},\quad m \in \mathbb{Z}_{\geq 2},\\ &v_{1, l}^{(n)} = \sum_{b = 0}^{n - l} \beta^b. \end{align}\tag{57}\] It is immediate from 3 that \(v_m^{(n)} \equiv v_{m, 1}^{(n)}\).

Next, define \(\mathcal{c}_m \in \tilde{\mathcal{A}}[m]\) the chain with \(m\) vertices (defined in 1.5). Applying 21 with \(l = 1\) gives \[\nabla^2 L \E{\mathcal{c}_{m - 1}}{1}{n} + \beta \E{\mathcal{c}_m}{2}{n} = \E{\mathcal{c}_m}{1}{n}.\] By 2 with \(l = a = 1\), we have also \[\E{\mathcal{c}_m}{2}{n} = \sum_{\mrk \in \mrkset_{\mathcal{c}_m}} \Emark{\mathcal{c}_m}{\mrk}{a}{n - 1 \to n}.\] Combining and using the definition of \(\Emark{\mathcal{c}_m}{\mrk}{a}{n - 1 \to n}\) gives ?? .

It is clear from ?? that for each fixed \(m\) the sequence \(v_m^{(n)}\) can be bounded by a constant not depending on \(n\) (but depending on \(m\)), and from 57 that each \(v_{m, l}^{(n)}\) (and in particular \(v_m^{(n)}\)) is non-decreasing in \(n\). Hence, there is a limit \(v_m^{(\infty)} := \lim_{n \to \infty} v_m^{(n)}\). The initial condition is \(v_1^{(\infty)} = \lim_{n \to \infty} \sum_{b = 0}^{n - 1} \beta^b = (1 - \beta)^{-1}\). ?? is now immediate from ?? by taking the limit \(n \to \infty\).

From ?? , we get that the generating function \(g_{\beta}(x)\) needs to satisfy the quadratic equation \(g_{\beta}(x) - (1 - \beta)^{-1} = (1 - \beta)^{-1} x g_{\beta}(x) + \beta (1 - \beta)^{-1} x g_{\beta}(x)^2\) and have \(g_{\beta}(0) = (1 - \beta)^{-1}\), which gives ?? .

The Narayana polynomials defined above satisfy the recurrence (e. g. coker2003enumerating?) \[N_m(\beta) = (1 + \beta) N_{m - 1}(\beta) + \beta \sum_{k = 1}^{m - 2} N_k(\beta) N_{m - k - 1}(\beta)\] for all \(m \geq 3\), which means that \(\tilde{N}_m(\beta) := N_m(\beta) / (1 - \beta)^{2 m + 1}\) satisfy \[\tilde{N}_m(\beta) = \frac{1 + \beta}{(1 - \beta)^2} \tilde{N}_{m - 1}(\beta) + \frac{\beta}{1 - \beta} \sum_{k = 1}^{m - 2} \tilde{N}_k(\beta) \tilde{N}_{m - k - 1}(\beta).\] But ?? can be rewritten as \[v_{m' + 1}^{(\infty)} = \frac{1 + \beta}{(1 - \beta)^2} v_{(m' - 1) + 1}^{(\infty)} + \frac{\beta}{1 - \beta} \sum_{j' = 1}^{m' - 2} v_{j' + 1}^{(\infty)} v_{(m' - j' - 1) + 1}^{(\infty)},\quad m' \geq 3,\] so \(\tilde{N}_m(\beta)\) and \(v_{m + 1}^{(\infty)}\) satisfy the same recurrence. It is easy to see that their elements with \(m \in \crl{1, 2}\) are equal, concluding the result.

10.3 Proof of [{cor:eulerian}]↩︎

From 3, \[\begin{align} q_{m, l}^{(n)} = \sum_{b = 0}^{n - l} \beta^b \prn[\bigg]{\sum_{l_1 = 1}^{l + b} \frac{1 - \beta^{n - l_1 + 1}}{1 - \beta}}^{m - 1}, \end{align}\] so their limit is \[q_{m, l}^{(\infty)} = \frac{1}{(1 - \beta)^{m - 1}} \sum_{b = 0}^{\infty} \beta^b (l + b)^{m - 1}.\] In particular, \(q_{m + 1, 1}^{(\infty)} = \frac{1}{(1 - \beta)^m} \sum_{b = 0}^{\infty} \beta^b (1 + b)^m = \frac{1}{(1 - \beta)^m} \frac{1}{(1 - \beta)^{m + 1}} A_m(\beta)\).

10.4 Proof of 4 ↩︎

Lemma 14. Neglecting non-principal terms, the following formula holds: \[\beacoef{m}{n}(\btheta) = \sum_{l = 1}^m \frac{(-1)^{l + 1}}{l} \sum_{\substack{k_1, \ldots, k_l \geq 1\\k_1 + \ldots + k_l = m}} \nabla \nomemcoef{k_1}{n} \ldots \nabla \nomemcoef{k_{l - 1}}{n} \nomemcoef{k_l}{n}(\btheta) + \np.\]

Proof. In other words, we need to show \[\begin{align} \beacoef{m}{n}(\btheta) &= \sum_{\mathcal{s}: w(\mathcal{s}) = m} C_{l(\mathcal{s})} \mathcal{s} + \np, \end{align}\] with \[\label{eq:def-of-tree-coefs} C_l := \begin{cases} 1 & \text{if l = 1,}\\ (-1)^{l + 1} \sum_{i = 2}^l \frac{(-1)^i}{i!} \sum_{\substack{l_1, \ldots, l_i \geq 1\\l_1 + \ldots + l_i = l}} \frac{1}{l_1 \ldots l_i} & \text{if l \geq 2,} \end{cases}\tag{58}\] where by \(\mathcal{s}\) we denote expressions of the form \(\nabla \nomemcoef{k_1}{n} \ldots \nabla \nomemcoef{k_{l - 1}}{n} \nomemcoef{k_l}{n}(\btheta)\), by \(w(\mathcal{s})\) their weight, which is defined as \(k_1 + \ldots + k_l\), and by \(l(\mathcal{s})\) their length, which is the number of nodes \(\nomemcoef{k_i}{n}\) (\(l\) in this case). For two such expressions \(\mathcal{s}_1\) and \(\mathcal{s}_2\), we will write \(\mathcal{s}_1 \mathcal{s}_2\) for their concatenation, for example if \(\mathcal{s}_1 = \nomemcoef{3}{n}(\btheta)\) and \(\mathcal{s}_2 = \nabla \nomemcoef{1}{n}(\btheta) \nomemcoef{2}{n}(\btheta)\), then \(\mathcal{s}_1 \mathcal{s}_2 = \nabla \nomemcoef{3}{n}(\btheta) \nabla \nomemcoef{1}{n}(\btheta) \nomemcoef{2}{n}(\btheta)\).

We will argue by induction over \(m\). For \(m = 1\) the statement is obvious. Ignoring non-principal terms, we can write \[\beacoef{m}{n}(\btheta) = \nomemcoef{m}{n}(\btheta) - \sum_{i = 2}^m \frac{1}{i!} \sum_{\substack{k_1, \ldots, k_i \geq 1\\k_1 + \ldots + k_i = m}} \nabla \beacoef{k_1}{n} \ldots \nabla \beacoef{k_{i - 1}}{n} \beacoef{k_i}{n}(\btheta) + \np.\] Now using the induction assumption, we rewrite it as \[\begin{align} \beacoef{m}{n}(\btheta) &= \nomemcoef{m}{n}(\btheta) - \sum_{i = 2}^m \frac{1}{i!} \sum_{\substack{k_1, \ldots, k_i \geq 1\\k_1 + \ldots + k_i = m}} \sum_{\substack{\mathcal{s}_1, \ldots, \mathcal{s}_i:\\w(\mathcal{s}_1) = k_1, \ldots, w(\mathcal{s}_i) = k_i}} C_{l(\mathcal{s}_1)} \ldots C_{l(\mathcal{s}_i)} \mathcal{s}_1 \ldots \mathcal{s}_i + \np\\ &\labrel{1}[=] \nomemcoef{m}{n}(\btheta) - \sum_{i = 2}^m \frac{1}{i!} \sum_{\substack{k_1, \ldots, k_i \geq 1\\k_1 + \ldots + k_i = m}} \sum_{\substack{\mathcal{s}_1, \ldots, \mathcal{s}_i:\\w(\mathcal{s}_1) = k_1, \ldots, w(\mathcal{s}_i) = k_i}} \frac{(-1)^{l(\mathcal{s}_1) + \ldots + l(\mathcal{s}_i) + i}}{l(\mathcal{s}_1) \ldots l(\mathcal{s}_i)} \mathcal{s}_1 \ldots \mathcal{s}_i + \np\\ &= \nomemcoef{m}{n}(\btheta) - \sum_{\mathcal{s}: w(\mathcal{s}) = m} \sum_{i = 2}^m \frac{(-1)^i}{i!} \sum_{\substack{\mathcal{s}_1, \ldots, \mathcal{s}_i\text{ non-empty}:\\\mathcal{s} = \mathcal{s}_1\ldots \mathcal{s}_i}} \frac{(-1)^{l(\mathcal{s}_1) + \ldots + l(\mathcal{s}_i)}}{l(\mathcal{s}_1) \ldots l(\mathcal{s}_i)} \mathcal{s} + \np\\ &\labrel{2}[=] \sum_{\mathcal{s}: w(\mathcal{s}) = m} C_{l(\mathcal{s})} \mathcal{s} + \np. \end{align}\] where in we used that \(C_l = (-1)^{l + 1} / l\) which is proven below, in we used the definition of \(C_l\) given in 58 .

It is left to prove \(C_l = (-1)^{l + 1} / l\), or, equivalently, for \(l \geq 2\) we have \[\label{eq:jhfhnbc} \sum_{i = 1}^l \frac{(-1)^i}{i!} \sum_{\substack{l_1, \ldots, l_i \geq 1\\l_1 + \ldots + l_i = l}} \frac{1}{l_1 \ldots l_i} \overset{?}{=} 0.\tag{59}\] To do this, note that \[- \ln(1 - x) = \sum_{n = 1}^{\infty} \frac{x^n}{n},\] which means that the coefficient before \(x^l\) in the power series of \(\brk[\big]{- \ln(1 - x)}^i\) is \[\sum_{\substack{l_1, \ldots, l_i \geq 1\\l_1 + \ldots + l_i = l}} \frac{1}{l_1 \ldots l_i},\] and therefore the left-hand side of 59 is the coefficient before \(x^l\) in the power series \[\sum_{i = 1}^{\infty} \frac{(-1)^i}{i!} \brk[\big]{- \ln(1 - x)}^i = \exp \crl{\ln(1 - x)} - 1 = -x,\] which is zero for \(l \geq 2\). ◻

?? is immediate from 14 and 2. Taking the limit as \(n \to \infty\), \[\zcoef{m}{\infty} := \lim_{n \to \infty} \zcoef{m}{n} = \sum_{l = 1}^m \frac{(-1)^{l + 1}}{l} \sum_{\substack{k_1, \ldots, k_l \geq 1\\k_1 + \ldots + k_l = m}} p^{(\infty)}_{k_1} \ldots p^{(\infty)}_{k_l},\] where \[p^{(\infty)}_k := \begin{cases} - (1 - \beta)^{-1}, &\text{if } k = 1,\\ - \beta v_k^{(\infty)}, &\text{if } k \geq 2. \end{cases}\]

The generating function \(\bar{g}_{\beta}(x) := \sum_{k = 0}^{\infty} \zcoef{k + 1}{\infty} x^k\) satisfies \[\brk{\bar{g}_{\beta}(x)}_k = \sum_{l = 0}^k \frac{(-1)^l}{l + 1} \brk[\big]{\crl{-1 - \beta g_{\beta}(x)}^{l + 1}}_{k - l},\] where \(\brk{g(x)}_k\) denotes the coefficient before \(x^k\) in the power series of \(g(x)\). Multiplying both sides by \(x^k\), summing over \(k\) and changing the order of summation gives \[\begin{align} \sum_{k = 0}^{\infty} \brk{\bar{g}_{\beta}(x)}_k x^k &= - \sum_{k = 0}^{\infty} \sum_{l = 0}^k \frac{1}{l + 1} \brk[\big]{\crl{1 + \beta g_{\beta}(x)}^{l + 1}}_{k - l} x^k\\ &= - \sum_{l = 0}^{\infty} \frac{1}{l + 1} x^l \sum_{k = l}^{\infty} \brk[\big]{\crl{1 + \beta g_{\beta}(x)}^{l + 1}}_{k - l} x^{k - l}\\ &= - \sum_{l = 0}^{\infty} \frac{1}{l + 1} x^l \crl{1 + \beta g_{\beta}(x)}^{l + 1} = - \frac{1}{x} \sum_{l = 1}^{\infty} \frac{1}{l} \crl{x + \beta x g_{\beta}(x)}^l\\ &= \frac{1}{x} \ln \prn[\bigg]{\frac{1 + \beta - x + \sqrt{(1 - \beta - x)^2 - 4 \beta x}}{2}}, \end{align}\] as desired.

11 Averaging over Dataset Permutations↩︎

Let \(\batchcntDef := n + 1\) be the batch count.

Recall that \[L^{(k)}(\btheta) = \frac{1}{\batchsize} \sum_{r = k \batchsize + 1}^{k \batchsize + \batchsize} \ell_{\pi(r)}(\btheta),\quad k \in \range{0}{n},\] where \(\crl{\ell_s}_{s = 1}^{(n + 1) \batchsize}\) are per-sample losses and \(\pi\) is a random permutation of \(\range{1}{(n + 1) \batchsize}\), distributed uniformly over all \(((n + 1) \batchsize)!\) such permutations.

Note that \[\begin{align} \Eletpi \nabla^2 (\ell_{\pi(1)} - L) \nabla (\ell_{\pi(2)} - L) &= - \frac{\nabla \trace \mSigma}{2 (\batchcnt \batchsize - 1)},\\ \Eletpi \nabla^2 (\ell_{\pi(1)} - L) \nabla (\ell_{\pi(1)} - L) &= \frac{\nabla \trace \mSigma}{2}, \end{align}\] where \(\mSigma\) is the empirical covariance matrix of per-sample gradients 7 .

Lemma 15. The following expressions hold: \[\Eletpi \nabla^2 L^{(0)} \nabla L^{(1)} = \nabla^2 L \nabla L - \frac{\nabla \trace \mSigma}{2 (\batchcnt \batchsize - 1)},\quad \Eletpi \nabla^2 L^{(0)} \nabla L^{(0)} = \nabla^2 L \nabla L + \frac{\batchcnt - 1}{2 (\batchcnt \batchsize - 1)} \nabla \trace \mSigma.\]

Proof. Indeed, \[\begin{align} &\Eletpi \nabla^2 L^{(0)} \nabla L^{(1)} = \Eletpi \frac{1}{\batchsize} \sum_{r = 1}^{\batchsize} \nabla^2 \ell_{\pi(r)} \frac{1}{\batchsize} \sum_{s = \batchsize + 1}^{2 \batchsize} \nabla \ell_{\pi(s)}\\ &\quad = \Eletpi \nabla^2 \ell_{\pi(1)} \nabla \ell_{\pi(2)} = \nabla^2 L \nabla L + \Eletpi \nabla^2 (\ell_{\pi(1)} - L) \nabla (\ell_{\pi(2)} - L)\\ &\quad = \nabla^2 L \nabla L - \frac{\nabla \trace \mSigma}{2 (\batchcnt \batchsize - 1)} \end{align}\] and \[\begin{align} &\Eletpi \nabla^2 L^{(0)} \nabla L^{(0)} = \Eletpi \frac{1}{\batchsize} \sum_{r = 1}^{\batchsize} \nabla^2 \ell_{\pi(r)} \frac{1}{\batchsize} \sum_{s = 1}^{\batchsize} \nabla \ell_{\pi(s)}\\ &\quad = \frac{1}{\batchsize} \Eletpi \nabla^2 \ell_{\pi(1)} \nabla \ell_{\pi(1)} + \frac{\batchsize - 1}{\batchsize} \Eletpi \nabla^2 \ell_{\pi(1)} \nabla \ell_{\pi(2)}\\ &\quad = \frac{1}{\batchsize} \prn[\bigg]{\nabla^2 L \nabla L + \frac{\nabla \trace \mSigma}{2}} + \frac{\batchsize - 1}{\batchsize} \prn[\bigg]{\nabla^2 L \nabla L - \frac{\nabla \trace \mSigma}{2 (\batchcnt \batchsize - 1)}}\\ &\quad = \nabla^2 L \nabla L + \frac{\batchcnt - 1}{2 (\batchcnt \batchsize - 1)} \nabla \trace \mSigma \end{align}\] completing the proof. ◻

Lemma 16. We have \[\begin{align} \Eletpi \nomemcoef{2}{n}(\btheta) &= \prn[\big]{\twoeq(\beta) + \twoneq(\beta)} \nabla^2 L \nabla L + \prn[\bigg]{\twoeq(\beta) \frac{\batchcnt - 1}{2 (\batchcnt \batchsize - 1)} - \twoneq(\beta) \frac{1}{2 (\batchcnt \batchsize - 1)}} \nabla \trace \mSigma\\ &= \prn[\bigg]{- \frac{\beta}{(1 - \beta)^3} + o_n(1)} \nabla^2 L \nabla L + \prn[\bigg]{- \frac{\beta}{2 (1 - \beta)^2 (1 + \beta)} + o_n(1)} \frac{\nabla \trace \mSigma}{\batchsize}, \end{align}\] where \(o_n(1)\) are terms that go to zero as \(n \to \infty\) (for fixed \(\beta\)) regardless of \(\batchsize \in \range{1}{n + 1}\), and \[\begin{align} \twoeq(\beta) &:= - \frac{\beta \brk{1 - \beta^n (1 + \beta) + \beta^{2 n + 1}}}{(1 - \beta)^2 (1 + \beta)},\label{eq:twoeq-def}\\ \twoneq(\beta) &:= \frac{- 2 \beta^2 + 2 n (1 - \beta^2) \beta^{n + 1} + 2 \beta^{2 n + 2}}{(1 - \beta)^3 (1 + \beta)}.\label{eq:twoneq-def} \end{align}\] {#eq: sublabel=eq:eq:twoeq-def,eq:eq:twoneq-def}

Proof. Using 15, we can write \[\begin{align} &\Eletpi \nomemcoef{2}{n}(\btheta) = - \beta \Eletpi \sum_{b = 0}^{n - 1} \beta^{b} \sum_{l' = 1}^{b + 1} \sum_{b' = 0}^{n - l'} \beta^{b'} \nabla^2 L^{(n - 1 - b)} \nabla L^{(n - l' - b')}\\ &\quad = \twoeq(\beta) \Eletpi \nabla^2 L^{(0)} \nabla L^{(0)} + \twoneq(\beta) \Eletpi \nabla^2 L^{(0)} \nabla L^{(1)}\\ &\quad = \twoeq(\beta) \prn[\bigg]{\nabla^2 L \nabla L + \frac{\batchcnt - 1}{2 (\batchcnt \batchsize - 1)} \nabla \trace \mSigma} + \twoneq(\beta) \prn[\bigg]{\nabla^2 L \nabla L - \frac{\nabla \trace \mSigma}{2 (\batchcnt \batchsize - 1)}}\\ &\quad = \prn[\big]{\twoeq(\beta) + \twoneq(\beta)} \nabla^2 L \nabla L + \prn[\bigg]{\twoeq(\beta) \frac{\batchcnt - 1}{2 (\batchcnt \batchsize - 1)} - \twoneq(\beta) \frac{1}{2 (\batchcnt \batchsize - 1)}} \nabla \trace \mSigma\\ &\quad = \prn[\bigg]{- \frac{\beta}{(1 - \beta)^3} + o_n(1)} \nabla^2 L \nabla L + \prn[\bigg]{- \frac{\beta}{2 (1 - \beta)^2 (1 + \beta)} + o_n(1)} \frac{\nabla \trace \mSigma}{\batchsize}, \end{align}\] where \(\twoeq(\beta)\) and \(\twoneq(\beta)\) can be calculated as \[\begin{align} \twoeq(\beta) &:= - \beta \sum_{b = 0}^{n - 1} \beta^b \sum_{l' = 1}^{b + 1} \beta^{b + 1 - l'} = - \frac{\beta \brk{1 - \beta^n (1 + \beta) + \beta^{2 n + 1}}}{(1 - \beta)^2 (1 + \beta)} = \underset{n \to \infty}{\longrightarrow} - \frac{\beta}{(1 - \beta)^2 (1 + \beta)},\\ \twoneq(\beta) &:= - \beta \sum_{b = 0}^{n - 1} \beta^b \sum_{l' = 1}^{b + 1} \sum_{b' = 0}^{n - l'} \beta^{b'} - \twoeq(\beta)\\ &= \frac{- 2 \beta^2 + 2 n (1 - \beta^2) \beta^{n + 1} + 2 \beta^{2 n + 2}}{(1 - \beta)^3 (1 + \beta)} \underset{n \to \infty}{\longrightarrow} - \frac{2 \beta^2}{(1 - \beta)^3 (1 + \beta)}.\qedhere \end{align}\] ◻

Authors are in alphabetic order. We thank Boris Hanin for insighful comments and discussions. Cattaneo gratefully acknowledges financial support from the National Science Foundation through grants SES-2019432 and SES-2241575.↩︎

Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis