June 04, 2024
This paper proposes a ridgeless kernel method for solving infinite-horizon, deterministic, continuous-time models in economic dynamics, formulated as systems of differential-algebraic equations with asymptotic boundary conditions (e.g., transversality). Traditional shooting methods enforce the asymptotic boundary conditions by targeting a known steady state—which is numerically unstable, hard to tune, and unable to address cases with steady-state multiplicity. Instead, our approach solves the underdetermined problem without imposing the asymptotic boundary condition, using regularization to select the unique solution fulfilling transversality among admissible trajectories. In particular, ridgeless kernel methods recover this path by selecting the minimum norm solution, coinciding with the non-explosive trajectory. We provide theoretical guarantees showing that kernel solutions satisfy asymptotic boundary conditions without imposing them directly, and we establish a consistency result ensuring convergence within the solution concept of differential-algebraic equations. Finally, we illustrate the method in canonical models and demonstrate its ability to handle problems with multiple steady states.
Keywords: Kernel Methods; Economic Dynamics; Machine Learning; Growth; Inductive Bias.
JEL codes: E13, C45, C63.
This paper proposes using ridgeless kernel regression to solve a broad class of infinite-horizon, deterministic, continuous-time models in economic dynamics. The main computational challenge in these models is satisfying asymptotic boundary conditions, typically arising from transversality conditions in the embedded optimal control problem. Unlike shooting algorithms that aim toward a known steady state, we show that kernel methods can satisfy these asymptotic conditions by selecting the least explosive trajectory among all candidate solutions—and without even calculating the steady-state, let alone imposing it as a condition. This yields robust and computationally efficient algorithms, even under steady-state multiplicity and hysteresis, more stable than parametric approaches such as deep learning, and often easier to tune than classic shooting methods.
Beyond capturing forward-looking behavior, asymptotic boundary conditions are essential for well-posedness in the sense of Hadamard [1], [2]—ensuring existence, uniqueness, and continuous dependence on initial conditions. Without them, the system may admit a continuum of trajectories that satisfy both the initial conditions and the DAEs, thereby violating well-posedness.
The central idea of this paper is to treat the problem as underdetermined: impose initial conditions and laws of motion, but not the asymptotic boundary conditions, and then use regularization to select among the resulting trajectories.1 For a large class of models, the unique valid solution is also the only non-explosive one; all others fail transversality. By choosing an appropriate function norm and penalizing it, kernel methods recover this non-explosive solution—thereby satisfying the asymptotic boundary conditions.
Kernel Methods. Kernel methods are a natural fit for this task since they provide a formal framework for defining and penalizing function norms through the theory of the Reproducing Kernel Hilbert Space (RKHS) induced by the choice of the kernel. In particular, we focus on ridgeless kernel methods [6], which select the minimum norm solution among all solutions that perfectly satisfy the differential equation at some finite number of points.
A “kernel machine” is a non-parametric approximation where a function is represented as a weighted sum of the distances to all existing data (i.e., a kernel).2 For goals ranging from empirical risk-minimization to solving a system of function equations, kernels and RKHS map an infinite dimensional problem in a function space to a finite dimensional problem in the weights of the kernel machine for all data.
Related Work. Our approach is connected with both traditional methods for solving linearized systems and to a recent literature that uses ML to solve nonlinear optimal control problems.
Perturbation solutions to models of economic dynamics use classic stability analysis from linear-quadratic (LQ) control to find solutions that satisfy asymptotic boundary conditions (e.g., [9]). These methods exploit the linearity of time-invariant policies in LQ control, ensuring stability by ruling out explosive roots inconsistent with transversality and selecting the unique stabilizing solution. Our paper draws inspiration from this broader approach: algorithms that select non-explosive roots lie on the solution manifold and automatically satisfy transversality conditions.
The use of ML methods is becoming increasingly popular for solving and estimating economic models. Applications span a wide range, including wealth inequality [10], financial frictions [11], the heterogeneous impacts of climate change [12], portfolio choice problems [13], heterogeneous agent New Keynesian models [14], human capital accumulation in the labor market [15], and labor market dynamics in search and matching environments [16]. These models often rely on neural networks [5], [17]–[19] and even some work with Gaussian Processes [20].
Recent work in optimal control explores how to achieve stable solutions using ML-based methods [21]–[24]. More directly within economics, [5] discusses the intuitive connection between the inductive bias of deep neural networks and turnpikes [25] in dynamic economic models, but does not provide a formal theory or consider kernel methods. Our paper contributes to this literature, providing a formal argument on why the inductive bias of ML algorithms promotes stability in infinite-horizon control when using particular kernel machines.
Contributions. Core results of our paper include:
Inductive bias alignment: theoretical and empirical evidence that the minimum-norm implicit bias of kernel methods aligns with many asymptotic boundary conditions found in economic problems;
Learning the right set of steady states: evidence that kernel machines identify the steady states of dynamical systems—the ones corresponding to the optimal solution—which leads to highly accurate generalization outside the training data, even without enforcing asymptotic boundary conditions;
Consistency of ML estimates: guarantees that the approximation error of our kernel methods can be bounded, with the method converging to the true minimum norm solution (and thus solutions satisfying asymptotic boundary conditions) as training data increases; and
Robustness and speed: demonstrations that kernel machines can be competitive in both speed and robustness with traditional methods for modeling economic systems, even on small-scale problems.
Structure. The remainder of this paper is structured as follows. 2 describes the class of economic models and provides assumptions on primitives which lead to explosive solutions violating transversality. 3 describes how kernel methods map to our class of problems, and discusses cases where a min norm solution is sufficient to fulfill transversality. [thm:consistency] provides a consistency result showing how kernel methods converge to the min norm solution, aligning with the solution concept of the DAE itself. Results from our core applications and future directions are presented in 4, while 5 concludes the paper.
We focus on an important class of dynamic problems in economics and finance: deterministic, continuous-time systems that arise from discounted infinite-horizon optimal control problems, together with algebraic constraints that encode conditions such as instantaneous market clearing.3
Problem class. In these models, the first-order necessary conditions of the underlying decision problems can be stacked into a system of ordinary differential and algebraic equations (DAEs). The variables can be partitioned into three vectors: state variables \({\boldsymbol{x}}(t)\in\mathbb{R}^{{M}}\), with an initial condition \({\boldsymbol{x}}_0\); co-state variables \({\boldsymbol{\mu}}(t)\in\mathbb{R}^{{M}}\), with accompanying asymptotic boundary conditions that typically arise from the embedded control problem; and jump variables \({\boldsymbol{y}}(t)\in\mathbb{R}^{{P}}\), which relate the state and co-state variables (e.g., co-state \(=\) the marginal utility of consumption) and/or impose intratemporal constraints (e.g., market clearing conditions).4 Canonical examples of such equations arise from applying Pontryagin’s Maximum Principle to the present-value Hamiltonian of a dynamic optimization problem, as in [29]. In that case, \({\boldsymbol{\mu}}(t)\) represents the present-value Lagrange multipliers associated with the state variables \({\boldsymbol{x}}(t)\). The dynamical system with primitives \({\boldsymbol{F}}, {\boldsymbol{G}}: \mathbb{R}^{{M}}\times \mathbb{R}^{{M}}\times\mathbb{R}^{{P}} \to \mathbb{R}^{{M}}\) and \({\boldsymbol{H}}: \mathbb{R}^{{M}}\times \mathbb{R}^{{M}}\times\mathbb{R}^{{P}} \to \mathbb{R}^{{P}}\) with a discount rate \(r > 0\) is \[\begin{align} \dot{{\boldsymbol{x}}}(t) &= {\boldsymbol{F}}({\boldsymbol{x}}(t),{\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)),\tag{1}\\ \dot{{\boldsymbol{\mu}}}(t) &= r {\boldsymbol{\mu}}(t) - {\boldsymbol{\mu}}(t) \odot {\boldsymbol{G}}({\boldsymbol{x}}(t),{\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)),\tag{2}\\ \mathbf{0} &= {\boldsymbol{H}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)),\tag{3}\\ \intertext{subject to {M} initial values, and {M} asymptotic boundary conditions (i.e., transversality conditions)} {\boldsymbol{x}}(0) &= {\boldsymbol{x}}_0 \tag{4}\\ \mathbf{0} &= \lim_{t\to\infty}e^{-r t} {\boldsymbol{x}}(t) \odot {\boldsymbol{\mu}}(t) \tag{5} \end{align}\] The symbol \(\odot\) denotes element-wise multiplication between vectors of the same dimension.
This system is an autonomous, semi-explicit differential-algebraic equation (DAE) with a mix of initial conditions, 4 , and asymptotic boundary conditions, 5 .
Example: Neoclassical Growth Model. The canonical example of this class in macroeconomics is the neoclassical growth model. Fitting to our notation define: capital \(x(t)\), consumption \(y(t)\), flow utility \(\log(y)\), present-value co-state variable \(\mu(t)\), discount rate \(r > 0\), depreciation rate \(0 < \delta < 1\), and a monotonically increasing and strictly concave production function \(f(x)\) where \(f(0) = 0\). Writing the standard equations in our notation with \(x, \mu, y \in \mathbb{R}^1\),5
\[\begin{align} \dot{x}(t) &= f\left(x(t)\right) - \delta x(t) - y(t):= F(x(t), \mu(t), y(t))\tag{6} \\ \dot{\mu}(t) &= r \mu(t) - \mu(t) \underbrace{\left[f'\left(x(t)\right)- \delta \right]}_{:= G(x(t), \mu(t), y(t))}\tag{7} \\ 0 &= \mu(t)y(t) -1:= H(x(t), \mu(t), y(t))\tag{8}\\ x(0) &= x_0\tag{9}\\ 0 &= \lim_{t \to \infty} e^{-rt} \mu(t) x(t)\tag{10} \end{align}\] Next we will analyze the key conditions required for our algorithm, and demonstrate the intuition with this running example.
Key necessary conditions for the DAE to be well-posed are that for the given \({\boldsymbol{x}}_0\), there exists a unique \({\boldsymbol{\mu}}(0)\) fulfilling the asymptotic boundary condition 5 . 1 provides further assumptions to ensure that there exists a unique \({\boldsymbol{y}}(0)\) fulfilling 3 given \({\boldsymbol{x}}(0)\) and \({\boldsymbol{\mu}}(0)\).
Assumption 1 (Conditions for Well-posedness and Regularity). Assume that
\({\boldsymbol{F}}\), \({\boldsymbol{G}}\), and \({\boldsymbol{H}}\) in 1 2 3 4 are Lipschitz with respect to \(\Vert \cdot \Vert_\infty\);
\({\boldsymbol{F}}\) and \({\boldsymbol{G}}\) have Lipschitz first derivatives and \({\boldsymbol{H}}\) has Lipschitz first and second derivatives;
The Jacobian of \({\boldsymbol{H}}\) with respect to \({\boldsymbol{y}}\) is nonsingular along the relevant trajectories: \(\det\!\big(\nabla_{{\boldsymbol{y}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}})\big)\neq 0\), and its inverse is Lipschitz continuous with a Lipschitz first derivative in the domain \(t \in [0, T]\).
By the implicit function theorem the last condition in 1 implies a unique, locally Lipschitz, map \({\boldsymbol{y}}={\boldsymbol{y}}({\boldsymbol{x}},{\boldsymbol{\mu}})\) fulfilling \({\boldsymbol{H}}({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}})=\mathbf{0}\)—so no initial/boundary condition for \({\boldsymbol{y}}(0)\) is required. This ensures that the system is a semi-explicit DAE with index \(1\).6
The conditions in 1 ensure uniqueness given an initial condition—eliminating sources of multiplicity that may arise due to the jump variables, \({\boldsymbol{y}}(t)\).7
In addition, we add another standard assumptions on problem formulation to ensure that both the state and co-state variables in a solution are bounded and strictly positive.8
Assumption 2 (Bounded Solutions). Assume that:
For any given \({\boldsymbol{x}}_0\), there is a unique solution, \({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)\) to 1 2 3 4 5 .
There exist bounds such that, \(\mathbf{0} < \underline{{\boldsymbol{x}}} < {\boldsymbol{x}}(t) < \bar{{\boldsymbol{x}}} < \infty, \mathbf{0} < \underline{{\boldsymbol{\mu}}} < {\boldsymbol{\mu}}(t)< \bar{{\boldsymbol{\mu}}} < \infty,\) and \(0 < \underline{{\boldsymbol{y}}}< {\boldsymbol{y}}(t) < \bar{{\boldsymbol{y}}} < \infty\) for all \(t > 0\) on the solution path for a given \({\boldsymbol{x}}_0\).
Before we explore the kernel-based methods in 3 and solve the neoclassical growth model in 4.1, we will analyze shooting methods to provide intuition on our solution algorithm.
The central computational challenge is that applying the boundary-value 5 numerically is not directly feasible, as it is asymptotic. Shooting methods, and related algorithms solving for an implicit equation with a finite-horizon BVP Methods, use the insight that if we could replace the asymptotic boundary condition in 5 with the right \({\boldsymbol{\mu}}(0) = {\boldsymbol{\mu}}_0\), then 2 1 ensure the solution is a well-posed initial value problem (IVP)—routinely solved for DAEs with millions of equations using software such as [32].
First, we need to contrast two types of functions fulfilling the ODEs: a solution trajectory fulfills the full set of well-posed equations 1 2 3 4 5 ; and a non-solution trajectory fulfills the ill-posed system 1 2 3 4 , but fails the asymptotic boundary condition 5 . 1 2 ensure that we have a unique bounded solution, but there are usually a continuum of non-solutions which can each be associated with a different \({\boldsymbol{\mu}}(0)\). The solution is indexed by its initial condition \({\boldsymbol{\mu}}_0\), a single point inside an \({M}\)-dimensional set of non-solutions.
Shooting methods. In many applications it is straightforward to compute a steady state \({\boldsymbol{x}}(\infty),\, {\boldsymbol{\mu}}(\infty),\, {\boldsymbol{y}}(\infty)\), which is bounded by 2. A shooting algorithm takes an initial co-state \(\tilde{{\boldsymbol{\mu}}}_0\), integrates the dynamics to a large \(T\) using a standard DAE/ODE IVP solver, and evaluates \[\Psi(\tilde{{\boldsymbol{\mu}}}_0;T) \equiv \begin{bmatrix} {\boldsymbol{x}}(T;\tilde{{\boldsymbol{\mu}}}_0) - {\boldsymbol{x}}(\infty)\\ {\boldsymbol{\mu}}(T;\tilde{{\boldsymbol{\mu}}}_0) - {\boldsymbol{\mu}}(\infty)\\ {\boldsymbol{y}}(T;\tilde{{\boldsymbol{\mu}}}_0) - {\boldsymbol{y}}(\infty) \end{bmatrix}^{\top}.\] One then finds \(\tilde{{\boldsymbol{\mu}}}_0\) such that \(\Psi(\tilde{{\boldsymbol{\mu}}}_0;T) \approx \mathbf{0}\) using a root–finding method (e.g., bisection, Newton). The transversality condition in 5 is implicitly used in calculating the steady-state, but it is always present. The stability of this procedure is governed by the Jacobian \(\nabla\Psi(\tilde{{\boldsymbol{\mu}}}_0;T)\).9
Challenges with shooting methods. Even when the steady state is easily calculated, in practice the algorithm is unstable and highly sensitive to both the initial guess \(\tilde{{\boldsymbol{\mu}}}_0\) and the chosen horizon \(T\).10
If \(T\) is too small, the algorithm may converge but produces a biased approximation because it forces trajectories to approach the steady state too quickly. If \(T\) is too large, then non-solution trajectories—those failing 5 —separate rapidly from the true solution. In that case the Jacobian \(\nabla\Psi(\tilde{{\boldsymbol{\mu}}}_0;T)\) becomes ill-conditioned: \(\Psi(\tilde{{\boldsymbol{\mu}}}_0;T)\) is close to zero in a small neighborhood around the true \({\boldsymbol{\mu}}_0\) and diverges sharply outside it. This lack of smoothness renders root-finding highly sensitive in higher dimensions. Thus one faces a tradeoff: a small \(T\) yields stability but bias, while a large \(T\) is unbiased but numerically unstable.
This sensitivity is inherent to optimal control problems due to their saddle-path structure: all trajectories except those along the unique solution manifold diverge. For this problem class the situation is even sharper: [thm:divergence-speed] shows that under our assumptions, any non-solution trajectory diverges at least as fast as the exponential discounting rate \(r\) appearing in the transversality condition 5 . Consequently, even small deviations in the initial condition \(\tilde{{\boldsymbol{\mu}}}_0\) accumulate exponentially over time, directly leading to instability in computing \(\nabla \Psi(\cdot;T)\).11
This sensitivity is inherent to discounted optimal control problems with a saddle-path structure: for a given \({\boldsymbol{x}}_0\), the stable manifold is lower-dimensional, so initializations \(\tilde{{\boldsymbol{\mu}}}_0\neq {\boldsymbol{\mu}}_0\) generate non-solution trajectories that violate transversality. For our problem class the effect is sharper: [thm:divergence-speed] implies that any non-solution trajectory has co-states growing at least at the discount rate \(r\). Hence small errors in \(\tilde{{\boldsymbol{\mu}}}_0\) diverge at least as fast as \(e^{rt}\) in 5 , making \(\nabla\Psi(\tilde{{\boldsymbol{\mu}}}_0;T)\) numerically unstable as \(T\) increases.
theoremthm:divergence-speed[Divergence Rate of Non-Solutions] Let \(\tilde{{\boldsymbol{\mu}}}_0\) be the initial condition associated with a non-solution, and let \({\boldsymbol{y}}({\boldsymbol{x}},\tilde{{\boldsymbol{\mu}}})\) denote the solution of 3 given 1, i.e., \({\boldsymbol{H}}\!\left({\boldsymbol{x}}, \tilde{{\boldsymbol{\mu}}}, {\boldsymbol{y}}({\boldsymbol{x}},\tilde{{\boldsymbol{\mu}}})\right) = \mathbf{0}.\) Suppose there exist points \(\tilde{{\boldsymbol{x}}}^* \in \mathbb{R}^{M}\) and \(\tilde{{\boldsymbol{y}}}^* \in \mathbb{R}^{P}\) such that \[\lim_{\tilde{{\boldsymbol{\mu}}} \to \infty} {\boldsymbol{y}}(\tilde{{\boldsymbol{x}}}^*,\tilde{{\boldsymbol{\mu}}}) = \tilde{{\boldsymbol{y}}}^*, \quad \lim_{\tilde{{\boldsymbol{\mu}}} \to \infty} {\boldsymbol{F}}\!\left(\tilde{{\boldsymbol{x}}}^*, \tilde{{\boldsymbol{\mu}}}, {\boldsymbol{y}}(\tilde{{\boldsymbol{x}}}^*,\tilde{{\boldsymbol{\mu}}})\right) = \mathbf{0}, \quad \lim_{\tilde{{\boldsymbol{\mu}}} \to \infty} {\boldsymbol{G}}\!\left(\tilde{{\boldsymbol{x}}}^*, \tilde{{\boldsymbol{\mu}}}, {\boldsymbol{y}}(\tilde{{\boldsymbol{x}}}^*,\tilde{{\boldsymbol{\mu}}})\right) \leq \mathbf{0}.\] Then \[\lim_{t \to \infty} \frac{\dot{\tilde{{\boldsymbol{\mu}}}}^{(m)}(t)}{\tilde{{\boldsymbol{\mu}}}^{(m)}(t)} \;\geq\; r, \qquad \text{ for some } m = 1,\ldots,M.\] Furthermore, if \(~\lim_{{\boldsymbol{\mu}}\to \infty} {\boldsymbol{G}}\!\left(\tilde{{\boldsymbol{x}}}^*, {\boldsymbol{\mu}}, {\boldsymbol{y}}(\tilde{{\boldsymbol{x}}}^*,{\boldsymbol{\mu}})\right)^{(m)} < 0\) some \(m\), then \[\lim_{t \to \infty} \frac{\dot{\tilde{{\boldsymbol{\mu}}}}^{(m)}(t)}{\tilde{{\boldsymbol{\mu}}}^{(m)}(t)} \;> \; r,\]
Proof. Rewrite 2 componentwise and take limits \[\begin{align} \lim_{t \to \infty}\frac{\dot{{\boldsymbol{\mu}}}^{(m)}(t)}{{\boldsymbol{\mu}}^{(m)}(t)} &= \lim_{t \to \infty} \left(r - {\boldsymbol{G}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t))^{(m)}\right) = \lim_{{\boldsymbol{\mu}}\to \infty} r - {\boldsymbol{G}}(\tilde{{\boldsymbol{x}}}^*, {\boldsymbol{\mu}}, \tilde{{\boldsymbol{y}}}^*)^{(m)}. \end{align}\] Note that if \(\lim_{{\boldsymbol{\mu}}\to \infty}{\boldsymbol{G}}(\tilde{{\boldsymbol{x}}}^*, {\boldsymbol{\mu}}, \tilde{{\boldsymbol{y}}}^*)^{(m)} < 0\) for some \(m\), then \(\lim_{t \to \infty} \frac{\dot{\tilde{{\boldsymbol{\mu}}}}^{(m)}(t)}{\tilde{{\boldsymbol{\mu}}}^{(m)}(t)} > r\) ◻
This result confirms the intuition for why shooting methods are difficult to use in practice. Not only do the non-solutions diverge, they do so exponentially, at least as fast as the transversality condition in 5 . Consequently, a small error in the initial condition at time zero, \(\tilde{{\boldsymbol{\mu}}}_0\), accumulates exponentially over time—which leads to the numerical instability of calculating \(\nabla \Psi(\cdot;T)\) in shooting methods. .
Verification for neoclassical growth. The condition in [thm:divergence-speed] is easily verified in the neoclassical growth model. If \(\tilde{\mu}(t)\to\infty\), then by 8 we have \(\tilde{y}^*=\lim_{\tilde{\mu}\to\infty}\tilde{\mu}^{-1} = 0\). Use this result with 6 to find that \(\tilde{x}^*=\lim_{t\to\infty}\tilde{x}(t)\), must satisfies \(f(\tilde{x}^*)=\delta\,\tilde{x}^*\). Rearrange using \(f(0) = 0\) to get \(\frac{f(\tilde{x}^*) - f(0)}{\tilde{x}^*}=\delta\). By strict concavity \(f'(\tilde{x}^*)<\frac{f(\tilde{x}^*)-f(0)}{\tilde{x}^*}=\delta\). Finally, 7 gives us \(G(\tilde{x}^*,\tilde{\mu}(t),\tilde{y}^*)=f'(\tilde{x}^*)-\delta < 0\). The rate of divergence, by [thm:divergence-speed], is strictly faster than \(r\).
The results of [thm:divergence-speed] only partially characterizes the set of non-solutions—concentrating on the primary case which makes shooting methods difficult and our own methods successful. Other cases which pose no challenge for shooting methods—such as a divergent \({\hat{\boldsymbol{x}}}(t)\) with stationary \(\hat{\boldsymbol{\mu}}(t)\) tend to be problem dependent and rely on 2 (see standard saddle-path analysis in [29]). For example, in the growth example a divergent \(\tilde{x}(t)\) leads negative \(\tilde{y}(t)\) to which contradict 2—and can be eliminated by adding in extra bounds to the optimization problem.
The key takeaway from these results is that shooting methods, and similar approaches imposing the steady-state at a terminal condition, are inherently sensitive to the choice of \(T\) since all non-solutions explode exponentially leading to instability when evaluating the system pointwise at the terminal condition.
The key insight into our methods is that we can use the rapid divergence separating the unique solution from the non-solutions to our advantage.
While [thm:divergence-speed] shows that the pointwise evaluation of transversality at a large \(T\) is sensitive, it hints that more global approaches to contrast solutions from non-solutions might be effective. In particular, many function norms and semi-norms can provide a numerically stable alternative. If a non-solution diverges, the norm of the corresponding state or co-state variables will eventually exceed that of the optimal solution for a large enough \(T\), thereby violating 5 . Therefore, any algorithm that solves 1 2 3 4 while controlling a norm of the state and co-state variables over a finite time horizon can get arbitrarily close to the optimal solution.
While we could use this insight for different types of function approximation and norms (e.g., [5] empirically shows similar results arising from inductive bias in deep learning), kernel methods have the advantage that the norms are precisely defined in the Reproducing Kernel Hilbert Space (RKHS) and are numerically stable even for large \(T\).
Sobolev norm solutions. Before discussing function approximation in the RKHS space and determining whether it aligns with the solution concept of the DAE, we can consider the natural function space in which solutions to our problem class reside. This provides us with a precise way to compare degrees of divergence that is not pointwise.
Differential equations of this sort typically align with a space defined by the Sobolev norm of the solution’s derivative. In particular, we consider the Sobolev-\(2,2\) space of functions, which is defined as the space of functions \(w(\cdot)\) where the function, as well as its first and second (weak) derivatives are square integrable over the domain \([0, T]\): \[\begin{align} \mathcal{W}^{2,2}([0,T]) &= \{ w(\cdot) : w(\cdot), \dot{w}(\cdot), \ddot{w}(\cdot) \in L^2([0,T]) \}. \end{align}\] For any \(w(\cdot) \in \mathcal{W}^{2,2}([0,T])\), we have that \(\dot{w}(\cdot)\) lives in the Sobolev-\(1,2\) space-the space of square-integrable functions with square-integrable first (weak) derivatives. The Sobolev-\(1,2\) norm of \(\dot{w}(\cdot)\), also defined as the Sobolev-\(2,2\) semi-norm of \(w(\cdot)\), can be viewed as a measure of complexity of the differential equation solution \(w(\cdot)\): \[\begin{align} \vert w(\cdot) \vert_{\mathcal{W}^{2,2}([0,T])} := \Vert \dot{w}(\cdot) \Vert_{\mathcal{W}^{1,2}([0,T])} := \left( \int_0^T \left( \vert \dot{w}(t) \vert^2 + \vert \ddot{w}(t) \vert^2 \right) dt \right)^{1/2}. \end{align}\]
In other words: solutions (and non-solution on a finite domain), and their derivatives belong to a Sobolev space with a well-defined semi-norm.12 These are reasonable assumptions given the connections between infinite-horizon optimal solutions in economic growth models and Sobolev spaces; see [28] and [33].
Back to our setup: intuitively, solutions to 1 2 3 4 either converge and fulfill 5 or diverge, for example with \(\dot{\tilde{{\boldsymbol{\mu}}}}(t) > 0\). This would lead to solutions having a lower semi-norm. As \(T\) increases, the solution and non-solutions increasingly separate since [thm:divergence-speed] shows \(\frac{\dot{\tilde{{\boldsymbol{\mu}}}}^{(m)}(t)}{\tilde{{\boldsymbol{\mu}}}^{(m)}(t)} \;\geq\; r\), for some \(m = 1,\cdots,M\). Therefore, given the bounding in 2 to find the solution that fulfills transversality, it is sufficient to choose among all non-solutions those where \(\Vert {\boldsymbol{x}}\Vert_{\mathcal{W}^{1,2}([0,T])} + \Vert {\boldsymbol{\mu}}\Vert_{\mathcal{W}^{1,2}([0,T])}\) is minimized.
In this section we first define our approximation class and present an algorithm for solving an underdetermined DAEs, 1 2 3 4 , using ridgeless kernel regression. Next, [thm:bound-norms] shows that a minimum-norm solution is a sufficient condition to satisfy the pointwise transversality requirement—ensuring that the unique solution to the underdetermined kernel regression is the one fulfilling transversality. Finally, we establish in [thm:consistency] that ridgeless kernel regression is consistent and therefore asymptotically enforces the minimum norm condition and, by [thm:bound-norms], solves the full system 1 2 3 5 4 . Although intuitive, the result is nontrivial because it requires demonstrating alignment and mathematical consistency between the RKHS of the approximate solution, its derivatives, and the solution concept of the DAE itself.
Our algorithm models \({\boldsymbol{x}}(t)\), \({\boldsymbol{\mu}}(t)\), and \({\boldsymbol{y}}(t)\), together with their derivatives, using kernel machines—an exemplar-based approximation in which functions are expressed as combinations of their values at the data, weighted by a function that encodes a notion of distance [8]. More specifically, our approximation ensures that the DAE is satisfied on some finite set of (possibly irregular) points \(\mathcal{D}:= \{t_1,\ldots,t_N\}\), which we refer to as the “training data,” while guaranteeing that it has the minimum function norm amongst all possible solutions satisfying this condition. As a non-parametric method, the number of parameters defining our approximation grows with the data. Given a kernel function \(k(\cdot,\cdot)\), our approximations take the form: \[\begin{gather} {\hat{\dot{\boldsymbol{x}}}}(t) = \mathop{\mathrm{{\textstyle \sum}}}_{j=1}^N \boldsymbol{\alpha}^x_j k(t, t_j), \qquad {\hat{\dot{\boldsymbol{\mu}}}}(t) = \mathop{\mathrm{{\textstyle \sum}}}_{j=1}^N \boldsymbol{\alpha}^\mu_j k(t, t_j), \qquad {\hat{\dot{\boldsymbol{y}}}}(t) = \mathop{\mathrm{{\textstyle \sum}}}_{j=1}^N \boldsymbol{\alpha}^y_j k(t, t_j),\\ {\hat{\boldsymbol{x}}}(t) = {\boldsymbol{x}}_0 + \mathop{\mathrm{{\textstyle \int}}}_0^t \dot{\hat{\boldsymbol{x}}}(\tau) d\tau, \qquad \hat{\boldsymbol{\mu}}(t) = \hat{\boldsymbol{\mu}}_0 + \mathop{\mathrm{{\textstyle \int}}}_0^t \dot{\hat{\boldsymbol{\mu}}}(\tau) d\tau, \qquad {\hat{\boldsymbol{y}}}(t) = {\hat{\boldsymbol{y}}}_0 + \mathop{\mathrm{{\textstyle \int}}}_0^t \dot{\hat{\boldsymbol{y}}}(\tau) d\tau, \qquad \end{gather} \label{eq:kernel}\tag{11}\] where \(\boldsymbol{\alpha}^x_j\), \(\boldsymbol{\alpha}^\mu_j\), \(\boldsymbol{\alpha}^y_j\), \(\hat{\boldsymbol{\mu}}_0\), and \({\hat{\boldsymbol{y}}}_0\) are parameters fit to fulfill our DAE.13 We approximate the time-derivative, and integrate to obtain the function values. While it is possible to approximate the function itself, the benefits of this formulation are that the initial condition \({\hat{\boldsymbol{x}}}(0) = {\boldsymbol{x}}_0\) is directly enforced, and extrapolation performance is significantly improved.14
In this paper, we use a Matérn kernel for \(k(\cdot, \cdot)\), with smoothness \(\nu\) and lengthscale \(\ell\) (see 1 in 7). The choice of the Matérn kernel family is driven by theoretical considerations to formally align with the solution concept of the DAE, but many other kernels (e.g., Gaussian kernels) have similar performance in practice, even if they impose more smoothness than is strictly necessary.
Function Norms. Central to these methods is that the kernel function, \(k(\cdot,\cdot)\) has an associated function space, its Reproducing Kernel Hilbert Space, \(\mathcal{H}\) which provides an inner product and an associated function norm. Moreover, the norm of functions in this space can be calculated as a quadratic form of its coefficients. In particular, for the approximated \({\hat{\dot{\boldsymbol{x}}}}(t)\) and \({\hat{\dot{\boldsymbol{\mu}}}}(t)\) in 11 , \(\Vert {\hat{\dot{\boldsymbol{x}}}}^{(m)} \Vert^2_{\mathcal{H}}\) and \(\Vert {\hat{\dot{\boldsymbol{\mu}}}}^{(m)} \Vert^2_{\mathcal{H}}\), are constructed as follows: \[\begin{gather} \Vert {\hat{\dot{\boldsymbol{x}}}}^{(m)} \Vert^2_{\mathcal{H}} = \sum_{i=1}^N \sum_{j=1}^N \alpha^{x^{(m)}}_i \, \alpha^{x^{(m)}}_j \, k(t_i, t_j), \\ \Vert {\hat{\dot{\boldsymbol{\mu}}}}^{(m)} \Vert^2_{\mathcal{H}} = \sum_{i=1}^N \sum_{j=1}^N \alpha^{\mu^{(m)}}_i \, \alpha^{\mu^{(m)}}_j \, k(t_i, t_j) \end{gather}\label{eq:kernel-norms}\tag{12}\] where superscript \((m)\) denotes the coefficients corresponding to the \(m\)-th state or co-state variable. For instance, \(\boldsymbol{\alpha}^{x^{(m)}}_i\) is the \(i\)-th element of the learnable coefficients for the \(m\)-th state variable.
The subscript \(\mathcal{H}\) denotes the Reproducing Kernel Hilbert Space (RKHS) associated with the kernel \(k(\cdot,\cdot)\). The RKHS norm \(\Vert\cdot\Vert_{\mathcal{H}}\) measures the complexity of a function in a way determined by \(K\) and is used to control the smoothness of the approximating functions. For more details, see [34]. In the case of Matérn kernels, as will be discussed later, this norm has a concrete smoothness interpretation: it directly controls the magnitude of the function’s derivatives up to a certain order.
Ridgeless Kernel Regression. Kernel methods are tailored to solve empirical risk minimization (ERM) style objective in this function space such as \(\min_{h\in\mathcal{H}}\{\sum_{i=1}^N \Phi(h(t_i)) + \lambda ||h||_{\mathcal{H}}^2\}\) for some loss \(\Phi(\cdot)\) and regularization term \(\lambda > 0\). The Representer Theorems show this has a unique solution of the form \(h(t) = \sum_{j=1}^N \alpha_j k(t, t_j)\) for some \(\alpha_j \in \mathbb{R}\) [34], [35], with norm \(\Vert h \Vert^2_{\mathcal{H}} = \sum_{i=1}^N \sum_{j=1}^N \alpha_i \, \alpha_j \, k(t_i, t_j)\). Furthermore, standard results (e.g., [36] and [37]) show that in the limit as \(\lambda \to 0\) this is equivalent to directly minimizes that norm subject to interpolating the data, i.e., \(\min_{h\in\mathcal{H}} ||h||_{\mathcal{H}}^2 \text{ s.t. }\Phi(h(t_i)) = 0\) for all \(t_i \in \mathcal{D}\).
Mapping to our problem, the loss will interpolate the system of equations in 1 2 3 4 and minimize the sum of RKHS norms of the approximating functions to solve, \[\begin{align} \min_{\substack{ {\hat{\dot{\boldsymbol{x}}}}, {\hat{\dot{\boldsymbol{\mu}}}},{\hat{\dot{\boldsymbol{y}}}} }}\, & \left( \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert {\hat{\dot{\boldsymbol{x}}}}^{(m)} \Vert^2_{\mathcal{H}} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert {\hat{\dot{\boldsymbol{\mu}}}}^{(m)} \Vert^2_{\mathcal{H}}\right)\label{eq:erm-norm}\\ \text{s.t.}~\, & {\hat{\dot{\boldsymbol{x}}}}(t_i) = {\boldsymbol{F}}({\hat{\boldsymbol{x}}}(t_i), \hat{\boldsymbol{\mu}}(t_i), {\hat{\boldsymbol{y}}}(t_i)),\quad\text{for all } t_i \in \mathcal{D}\nonumber\\ & {\hat{\dot{\boldsymbol{\mu}}}}(t_i) = r\hat{\boldsymbol{\mu}}(t_i)- \hat{\boldsymbol{\mu}}(t_i)\odot{\boldsymbol{G}}({\hat{\boldsymbol{x}}}(t_i), \hat{\boldsymbol{\mu}}(t_i), {\hat{\boldsymbol{y}}}(t_i)),\quad\text{for all } t_i \in \mathcal{D}\nonumber\\ & \mathbf{0} = {\boldsymbol{H}}({\hat{\boldsymbol{x}}}(t_i), \hat{\boldsymbol{\mu}}(t_i), {\hat{\boldsymbol{y}}}(t_i)),\quad\text{for all } t_i \in \mathcal{D}.\nonumber \end{align}\tag{13}\] The power of kernel methods is that the Representer Theorems shows that this infinite-dimensional non-parametric problem in a function space becomes tractable and finite-dimensional in \(\boldsymbol{\alpha}^x_j\), \(\boldsymbol{\alpha}^\mu_j\), \(\boldsymbol{\alpha}^y_j\), \(\hat{\boldsymbol{\mu}}_0\), and \({\hat{\boldsymbol{y}}}_0\) since solutions can be expressed as 11 12 .15 When the scale is such that constrained optimizers are insufficient, one can instead solve the optimization problem for a fixed, small regularization penalty using the more standard ERM formulation. For details, see 6.
Intuitively, minimizing 13 finds the function with the minimum RKHS norm among all possible interpolating solutions. Next we need to show why this minimum norm solution will be the one which fulfill transversality.
While the intuition that a minimum function norm solution is sufficient to enforce transversality is intuitive, we need to proceed cautiously to relate the norms of the DAE solution concept to that of the RKHS. While many different kernels, and associated norms, could be used in practice, we will emphasize Matérn kernels due to their formal connection to Sobolev spaces.
Matérn kernels. For functions defined over compact domains, it is well established that the Matérn RKHS with \(\nu = (P - 1/2)\) and a specific value of \(\ell\) is exactly equal to the \(\mathcal{W}^{P,2}([0, T])\) norm for all \(P \geq 1\) (see 7). Thus, modelling the derivatives with the \(\nu = 1/2\) Matérn kernel (with appropriate lengthscale) produces minimum \(\mathcal{W}^{1,2}([0, T])\) derivatives (and thus minimum \(\mathcal{W}^{2,2}([0, T])\)-seminorm solutions). We are assuming more curvature than is strictly necessary, given that the solution concept of the DAE only requires a first derivative. However, targeting the \(\mathcal{W}^{2,2}([0, T])\)-semi-norm provides the necessary control to ensure transversality, as we will now show.
Given that 5 is expressed in terms of the product \({\boldsymbol{x}}(t)\odot {\boldsymbol{\mu}}(t)\), but that 13 and Representer Theorems are written in terms of function norms, we first need to show how penalized norms are sufficient to control the \({\boldsymbol{x}}(t) \odot {\boldsymbol{\mu}}(t)\) globally—and consequently will bound the pointwise transversality condition.
theoremthm:bound-normsBounding with Norms Let \(f\) and \(g\) be elements of \(\mathcal{W}^{2,2}([0, T])\). Then, for some \(D_1 < \infty\), \[\begin{align} \sup_{t} \left| f(t) g(t) \right| &\leq D_1 \left( \left\Vert f\right\Vert_{\mathcal{H}}^2 + \left\Vert g \right\Vert_{\mathcal{H}}^2 \right),\tag{14}\\ \intertext{and from the Sobolev embedding theorem, there exists some D_2 < \infty such that,} \sup_{t} \left| f(t) g(t) \right| &\leq D_2 \left( \left\Vert \dot{f}\right\Vert_{\mathcal{H}}^2 + \left\Vert \dot{g} \right\Vert_{\mathcal{H}}^2 \right),\tag{15} \end{align}\] where \(\mathcal{H}\) is the Matérn RKHS with \(\nu = \frac{1}{2}\)
Proof. See 8. ◻
This theorem connects the norm of the derivatives to the maximum value the product of two functions can obtain. Intuitively, this result shows why controlling the RKHS norms of the sum of the derivatives might rule out unbounded and explosive functions. Functions that diverges pointwise on \([0,T]\) must have an RKHS norm that becomes unbounded as \(T\) grows.
To reiterate, this shows that a minimum norm of the approximated derivatives is sufficient to control the pointwise transversality condition and aligns with the underlying Sobolev norm of the solution concept of the DAE. While we have assumed more smoothness than is strictly necessary, e.g., \(\mathcal{W}^{2,2}([0, T])\) instead of \(\mathcal{W}^{1,2}([0, T])\), this is not a significant limitation in practice as we demonstrate in 4.16
While applying [thm:bound-norms] shows that the min norm solution of the DAE will be the one which fulfills 5 , we need to show that our kernel solution minimizing 13 approximates the true solution despite only satisfying the DAE on a finite number of points. To that end, we demonstrate that our approximation is a consistent estimator of the true minimum norm solution, meaning that as \(N \to \infty\) the empirical solutions \({\hat{\boldsymbol{x}}}_N, \hat{\boldsymbol{\mu}}_N, {\hat{\boldsymbol{y}}}_N\) converge to 1 2 3 4 5 .
theoremthm:consistencyConsistency Given some \(0 < K < \infty\), let \({\mathbb{S}}\) be the set of functions \(({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}})\) that satisfy 1 2 3 4 and 1 with \({\boldsymbol{x}}(0) = {\boldsymbol{x}}_0\) and \(\Vert {\boldsymbol{\mu}}(0) \Vert_\infty, \Vert {\boldsymbol{y}}(0) \Vert_\infty \leq K\). Then the minimum norm solution \[({\boldsymbol{x}}^*, {\boldsymbol{\mu}}^*, {\boldsymbol{y}}^*) = \textstyle{\inf_{({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}) \in {\mathbb{S}}}} \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert \dot{x}^{(m)} \Vert^2_\mathcal{H} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert \dot{\mu}^{(m)} \Vert^2_\mathcal{H} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{P}}\, \Vert \dot{y}^{(m)} \Vert^2_\mathcal{H}\] exists and has bounded \(\mathcal{H}\) norm. Moreover, if \(t \in \mathcal{D}\) are drawn uniformly i.i.d. from \([0, T]\) then the solutions \({\hat{\boldsymbol{x}}}_N, \hat{\boldsymbol{\mu}}_N, {\hat{\boldsymbol{y}}}_N\) from 13 with the Matérn-\(1/2\) kernel satisfies 1 2 3 4 almost everywhere in the limit as \(N \to \infty\) and \[\begin{align} \textstyle{\lim_{N \to \infty}} & \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert \hat{\dot{x}}^{(m)}_N \Vert^2_\mathcal{H} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert \hat{\dot{\mu}}^{(m)}_N \Vert^2_\mathcal{H} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{P}}\, \Vert \hat{\dot{y}}^{(m)}_N \Vert^2_\mathcal{H} \\ \overset{\mathrm{a.s.}}{=} \: \textstyle{\inf_{({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}) \in {\mathbb{S}}}} & \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert \dot{x}^{(m)} \Vert^2_\mathcal{H} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert \dot{\mu}^{(m)} \Vert^2_\mathcal{H} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{P}}\, \Vert \dot{y}^{(m)} \Vert^2_\mathcal{H}. \end{align}\]
Proof. See 9. ◻
Though this proof borrows techniques from the statistical learning literature on kernel methods [40], it is non-trivial for several reasons. First, we need to show that solutions to the DAE exist in the RKHS induced by the Matérn kernel, and that a minimum norm solution exists. Second, we need to prove that uniform convergence of derivatives implies uniform convergence of the DAE solution functions themselves. Given these results, we can be confident that with enough data and a large enough \(T\), the min semi-norm solution exists, will be consistently approximated by the empirical solutions \({\hat{\boldsymbol{x}}}_N, \hat{\boldsymbol{\mu}}_N, {\hat{\boldsymbol{y}}}_N\) fulfilling 1 2 3 4 , and that it will be sufficient to ensure the transversality condition 5 holds.
We solve two standard baselines in dynamic economics: the neoclassical growth model in 4.1, and a model of risk-neutral asset pricing in 4.3. These problems are chosen because they are standard examples in textbooks [41], they admit reference solutions from classical methods, and they have established results regarding the set of solutions to the ill-posed versions without the asymptotic boundary conditions. In 4.2, we present a case with multiple steady states—a challenging setting where our methods are particularly useful—and summarize additional experiments in 4.4.
In all cases, we use a Matérn kernel with \(\nu = \frac{1}{2}\), \(\ell = 10\), and \(\sigma = 1\) (see 1) and use \(\mathcal{D}:= \{0,1,2,\ldots,40\}\). Where possible we compare the relative errors of a kernel solution to a benchmark obtained with classic methods, e.g., \(\varepsilon_{{\boldsymbol{w}}(t)} := \left|\tfrac{{\hat{\boldsymbol{w}}}(t) - {\boldsymbol{w}}(t)}{{\boldsymbol{w}}(t)}\right|\) with corresponding definitions for the other variables. In our baseline model, all cases are solved using open-source constrained optimizers and execute in less than a second.
In this section, we solve the neoclassical growth model (also known as the Ramsey–Cass–Koopmans model) summarized by 6 7 8 10 9 . Our baseline parameters are \(x_0 = 1.0, \delta = 0.1, r = 0.11, f(x) = x^a\) and \(a= \frac{1}{3}\).
Results. 1 shows the consumption and capital relative to a benchmark.17 The kernel approximation recovers the optimal solution almost perfectly, fulfilling 10 despite not being provided the steady-state as a boundary condition.
The vertical dotted lines mark the end of the training set. While not our primary goal, the solution method extrapolates accurately, allowing it to learn the steady state. As discussed in 3, this is possible since we approximate the derivatives rather than the functions themselves and Matérn kernels are zero-reverting.
Sufficiency of the minimum norm solution. As discussed in 2.2, as long as \(f(x)\) is monotone and strictly concave, \(f(0) = 0\), and \(\delta > 0\), then this model fulfills [thm:divergence-speed]. That is, non-solutions of \(\tilde{\mu}(t)\) diverge asymptotically at a rate greater than \(r\). Since these are the key failures of transversality, the only modification to the optimization problem 13 required in practice was to impose the bound \(\hat{y}_0 > 0\), which prevented all other violations of 2.18
Robustness. While [thm:consistency] demonstrate consistency, it is helpful to check our methods’ sensitivity to hyperparameters and features of \(\mathcal{D}\): Section 1.1 of the Supplemental Appendix shows that our methods perform well with a much sparser and irregular \(\mathcal{D}\); section 1.2 of Supplemental Appendix indicates low sensitivity to different kernel hyperparameters; and section 1.3 of Supplemental Appendix demonstrates that the approximation remains effective in the short to medium term, even if \(\mathcal{D}\) does not contain large time values, which correspond to the solution getting very close to the steady state—an important consideration for problems where the convergence rate is unknown and the appropriate choice of \(T\) is not clear a priori.
We now turn to a more complex version of the neoclassical growth model where \(f(x) := A\max\{x^a,b_1 x^a-b_2\}\), as in [42], [43] where \(A = 0.5, b_1 = 3.0,\) and \(b_2 = 2.5\). The derivative of the production function, \(f'(x)\), exhibits a discontinuity at \(\bar{x} = \left(\frac{b_2}{b_1-1}\right)^{\frac{1}{a}}\). As a result this problem has two steady states, but a unique transition path for any given initial condition. This poses a significant challenge for classical algorithms, such as shooting and BVP methods, because the algorithms rely on analytic characterization of the steady-state value to impose the correct boundary-value for a particular initial condition \({\boldsymbol{x}}_0\). In practice, the set of steady-states and the partitioning of the initial conditions into domains of attraction are not known a-priori and are rarely computed outside of simple problems.
Results. 2 shows the results using Matérn kernels for 70 different initial conditions across the basins of attraction for the two steady states.
While [thm:divergence-speed] does not hold globally, it is true local to both steady-states—eliminating the key non-solutions. However, in this case there is an additional concern that an approximate solution might jump into a different basin of attraction and go to the wrong steady-state. While possible, we found that all of the examples converge to the “correct” steady state; i.e. the steady state that correctly corresponds with the supplied initial condition \(x_0\) despite not being provided the set of steady-states or their basins of attraction.
Intuitively, this behavior is also consequence of the minimum norm solution. Consider the two possible trajectories \(x_0\) to each of the two steady states: the trajectory with smaller gradients and less steep dynamics will have a smaller norm.
Models of asset pricing and rational bubbles are relatively simple and often admit closed-form solutions. These models have traditionally served a pedagogical role in exploring transversality conditions; for instance, see [41]. In this context, the transversality condition is also referred to as the “no-bubble” condition.19
Model. The model values a stream of dividends, where a “bubble” is defined as a price path whose dynamics cannot be explained by the dividends process.
Let \(x(t) \in \mathbb{R}\) be the flow payoffs from a claim to an asset, and \(\mu(t) \in \mathbb{R}\) be the price of a claim to that asset. For simplicity, we assume that the flow payoffs, \(x(t)\), follow a deterministic linear process. For a given \(x_0\), the key equations are \[\begin{align} \dot{x}(t) &= c + g x(t), \tag{16}\\ \dot{\mu}(t) &= r \mu(t) - x(t) := r \mu(t) - \mu(t) \frac{x(t)}{\mu(t)},\tag{17}\\ 0 &= \lim_{t\rightarrow \infty} e^{-r t}\mu(t)x(t),\tag{18} \end{align}\] where \(c\) and \(g\) are constants governing the dividend process, and \(r>0\) denotes the discount rate of a risk-neutral investor. 18 is the “no-bubble" condition. The set of solutions to eq. ¿eq:eq:asset-pricing-mu-dot? eq. ¿eq:eq:asset-pricing-x-dot? , without imposing eq. ¿eq:eq:asset-pricing-no-bubble? , can be found analytically: \[\begin{align} \mu(t) &= \mathop{\mathrm{{\textstyle \int}}}_0^{\infty} e^{-r \tau} x(t+\tau) \mathrm{d}\tau = \mu_f(t) + \zeta e^{r t},\label{eq:asset-pricing-ode-general}\\ \mu_f(t) &:= \tfrac{c}{r-g} + \left(x_0 - \tfrac{c}{r-g}\right)e^{(r-g)t} ,\label{eq:asset-pricing-fundamental} \end{align}\] {#eq: sublabel=eq:eq:asset-pricing-ode-general,eq:eq:asset-pricing-fundamental} where \(\mu_f(t)\) is interpreted as the”fundamental” price of the asset, and \(\zeta \geq 0\) is indeterminate. However, when the “no-bubble" condition (i.e., 18 ) is imposed, this problem is well-posed, with a unique solution of \(\mu(t) = \mu_f(t)\) (i.e., \(\zeta = 0\)).
Selecting the “no-bubble” solution. A key advantage of this example is that ?? characterizes the full set of deterministic non-solutions to 16 17 which do not impose 18 . For a given function norm, apply the triangle inequality to the set of solutions from ?? to yield \(\Vert\mu_f\Vert_{\mathcal{W}} \leq \Vert\mu\Vert_{\mathcal{W}} \leq \Vert\mu_f\Vert_{\mathcal{W}} + \zeta \Vert e^{rt}\Vert_{\mathcal{W}}\). Differentiating yields \(\Vert\dot{\mu_f}\Vert_{\mathcal{W}} \leq \Vert\dot{\mu}\Vert_{\mathcal{W}} \leq \Vert\dot{\mu}_f\Vert_{\mathcal{W}} + \zeta r \Vert e^{rt}\Vert_{\mathcal{W}}\) where \(\mathcal{W}\) is a norm such as \(\mathcal{W}^{1,2}([0, T])\). Finally, note that the this semi-norm is minimized when \(\zeta = 0\).
To verify the divergence rate in [thm:divergence-speed], note from 17 that if \({\boldsymbol{x}}(t)\) converges to a finite steady state while \({\boldsymbol{\mu}}(t)\) diverges, then \(\lim_{t\to\infty}{\boldsymbol{G}}({\boldsymbol{x}}(t),{\boldsymbol{\mu}}(t))=\lim_{t\to\infty}\frac{{\boldsymbol{x}}(t)}{{\boldsymbol{\mu}}(t)}=0\), so \(\dot{{\boldsymbol{\mu}}}(t)/{\boldsymbol{\mu}}(t)\to r\) and \({\boldsymbol{\mu}}(t)\) grows asymptotically at rate \(r\). However, note that the speed of separating solutions from non-solutions is slower than our previous example, where the non-solutions asymptotically diverged at rates strictly faster than \(r\).
Results. Our baseline parameters are \(x_0 = 1.0, c = 0.02, g = -0.2\), and \(r = 0.1\). 3 shows the results of the ridgeless kernel machine alongside the fundamental price from ?? . Even without imposing the long-run “no-bubble” condition in 18 , the kernel method recovers the “fundamental" price with high accuracy. As before, the minimum norm still selects the correct solution and as \(T\) grows the norms of non-solutions will grow quickly due to [thm:divergence-speed].
To show that our method can handle problems of increasing dimensionality, we test it on a standard model of human capital and economic growth in section 2.1 of the Supplemental Appendix. In our formulation of the model, there are two jump variables and three co-state variables. With classical methods such as shooting, finding the optimal solution requires a five-dimensional search. By comparison, our algorithm computes the solution in under a second. To demonstrate the applicability of our algorithm beyond macro and finance, in section 2.2 of the Supplemental Appendix, we test it on the optimal advertising model, a classic framework in the marketing literature.
There are several natural directions for future applications and extensions. One is extending kernel methods to handle inequality and complementarity constraints, which would make them applicable to models such as lifecycle consumption and macroeconomic models with financial frictions. Another direction is adapting kernel methods to stochastic and recursive settings, where conditions like transversality must hold globally and involve expectations over random trajectories. Finally, it is important to study the performance of kernel methods in very high-dimensional settings, such as those found in the trade and spatial economics literature.
The complexity of kernel methods depends on sample size rather than the dimensionality of the state space. If a kernel captures similarity in the state space, only a modest number of samples may suffice to solve dynamic economic problems, making these methods well-suited for both very high-dimensional and recursive stochastic models.
This paper introduces ridgeless kernel methods for solving deterministic, infinite-horizon, continuous-time dynamic models formulated as DAEs. For a broad class of models, we show that selecting the minimum-norm solution guarantees transversality, addressing the main computational obstacle of traditional approaches. Kernel methods are natural for this task, since their function norms are explicit, allowing us to establish sufficiency results linking kernel solutions to the DAE solution concept.
The optimization problem in 13 is equivalent to the limit of ridge regression, as follows: \[\begin{align} &\lim_{\lambda \to 0} \Bigg\{ \min_{\substack{{\hat{\dot{\boldsymbol{x}}}}, \, {\hat{\dot{\boldsymbol{\mu}}}}, \, {\hat{\dot{\boldsymbol{y}}}}}} \; \Bigg\{ \sum_{t_i \in \mathcal{D}} \Bigg[ \left\Vert {\hat{\dot{\boldsymbol{x}}}}(t_i) - {\boldsymbol{F}}({\hat{\boldsymbol{x}}}(t_i), \hat{\boldsymbol{\mu}}(t_i), {\hat{\boldsymbol{y}}}(t_i)) \right\Vert_2^2 \\ + &\left\Vert {\hat{\dot{\boldsymbol{\mu}}}}(t_i) - r \, \hat{\boldsymbol{\mu}}(t_i) + \hat{\boldsymbol{\mu}}(t_i) \odot {\boldsymbol{G}}({\hat{\boldsymbol{x}}}(t_i), \hat{\boldsymbol{\mu}}(t_i), {\hat{\boldsymbol{y}}}(t_i)) \right\Vert_2^2 + \left\Vert {\boldsymbol{H}}({\hat{\boldsymbol{x}}}(t_i), \hat{\boldsymbol{\mu}}(t_i), {\hat{\boldsymbol{y}}}(t_i)) \right\Vert_2^2 \Bigg] \\ &\quad \quad + \left\Vert {\hat{\boldsymbol{x}}}(0) - {\boldsymbol{x}}_0 \right\Vert_2^2 + \lambda \Big( \sum_{m=1}^{{M}} \Vert {\hat{\dot{\boldsymbol{x}}}}^{(m)} \Vert^2_{\mathcal{H}} + \sum_{m=1}^{{M}} \Vert {\hat{\dot{\boldsymbol{\mu}}}}^{(m)} \Vert^2_{\mathcal{H}} \Big) \Bigg\} \Bigg\}, \end{align}\]
The formulation in 13 is referred to as ridgeless due to the vanishing ridge penalty term as \(\lambda \to 0\). In practice, rather than taking the limit, one can simply choose a small fixed \(\lambda\). For the applications presented in this paper, we solved the problem for \(\lambda\) in the range \(10^{-4}\) to \(10^{-6}\). The results are omitted, as they were nearly identical to the solution of the optimization problem in 13 .
Definition 1 (Matern Kernel). Let \(k_{\nu,\ell}(\cdot, \cdot): \mathbb{R}\times \mathbb{R}\to \mathbb{R}\) denote the Matérn covariance function with smoothness \(\nu\) and lengthscale \(\ell\). For the purposes of this paper, we will define \(k_{\nu,\ell}\) as: \[k_{\nu,\ell}(t, t') = \kappa(\vert t - t' \vert), \qquad \widehat{\kappa}(\omega) := \left( 1 + \ell^2\omega^2 \right)^{-\nu - 1/2}, \label{eq:matern}\tag{19}\] where \(\widehat{(\cdot)}\) corresponds to the Fourier transform.
Note that 19 corresponds to the standard Matérn kernel definition [7] after appropriately scaling the inputs and outputs. \(\nabla^{(r)}k_{\nu,\ell}(\cdot, \cdot)\) will denote the \(r^\mathrm{th}\) derivative of \(k_{\nu,\ell}\) with respect to its its first argument.
Given some interval \([0, T] \subseteq \mathbb{R}\), \(\mathcal{H}^{\nu,\ell}([0, T])\) denotes the RKHS of \([0, T] \to \mathbb{R}\) functions where the reproducing kernel is equal to \(k_{\nu,\ell}\). \(\mathcal{W}^{P,2}([0, T])\) denotes to the Sobolev-\(P,2\) space of \([0, T] \to \mathbb{R}\) functions. Whenever possible we will drop the superscripts for \(\mathcal{H}^{\nu,\ell}([0, T])\).
Theorem 1 (Equivalence of \(\mathcal{W}^{P,2}\) and \(\mathcal{H}^{\nu,\ell}\)). For any positive integer \(P\), any \(\ell > 0\), and any \(\nu = P - 1/2\) there exists positive constants \(C_1(\nu, \ell)\) and \(C_2(\nu, \ell)\) so that \[C_1(\nu, \ell) \Vert w \Vert_{\mathcal{W}^{P,2}([0, T])} \leq \Vert w \Vert_{\mathcal{H}^{\nu,\ell}([0, T])} \leq C_2(\nu, \ell) \Vert w \Vert^{\mathcal{W}^{P,2}([0, T])}.\] for all \(w \in \mathcal{W}^{P,2}\). In other words, the Sobolev norm \(\Vert \cdot \Vert_{\mathcal{W}^{P,2}([0,T])}\) and the RKHS norm \(\Vert \cdot \Vert_{\mathcal{H}^{\nu,\ell}([0, T])}\) are equivalent, and thus \(\mathcal{W}^{P,2}([0, T]) = \mathcal{H}^{\nu,\ell}([0, T])\).
Proof. We begin by first establishing an equivalence between \(\mathcal{W}^{P,2}(\mathbb{R})\) and \(\mathcal{H}^{\ell,\nu}(\mathbb{R})\). In the Fourier domain, the Sobolev norm for \(\mathbb{R}\to \mathbb{R}\) functions is given by \[\Vert w \Vert_{\mathcal{W}^{P,2}(\mathbb{R})} = \left\Vert \widehat w(\cdot) \left( 1 + ( \cdot )^2 \right)^{P/2} \right\Vert_{L_2(\mathbb{R})},\] and, for any RKHS \(\mathcal{H}(\mathbb{R})\) with stationary reproducing kernels, the RKHS norm for \(\mathbb{R}\to \mathbb{R}\) functions is given by \[\Vert w \Vert_{\mathcal{H}(\mathbb{R})} = \left\Vert \widehat w(\cdot) \left( \widehat{\kappa}(\cdot) \right)^{-1/2} \right\Vert_{L_2(\mathbb{R})}\] The Matérn kernel, as defined in 19 , has a Fourier transform that decays at at a rate of \((1 + \vert\cdot\vert^2)^{-P}\) and thus \(\Vert \cdot \Vert_{\mathcal{H}^{\nu,\ell}(\mathbb{R})}\) is bounded above and below by a constant multiple of \(\Vert \cdot \Vert_{\mathcal{W}^{P,2}(\mathbb{R})}\) where the constant only depends on \(\ell\). This argument can be generalized to prove an equivalence between \(\mathcal{W}^{P,2}([0, T])\) and \(\mathcal{H}^{\ell,\nu}([0, T])\), as the domain \([0, T]\) trivially has a Lipschitz boundary. See [46] for details. ◻
Corollary 1 (Equality of \(\mathcal{W}^{P,2}\) and \(\mathcal{H}^{\nu,\ell}\) with specific \(\nu,\ell\) values). For any \(P\), there exists a value of \(\ell\) so that, for all \(w \in \mathcal{W}^{P,2}([0, T])\): \[\Vert w \Vert_{\mathcal{W}^{P,2}([0, T])} = \Vert w \Vert_{\mathcal{H}^{P - 1/2,\ell}([0, T])}.\]
restatethis@thm:bound-norms
Proof. Let \(k(\cdot,\cdot)\) be the kernel associated with the RKHS \(\mathcal{H}\) where \(\sup_{t}k(t,t) \leq K < \infty\). For \(f,g \in \mathcal{H}\) by the reproducing property, \[f(t) = \langle f,k(t,\cdot) \rangle_{\mathcal{H}} \quad \text{and} \quad g(t) = \langle g,k(t,\cdot)\rangle _{\mathcal{H}}\label{eq:f-g-rep}\tag{20}\] Then substitute with 20 , use the Cauchy-Schwarz inequality, and then the kernel bound to get \[\begin{align} |f(t) g(t)| &\leq |f(t)| |g(t)|\leq ||f||_{\mathcal{H}} ||k(t,\cdot)||_{\mathcal{H}^{{M}}}\, ||g||_{\mathcal{H}} ||k(t,\cdot)||_{\mathcal{H}}\\ &\leq K ||f||_{\mathcal{H}} ||g||_{\mathcal{H}} \intertext{Then, by the Arithmetic-Geometric Mean Inequality,} &\leq \frac{K}{2}\left( ||f||_{\mathcal{H}}^2 + ||g||_{\mathcal{H}}^2 \right) \intertext{By \ref{propo:A1}, there exists a constant K' such that} &\frac{K}{2}\left( ||f||_{\mathcal{H}}^2 + ||g||_{\mathcal{H}}^2 \right)\leq K'\left( ||f||_{\mathcal{W}^{1,2}([0, T])}^2 + ||g||_{\mathcal{W}^{1,2}([0, T])}^2 \right)\\ \intertext{From the Sobolev embedding theorem} & K'\left( ||f||_{\mathcal{W}^{1,2}([0, T])}^2 + ||g||_{\mathcal{W}^{1,2}([0, T])}^2 \right) \leq K'\left( ||\dot{f}||_{\mathcal{W}^{1,2}([0, T])}^2 + ||\dot{g}||_{\mathcal{W}^{1,2}([0, T])}^2 \right)\\ \intertext{Finally, from \ref{propo:A1}, there exists some D such that} &\leq D \left( \left\Vert \dot{f}\right\Vert_{\mathcal{H}}^2 + \left\Vert \dot{g} \right\Vert_{\mathcal{H}}^2 \right). \end{align}\] ◻
restatethis@thm:consistency
Throughout this proof we will drop the domain \([0, T]\) from the \(\mathcal{W}\) for brevity. For notational simplicity we redefine \(\dot{{\boldsymbol{\mu}}}(t) = r {\boldsymbol{\mu}}(t) - {\boldsymbol{\mu}}(t) \odot {\boldsymbol{G}}({\boldsymbol{x}}(t),{\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t))\equiv {\boldsymbol{B}}({\boldsymbol{x}}(t),{\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t))\). It is trivial to see that \({\boldsymbol{B}}\) is Lipschitz with a Lipschitz first derivative if \({\boldsymbol{G}}\) is.
We break up this proof into a series of lemmas.
Lemma 1. For every \(({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}) \in {\mathbb{S}}\), the functions \({x}^{(1)}, \ldots, {x}^{({M})}\), \(\mu^{(1)}, \ldots, \mu^{({M})}\), and \({y}^{(1)}, \ldots, {y}^{({P})}\), are all elements of \(\mathcal{W}^{2,2}\).
Proof. First, we note that 1 allows us to rewrite the DAE as an ODE. Specifically, 4 can be rewritten an equation of \(\dot{\boldsymbol{y}}(t)\) by differentiating both sides with respect to \(t\) \[\begin{align} {\boldsymbol{0}}&= \nabla_{{\boldsymbol{x}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}}) \dot{\boldsymbol{x}}(t) + \nabla_{{\boldsymbol{\mu}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}}) \dot{\boldsymbol{\mu}}(t) + \nabla_{{\boldsymbol{y}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}}) \dot{\boldsymbol{y}}(t) \\ &= \nabla_{{\boldsymbol{x}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}}) {\boldsymbol{F}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)) + \nabla_{{\boldsymbol{\mu}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}}) {\boldsymbol{B}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)) \\ &\:\:\: + \nabla_{{\boldsymbol{y}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}}) \dot{\boldsymbol{y}}(t). \end{align}\] By assumption, \(\nabla_{{\boldsymbol{y}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}})\) is non-singular on all relevant trajectories, so we have that \[\begin{align} \dot{\boldsymbol{y}}(t) &= \nabla_{{\boldsymbol{y}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}})^{-1} \Bigl( \nabla_{{\boldsymbol{x}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}}) {\boldsymbol{F}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)) \\ &\qquad\qquad\qquad\qquad+ \nabla_{{\boldsymbol{\mu}}}{\boldsymbol{H}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}}) {\boldsymbol{B}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)) \Bigr) \\ & =: {\boldsymbol{C}}({\boldsymbol{x}},{\boldsymbol{\mu}},{\boldsymbol{y}}). \end{align}\] By Lipschitz continuity of \({\boldsymbol{F}}\), \({\boldsymbol{B}}\), the derivative of \({\boldsymbol{H}}\), the inverse of \(\nabla_{{\boldsymbol{y}}}{\boldsymbol{H}}\), and all their respective derivatives, we have that \({\boldsymbol{C}}\) is Lipschitz continuous with a Lipschitz first derivative over the interval \([0, T]\).
We thus can rewrite the DAE as the following ODE system: \[\dot{\boldsymbol{x}}(t) = {\boldsymbol{F}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)), \quad \dot{\boldsymbol{\mu}}(t) = {\boldsymbol{B}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)), \quad \dot{\boldsymbol{y}}(t) = {\boldsymbol{C}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)),\] where \({\boldsymbol{F}}, {\boldsymbol{B}}, {\boldsymbol{C}}\) are all Lipschitz continuous with Lipschitz first derivatives. By Lipschitz continuity of \({\boldsymbol{F}}\), \({\boldsymbol{B}}\), \({\boldsymbol{C}}\) and the Cauchy-Lipschitz-Picard theorem [47], we have that every \(({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}) \in {\mathbb{S}}\) are continuous and continuously differentiable over \([0, \infty)\). This fact implies that \({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}\) and \(\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}\) are bounded over \([0, T]\). Now consider \(\ddot{\boldsymbol{x}}\), \(\ddot{\boldsymbol{\mu}}\), \(\ddot{\boldsymbol{y}}\) \[\begin{align} \ddot{{\boldsymbol{x}}}(t) &= \dot{\boldsymbol{F}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)) \\ \ddot{{\boldsymbol{\mu}}}(t) &= \dot{\boldsymbol{B}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)) \\ \ddot{{\boldsymbol{y}}}(t) &= \dot{\boldsymbol{C}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)). \end{align}\] Because \({\boldsymbol{F}}\), \({\boldsymbol{B}}\), and \({\boldsymbol{C}}\) admit Lipschitz first derivatives, we have that \(\ddot {\boldsymbol{x}}\), \(\ddot {\boldsymbol{\mu}}\), and \(\ddot {\boldsymbol{y}}\) are also bounded over \([0, T]\). Since bounded functions are trivially square integrable, we have that
\(\Vert {x}^{(1)} \Vert_{\mathcal{W}^{2,2}}, \ldots, \Vert {x}^{({M})} \Vert_{\mathcal{W}^{2,2}} < \infty\),
\(\Vert \mu^{(1)} \Vert_{\mathcal{W}^{2,2}}, \ldots, \Vert \mu^{({M})} \Vert_{\mathcal{W}^{2,2}} < \infty\), and
\(\Vert {y}^{(1)} \Vert_{\mathcal{W}^{2,2}}, \ldots, \Vert {y}^{({P})} \Vert_{\mathcal{W}^{2,2}} < \infty\).
◻
Lemma 2. Let \(M'\) be the total dimension of the DAE (i.e., \(M' = 2 {M}+ {P}\)), and define \(\mathcal{W}^{2,2}_M\) as the Hilbert space of functions \(({\boldsymbol{x}}(\cdot), {\boldsymbol{\mu}}(\cdot), {\boldsymbol{y}}(\cdot)): [0, T] \to \mathbb{R}^{M'}\) equipped with the norm: \[\Vert ({\boldsymbol{x}}(\cdot), {\boldsymbol{\mu}}(\cdot), {\boldsymbol{y}}(\cdot)) \Vert_{\mathcal{W}^{2,2}_{M'}}^2 = \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert {x}^{(m)} \Vert^2_{\mathcal{W}^{2,2}} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert {\mu}^{(m)} \Vert^2_{\mathcal{W}^{2,2}} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{P}}\, \Vert {y}^{(m)} \Vert^2_{\mathcal{W}^{2,2}}.\] Then the set \({\mathbb{S}}\) is closed in \(\mathcal{W}^{2,2}_{M'}\).
Proof. Consider a Cauchy sequence \(({\boldsymbol{x}}_n, {\boldsymbol{\mu}}_n, {\boldsymbol{y}}_n) \in {\mathbb{S}}\). By 1, we have that \({x}_n^{(m)} \in \mathcal{W}^{2,2}\) for all \(m \in [1, {M}]\). By completeness of \(\mathcal{W}^{2,2}\), we have that \({x}_n^{(m)} \to {x}^{(m)}\) for some \({x}^{(m)} \in \mathcal{W}^{2,2}\); i.e. for every \(\epsilon\), there exists some \(n'\) such that \(\Vert {y}_{n'}^{(m)} - {y}^{(m)} \Vert_{\mathcal{W}^{2,2}} < \epsilon\). \[\begin{align} \sup_{t \in [0, T]} \Vert {x}^{(m)}_{n'}(t) - {x}^{(m)}(t) \Vert_\infty &= \sup_{t \in [0, T]} \left\langle k(t, \cdot), {x}^{(m)}_{n'}(t) - {x}^{(m)}(t) \right\rangle_{\mathcal{W}^{2,2}} \\ &\leq \sup_{t \in [0, T]} \Vert k(t, \cdot) \Vert_{\mathcal{W}^{2,2}} \: \Vert {x}^{(m)}_{n'}(t) - {x}^{(m)}(t) \Vert_{\mathcal{W}^{2,2}} \\ &\leq C \Vert {x}^{(m)}_{n'}(t) - {x}^{(m)}(t) \Vert_{\mathcal{W}^{2,2}} < C \epsilon, \end{align}\] where \(k(\cdot, \cdot)\) is the reproducing kernel associated with \(\mathcal{W}^{2,2}\) and \(C\) is some universal constant. The penultimate inequality comes from fact that \(\mathcal{W}^{2,2}\) is equivalent to a Matérn RKHS which has a bounded-everywhere reproducing kernel. Thus, \({x}^{(m)}_{n'}\) converges uniformly to \({x}^{(m)}\), and so \[\begin{align} \dot{\boldsymbol{x}}(t) = \lim_{n\to\infty} \dot{\boldsymbol{x}}_n(t) &= \lim_{n\to\infty} {\boldsymbol{F}}({\boldsymbol{x}}_n(t), {\boldsymbol{\mu}}_n(t), {\boldsymbol{y}}_n(t)) \\ &= {\boldsymbol{F}}(\lim_{n\to\infty} {\boldsymbol{x}}_n(t), \lim_{n\to\infty} {\boldsymbol{\mu}}_n(t), \lim_{n\to\infty} {\boldsymbol{y}}_n(t)) = {\boldsymbol{F}}({\boldsymbol{x}}(t), {\boldsymbol{\mu}}(t), {\boldsymbol{y}}(t)) \end{align}\] where the penultimate equality comes from the continuity of \({\boldsymbol{F}}\). Analogous results hold for \({\boldsymbol{\mu}}_n\) and \({\boldsymbol{y}}_n\). Moreover, by uniform convergence we have that \(\Vert {\boldsymbol{\mu}}(0) - {\boldsymbol{\mu}}_{n}(0) \Vert_\infty\) and \(\Vert {\boldsymbol{y}}(0) - {\boldsymbol{y}}_{n}(0) \Vert_\infty\) are arbitrarily small and thus \(\Vert {\boldsymbol{\mu}}(0) \Vert_\infty, \Vert {\boldsymbol{y}}(0) \Vert_\infty < K\). Therefore \(({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}) \in {\mathbb{S}}\). ◻
Lemma 3. Denote \(M'\) and \(\mathcal{W}^{2,2}_{M'}\) as in 2. Let \(\mathcal{H}_{M'}\) be the Hilbert space of functions \((\dot{\boldsymbol{x}}(\cdot), \dot{\boldsymbol{\mu}}(\cdot), \dot{\boldsymbol{y}}(\cdot)): [0, T] \to \mathbb{R}^{M'}\) equipped with the norm \[\Vert (\dot{\boldsymbol{x}}(\cdot), \dot{\boldsymbol{\mu}}(\cdot), \dot{\boldsymbol{y}}(\cdot)) \Vert_{\mathcal{H}_{M'}}^2 = \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert \dot{x}^{(m)} \Vert^2_{\mathcal{H}} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert \dot{\mu}^{(m)} \Vert^2_{\mathcal{H}} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{P}}\, \Vert \dot{y}^{(m)} \Vert^2_{\mathcal{H}},\] where \(\mathcal{H}\) is the RKHS associated with the Matérn-\(1/2\) kernel with some lengthscale \(0 < \ell < \infty\). Let \[\begin{align} B :&= {\inf_{({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}) \in {\mathbb{S}}}} \Vert (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \Vert_{\mathcal{H}_{M'}} \end{align}\] Then there exists some \(({\boldsymbol{x}}^*, {\boldsymbol{\mu}}^*, {\boldsymbol{y}}^*) \in {\mathbb{S}}\) that achieves this infimum.
Proof. By [prop:sobolev95matern], \(\mathcal{W}^{2,2}\) is equivalent to the Matérn-\(1/2\) RKHS \(\mathcal{H}\). Note that the elements of \(\dot{\boldsymbol{w}}(\cdot)\) for any \({\boldsymbol{w}}(\cdot) \in \mathcal{W}^{2,2}_{M'}\) are \(\mathcal{W}^{1,2}\) functions, and thus \[{\boldsymbol{w}}(\cdot) \in \mathcal{W}^{2,2}_{M'} \:\: \Longrightarrow \:\: \dot{\boldsymbol{w}}(\cdot) \in \mathcal{H}_{M'}.\]
Define the operator \(D: \mathcal{W}^{2,2}_{M'} \to \mathcal{H}_{M'}\) as \(D({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}) = (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}})\), where here \(\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}\) denote weak derivatives. Note that \(D\) is a surjective and bounded linear operator between two Hilbert spaces, and that the nullspace of \(D\) (i.e. the set of constant functions) is a closed set. Since \({\mathbb{S}}\subset \mathcal{W}^{2,2}_{M'}\) is a closed subset of a Hilbert space (2), \(\dot{\mathbb{S}}:= D({\mathbb{S}}) \subset \mathcal{W}^{2,2}_{M'}\) is also closed subset of a Hilbert space [47]. By the existence portion of the Hilbert projection theorem, there exists a (potentially non-unique) minimum \(\mathcal{H}_{M'}\)-norm element of \(\dot{\mathbb{S}}\) (i.e. there exists some \((\dot{\boldsymbol{x}}^*, \dot{\boldsymbol{\mu}}^*, \dot{\boldsymbol{y}}^*) \in \dot{\mathbb{S}}\) such that \(\Vert (\dot{\boldsymbol{x}}^*, \dot{\boldsymbol{\mu}}^*, \dot{\boldsymbol{y}}^*) \Vert_{\mathcal{H}_{M'}} = \inf_{({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}) \in \dot{\mathbb{S}}} \Vert (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \Vert_{\mathcal{H}_{M'}}.\) We conclude the proof by setting \(({\boldsymbol{x}}^*, {\boldsymbol{\mu}}^*, {\boldsymbol{y}}^*)\) to be some element in \({\mathbb{S}}\) such that \((\dot{\boldsymbol{x}}^{*}, \dot{\boldsymbol{\mu}}^{*}, \dot{\boldsymbol{y}}^{*}) = D({\boldsymbol{x}}^*, {\boldsymbol{\mu}}^*, {\boldsymbol{y}}^*)\). ◻
Lemma 4. For any \(0 < C < \infty\), define the sets \[\begin{gather} {\mathbb{F}}^C := \{ w \in \mathcal{H}\: : \: \Vert w \Vert_{\mathcal{H}} \leq C \} \\ \mathop{\mathrm{{\textstyle \int}}}{\mathbb{F}}^C := \{ \mathop{\mathrm{{\textstyle \int}}}_{0}^{(\cdot)} w(\tau) d\tau \: : \: w \in {\mathbb{F}}\}. \end{gather}\] Denoting \(\widehat\mathcal{R}_N\) as the empirical Rademacher complexity for some dataset \(t_1, \ldots, t_N \in [0, T]\), we have that \[\widehat\mathcal{R}_N({\mathbb{F}}^C) \lesssim C N^{-1/2}, \qquad \widehat\mathcal{R}_N(\mathop{\mathrm{{\textstyle \int}}}{\mathbb{F}}^C) \lesssim T C N^{-1/2}.\]
Proof. The Rademacher complexity \(\widehat\mathcal{R}_N({\mathbb{F}}^C) \lesssim C N^{-1/2}\) follows a standard result for reproducing kernel Hilbert spaces, using the fact that \(\mathcal{H}\) (the Matérn-\(1/2\) RKHS) has a bounded-everywhere reproducing kernel. Bounding the Rademacher complexity of \(\mathop{\mathrm{{\textstyle \int}}}{\mathbb{F}}^C\) mirrors the standard proof of the \(\widehat\mathcal{R}_N({\mathbb{F}}^C)\) bound: \[\begin{align} \widehat\mathcal{R}(\mathop{\mathrm{{\textstyle \int}}}{\mathbb{F}}^C) &:= \mathbb{E}_{\epsilon_i} \left[ \sup_{w \in \mathcal{H}} \frac{1}{N} \sum_{i=1}^N \epsilon_i \int_{0}^{t_i} w(\tau) d\tau \right] }{\sim}\mathrm{Rad}} \\ &= \mathbb{E}_{\epsilon_i} \left[ \sup_{w \in \mathcal{H}} \left\langle \frac{1}{N} \sum_{i=1}^N \epsilon_i \int_{0}^{t_i} k(\tau, \cdot) d\tau, \: w(\cdot) \right\rangle_{\mathcal{H}} \right] \\ &\leq \mathbb{E}_{\epsilon_i} \left[ \left\Vert \frac{C}{N} \sum_{i=1}^N \epsilon_i \int_{0}^{t_i} k(\tau, \cdot) d\tau \right\Vert_{\mathcal{H}} \right]\\ &\leq \sqrt{ \mathbb{E}_{\epsilon_i} \left[ \left\Vert \frac{C}{N} \sum_{i=1}^N \epsilon_i \int_{0}^{t_i} k(\tau, \cdot) d\tau \right\Vert_{\mathcal{H}}^2 \right] }\\ &= \sqrt{ \mathbb{E}_{\epsilon_i} \left[ \frac{C^2}{N^2} \sum_{i,j=1}^N \epsilon_i \epsilon_j \left\langle \int_{0}^{t_i} k(\tau, \cdot) d\tau, \: \int_{0}^{t_j} k(\tau, \cdot) d\tau \right\rangle_{\mathcal{H}} \right] } \\ &= \sqrt{ \frac{C^2}{N^2} \sum_{i}^N \left\Vert \int_{0}^{t_i} k(\tau, \cdot) d\tau \right\Vert_{\mathcal{H}}^2 }\\ &\leq \sqrt{ \frac{C^2}{N^2} \sum_{i}^N \left( \int_{0}^{t_i} \left\Vert k(\tau, \cdot) \right\Vert_{\mathcal{H}} d\tau \right)^2 }\\ &\leq \sqrt{ \frac{C^2}{N^2} \sum_{i}^N \left( \int_{0}^{T} \sup_{t \in [0, T]} \left\Vert k(t, \cdot) \right\Vert_{\mathcal{H}} d\tau \right)^2 } \\ &= TC N^{-1/2} \sup_{t \in [0, T]} \left\Vert k(t, \cdot) \right\Vert_{\mathcal{H}}. \end{align}\] Recognizing that \(k(t, \cdot)\) is a bounded-everywhere reproducing kernel completes the proof. ◻
Lemma 5. Define \(\mathcal{H}_{M'}\) as in 3. For any \(0 < C < \infty\), define the sets \[\begin{gather} {\mathbb{F}}^C_{M'} := \left\{ (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \in \mathcal{H}_{M'} \: : \: \Vert (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \Vert_{\mathcal{H}_D} \leq C \right\} \\ \mathop{\mathrm{{\textstyle \int}}}{\mathbb{F}}^C_{M'} := \left\{ \mathop{\mathrm{{\textstyle \int}}}_{0}^{(\cdot)} (\dot{\boldsymbol{x}}(\tau), \dot{\boldsymbol{\mu}}(\tau), \dot{\boldsymbol{y}}(\tau)) d\tau \: : \: (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \in {\mathbb{F}}\right\} \end{gather}\] Then \(\widehat\mathcal{R}_N({\mathbb{F}}^C_{M'}) \lesssim C M N^{-1/2}\) and \(\widehat\mathcal{R}_N(\mathop{\mathrm{{\textstyle \int}}}{\mathbb{F}}^C_{M'}) \lesssim T M C N^{-1/2}\).
Proof. The proof follows a standard summation argument for Rademacher complexity: \[\begin{align} &\widehat\mathcal{R}({\mathbb{F}}^C_{M'}) \\ :=& \mathbb{E}\left[ \sup_{(\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \in \mathcal{H}_{M'}} \frac{1}{N} \sum_{i=1}^N \left( \sum_{j_x=1}^{{M}} \epsilon_{ij_x} \dot{x}^{(j_x)}(t_i) + \sum_{j_y=1}^{{M}} \epsilon_{ij_y} \dot{y}^{(j_y)}(t_i) + \sum_{j_z=1}^{{M}} \epsilon_{ij_z} \dot{z}^{(j_z)}(t_i) \right) \right] , \epsilon_{ij_y}, \epsilon_{ij_z} \overset{\mathrm{i.i.d.}}{\sim}\mathrm{Rad}} \\ \leq& \sum_{j_x=1}^{{M}} \mathbb{E}\left[ \sup_{\dot{x}^{(j_x)} \in \mathcal{H}} \frac{1}{N} \sum_{i=1}^N \epsilon_{ij_x} \dot{x}^{(j_x)}(t_i) \right] + \sum_{j_y=1}^{{M}} \mathbb{E}\left[ \sup_{\dot{y}^{(j_y)} \in \mathcal{H}} \frac{1}{N} \sum_{i=1}^N \epsilon_{ij_y} \dot{y}^{(j_y)}(t_i) \right] \\ &+ \sum_{j_z=1}^{{P}} \mathbb{E}\left[ \sup_{\dot{z}^{(j_z)} \in \mathcal{H}} \frac{1}{N} \sum_{i=1}^N \epsilon_{ij_z} \dot{z}^{(j_z)}(t_i) \right] \\ \lesssim& \: C M N^{-1/2}, } \end{align}\] where the last inequality comes from the fact that \(\Vert (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \Vert_{\mathcal{H}_{M'}} \leq C\) implies that \(\Vert \dot{\boldsymbol{x}}^{(m)} \Vert_{\mathcal{H}}\), \(\Vert \dot{\boldsymbol{\mu}}^{(m)} \Vert_{\mathcal{H}}\), \(\Vert \dot{\boldsymbol{y}}^{(m)} \Vert_{\mathcal{H}} \leq C\) for all \(m\). An analogous proof holds for \(\widehat\mathcal{R}(\mathop{\mathrm{{\textstyle \int}}}{\mathbb{F}}^C_{M'})\). ◻
Lemma 6. Denote \(\mathcal{H}_{M'}\) as in 2. For any \((\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \in \mathcal{W}^{2,2}_M\), \(\hat{\boldsymbol{\mu}}_0 \in \mathbb{R}^{{M}}\), \(\hat{\boldsymbol{y}}_0 \in \mathbb{R}^{{P}}\), define the differential equation error function \[\begin{gather} e_{\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}, \hat{\boldsymbol{\mu}}_0, \hat{\boldsymbol{y}}_0}(\cdot) := \left\Vert \begin{bmatrix} \dot{\boldsymbol{x}}(\cdot) - {\boldsymbol{F}}\left( {\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}\right) \\ \dot{\boldsymbol{\mu}}(\cdot) - {\boldsymbol{B}}\left( {\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}\right) \\ {\boldsymbol{H}}\left( {\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}\right) \end{bmatrix} \right\Vert_\infty \\ {\boldsymbol{x}}(t) := {\boldsymbol{x}}_0 + \int_0^t \dot{\boldsymbol{x}}(\tau) d\tau, \quad {\boldsymbol{\mu}}(t) := \hat{\boldsymbol{\mu}}_0 + \int_0^t \dot{\boldsymbol{\mu}}(\tau) d\tau, \quad {\boldsymbol{y}}(t) := \hat{\boldsymbol{y}}_0 + \int_0^t \dot{\boldsymbol{y}}(\tau) d\tau, \end{gather} \label{eqn:diffeq95err95fn}\tag{21}\] For any \(0 < C < \infty\), define \({\mathbb{G}}^{C,K}\) as the set of error functions \[\left\{ e_{\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}, \hat{\boldsymbol{\mu}}_0, \hat{\boldsymbol{y}}_0}(\cdot) \: : \: \Vert (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \Vert_{\mathcal{W}^{2,2}_M} \leq C, \Vert \hat{\boldsymbol{\mu}}_0 \Vert_\infty \leq K, \Vert \hat{\boldsymbol{y}}_0 \Vert_\infty \leq K \right\}.\] Then every error function in \({\mathbb{G}}^{C,K}\) is bounded by some constant \(\tilde{C}\) and \(\widehat\mathcal{R}({\mathbb{G}}^{C,K}) \lesssim C N^{-1/2}\).
Proof. Note that each error function in \({\mathbb{G}}^{C,K}\) is a Lipschitz function (\(\Vert \cdot \Vert_\infty\)), each of which is applied to the summation of two sub-functions:
a \((\dot{\boldsymbol{x}}(\cdot), \dot{\boldsymbol{\mu}}(\cdot), \dot{\boldsymbol{y}}(\cdot)) \in \mathcal{H}_{M'}\) with norm less than \(C\) (i.e. an element of \({\mathbb{F}}^C_{M'}\), as defined in 5), and
lipschitz functions (\({\boldsymbol{F}}\), \({\boldsymbol{B}}\), \({\boldsymbol{H}}\)) applied to the integral of a vector-valued RKHS function (\(\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}\)) with norm less than \(C\) (i.e. a Lipschitz function applied to an element of \(\mathop{\mathrm{{\textstyle \int}}}{\mathbb{F}}^C_{M'}\), as defined in 5).
Boundedness of the error functions falls from the fact \(({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}})\) are bounded, \(\hat{\boldsymbol{\mu}}_0, \hat{\boldsymbol{y}}_0\) are bounded, and \({\boldsymbol{F}}\), \({\boldsymbol{B}}\), \({\boldsymbol{H}}\) are continuous (and thus bounded over \([0, T]\)). The Rademacher complexity falls from standard Lipschitz and summation rules: \(\widehat\mathcal{R}({\mathbb{G}}^{C,K}) \lesssim \widehat\mathcal{R}({\mathbb{F}}^C_{M'}) + \widehat\mathcal{R}(\mathop{\mathrm{{\textstyle \int}}}{\mathbb{F}}^C_{M'}) \lesssim C M N^{-1/2}.\) ◻
Now we are ready to prove [thm:consistency].
Proof of [thm:consistency]. We begin by noting that \[\begin{align} B :=& \inf_{({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}\in {\mathbb{S}}} \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert {\hat{{x}}}^{(m)} \Vert^2_{\mathcal{H}} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{M}}\, \Vert {\hat{{\mu}}}^{(m)} \Vert^2_{\mathcal{H}} + \mathop{\mathrm{{\textstyle \sum}}}_{m=1}^{{P}}\, \Vert {\hat{{y}}}^{(m)} \Vert^2_{\mathcal{H}} \\ =& \inf_{({\boldsymbol{x}}, {\boldsymbol{\mu}}, {\boldsymbol{y}}\in {\mathbb{S}}} \Vert (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \Vert_{\mathcal{H}_{M'}} < \infty \end{align}\] is implied by 1. Let \(({\boldsymbol{x}}^*, {\boldsymbol{\mu}}^*, {\boldsymbol{y}}^*)\) be some element in \({\mathbb{S}}\) that achieves this infimum (the existence of which is guaranteed by 3). We know that \(\Vert (\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N) \Vert_{\mathcal{H}_{M'}} \leq \Vert (\dot{\boldsymbol{x}}^*, \dot{\boldsymbol{\mu}}^*, \dot{\boldsymbol{y}}^*) \Vert_{\mathcal{H}_{M'}} = B\)— since \(({\boldsymbol{x}}^*, {\boldsymbol{\mu}}^*, {\boldsymbol{y}}^*)\) satisfies the constraints for 13 — and thus \((\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N) \in {\mathbb{F}}^B_{M'}\) (as defined by 5).
Defining \(e_{\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N, \hat{\boldsymbol{\mu}}_0, \hat{\boldsymbol{y}}_0}(\cdot)\) as in 6, we have that \(e_{\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N, \hat{\boldsymbol{\mu}}_0, \hat{\boldsymbol{y}}_0}(t_i) = 0\) for each \(t_i\) in \(\mathcal{D}\). Applying a standard uniform large law argument [40] we have that, for any \(\delta > 0\), \[\begin{align} \int_0^T e_{\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N, \hat{\boldsymbol{\mu}}_0, \hat{\boldsymbol{y}}_0}(\tau) d\tau &= \left\vert \frac{1}{N} \sum_{i=1}^N e_{\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N, \hat{\boldsymbol{\mu}}_0, \hat{\boldsymbol{y}}_0}(t_i) - \int_0^T e_{\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N, \hat{\boldsymbol{\mu}}_0, \hat{\boldsymbol{y}}_0}(\tau) d\tau \right\vert \\ &\leq 2\mathbb{E}_{t_i} \left[ \widehat\mathcal{R}({\mathbb{G}}^{B,K}) \right] + \delta \end{align}\] with probability \(1 - 2\exp(-\frac{N\delta^2}{8 \tilde{C}^2})\) (where \(\tilde{C}\) is the constant defined in 6). Since \(\widehat\mathcal{R}({\mathbb{G}}^{B,K}) \lesssim B M N^{-1/2}\) (6), we have that \(\int_0^T e_{\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N, \hat{\boldsymbol{\mu}}_0, \hat{\boldsymbol{y}}_0}(\tau) d\tau \overset{\mathrm{a.s.}}{\longrightarrow} 0\), which implies that \(\lim_{N\to\infty} ({\hat{\boldsymbol{x}}}_N, \hat{\boldsymbol{\mu}}_N, {\hat{\boldsymbol{y}}}_N)\) satisfies the differential equation almost everywhere.
Define \((\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}})\) as the continuously-differentiable representative of the function \(\lim_{N\to\infty} (\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N)\) ([47]). Since \((\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}})\) is continuously-differentiable and satisfies the differential equation everywhere, it must also be an element of \({\mathbb{S}}\). All together, this implies that \(\Vert (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \Vert_{\mathcal{H}_{M'}} \geq B\), and so \(\Vert (\dot{\boldsymbol{x}}, \dot{\boldsymbol{\mu}}, \dot{\boldsymbol{y}}) \Vert_{\mathcal{H}_{M'}} = \Vert \lim_{N\to\infty} (\dot{\hat{\boldsymbol{x}}}_N, \dot{\hat{\boldsymbol{\mu}}}_N, \dot{\hat{\boldsymbol{y}}}_N) \Vert_{\mathcal{H}_{M'}} = B\). ◻
Supplement to “Solving Models of Economic Dynamics with Ridgeless Kernel Regressions”
Mahdi Ebrahimi Kahou1, Jesse Perla2, Geoff Pleiss3,4
1Department of Economics, Bowdoin College.
2Vancouver School of Economics, University of British Columbia.
3Department of Statistics, University of British Columbia.
4Vector Institute.
This supplement contains the details of the robustness checks on our algorithm and provides additional applications alongside those presented in the paper.
This section provides robustness checks and an exploration of sample efficiency for the neoclassical growth model.
Since the number of parameters in the optimization grows linearly with the number of grid points, the cardinality of the training set, \(\mathcal{D}\), becomes an impediment in higher dimensions with many state and jump variables. Therefore, obtaining accurate approximate solutions with sparse training data is crucial.
4 shows the result of the neoclassical growth model (i.e., 6 7 8 of the main paper) solved with \(\mathcal{D} := \{0, 1, 3, 5, 10, 15, 20, 25, 30, 35, 38, 40\}\). The top-left panel shows the approximate and benchmark capital paths, denoted by \(\hat{x}(t)\) and \(x(t)\), respectively. The top-right panel shows the relative errors between the approximate and benchmark solutions for capital, denoted by \(\varepsilon_x(t)\). The bottom-left panel shows the approximate and benchmark consumption paths, denoted by \(\hat{y}(t)\) and \(y(t)\), respectively. The bottom-right panel shows the relative errors between the approximate and benchmark solutions for consumption, denoted by \(\varepsilon_y(t)\).
These results show that one can obtain very accurate approximate solution, even with a very sparse training data.
[tab:robustness] shows the result of the approximate solution of the neoclassical growth model, described in 6 7 8 of the main paper, for different Matérn kernels and kernel parameters.
The first three rows report the performance of the approximate solutions using three different kernels. We present the maximum and minimum absolute values of the relative errors for both the capital path, \(\hat{x}(t)\), and the consumption path, \(\hat{y}(t)\). The first row shows the baseline solution using the Matérn kernel with \(\nu = \frac{1}{2}\). The second and third rows present results for the Matérn kernels with \(\nu = \frac{3}{2}\) and \(\nu = \frac{5}{2}\), respectively.
c \(\nu\)& \(\ell\)& Max of Rel. Error: \(\hat{x}(t)\)& Max of Rel. Error: \(\hat{y}(t)\)& Min of Rel. Error: \(\hat{x}(t)\)& Min of Rel. Error: \(\hat{y}(t)\) \(1/2\)& 10 & 1.8e-03 & 2.9e-03 & 7.1e-05 & 9.6e-07 \(3/2\)& 10 & 5.9e-04 & 3.0e-02 & 1.5e-05 & 3.4e-06 \(5/2\)& 10 & 1.4e-04 & 2.4e-02 & 2.7e-05 & 6.0e-08 \(1/2\)& 2 & 3.1e-03 & 2.8e-03 & 1.3e-07 & 5.5e-07 \(1/2\)& 20 & 1.9e-03 & 8.2e-02 & 7.6e-05 & 2.9e-05
The last two rows show the performance of the approximate solutions for two different length scales, \(\ell = 2\), and \(\ell = 20\).
Throughout these experiments, we achieve highly accurate approximate solutions. Therefore, the results demonstrate insensitivity to the selection of Matérn kernels and the length scales.
One might suspect that achieving an accurate optimal solution, which does not violate the transversality condition, is only possible if one uses a large time horizon in the training data. For instance, we use \(\mathcal{D}= \{0,1,2,\ldots,30\}\) to obtain the results depicted in 1. In this experiment, we establish that we can still achieve accurate short-run dynamics by using a smaller time horizon.
5 shows the approximate solutions for the neoclassical growth model (i.e., 6 7 8 of the main paper) for training data with smaller time horizon, defined as \(\mathcal{D}:=\{0,1,2,\cdots,10\}\). The top-left panel shows the approximate and benchmark capital paths, denoted by \(\hat{x}(t)\) and \(x(t)\), respectively. The top-right panel shows the relative errors between the approximate and benchmark solutions for capital, denoted by \(\varepsilon_x(t)\). The bottom-left panel shows the approximate and benchmark consumption paths, denoted by \(\hat{y}(t)\) and \(y(t)\), respectively. The bottom-right panel shows the relative errors between the approximate and benchmark solutions for consumption, denoted by \(\varepsilon_y(t)\).
In this section, we discuss additional applications. In 11.1, we solve a model that incorporates the time evolution of human capital and its interaction with physical capital in economic growth. The model includes seven variables: two state variables along with their co-state variables, and three jump variables. In 11.2, we solve an optimal advertising model based on the work of [48].
In this example, we solve the neoclassical growth model with human and physical capital as illustrated in [29]. The optimal paths for the state variables \({\boldsymbol{x}}(t) := \left[x_k(t),x_h(t)\right]\), co-state variables \({\boldsymbol{\mu}}(t) := \left[\mu_k(t),\mu_h(t)\right]\), and jump variables \({\boldsymbol{y}}(t) := \left[y_c(t), y_{k}(t), y_{h}(t)\right]\) solve
\[\begin{align} \dot{x}_k(t) &= y_k(t) - \delta_k x_k(t),\tag{22}\\ \dot{x}_h(t) &= y_h(t) - \delta_h x_h(t),\tag{23}\\ \dot{\mu}_k(t) &= r \mu_k(t) - \mu_k(t) \bigl[ f_1(x_k(t),\,x_h(t)) - \delta_k \bigr],\tag{24}\\ \dot{\mu}_h(t) &= r \mu_h(t) - \mu_h(t) \bigl[ f_2(x_k(t),\,x_h(t)) - \delta_h \bigr],\tag{25}\\ 0 &= \mu_k(t) y_c(t) - 1,\tag{26}\\ 0 &= \mu_k(t) - \mu_h(t),\tag{27}\\ 0 &= f(x_k(t),\,x_h(t)) - y_c(t) - y_k(t) - y_h(t),\tag{28} \intertext{with two transversality conditions} 0 &= \lim_{t\rightarrow \infty} e^{-rt} x_k(t) \mu_k(t),\tag{29}\\ 0 &= \lim_{t\rightarrow \infty} e^{-rt} x_h(t) \mu_h(t),\tag{30} \end{align}\]
for given initial conditions \(x_k(0) = x_{k_0}\), \(x_h(0) = x_{h_0}\).
The production function is defined as \(f\left(x_k(t),x_h(t)\right) = x_k(t)^{a_k}x_h(t)^{a_h}\). Here, \(f_1\left(\cdot,\cdot\right)\) is the derivative with respect to the first input and \(f_2\left(\cdot,\cdot\right)\) is the derivative with respect to the second input. The two constants in the production function, \(a_k\) and \(a_h\), are positive numbers, such that \(a_k+a_h <1\). Additionally, \(\delta_k>0\), \(\delta_h>0\), and \(r>0\).
Human capital is denoted by \(x_h(t)\), physical capital by \(x_k(t)\), consumption by \(y_c(t)\), investment in human capital by \(y_{h}(t)\), and investment in physical capital by \(y_{k}(t)\). Here \(\mu_k(t)\) and \(\mu_{h}(t)\) are the co-state variables.
This problem is more challenging than the neoclassical growth model introduced in 4 of the main paper because it has a higher dimension and involves two algebraic equations, 27 28 .
In this experiment we, use \(\delta_k = 0.1\), \(\delta_h = 0.05\), \(\alpha_k = \frac{1}{3}\), \(\alpha_h = \frac{1}{4}\), \(r = 0.11\), \(x_{k_0} = 1.5\), and \(x_{h_0} = 1.37\) as the numerical values for the economic parameters. We use \(\mathcal{D}=\{0,1,\cdots,80\}\) as the training data.
6 shows the approximate paths of physical capital (\(\hat{x}_k(t)\)), human capital (\(\hat{x}_h(t)\)), consumption (\(\hat{y}_c(t)\)), investment in physical capital (\(\hat{y}_k(t)\)), and investment in human capital (\(\hat{y}_h(t)\)), along with the co-state variables (\(\hat{\mu}_k(t)\) and \(\hat{\mu}_h(t)\)). The vertical dashed line shows the boundary between the interpolation and extrapolation regions.
This result establishes that even as the dimensionality increases with more complex algebraic equations, the approximate solutions converge to the correct set of steady states. Therefore, the solutions satisfy the transversality conditions.
The correct set of steady states. How do we know the approximate solutions converge to the correct set of steady states? A set of solutions that violates the transversality conditions is characterized by paths such that \(\displaystyle\lim_{t\rightarrow \infty} e^{-r t}\mu_k(t)x_k(t) \neq 0\) and \(\displaystyle\lim_{t \rightarrow \infty} e^{-r t}\mu_h(t)x_h(t) \neq 0\). Since the approximate solutions shown in 6 do not exhibit this behavior, we can be confident that they converge to the correct set of steady states.
In this example we solve an optimal advertising model based on the classical model of expenditure on advertising introduced in [49]. The optimal paths \({\boldsymbol{x}}(t)\), \({\boldsymbol{y}}(t)\), and \({\boldsymbol{\mu}}(t)\), solve \[\begin{align} \dot{x}(t) & = \left[1-x(t)\right]y(t)-\beta x(t) \tag{31},\\ \dot{\mu}(t) &= r\mu(t) -\gamma + \beta\mu(t)+ \mu(t)y(t) \tag{32}\\ 0 & = y(t)^{\frac{1-\kappa}{\kappa}}- \kappa \mu(t)\left[1-x(t)\right]\tag{33} \end{align}\] for a given initial condition \({\boldsymbol{x}}(0)= {\boldsymbol{x}}_0\), and a transversality condition \[\begin{align} \lim_{t\rightarrow \infty} e^{-r t}x(t) \mu(t) = 0,\label{eq:optima-advertising-TVC} \end{align}\tag{34}\] \(x\) represents the market share of the company, \(y\) is a variable corresponding to advertising expenditure, and \(\mu\) is the co-state variable.
The parameter \(\kappa\) is a constant between \(0\) and \(1\), \(\beta\) is strictly positive, \(r\) is the discount rate, the constant \(\gamma\) is defined as \(\gamma := \frac{\beta+r}{c}\), and \(c\) is the cost of advertising. See [48], [50] for a detailed treatment of this problem.
In this example we use \(x_0 = 0.4\), \(r = 0.11\), \(c = 0.5\), \(\beta = 0.05\), \(\kappa = 0.5\), and \(\gamma = \frac{\beta+r}{c} = 0.32\) as the numerical values for the parameters. We use \(\mathcal{D} = \{0,1,\cdots,40\}\) as the training data.
Results. 7 shows the approximate market share, denoted by \(\hat{x}(t)\), and the approximate co-state variable, denoted by \(\hat{\mu}(t)\), obtained using a Matérn kernel with \(\nu = \frac{1}{2}\). The vertical dashed line indicates the boundary between the interpolation and extrapolation regions. Despite not applying the transversality condition (34 ), the kernel approximation accurately recovers the optimal solution.
The correct set of steady states. How do we know the approximate solution recovers the optimal solution? As shown in 7, the approximate state, and co-state variable approach a finite number. Therefore, \(\lim_{t\rightarrow \infty} e^{-r t}\hat{\mu}(t) \hat{x}(t) = 0\).
See [3] and [4] for classic treatments of regularization in ill-posed problems. Our approach is also related to implicit and explicit regularization in deep learning [5].↩︎
See [7] and [8] for more details on kernel methods, Reproducing Kernel Hilbert Spaces, and the Representer Theorems.↩︎
An inherent characteristic of discounted infinite-horizon optimal control problems, whether deterministic or stochastic, are asymptotic boundary conditions such as transversality conditions. These conditions can be formulated sequentially or recursively in a state space, and in continuous- or discrete-time (see discussions of necessary conditions in [26]–[28]).↩︎
When present, jump variables constrain the solution manifold and are not matched with boundary or initial values. The connection between the number of jump variables and stability local to a steady-state is discussed in [9]. While this paper analyzes the convergence for DAEs, the computational methods can be used for systems augmenting inequality constraints in addition to 3 and differential inclusions in 1 2 .↩︎
The derivation follows the standard planning problem maximizes the lifetime discounted utility of consumption: \(\int_0^\infty e^{-rt} u\left(c(t)\right) dt\), with \(u(c) = \log(c)\), subject to the law of motion for capital: \(\dot{k}(t) = f\left(k(t)\right) - \delta k(t) - c(t)\), where the production function is \(f(k) = k^a\) with \(0 < a < 1\), and the depreciation rate satisfies \(0 < \delta < 1\). In this case, the present-value Hamiltonian is \(u\left(c(t)\right) + \mu(t)\left[f\left(k(t)\right) - \delta k(t) - c(t)\right]\). The DAE follows from applying Pontryagin’s Maximum Principle along with a standard transversality condition, and mapped to our notation (see [29]).↩︎
Algorithms that reduce the index of a semi-explicit DAE to an ODE, such as the Pantelides algorithm, may augment \({\boldsymbol{x}}(t)\) and \({\boldsymbol{\mu}}(t)\) to eliminate the algebraic equation (cf. 3 ). While this procedure is often carried out by hand in macroeconomics, we will instead work directly with the more natural DAE formulation.↩︎
For example, some models have multiple \(\left({\boldsymbol{x}}(t),{\boldsymbol{\mu}}(t),{\boldsymbol{y}}(t)\right)\) fulfill 1 2 3 5 4 for a given \({\boldsymbol{x}}_0\). These often come out of coordination failures in monetary economics and multiplicity in rational expectations equilibria. While these models with multiplicity are not considered in our paper, there may be hysteresis and multiplicity of steady-states—i.e., \(\lim_{t\to\infty} {\boldsymbol{x}}(t)\) may depend on \({\boldsymbol{x}}_0\). For example, these methods could solve transition paths and steady-states in dynamic models of trade, such as [30], where steady-state depends on the current account initial conditions.↩︎
See [31] for when these assumptions hold. In practice, it is usually sufficient to use a present-value rather than current-value Hamiltonian, and solve a de-trended model in cases with growth.↩︎
Related approaches convert the DAE into a finite-horizon boundary-value problem (BVP) by discretizing time (e.g., via finite differences) and solving a nonlinear system that enforces the initial conditions, artificial terminal conditions, and the dynamics. Although often preferable to shooting, the resulting Jacobian has similar conditioning because the system effectively embeds the same \(\Psi(\cdot;T)\) structure.↩︎
In many important applications in growth, trade, international economics, and spatial economics, calculating the steady state itself is difficult. Moreover, there may be initial-condition dependence and multiple steady states. In such cases, shooting methods require first solving for all candidate steady states \(x(\infty)\) given \(x_0\), analyzing fixed-point stability via the Hessians of 1 2 3 , and then partitioning \(\mathbb{R}^{{M}}\) into basins of attraction. This process is infeasible outside of the simplest, low-dimensional settings.↩︎
Formally, the condition number of \(\nabla \Psi(\cdot;T)\), the ratio of the largest to smallest singular values, will diverge with \(T\). It is worth noting that if the focus is only on transition dynamics, the explicit calculation of the steady state is not strictly required beyond providing a target for the shooting method. While \(\Psi(\cdot;T)\) uses the steady state for guidance, its main role is to steer the algorithm away from explosive non-solutions—and it can be discarded once a valid solution is found. We will exploit this observation in our kernel-based methods.↩︎
We note that the semi-norm \(\vert w(\cdot) \vert_{\mathcal{W}^{2,2}}\) is a more natural measure of complexity than the norm \(\Vert w(\cdot) \Vert_{\mathcal{W}^{2,2}}\). Consider for example a problem where \(w(T)\) converges to some steady state \(w(\infty)\) as \(T \to \infty\). Convergence to a steady state implies that \(\dot{w}(T) \to 0\) as \(T \to \infty\), and so it is possible for \(\Vert \dot{w}(\cdot) \Vert_{L^2([0,T])}\) to be bounded. However, if \(w(\infty) > 0\), then \(\Vert w(\cdot) \Vert_{\mathcal{W}^{1,2}([0,T])}\) will diverge as \(T \to \infty\). In other words, norms of \(w(\cdot)\) are sensitive to the scaling and location of the steady state, while norms of \(\dot{w}(\cdot)\) are not.↩︎
For a given kernel, the integration in 11 can be done in closed form or numerically, i.e. \(\int_0^t k(\tau, t_j)\mathrm{d}\tau\) for each \(t_j\).↩︎
In particular, since most kernels will have \(\lim_{t\to\infty}k(t,t_j) = 0\) for any \(t_j\), approximating the derivatives leads to \(\lim_{t\to\infty}\hat{\boldsymbol{\mu}}(t) = \hat{\boldsymbol{\mu}}_0\) for any fixed \(\mathcal{D}\). Alternatively, if the function itself was approximated then, for any fixed \(\mathcal{D}\), \(\lim_{t\to\infty}\hat{\boldsymbol{\mu}}(t) =0\). That said, given that our goal is to solve for only short- and medium-run behavior, and our consistency results in [thm:consistency] only apply on \([0,T]\), approximating the function itself often works in practice.↩︎
The primary tradeoffs for kernel methods is that they require pairwise evaluation across all of the data (i.e., \(k(t_i, t_j)\) for all \(i,j\)), and an out-of-data function evaluation require evaluating \(k(t,t_i)\) for all data. Computational methods can solve these challenges approximately up to millions of observables, [38], [39], but as with all non-parametric methods it eventually becomes a limitation. Another consideration is that while in our current applications it is essential, the strong dependence of solutions on the RKHS norms may or may not be a benefit if the solution concept of the underlying problem is poorly aligned with the RKHS induced by the kernel.↩︎
However, this also shows where these methods are likely to break down. For example, in cases which are not twice-differentiable almost everywhere, or where the level of the function rather than just its derivatives are essential for selecting the trajectory fulfilling 5 .↩︎
In this case we solved this as a mixed initial-boundary value problem using the analytically calculated boundary condition as a boundary condition. As discussed in 2.↩︎
In general as problems get larger it is helpful to add additional non-negativity or box-bounding constraints from 2 to the optimization problem minimizing 13 . While these will often be non-binding in the optimal solution, they can help optimizers converge.↩︎
For asset pricing models, see [44], [45], which characterizes the set of solutions not fulfilling transversality and connects them to economic bubbles.↩︎