Minimizing Residuals in ODE Integration
Using Optimal Control


Abstract

Given the set of discrete solution points or nodes, called the skeleton, generated by an ODE solver, we study the problem of fitting a curve passing through the nodes in the skeleton minimizing a norm of the residual vector of the ODE. We reformulate this interpolation problem as a multi-stage optimal control problem and, for the minimization of two different norms, we apply the associated maximum principle to obtain the necessary conditions of optimality. We solve the problem analytically for the Dahlquist test problem and a variant of the leaky bucket problem, in terms of the given skeleton. We also consider the Van der Pol equation, for which we obtain interpolating curves with minimal residual norms by numerically solving a direct discretization of the problem through optimization software. With the skeletons obtained by various ODE solvers of Matlab, we make comparisons between the residuals obtained by our approach and those obtained by the Matlab function deval.

Key words: ODE integration, ODE residual, ODE defect, Optimal control, Dahlquist test problem.

2020 Mathematics Subject Classification: Primary 65L05, 49K15,Secondary 65K10, 49M25

1 Introduction↩︎

1.1 Background↩︎

The numerical solution of differential equations, both initial-value problems and boundary-value problems, is by now a very old subject. In particular, the subject underwent significant development in the 1960’s and 1970’s as computers gained in speed and power. Much of that development has been captured in the classic texts by Hairer, Nørsett and Wanner [1], Hairer and Wanner [2], Butcher [3], and for boundary-value problems, Ascher, Mattheij and Russell [4]; not forgetting the earlier work by Stetter [5]; or the myriads of other texts.

Yet some pieces of that development were not captured in those texts, or not wholly, or if they were, they may lie undisturbed as the texts (such as Stetter’s) fall out of use. An example of that is the relationship of defect (what we will call the residual in this paper, following Corless and Fillion [6]) to the “local error per unit step," which is mentioned in Stetter’s book but nowhere else (outside of, say, Wayne Enright’s lecture notes [7]) that we know of.

Another item of interest is that some design decisions for the methods we use today were taken in a time when computer memory was much more constrained than it is now, and moreover were taken at a time when the relationship between forward error (called “global error" in the numerical ODE community) and backward error (e.g., the residual) was not as widely understood as perhaps it ought to have been.

Another factor is that modern codes use heuristics for variable stepsize and variable order, and the heuristics and controls for stepsize introduce nonlinearities and feedback into the solution. There is a tension involved: the error control uses theoretical results that are asymptotically correct as the stepsize \(h\) goes to zero, but for efficiency the code is designed to use stepsizes \(h\) that are as large as possible, consistent with the tolerances. It is possible, especially for loose tolerances, that the solution is not as accurate as the theoretical model underlying the code predicts that it will be. Even linear ODE can have numerical solutions whose behaviour is difficult to predict, or even to explain afterwards. One important motive for this paper is to provide another set of tools for “retrospective diagnostics” of what the code actually solved.

A final motive for this paper and perhaps the most salient one is a question of validation, reproducibility, of what we call “computational epistemology.” Just because a numerical method and its implementation has been tested on problems from test sets3, say, A, B, C and D, it does not necessarily follow that the methods will work on problems from classes E, F, or G, or, say, Z. Yet people rely on the results of widely—and freely—available solvers, sometimes blithely, and sometimes blindly. We are aware of published papers containing incorrect results obtained with such naı̈ve reliance—some examples are discussed in [9]. Given the ubiquity of ODE solvers nowadays, what can a skeptical user do to address the validity of the computed numerical solution?

1.2 Setting↩︎

We believe that one tool that can be helpful is independent computation of the residual (also known as the defect, especially in the older literature). If our differential equation is \(\dot{y} = f(t,y)\) where \(y: t \mapsto \mathbb{R}^N\) and \(f: (t,y) \mapsto \mathbb{R}^N\), then a putative solution to the differential equation (call it \(z(t)\)) would have a residual \(r(t)\) defined at all but exceptional points by \[r(t) := \dot{z}(t) - f(t, z(t)) \label{eq:residual}\tag{1}\] which trivially rearranges to show that \(z(t)\) is the exact solution of the perturbed equation \[\dot{y}(t) = f(t,y(t)) + r(t)\>. \label{eq:perturbed}\tag{2}\]

This can then often be interpreted as a meaningful perturbation of the model equations. If this is small compared to real effects that have been neglected, then the computed solution can be regarded as valid. The effects of neglecting small perturbations—real, or numerical—of course have to be understood, perhaps by using variational analysis, the theory of conditioning of ODE, or even the Gröbner–Alexeev nonlinear variation of constants formula [1], but that has to be done anyway. For reference, the Gröbner–Alexeev nonlinear variation of constants formula is, assuming that the initial condition \(y(t_0)=y_0\) is not perturbed, \[y(t)-z(t) = \int_{\tau=t_0}^t G(t,\tau)r(\tau)\,d\tau \>. \label{eq:GroebnerAlexeev}\tag{3}\] The function \(G(t,\tau)\), whose existence the theorem guarantees, is essentially \(\partial y /\partial y_0\), the dependence of the solution on the initial conditions. This acts as a kind of “condition number” for the differential equation. If \(G\) is large, then even a small residual can have a dramatic effect on the solution of the differential equation.

Notice that the Gröbner–Alexeev formula also lets us predict the effect of physical disturbances, not just the effect of approximate computation. By putting the approximations of the numerical method on the same footing as that of meaningful modelling choices, we will have succeeded in validating the numerical computation.

This paper advocates for an independent method to do this. Given a putative skeleton of a numerical solution to an ODE, we ask for the minimal perturbation of the ODE that would fit the numerical skeleton. If this minimal solution is “too large" in size, we reject the solution. If it is”small enough," we accept it. These terms are loose and depend on the problem context and its structure, and so we advocate giving tools to the users to allow them to decide for themselves. In this paper, we formulate several such perturbations and show how to use techniques from optimal control theory to find minimal ones. Knowing the minimal possible residual gives access to sufficient conditions for accepting the numerical solution as valid.

It turns out that the problem formulation, and the problem scaling, have a great impact on the interpretation of the residual. We presume in this paper that a proper scaling has been done, and moreover that the ODE is reasonable to pose as a system of first order equations. Most codes require higher-order systems to be rewritten as a first-order system, typically using the standard chain of new variables \(y_1 = y\), \(y_2 = y_1'\), \(y_3 = y_2'\), and so on; in these equations, if they are artificial and introduced only to make an existing code applicable, a residual or defect can be quite hard to interpret, because the standard chain equations do not really permit perturbation. For the purposes of this paper, we ignore this concern. In a future paper we examine optimal residuals for higher-order equation systems. Interestingly, the issue of the numerical difference between solving an ODE rewritten as a first order system versus solving it in its original higher-order formulation is discussed only, so far as we know, in the classic text [4] and nowhere else in the literature; but we leave it for now.

The problem of our interest can be viewed as a problem of interpolation, that is, the problem of finding (in some sense) the “best solution curve” that passes through the nodes in the numerical skeleton computed by some ODE solver. Most modern codes choose to use polynomial approximations valid between the nodes of the skeleton, which are designed to be “asymptotically correct,” in the limit as the stepsize \(h\) goes to zero. Typically we would want the approximation to an \(O(h^p)\) accurate numerical solution to also be \(O(h^p)\) accurate, for “small enough” \(h\).

But for efficiency, one wants \(h\) to be “as large as possible” consistent with the error tolerances, and we therefore cannot be sure that the code is operating “in the asymptotic regime” and indeed there are documented cases where it was not. There is also the phenomenon of “order reduction” (see [2]) to consider, in which the observed order of convergence as \(h\) goes to zero is smaller than the expected \(p\)th order. The reasons for that reduction in any given problem can be quite obscure, and even specialists can be confused about it. We would like to provide tools for retrospective diagnostics that can reassure the user that the code, even if not operating in the asymptotic regime, and even if suffering from order reduction, still produced a good solution. To do this we will relax the idea of using piecewise polynomial approximation.

We will, however, tighten the notion of “approximation” to be that of “interpolation.” It turns out that not all polynomial approximations supplied by modern codes actually interpolate the skeleton. Some do, for example the Matlab code ode45, which is supplied with a piecewise polynomial interpolant that is actually continuously differentiable (at least in the absence of rounding error). Others are not: for instance, the Matlab code ode113 comes with a piecewise polynomial approximation that is continuously connected to the skeleton only at one node of two (the right endpoint). The lack of continuity at the left is not a difficulty if all one wants from the approximation is dense output for a graph, and indeed the piecewise polynomial is “almost” continuous there, having a jump discontinuity of size similar to the tolerance supplied to the integration. But for other uses, this lack of continuity is, as Shampine says in [8], a serious drawback.

1.3 Contributions↩︎

We will insist, in this paper, on a piecewise interpolant that is continuous at every interior node of the skeleton. It will turn out that we cannot insist on the interpolant being continuously differentiable4, but the \(C^0\) continuity will turn out to be crucial.

The reason that we insist our piecewise interpolant be globally \(C^0\) is that if we permit discontinuous approximate solutions, then approximation by “local exact solutions” will have residual \(r(t)\) that is identically zero. That is, if we approximate \(y(t)\) by the piecewise-defined function \(y(t) = y_k(t)\) on \(t_k \le t \le t_{k+1}\) with \(\dot{y}_k(t)=f(t,y_k(t))\) and either \(y(t_k) = y_k\) or, perhaps, \(y(t_{k+1}) = y_{k+1}\) at the other end, then \(r(t)\) will be zero in \((t_k,t_{k+1})\) but a delta function at one end. Putting that \(r(t)\) into equation 3 produces a sequence of jumps at the node; this is a classical way of relating the “local error” of the method with the “global error” in the computed skeleton.

Instead we insist on a globally continuous interpolant to the skeleton, and ask for the interpolant that gives the “minimal residual” in various senses.

The interpolant in our case is required to minimize a measure of the residual; in particular, we choose to minimize the \(L^2\)-norm or some “convenient” variant of the \(L^\infty\)-norm of the residual, resulting in two variational interpolation problems. The particular variant mentioned here is simply the sum of the \(L^\infty\)-norms taken over each stage (defined as the interval between any two consecutive nodes5). This norm variant helps in getting analytical solutions for some relatively simple example ODEs that we consider in this paper. We call this variant the stage \(L^\infty\)-norm.

It is well known that interpolation problems can be reformulated as an optimal control problem, and a specialized maximum principle (for multi-stage or multi-process optimal control) can be employed with the aim of finding a solution; see, for example, [12], [13] and the references therein. We subsequently establish in Lemmas 1 and 2 that the respective optimal control problems we derive in the current paper are normal, and in Theorems 1 and 2 we present the resulting necessary conditions of optimality.

The Dahlquist test problem is conceivably the simplest possible and yet rich enough ODE that is widely used to analyze certain properties of ODE solvers, such as stability; see, for example, [14] and the references therein. We obtain closed-form expressions for the optimal residuals of the Dahlquist test problem in Propositions 1 and 2. These expressions also allow us to make comparisons between the two different norms of the residual in Proposition 3.

We consider another relatively simple but sufficiently interesting ODE, which is a variant of the celebrated “leaky bucket problem” (see [15]), and obtain analytical expressions for the optimal residuals in Propositions 4 and 5.

The third example we study is the Van der Pol system, which is not tractable analytically; however, we discretize the ensuing optimal control problem directly [16] and numerically solve the resulting large-scale optimization problem using the mathematical programming modelling language AMPL [17] paired with the popular optimization software Knitro [18]. We gain some insight from the numerical solutions for the optimal residual. In each of the three examples mentioned above, we use a suite of Matlab’s ODE solvers [19] to construct the numerical skeleton.

The paper is organized as follows. In Section 2 we amplify our motivational remarks above by a detailed consideration of a single example. In Section 3, we formulate the problem of finding interpolants minimizing the \(L^2\)- and stage \(L^\infty\)-norms and transform these problems into optimal control problems. In Section 4, we apply a maximum principle to obtain the necessary conditions of optimality, after showing that both optimal control problems are normal. We study the three example ODEs in Section 5, also presenting numerical experiments. We briefly discuss the experiments in Section 6. Finally, in Section 7, we provide some concluding remarks.

2 Motivation↩︎

\(\ldots\) in theory, there is no difference between theory and practice,
while in practice, there is.”
—Benjamin Brewster, The Yale Literary Magazine, October 1881/June 1882 issue

One motivation for this paper is that the polynomial approximations extending the skeleton to “continuous” functions used in current codes can behave suboptimally. We demonstrate this with an example, the numerical solution of the simple harmonic oscillator \(\ddot y + y = 0\) on a representative time interval with representative initial conditions.

Consider the following piece of Matlab code, which solves the simple harmonic oscillator \(\ddot y + y = 0\) on \(0 \le t \le 30\) with \(y(0)=3.5\) and \(\dot{y}(0) = 0\) using many different tolerances, and compares the maximum residual error (which we define in Section 3, but note that this is not the forward or global error, which would require us to know the reference solution to the equation, which we do not use here) to the mean stepsize that the code takes. The mean stepsize tells us something directly about the amount of work that the code does, because the work is proportional to the number of steps taken. Note that in the papers [20][22] we find an explanation of the experimentally observed fact that modern stepsize heuristics, which approximately equidistribute the local error, can usually be expected to produce solutions which are accurate to \(O(h^p)\) where \(h\) is the arithmetic mean of all the stepsizes taken and \(p\) is the order of the method. Most texts say that it is the maximum stepsize which must be used, but this is not true: for the codes to be asymptotically correct, only the mean stepsize needs to go to zero, given the assumption of equidistribution. Therefore, this script will produce a graph, analogous to a “work-precision diagram,” which will tell us something about the behaviour of the error in solving the simple harmonic oscillator with ode45, and in particular will tell us how the polynomial interpolant behaves.

The results are plotted in Figure 1. While a naı̈ve user might have expected that the observed error would decrease as the mean stepsize decreases, over the whole range of tolerances, we see hints that for more complicated problems this might not really be so for large tolerances, only for “small enough” tolerances. We also see that the order of accuracy is not \(O(h^5)\) (the red curve) but rather seems to be reduced to \(O(h^4)\) (the blue curve), once the error starts to decrease. Neither of these behaviours is likely to be expected by a naı̈ve user.

The behaviour of modern codes is not always simple, and therefore we believe we should provide more tools for retrospective diagnostics.

In fairness to ode45, we note that it was not designed to achieve small residual error, but instead to have small local error for modest tolerances. The two notions are related, but not identical. We remark that residual error is easier to explain to students and users of the code, and is useful in other contexts as well, while the concept of “local error” is useful only in the context of historical codes for the solution of ODEs.

Figure 1: A kind of work-precision diagram for the solution of the simple harmonic oscillator \ddot y + y = 0 on 0 \le t \le 30 with reasonable initial conditions, given to ode45 in the form of a first-order system as usual. On the horizontal axis we have the mean stepsize h. The code ode45 is a variable-stepsize method, and the stepsize at the tightest tolerance varied systematically between 9\times 10^{-3} and 1.2\times 10^{-2}. Smaller mean h implies that the code had to work harder to get to t=30. On the vertical axis, we have the maximum residual norm observed over the solution on 0 \le t \le 30. We see that the expected order of the error, namely O(h^5), plotted in red, is not attained; instead, the observed order of the residual error is very clearly O(h^4) (plotted in blue).
nsamp = 40; err = zeros(1,nsamp);
h = zeros(1,nsamp);
for k=1:nsamp
    disp(k)
    opts = odeset('RelTol',2.^(-k),'AbsTol',2.^(-k-1));
    sol = ode45( @sho, [0, 30], [3.5;0], opts);
    h(k) = mean(diff(sol.x)); 
    t = RefineMesh(sol.x, 8);
    [y,dy] = deval(sol,t);
    [dx,nt] = size(t);
    res = zeros(size(y));
    for j=1:nt
      res(:,j) = dy(:,j)-sho(t(j),y(:,j));
    end
    err(k) = max(abs(res(:)));
end
k2 = floor(nsamp/2);
const4 = err(k2)/h(k2)^4;
const5 = err(k2)/h(k2)^5;
figure(1), loglog( h, err, 'k.', ...
    h, const4*h.^4, 'b-', ...
    h, const5*h.^5, 'r--')
xlabel('h'), ylabel('max residual')

The question arises, then, what should we do in order to understand what the numerical solver has done? We advocate in this paper that we should not rely solely on the interpolants provided by the code (which are not independent of the method used) but instead do a separate assessment of the “minimal possible” maximum residual.

3 Interpolants Minimizing a Measure of the Residual↩︎

The problem is to find an interpolating curve \(x:[t_0,t_f]\to{\rm {I\ \kern -0.54emR}}^n\) passing through a set of points \(z_i\) in \({\rm {I\ \kern -0.54emR}}^n\), called the skeleton, generated at \(t=t_i\), \(i = 0,\ldots,N\), where \(t_0 < t_1 < \cdots < t_N := t_f\), using an ODE solver for the initial value problem \(\dot{z}(t) = f(z(t),t)\),\(z(t_0) = z_0\). We assume that \(z_i \neq z_{i-1}\), \(i = 1,\ldots,N\). The interpolating curve is required to minimize some measure, for example, the \(L^2\)- or the \(L^\infty\)-norm, of the residual \((\dot{z}(t) - f(z(t),t))\).

3.1 Reformulations of the minimum residual problems↩︎

The problem of \(L^2\)-minimization, or minimization of the square \(L^2\)-norm, of the residual by a curve \(x(t)\) interpolating the skeleton \(\{z_0,z_1,\ldots,z_N\}\) can be written as \[(PL2) \left\{\begin{array}{rl} \displaystyle\min & \;\;\displaystyle\frac{1}{2}\,\|\dot{x}(t) - f(x(t),t)\|_{L^2}^2 = \frac{1}{2}\,\sum_{i=1}^{N} \int_{t_{i-1}}^{t_i} \|\dot{x}(t) - f(x(t),t)\|^2\,dt \\[6mm] subject to & \;\;x(t_i) = z_i\,,\;\;i = 0,\ldots,N\,, \end{array} \right.\] where \(\|\cdot\|\) is the Euclidean norm in \({\rm {I\ \kern -0.54emR}}^n\). Define the control variable \(u:[t_0,t_f]\to{\rm {I\ \kern -0.54emR}}^n\) such that \(u(t):=\dot{x}(t)-f(x(t),t)\). Now Problem (PL2) can equivalently be re-written as an optimal control problem as follows. \[(PL2a) \left\{\begin{array}{rl} \displaystyle\min & \;\;\displaystyle\frac{1}{2}\,\sum_{i=1}^{N} \int_{t_{i-1}}^{t_i} \|u(t)\|^2\,dt \\[5mm] subject to & \;\;\dot{x}(t) = f(x(t),t) + u(t)\,, a.e.t\in[t_0,t_f]\,, \\[2mm] & \;\;x(t_i) = z_i\,,\;\;i = 0,\ldots,N\,, \end{array} \right.\] where \(x(t)\) is referred to as the state variable. We note that an optimal control \(u\) can be indeterminate at isolated points in \([t_0,t_f]\); therefore, the ODE in Problem (PL2a) is stated for a.e. \(t\in[t_0,t_f]\). We call the time interval \([t_{i-1}, t_i]\), \(i = 1,\ldots, N\), the \(i\)th stage.

As can be seen in Problem (PL2), the square \(L^2\)-norm can be written as the sum of the square \(L^2\)-norms in each stage. Subsequently, we define the problem of stage \(L^\infty\)-minimization, or minimization of the stage \(L^\infty\)-norm, of the residual by defining the objective functional (similarly) as the sum of the \(L^\infty\)-norms in each stage, since this will eventually allow us to obtain closed-form expressions for the optimal residuals of some of the example problems we investigate. We point out that this sum of the \(L^\infty\)-norms in stages \([t_{i-1}, t_i]\) is obviously different from the (classical) \(L^\infty\)-norm over the whole interval \([t_0, t_f]\), with which it is not possible to obtain any analytical solution for the optimal residual, and we leave this outside the scope of the present paper. So, we alter Problem (PL2a) slightly to get the stage \(L^\infty\)-minimization problem as follows. \[(PLinf) \left\{\begin{array}{rl} \displaystyle\min & \;\;\displaystyle\sum_{i=1}^{N}\,\max_{t_{i-1}\le t\le t_i} \|u(t)\|_\infty \\[5mm] subject to & \;\;\dot{x}(t) = f(x(t),t) + u(t)\,, a.e.t\in[t_0,t_f]\,, \\[2mm] & \;\;x(t_i) = z_i,\;\;i = 0,\ldots,N\,, \end{array} \right.\] where \(\|\cdot\|_\infty\) is the \(\ell_\infty\)-norm in \({\rm {I\ \kern -0.54emR}}^n\). Now, using a standard reformulation in mathematical programming (see, for example, [23] and [12]), Problem (PLinf) can be re-cast as \[(PLinfa) \left\{\begin{array}{rl} \displaystyle\min & \;\;\displaystyle\sum_{i=1}^{N} \alpha^{[i]} \\[5mm] subject to & \;\;\dot{x}(t) = f(x(t),t) + u(t)\,, a.e.t\in[t_0,t_f]\,,\\[2mm] & \;\;x(t_i) = z_i,\;\;i = 0,\ldots,N\,, \\[2mm] & \;\;\|u(t)\|_\infty \le\alpha^{[i]}\,, a.e.t\in[t_{i-1},t_i)\,,\;\;i = 1,\ldots,N\,, \end{array} \right.\] where \(\alpha^{[i]} \ge 0\) are new parameters that are to be determined for each stage \(i\). Note that, for a.e.\(t\in[t_{i-1},t_i]\), \(\|u(t)\|_\infty \le\alpha\) if, and only if, \(|u_j(t)| \le\alpha^{[i]}\), \(i = 1,\ldots,N\), \(j = 1,\ldots,n\). Let \(u := \alpha^{[i]}\,v\), for \(t\in[t_{i-1},t_i)\), \(i = 1,\ldots,N\), where \(v:[t_0,t_f]\to{\rm {I\ \kern -0.54emR}}^n\) is a new control variable. Then Problem (PLinfa) can be rewritten as the following differentiable, or smooth, parametric optimal control problem. \[(PLinfb) \left\{\begin{array}{rl} \displaystyle\min & \;\;\displaystyle\sum_{i=1}^{N} \alpha^{[i]} \\[5mm] subject to & \;\;\dot{x}(t) = f(x(t),t) + \alpha^{[i]}\,v(t)\,, a.e.t\in[t_{i-1},t_i)\,,\;i = 1,\ldots,N\,, \\[2mm] & \;\;x(t_i) = z_i\,,\;\;i = 0,\ldots,N\,, \\[2mm] & \;\;|v_j(t)| \le 1\,, a.e.t\in[t_0,t_f]\,,\;\;j = 1,\ldots,n\,. \end{array} \right.\] Problem (PLinfb) can further be transformed into a more standard or classical form by defining a new state variable, \(y(t) := \alpha^{[i]}\), for all \(t\in[t_{i-1},t_i)\), \(i = 1,\ldots,N\), as follows. \[(PLinfc) \left\{\begin{array}{rl} \displaystyle\min & \;\displaystyle\int_{t_0}^{t_f} y(t)\,dt \\[5mm] subject to & \;\;\dot{x}(t) = f(x(t),t) + y(t)\,v(t)\,, a.e.t\in[t_{i-1},t_i)\,,\;i = 1,\ldots,N\,, \\[2mm] & \;\;x(t_i) = z_i\,,\;\;i = 0,\ldots,N\,, \\[2mm] & \;\;\dot{y}(t) = 0\,, \\[2mm] & \;\;|v_j(t)| \le 1\,, a.e.t\in[t_0,t_f]\,,\;\;j = 1,\ldots,n\,. \end{array} \right.\]

4 Necessary conditions of optimality↩︎

Both Problems (PL2a) and (PLinfc) are optimal control problems with point-state constraints, \(x(t_i) = z_i\), \(i = 1,\ldots,N-1\), given in the interior of \([t_0,t_f]\). They can be further transformed into multi-process, or multi-stage, optimal control problems, in a fashion similar to that done for the interpolation problems in [12], [13], as well as for density estimation in [24]. The necessary conditions of optimality for general multi-process or multi-stage optimal control problems can be found in [25][27].

We first define a new independent variable \(s\) in terms of \(t\) to map the “duration" \(\tau_i\) of each stage \(i\) to unity: \[t = t_{i-1} + s\,\tau_i\,,\quad s\in[0,1]\,,\quad \tau_i := t_i - t_{i-1}\,,\quad i = 1,\ldots,N\,.\] With this definition, the time horizon of each stage \(i\) is rescaled as \([0,1]\) in the new independent (time) variable \(s\). Note that \(\tau_i > 0\), \(i=1,\ldots,N\). Let \[x^{[i]}(s) := x(t)\,,\;y^{[i]}(s) := y(t)\,\;and\;u^{[i]}(s) := u(t)\,,\;fors\in[0,1],\;t\in[t_{i-1},t_{i}]\,,\] for \(i = 1,\ldots,N\). Here \(x^{[i]}(s) := (x^{[i]}_1(s),\ldots,x^{[i]}_n(s))\in{\rm {I\ \kern -0.54emR}}^n\) denotes the value of the state variable vector \(x(t)\) in stage \(i\), and other stage variables are to be interpreted similarly. With the usage of stages, it is necessary to ensure continuity of the state variables \(x(t)\) at the junction of any two consecutive stages, but this is readily ascertained since \(x^{[i]}(1) = x^{i+1}(0) = z_i\), \(i = 1,\ldots,N-1\).

4.1 Necessary conditions of optimality for -minimization↩︎

Using the definitions above, a multi-stage optimal control problem associated with (PL2a) can now be written as \[(PL2b)\left\{\begin{array}{rl} \min &\;\;\displaystyle\frac{1}{2}\,\sum_{i=1}^{N}\tau_i\int_{0}^{1} \|u^{[i]}(s)\|^2\,ds \\[6mm] s.t. &\;\;\dot{x}^{[i]}(s) = \tau_i \left(f(x^{[i]}(s),s) + u^{[i]}(s)\right), a.e.s\in[0,1]\,, \\[2mm] &\;\; x^{[i]}(0) = z_{i-1}\,,\;x^{[i]}(1) = z_i\,, \;\;\;i = 1,\ldots,N\,. \end{array}\right.\]

In what follows, we will state a maximum principle, i.e., necessary conditions of optimality, for Problem (PL2b), using [26] and [25]. A relevant maximum principle can also be found in [27]. First, define the Hamiltonian function for the \(i\)th stage of Problem (PL2b) as \[H^{[i]}(x^{[i]},\lambda_0,\lambda^{[i]},u^{[i]},s) := \tau_i \left(\frac{1}{2}\,\lambda_0\,\|u^{[i]}\|^2 + (\lambda^{[i]})^T \left(f(x^{[i]},s) + u^{[i]}\right)\right),\] where \(\lambda_0\) is a scalar (multiplier) parameter, and \(\lambda^{[i]}:[0,1]\rightarrow{\rm {I\ \kern -0.54emR}}^n\), such that \(\lambda^{[i]} := (\lambda^{[i]}_1,\ldots,\lambda^{[i]}_n)\), is the adjoint variable vector (or the vector of multiplier functions) in the \(i\)th stage. For neatness of the above expression, the time dependence of the variables has been suppressed. For further neatness, let \[H^{[i]}[s] := H^{[i]}(x^{[i]}(s),\lambda_0,\lambda^{[i]}(s),u^{[i]}(s),s)\,.\] First, we recall formally that \(L^\infty(0,t_f;{\rm {I\ \kern -0.54emR}}^n)\) denotes the space of essentially bounded measurable functions equipped with the essential supremum norm. Furthermore, \(W^{1,\infty}(0,t_f;{\rm {I\ \kern -0.54emR}}^n)\) is the Sobolev space consisting of functions \(x : [0,t_f] \to {\rm {I\ \kern -0.54emR}}^n\) whose first derivatives lie in \(L^\infty\). Suppose that \(x^{[i]}\in W^{1,\infty}(0,1;{\rm {I\ \kern -0.54emR}}^n)\) and \(u^{[i]}\in L^\infty(0,1;{\rm {I\ \kern -0.54emR}}^n)\), \(i = 1,\ldots,N\), solve Problem (PL2b). Then there exist a number \(\lambda_0\ge0\) and the vector function \(\lambda^{[i]}\in W^{1,\infty}(0,1;{\rm {I\ \kern -0.54emR}}^n)\) such that
\(\overline{\lambda}^{[i]}(s):= (\lambda_0,\lambda^{[i]}(s)) \neq \boldsymbol{0}\), \(i = 1,\ldots,N\), for every \(s\in[0,1]\), and, in addition to the state differential equations and the other constraints given in Problem (PL2b), the following conditions hold: \[\begin{align} && \dot{\lambda}_j^{[i]}(s) = -H^{[i]}_{x^{[i]}_j}[s]\,, a.e.s\in[0,1],\;\;i = 1,\ldots,N,\;j = 1,\ldots,n\,, \tag{4} \\[1mm] && \lambda_j^{i+1}(0) = \lambda_j^{[i]}(1) + \delta_j^{[i]}\,,\;\;i = 1,\ldots,N-1,\;j = 1,\ldots,n\,, \tag{5} \\[1mm] && u^{[i]}(s)\in\mathop{\mathrm{argmin}}_{w\in{\rm {I\ \kern -0.54emR}}} H^{[i]}(x^{[i]}(s),\lambda_0,\lambda^{[i]}(s),w,s)\,,\;a.e.s\in[0,1]\,, \tag{6} \end{align}\] where \(H^{[i]}_{x^{[i]}_j} := \partial{H^{[i]}}/\partial{x^{[i]}_j}\), and \(\delta_j^{[i]}\), \(i = 1,\ldots,N-1\), \(j = 1,\ldots,n\), are real constants.

We define the “overall" adjoint variable vector \(\lambda_j(t)\), \(j=1,\ldots,n\), to be formed by concatenating the stage adjoint variable vectors as follows. \[\label{overall} \lambda(t) := \lambda^{[i]}(s)\,,\quad t = t_{i-1} + s\,\tau_i,\quad s\in[0,1]\,,\quad i = 1,\ldots,N\,.\tag{7}\] Condition 5 states that the adjoint variables \(\lambda_j(t)\), \(j = 1,\ldots,n\), may have jumps as they go from one stage to the other.

The optimality conditions 46 can now be re-written rather more explicitly, along with the state equations, as follows. \[\begin{align} \dot{x}(t) &=& f(x(t),t) + u(t)\,,\quad x(0) = z_0\,,\;x(t_1) = z_1\,,\ldots,\;x(t_N) = z_N\,, a.e.t\in[0,t_N]\,, \tag{8} \\[1mm] \dot{\lambda}(t) &=& -f_x(x(t),t)^T\,\lambda(t)\,, for allt\in[t_{i-1},t_i)\,,\;i = 1,\ldots,N\,, \tag{9} \\[1mm] \lambda_j(t_i^+) &=& \lambda_j(t_i^-) + \delta_j^{[i]}\,,\;\;i = 1,\ldots,N-1,\;j = 1,\ldots,n\,, \tag{10} \\[1mm] \lambda_0\,u(t) &=& -\lambda(t)\,, for allt\in[t_{i-1},t_i)\,,\;i = 1,\ldots,N\,, \tag{11} \end{align}\] where \(f_x(x(t),t)\) is the Jacobian of \(f(x(t),t)\) with respect to \(x\). In the jump condition 10 , \(\lambda_3(t_i^+) := \lim_{t\to t_i^+}\lambda_3(t)\) and \(\lambda_3(t_i^-) := \lim_{t\to t_i^-}\lambda_3(t)\).

The problems which result in \(\lambda_0 = 0\) are called abnormal in the optimal control theory literature, for which the necessary conditions in 811 are independent of the objective functional and therefore not fully informative. The problems that result in \(\lambda_0 > 0\) are referred to as normal, which is the case of Problem (PL2b) as asserted by Lemma 1 below.

Lemma 1 (Normality). Problem (PL2b)* is normal. Subsequently, one can set \(\lambda_0 = 1\) for Problem (PL2b) and express the optimal control as \(u(t) = -\lambda(t)\), for a.e.\(t\in[t_0,t_f)\).*

Proof. Suppose that \(\lambda_0 = 0\). Then Condition 11 implies that \(\lambda(t) = 0\) for all \(t\in[0,t_f]\) and thus \((\lambda_0, \lambda(t)) = 0\), which is not allowed by the maximum principle cited above. Therefore, \(\lambda_0 >0\), and so the problem is normal and that, without loss of generality, one can take \(\lambda_0 = 1\). Then the substitution of \(\lambda_0 = 1\) in 11 completes the proof.

Remark 1 (Optimal control). We re-iterate that, by Lemma 1, \(u(t) = -\lambda(t)\), i.e., the residual is the negative of the adjoint variable. Furthermore, by 10 , the adjoint variable \(\lambda\) will typically be discontinuous at \(t_i\), \(i = 1,\ldots,N-1\); so will be the residual \(u\).

The necessary conditions of optimality for Problem (PL2), or equivalently those for Problem (PL2b), which are listed in 811 can now be simply written, also using Lemma 1, as a two-point boundary-value problem (TPBVP) in each of the \(N\) stages, in the following theorem.

Theorem 1 (\(L^2\)-residual). Suppose that \(x(t)\) and \(u(t)\) solve Problem (PL2). Then they solve \[\begin{align} &&\dot{x}(t) = f(x(t),t) - \lambda(t)\,,\;\;x(t_{i-1}) = z_{i-1},\;\; x(t_i) = z_i\,, \label{L295state95eqn} \\[2mm] &&\dot{\lambda}(t) = -f_x(x(t),t)^T\,\lambda(t)\,, \label{L295costate95eqn} \end{align}\] {#eq: sublabel=eq:L295state95eqn,eq:L295costate95eqn} for all \(t\in[t_{i-1},t_i]\), \(i = 1,\ldots,N\), with possible jumps in the values of the adjoint variable vector \(\lambda(t)\) at \(t_i\), \(i = 1,\ldots,N-1\).

Remark 2. Theorem 1 asserts that the TPBVP in ?? –?? should be solved in \([t_{i-1},t_i]\) (in each of the \(N\) stages) for unknown functions \(x\) and \(\lambda\).

4.2 Necessary conditions of optimality for stage -minimization↩︎

Using the rescaling and related definitions made at the beginning of Section 4, a multi-stage optimal control problem for (PLinfc) can be written as \[(PLinfd)\left\{\begin{array}{rl} \min &\;\;\displaystyle\sum_{i=1}^N \tau_i \int_{0}^{1} y^{[i]}(s)\,ds \\[6mm] s.t. &\;\;\dot{x}^{[i]}(s) = \tau_i \left(f(x^{[i]}(s),s) + y^{[i]}(s)\,v^{[i]}(s)\right), a.e.s\in[0,1]\,, \\[2mm] &\;\; x^{[i]}(0) = z_{i-1}\,,\;x^{[i]}(1) = z_i\,, \;\;\;i = 1,\ldots,N\,,\\[2mm] & \;\;\dot{y}^{[i]}(s) = 0\,, alls\in[0,1]\,, \\[2mm] & \;\;|v^{[i]}_j(s)| \le 1\,, a.e.s\in[0,1]\,, \;\;\;i = 1, \ldots, N,\;\;j = 1,\ldots,n\,. \end{array}\right.\]

The Hamiltonian function for the \(i\)th stage of Problem (PLinfd) is \[H^{[i]}(x^{[i]},y^{[i]},\lambda_0,\lambda^{[i]},\mu^{[i]},v^{[i]}) := \tau_i\left(\lambda_0\,y^{[i]} + (\lambda^{[i]})^T \left(f(x^{[i]},s) + y^{[i]}\,v^{[i]}\right) + \mu^{[i]}\cdot 0\right),\] where \(\mu^{[i]}:[0,1]\rightarrow{\rm {I\ \kern -0.54emR}}\) are the additional adjoint variables (compared to the \(L^2\)-minimization formulation in Section 4.1) in the \(i\)th stage. Let \[H^{[i]}[s] := H^{[i]}(x^{[i]}(s),y^{[i]}(s),\lambda_0,\lambda^{[i]}(s),\mu^{[i]}(s),v^{[i]}(s),s)\,.\] We invoke the maximum principle as in [26] and [25] again, this time for Problem (PLinfd): suppose that \(x^{[i]}\in W^{1,\infty}(0,1;{\rm {I\ \kern -0.54emR}}^n)\), \(y^{[i]}\in W^{1,\infty}(0,1;{\rm {I\ \kern -0.54emR}})\) and \(v^{[i]}\in L^\infty(0,1;{\rm {I\ \kern -0.54emR}}^n)\) solve Problem (PLinfd). Then there exist a number \(\lambda_0\ge0\), the vector function \(\lambda^{[i]}\in W^{1,\infty}(0,1;{\rm {I\ \kern -0.54emR}}^n)\), and \(\mu^{[i]}\in W^{1,\infty}(0,1;{\rm {I\ \kern -0.54emR}})\), such that \(\overline{\lambda}^{[i]}(s):= (\lambda_0,\lambda^{[i]}(s),\mu^{[i]}(s)) \neq \boldsymbol{0}\), for every \(s\in[0,1]\), \(i = 1,\ldots,N\), and, in addition to the state differential equations and the other constraints given in Problem (PLinfd), the following conditions hold: \[\begin{align} && \dot{\lambda}_j^{[i]}(s) = -H^{[i]}_{x_j}[s]\,, a.e.s\in[0,1],\;\;i = 1,\ldots,N,\;j = 1,\ldots,n\,, \tag{12} \\[1mm] && \lambda_j^{i+1}(0) = \lambda_j^{[i]}(1) + \delta_j^{[i]}\,,\;\;i = 1,\ldots,N-1,\;j = 1,\ldots,n\,, \tag{13} \\[1mm] && \dot{\mu}^{[i]}(s) = -H^{[i]}_{y^{[i]}}[s]\,, a.e.s\in[0,1],\;\;i = 1,\ldots,N\,,\tag{14} \\[1mm] && \mu^{[i]}(0) = 0 = \mu^{[i]}(1)\,,\;\;i = 1,\ldots,N,\;j = 1,\ldots,n\,, \tag{15} \\[1mm] && v^{[i]}(s)\in\mathop{\mathrm{argmin}}_{|w|\le 1} H^{[i]}(x^{[i]}(s),y^{[i]}(s),\lambda_0,\lambda^{[i]}(s),\mu^{[i]}(s),w,s)\,,\;a.e.s\in[0,1]\,,\tag{16} \end{align}\] where \(\delta_j^{[i]}\) are real constants. As in the case of optimality conditions for \(L^2\)-minimization, here one also has a jump condition on the adjoint variable \(\lambda\) at the junction/skeleton points, as given in 13 . The new adjoint variables \(\mu^{[i]}\), on the other hand, are continuous in each stage, and this is reflected by 14 and 15 , the latter providing the transversality conditions in each stage. We define the “overall" adjoint variable \(\mu(t)\), in the same way as we had defined \(\lambda(t)\) in 7 : \[\mu(t) := \mu^{[i]}(s)\,,\quad t = t_{i-1} + s\,\tau_i,\quad s\in[0,1]\,,\quad \tau_i := t_i - t_{i-1}\,,\quad i = 1,\ldots,N\,.\] The adjoint variable \(\mu(t)\) is continuous in view of 15 . The optimality conditions 1216 can now be re-written, along with the state equations, as follows. \[\begin{align} \dot{x}(t) &=& f(x(t),t) + \alpha\,v(t)\,, a.e.t\in[t_0,t_f],\;x(0) = z_0,\;x(t_1) = z_1,\ldots,\;x(t_N) = z_N, \tag{17} \\[1mm] \dot{\lambda}(t) &=& -f_x(x(t),t)^T\,\lambda(t)\,, a.e.t\in[t_{i-1},t_i)\,,\;i = 1,\ldots,N\,, \tag{18} \\[1mm] \lambda_j(t_i^+) &=& \lambda_j(t_i^-) + \delta_j^{[i]}\,,\;\;i = 1,\ldots,N-1,\;j = 1,\ldots,n\,, \tag{19} \\[1mm] \dot{\mu}(t) &=& -\lambda_0 - \lambda(t)^T v(t)\,, a.e.t\in[t_0,t_f],\;\;\mu(t_{i-1}) = 0\,,\;\mu(t_i) = 0\,, \tag{20} \\[1mm] v_j(t) &=& \left\{\begin{array}{ll} \;\;1\,, & if\; \lambda_j(t) < 0\,, \\[2mm] -1\,, & if\;\lambda_j(t) > 0\,, \\[2mm] \small undetermined\,, & if\;\lambda_j(t) = 0\,, \end{array}\right. \;\;a.e.t\in[t_{i-1},t_i),\;i = 1,\ldots,N,\;j = 1,\ldots,n, \tag{21} \end{align}\] where, in obtaining 21 from 16 , we have used \(y(t) = \alpha^{[i]} > 0\), for \(t\in[t_{i-1},t_i)\).

The following lemma can be considered as the companion of Lemma 1, for the stage \(L^\infty\)-minimization of the residual.

Lemma 2 (Normality). Problem (PLinfd)* is normal. Subsequently, one can set \(\lambda_0 = 1\) for Problem (PLinfd).*

Proof. From 21 , one has \(\lambda_j(t)\,v_j(t) \le 0\), \(i=1,\ldots,N\), and so \(\lambda(t)^T v(t) \le 0\), for a.e.\(t\in[t_0,t_N]\). Therefore, \(\dot{\mu}(t) = -\lambda_0 + |\lambda(t)^T v(t)|\). Suppose that \(\lambda_0 = 0\). Then \(\dot{\mu}(t) = |\langle\lambda(t),v(t)\rangle| \ge 0\), and so the boundary conditions in 14 are satisfied only if \(|\langle\lambda(t),v(t)\rangle| = 0\), or, also using 21 , only if \(\lambda(t) = 0\), for all \(t\in[t_0,t_N]\). This results in \(\mu(t) = 0\) and thus the adjoint variable vector \((\lambda_0,\lambda(t),\mu(t)) = 0\), for a.e.\(t\in[t_0,t_N]\), which is not allowed by the maximum principle. Therefore, \(\lambda_0 \neq 0\), and so one can set \(\lambda_0 = 1\).

The optimality conditions 1721 can now be written as a two-point boundary-value problem in the following theorem, which can be considered as the companion of Theorem 1, for the stage \(L^\infty\)-minimization of the residual.

Theorem 2 (Stage \(L^\infty\)-residual). Suppose that \(x(t)\) and \(u(t) := \alpha^{[i]}\,v(t)\), for \(t\in[t_{i-1},t_i]\), \(i = 1,\ldots,N\), solve Problem (Linf). Then \(x(t)\), \(v(t)\) and \(\alpha^{[i]}\) solve \[\begin{align} \dot{x}(t) &=& f(x(t),t) + \alpha^{[i]}\,v(t)\,,\;\;\;x(t_{i-1}) = z_{i-1}\,,\;\;x(t_i) = z_i\,, \label{Linf95x95eqn2} \\[1mm] \dot{\lambda}(t) &=& -f_x(x(t),t)^T\,\lambda(t)\,, \label{Linf95adjoint95DE2} \\[1mm] \dot{\mu}(t) &=& -1 - \lambda(t)^T v(t)\,,\;\;\mu(t_{i-1}) = 0\,,\;\mu(t_i) = 0\,, \label{Linf95mu2a} \end{align}\] {#eq: sublabel=eq:Linf95x95eqn2,eq:Linf95adjoint95DE2,eq:Linf95mu2a} where \[\label{Linf95opt95control95u} v_j(t) = \left\{\begin{array}{ll} \;\;1\,, & if\;\lambda_j(t) < 0\,, \\[2mm] -1\,, & if\;\lambda_j(t) > 0\,, \\[2mm] \small undetermined\,, & if\;\lambda_j(t) = 0\,, \end{array}\right.\qquad{(1)}\] for a.e.\(t\in[t_{i-1},t_i]\), \(i = 1,\ldots,N\), and \(j = 1,\ldots,n\), with possible jumps in the values of the adjoint variable vector \(\lambda(t)\) at \(t_i\), \(i = 1,\ldots,N-1\).

Remark 3. Theorem 2 asserts that the TPBVP in ?? –?? can be independently solved in \([t_{i-1},t_i]\) (in each of the \(N\) stages) for unknown functions \(x\), \(\lambda\) and \(\mu\), and the parameter \(\alpha^{[i]}\).

Bang–Bang and Singular Control. If \(\lambda_j(t) = 0\) only at isolated points in \([s_1,s_2]\subset[t_0,t_f]\), then \(v_j(t)\in\{-1,1\}\) for a.e.\(t\in[s_1,s_2]\) by ?? . In this case, \(v_j(t)\) is called bang–bang control over \([s_1,s_2]\), since either the value of \(v_j(t)\) changes between \(1\) and \(-1\), or simply \(v_j(t) \equiv 1\) or \(v_j(t) \equiv -1\), over this interval. However, if \(\lambda_j(t) = 0\), for a.e.\(t\in[s_1,s_2]\subset[t_0,t_f]\), then, by ?? , \(v_j(t)\) is undetermined over \([s_1,s_2]\) and is referred to as singular control. If \(v_j(t)\) is of singular type throughout the interval \([t_0,t_f]\), then it is called totally singular.

Lemma 3 (Partial singularity). Suppose that \(n=1\). Then the optimal control \(v(t)\) cannot be singular over the entire interval \([t_{i-1},t_i]\) for any given \(i = 1,\ldots,N\). Therefore, \(v(t)\) can also not be totally singular.

Proof. Suppose that (the only control variable) \(v(t)\) is singular, i.e., \(\lambda(t) = 0\), for a.e.\(t\in[t_{i-1},t_i]\). Then the ODE in ?? becomes \(\dot{\mu}(t) = -1\), which does not satisfy the boundary conditions \(\mu(t_{i-1}) = 0\) and \(\mu(t_i) = 0\), violating the maximum principle and thus proving the lemma. The second statement of the lemma follows immediately from the definition of total singularity.

5 Examples↩︎

In this section, we consider three initial value problems of the form \(\dot{x} = f(x(t),t)\), \(x(t_0)=x_0\), \(t_0\le t\le t_f\). For the analytical results (in particular, for Examples 1 and 2), we suppose that a numerical skeleton \(\{z_0,z_1,\ldots,z_N\}\), with \(z_0 = x(t_0)\) and \(z_N = x(t_f)\), of the numerical solution of the IVP is obtained by a numerical technique, for example, a Runge–Kutta or a multi-step method. For graphical illustrations of the optimal residuals obtained using our optimal control approach, we obtain the numerical skeleton with the Matlab ODE solvers, ode15s, ode45 and ode113. For each of these solvers, we set reltol = 1.0e-8 and abstol = 1.0e-8.

For comparison, we also use the Matlab function deval, which uses a given skeleton to find the values of \(x\) at a specified array of interior points of the domain by a polynomial approximation technique specific to each solver [9]. The particular approximation methods used in deval are briefly discussed in that reference, where we learn that the interpolant for ode45 is \(O(h^4)\) accurate, but the messy details are left to the code ntrp45. Close inspection of ntrp45 shows that the polynomial interpolant for ode45, which was supplied by Dormand and Prince by private communication to the authors of [9], is of degree at most 4, uses all stages6 of the RK method including the first stage needed for the following step and only those and is hence “free”, and has residual \(O(h^4)\). We give some details here because they are not easily available elsewhere (one has to read the code itself) and because writing the interpolant explicitly underscores the difference between the polynomial approach and the approach we propose.

The method used by ode45 is a \((5,4)\) Runge–Kutta pair by Dormand and Prince, which uses “local extrapolation” meaning that the solution is advanced by the fifth-order method while the error estimate in the fourth-order method is used for stepsize control. The method has the FSAL (First Same As Last) property, which means that while it uses \(7\) stages on any given step, the final stage will be re-used on the next step if the current step is successful, making the process almost equivalent to a six-stage method in computational cost.

The Butcher tableau for the method is of the form \[\begin{align} \renewcommand\arraystretch{1.5} \begin{array}{c|ccccccc} 0\; \\ \tfrac{1}{5}\; & \;\tfrac{1}{5}\; \\ \tfrac{3}{10}\; & \tfrac{3}{40} & \;\tfrac{9}{40}\; \\ \tfrac{4}{5}\; & \tfrac{44}{45} & -\tfrac{56}{15} & \;\tfrac{32}{9}\; \\ \tfrac{8}{9}\; & \tfrac{19372}{6561} & -{\tfrac{25360}{2187}} & \tfrac{64448}{6561} & \;-{\tfrac{212}{729}}\; \\ 1\; & \tfrac{9017}{3168} & -{\tfrac{355}{33}} & \tfrac{46732}{5247} & \tfrac{49}{176} & \;-{\tfrac{5103}{18656}}\; \\ 1\; & {\tfrac{35}{384}} & 0 & {\tfrac{500}{1113}} & {\tfrac{125}{192}} & -{ \tfrac{2187}{6784}} & {\tfrac{11}{84}} \\ \hline & {\tfrac{35}{384}} & 0 & {\tfrac{500}{1113}} & {\tfrac{125}{192}} & -{ \tfrac{2187}{6784}} & \;{\tfrac{11}{84}}\; \\ & d_1 & d_2 & d_3 & d_4 & d_5 & d_6 & \;d_7\; \end{array} \end{align}\] The stage values \(k_j\) are given by \(k_1 = f(t_n, y_n)\) and \[k_j = f\left( t_n+c_j h, y_n + h\sum_{\ell=1}^{j-1} a_{j,\ell} k_\ell \right)\] for \(j=2\), \(\ldots\), \(6\). The seventh stage, \(k_7\), which will be re-used on the next step if the current step is accepted, is \[k_7 = f\left( t_n + h, y_n + h\sum_{\ell=1}^6 b_\ell k_\ell \right)\>.\] The vector for computing the fourth-order solution used to estimate the error for stepsize control purposes is \(b = \left[{\tfrac{35}{384}}, 0, {\tfrac{500}{1113}}, {\tfrac{125}{192}}, -{ \tfrac{2187}{6784}}, {\tfrac{11}{84}}\right]\).

The continuous extension is the polynomial interpolant given by \[z(s) = y_n + h\sum_{j=1}^7 d_j(s) k_j \label{eq:dormandprincecext}\tag{22}\] where \(s = (t-t_n)/h\) and \[\begin{align} d_1(s) &= -\tfrac{1}{384}{s \left(435 s^{3}-1184 s^{2}+1098 s -384\right)} \nonumber \\ d_2(s) &= 0 \nonumber \\ d_3(s) &= \tfrac{500}{1113}{ s^{2} \left(6 s^{2}-14 s +9\right)} \nonumber \\ d_4(s) &= -\tfrac{125}{192} s^{2} \left(9 s^{2}-16 s +6\right) \nonumber \\ d_5(s) &= \tfrac{729}{6784} s^{2} \left(35 s^{2}-64 s +26\right) \nonumber \\ d_6(s) &= -\frac{11}{84} s^{2} \left(3 s -2\right) \left(5 s -6\right) \nonumber \\ d_7(s) &= \tfrac{1}{2}s^{2} \left(5 s -3\right) \left(s -1\right) \>. \end{align}\] We give this interpolant in factored form so that it is intelligible for the reader, who is not expected to be an expert in interpolation schemes for Runge–Kutta methods. Note that all of the \(d_i\) are zero when \(s=0\), ensuring that the polynomial interpolates the start of the step. Note that \(d_7\) is zero also when \(s=1\), meaning that the seventh stage is not involved in the value of \(y\) at the end of the step (this is the FSAL property). Note also that the derivatives of each of the polynomials are zero at \(s=0\), except for the first polynomial (it is \(1\) there, which is not evident without computation). This means that the derivative at \(s=0\) of the interpolant will match the value of the first stage, i.e. the derivative of \(y\) at \(s=0\). Finally, computation shows that the derivative of all the polynomials except \(d_7\) is zero at \(s=1\), while the derivative of \(d_7\) is \(1\) at \(s=1\), which means that the derivative of the interpolant matches the derivative of the computed solution \(y\) at \(s=1\), ensuring that the interpolant is continuously differentiable as a piecewise function, apart from rounding errors.

Because the interpolant for ode45 is piecewise continuously differentiable, it is piecewise continuous, and hence must have a residual no smaller than the minimal residual. In fact, because the skeleton is \(O(h^5)\) accurate and the interpolant is only \(O(h^4)\) accurate, we usually see that the residual using the interpolant of deval is substantially larger than the minimal residual.

The approximations for ode15s and ode113 are not continuously differentiable, by the way. They are not even continuous at both the mesh points \(t_n\) and \(t_{n+1}\). This implies that computing a residual using these approximations may not give an upper bound on the minimal residual. That is, the computed residual using these approximations is only an estimate of the residual, and may not be reliable in any consistent way.

We used both Maple and the B-series package in Julia [28], [29] to analyze the interpolant for ode45. Computation with the B-series package confirmed that the interpolant is \(4\)th order accurate, as we shall see in a moment. In detail, according to the B-series package, the leading terms of the residual are \[\begin{align} \frac{dz}{dt} - f(z) = s \left( 1-s \right)\Biggl( &\frac{1}{120} \left( 5 s^{2} - 9 s + 3 \right) F_{f}\mathopen{}\left( \rootedtree[.[.[.[.[.]]]]] \right)\mathclose{} + \frac{1}{120} \left( 5 s^{2} - 5 s + 1 \right) F_{f}\mathopen{}\left( \rootedtree[.[.[.[.][.]]]] \right)\mathclose{} \nonumber\\&{}+ \frac{1}{40} \left( 5 s^{2} - 5 s + 1 \right) F_{f}\mathopen{}\left( \rootedtree[.[.[.[.]][.]]] \right)\mathclose{} + \frac{1}{30} \left( 5 s^{2} - 5 s + 1 \right) F_{f}\mathopen{}\left( \rootedtree[.[.[.[.]]][.]] \right)\mathclose{} \nonumber\\&{}+ \frac{1}{120}\left( 5 s^{2} - 5 s + 1 \right) F_{f}\mathopen{}\left( \rootedtree[.[.[.][.][.]]] \right)\mathclose{} + \frac{1}{540} \left( 90 s^{2} - 92 s + 19 \right) F_{f}\mathopen{}\left( \rootedtree[.[.[.][.]][.]] \right)\mathclose{} \nonumber\\&{}+ \frac{1}{720} \left( 90 s^{2} - 92 s + 19 \right) F_{f}\mathopen{}\left( \rootedtree[.[.[.]][.[.]]] \right)\mathclose{} + \frac{1}{360} \left( 90 s^{2} - 92 s + 19 \right) F_{f}\mathopen{}\left( \rootedtree[.[.[.]][.][.]] \right)\mathclose{} \nonumber\\&{}+ \frac{1}{2160} \left( 90 s^{2} - 92 s + 19 \right) F_{f}\mathopen{}\left( \rootedtree[.[.][.][.][.]] \right)\mathclose{}\Biggr) h^4 + O(h^5) \end{align}\] Remember that \(t = t_n + sh\) so \(dz/dt = (dz/ds)/h\) by the chain rule.

The output of the B-series script has been tidied up by use of Maple so that one can see at a glance that the residual is zero at either end of the step (\(s=0\) and \(s=1\)). We use a common notation for elementary differentials and rooted trees. We are not aware of any published residual analysis of this method. We only print the fourth-order terms here, but more are available if needed. The B-series package is quite effective and efficient!

We refer back to Figure 1. The theory above predicts that the residual will behave like \(O(h^4)\) as the mean stepsize goes to zero (assuming that ode45 equidistributes the error, which it approximately does). We see that for the simple harmonic oscillator, this is true: the actual residual error diminishes like \(O(h^4)\), instead of \(O(h^5)\). [We do not show this here, but for the Van der Pol oscillator a phenomenon known as “order reduction” happens and we only get \(O(h^3)\) decay of the residual error.] We remind you that ode45 uses local extrapolation and is expected to have (“forward” or “global”) error that diminishes like \(O(h^5)\). We do not show this here, but this happens. What we are seeing here is that the interpolant supplied by deval for ode45 is only \(O(h^4)\) accurate, and so the residual error it measures is not as accurate as it could be.

In what follows we demonstrate using the tools of optimal control to understand the behaviour of numerical methods. For the first two examples the analysis can be carried out “by hand.” In Example 3, which involves the Van der Pol system of two first-order ODEs, an analytical solution is not possible to obtain; therefore, the optimal control problems are solved by the first-discretize-then-optimize approach [16]: Direct discretization of an (infinite-dimensional) optimal control problem gives rise to a large-scale finite-dimensional optimization problem, which is then solved by using standard optimization software. We use the mathematical programming modelling language AMPL [17] employing the optimization software Knitro [18], version 13.0.1. We set AMPL’s tolerance parameters as feastol = 1e-12 and opttol = 1e-12 for Knitro.

5.1 Example 1: the Dahlquist test problem↩︎

Consider the Dahlquist test problem, \[\label{simpleeqn} \dot{z}(t) = a\,z(t)\,,\quad z(0) = z_0\,,\tag{23}\] where \(a\) is a nonzero real constant.

We recall that in [14] the \(L^\infty\)-norm of the relative residual is minimized in a given stage for the Dahlquist test problem by also using an optimal control approach. In Sections 5.1.1 and 5.1.2 below, we obtain analytical expressions for the absolute residuals that minimize the \(L^2\)- and stage \(L^\infty\)-norms, respectively, for the same ODE, and compare the two residuals.

5.1.1 -minimization of the residual↩︎

Proposition 1 (-residual of 23 ). For \(i = 1,\ldots,N\), the residual \[\label{simp95res95L2} u(t) = -2a\,\frac{z_i - e^{a(t_i-t_{i-1})}z_{i-1}}{1 - e^{2a(t_i-t_{i-1})}}\,e^{a(t_i-t)}\,,\qquad{(2)}\] for all \(t\in[t_{i-1},t_i]\), solves (PL2)* with \(f(x(t),t) = a\,x(t)\). The resulting interpolant is given by \[\label{simp95x95L2} x(t) = -\frac{z_i - e^{-a(t_i-t_{i-1})} z_{i-1}}{e^{-a(t_i-t_{i-1})} - e^{a(t_i-t_{i-1})}}\,e^{a(t-t_{i-1})} - \frac{1}{2a}\,u(t)\,,\tag{24}\] for all \(t\in[t_{i-1},t_i]\).*

Proof. The two-point boundary-value problem in ?? –?? in Theorem 1 reduces, for 23 and \(i=1,\ldots,N\), to \[\begin{align} &&\dot{x}(t) = a\,x(t) - \lambda(t),\quad x(t_{i-1}) = z_{i-1}\,,\;\;x(t_i) = z_i\,, \\ &&\dot{\lambda}(t) = -a\,\lambda(t)\,, \end{align}\] for all \(t\in[t_{i-1},t_i]\). Also, recall from Remark 1 that \(u(t) = -\lambda(t)\), and that the residual function \(u\) will typically have jumps at \(t_i\), \(i=1,\ldots,N-1\). Using elementary linear algebraic techniques for the solution of systems of constant coefficient ODEs, and manipulations, one can easily obtain the required expressions given in the proposition.

Remark 4. Note from ?? that the residual \(u(t)\) is an exponential in \(t\). The modulus of the residual \(|u(t)| = r\,e^{a(t_i-t)}\), with \(r>0\) a constant, is decreasing if \(a>0\) and increasing if \(a<0\), in any stage \(i = 1,\ldots,N\). In particular, the \(L^2\)-norm of the residual over \([t_{i-1},t_i]\), in the \(i\)th stage, \(i=1,\ldots,N\), is given as \[\|u\|_{L^2} = -a\,\frac{\left(z_i - e^{a(t_i-t_{i-1})} z_i\right)^2}{1 - e^{2a(t_i-t_{i-1})}}\,.\]

5.1.2 Stage \(L^\infty\)-minimization of the residual↩︎

Proposition 2 (Stage -residual of 23 ). For \(i = 1,\ldots,N\), the residual \[\label{simp95res95Linfty} u(t) = -a\,\frac{z_{i-1} - e^{a(t_i-t_{i-1})} z_i}{1 - e^{a(t_i-t_{i-1})}} =: \overline{u}^{[i]}\,,\qquad{(3)}\] for all \(t\in[t_{i-1},t_i]\), solves (PLinf)* with \(f(x(t),t) = a\,x(t)\). The resulting interpolant is given by \[\label{simp95x95Linfty} x(t) = \left(z_{i-1} + \frac{\overline{u}^{[i]}}{a}\right)\,e^{a(t-t_{i-1})} - \frac{\overline{u}^{[i]}}{a}\,,\tag{25}\] for all \(t\in[t_{i-1},t_i]\).*

Proof. The DE ?? for the adjoint variable in Theorem 2 can be written for this example simply as \(\dot{\lambda}(t) = -a\,\lambda(t)\), the solution of which is \(\lambda(t) = c\,e^{-a\,t}\), where \(c\) is a real constant. Clearly, \(c\neq0\), as otherwise \(\lambda(t) = 0\) for all \(t\in[t_{i-1},t_i]\), which, by Lemma 3 is not permissible. The function \(\lambda(t)\) does not change sign, either; therefore, the optimal control cannot be singular, i.e., the optimal control is bang–bang and that \(u(t) = -\mathop{\mathrm{sgn}}(\lambda(t))\,\alpha^{[i]} = \overline{u}^{[i]}\), which is constant. The state equation in ?? can then be written as \[\dot{x}(t) - a\,x(t) = \overline{u}^{[i]}\,\quad x(t_{i-1}) = z_{i-1}\,,\;\;x(t_i) = z_i\,,\] solution of which yields 25 . Evaluating 25 at \(t=t_i\) and re-arranging gives ?? .

Remark 5 (Constancy of the residual). The residual \(u(t)\) in ?? is constant for all \(t\in[t_i,t_{i+1}]\), with the constant value typically changing from one stage to the other. We observe that \[\label{alpha95i} \alpha^{[i]} = \left|\overline{u}^{[i]}\right| = \left|a\,\frac{z_{i-1} - e^{a(t_i-t_{i-1})} z_i}{1 - e^{a(t_i-t_{i-1})}}\right|\,.\qquad{(4)}\]

It is interesting to compare the residuals which solve Problems (PL2) and (PLinf), respectively. So we next provide a companion to Propositions 12.

Let \(u_{L^2}(t)\) denote \(u(t)\) given in ?? . Recall that \(\alpha^{[i]} = |u^{[i]}(t)|\), where \(u^{[i]}(t)\) is given in ?? . The following fact establishes a relationship between the (pointwise) residuals minimizing their \(L^2\)- and stage \(L^\infty\)-norms, respectively.

Proposition 3 (Comparison of stage - and -residuals). For \(i = 1,\ldots,N\), one has that \[\label{relation} \max_{t_{i-1}\le t\le t_i} |u_{L^2}(t)| = \left\{\begin{array}{ll} \displaystyle\frac{2}{1 + e^{a(t_i-t_{i-1})}}\,\alpha^{[i]}, &if\;\; a<0\,, \\[4mm] \displaystyle\frac{2\,e^{a(t_i-t_{i-1})}}{1 + e^{a(t_i-t_{i-1})}}\,\alpha^{[i]}, &if\;\;a>0\,. \end{array}\right.\qquad{(5)}\] It follows that \[\label{eqn:compare} \alpha^{[i]} < \max_{t_i\le t\le t_{i+1}} |u_{L^2}(t)|\qquad{(6)}\] for any \(a\neq0\).

Proof. Suppose that \(a<0\). Then, since \(|u_{L^2}|\) is increasing over \([t_{i-1},t_i]\) by Remark 4, \[\begin{align} \max_{t_{i-1}\le t\le t_i} |u_{L^2}(t)| &=& |u_{L^2}(t_i)| = 2\,\left|a\,\frac{z_i - e^{a(t_i-t_{i-1})} z_{i-1}}{1 - e^{2a(t_i-t_{i-1})}}\right| \\ &=& \frac{2}{1 + e^{a(t_i-t_{i-1})}}\,\left|a\,\frac{z_i - e^{a(t_i-t_{i-1})} z_{i-1}}{1 - e^{a(t_i-t_{i-1})}}\right| \\ &=& \frac{2}{1 + e^{a(t_i-t_{i-1})}}\,\alpha^{[i]}\,, \end{align}\] by using ?? . Suppose that \(a>0\). In this case, since \(|u_{L^2}|\) is decreasing over \([t_{i-1},t_i]\) by Remark 4, \[\begin{align} \max_{t_{i-1}\le t\le t_i} |u_{L^2}(t)| &=& |u_{L^2}(t_{i-1})| = 2\,\left|a\,\frac{z_i - e^{a(t_i-t_{i-1})} z_{i-1}}{1 - e^{2a(t_i-t_{i-1})}}\right|\,e^{a(t_i-t_{i-1})} \\ &=& \frac{2\,e^{a(t_i-t_{i-1})}}{1 + e^{a(t_i-t_{i-1})}}\,\left|a\,\frac{z_i - e^{a(t_i-t_{i-1})} z_{i-1}}{1 - e^{a(t_i-t_{i-1})}}\right| \\ &=& \frac{2\,e^{a(t_i-t_{i-1})}}{1 + e^{a(t_i-t_{i-1})}}\,\alpha^{[i]}\,. \end{align}\] This proves ?? . The comparison made in ?? can simply be furnished by showing that the coefficients of \(\alpha^{[i]}\) displayed in ?? are greater than 1: if \(a<0\), then \(1 + e^{a(t_i-t_{i-1})}<2\); if \(a>0\), then \(2\,e^{a(t_i-t_{i-1})} = e^{a(t_i-t_{i-1})} + e^{a(t_i-t_{i-1})} > 1 + e^{a(t_i-t_{i-1})}\).

Remark 6 (Residuals for large ). As \(|a|(t_{i+1}-t_i)\to\infty\), we have that \(\max_{t_i\le t\le t_{i+1}} |u_{L^2}(t)|\to 2\alpha^{[i]}\). Therefore, for relatively large values of growth rate \(|a|\) in the Dahlquist test problem in 23 and the step-size \((t_{i+1}-t_i)\), the stage \(L^\infty\)-minimization of the residual (with pointwise values of at most \(\alpha^{[i]}\)) stands as a far better approach to use than the \(L^2\)-minimization of the residual (with pointwise values approaching \(2\,\alpha^{[i]}\)).

5.1.3 A graphical illustration of the residuals↩︎

Figure 2 depicts the residual graphs for various criteria with \(a=3\), \(z_0 = 1\), and \(t_f = 1\), for the Dahlquist test problem stated in 23 . To obtain the numerical skeleton, the popular Matlab solvers ode15s, ode45 and ode113 (see [9]) have been used. The number of mesh points corresponding to each solver turns out to be \(N = 78\), \(37\) and \(37\), respectively. With each numerical skeleton, approximating curves were generated using Matlab’s deval, as well as through the \(L^2\)- and stage \(L^\infty\)-minimization of the ODE residual, as given in 24 and 25 .

The residuals for each case and approach, nine of them altogether, are depicted in Figure 2, where the residual \(u_M(t)\) is produced using deval, and the residuals \(u_{L^2}(t)\) and \(u_{L^\infty}(t)\) of \(L^2\)- and stage \(L^\infty\)-minimization are computed as expressed in ?? and ?? , respectively.

First of all, we observe that ?? is verified in each of the graphs in Figure 2–(j). In the graphs in Figure 2 (e)–(h), the worst residuals look indistinguishable. The graphs in Figure 2 further show that, with ode15s and ode45, both \(\max_{0\le t\le 1}|u_{L^2}(t)|\) and \(\max_{0\le t\le 1}|u_{L^\infty}(t)|\) are better, i.e. smaller, than \(\max_{0\le t\le 1}|u_M(t)|\). With ode113, however, \(\max_{0\le t\le 1}|u_M(t)|\) turns out to be the smallest.

This is surprising: the polynomial approximation from the code has produced a residual that is (apparently) smaller than the minimum possible. The resolution is that the approximation from deval for the solution by ode113 is not continuous. As previously stated, the minimum possible residual from a discontinuous approximation is, in fact, zero, by using the local exact solution as the approximation.

Using discontinuous approximations \(z(t)\) introduces Dirac delta functions into the derivative \(\dot{z}(t)\) which means that the Gröbner–Alexeev nonlinear variation of constant formula (equation 3 ) will also have jumps at the nodes. This is, indeed, one classical analysis of the connection of “local error” to “global error.”

For the present context, it means that the residual computed by the approximation produced by deval for the solution by ode113 is not reliable. Our analysis gives a reliable estimate, in contrast.

All cases in Figure 2 considered, the ode45 skeleton with the interpolant minimizing the stage \(L^\infty\)-norm of the residual in part (i) stands out as the most satisfactory.

Figure 2: Example 1 – (a) interpolant for \dot{z}(t) = 3\,z(t), 0\le t\le1, z(0)=1, and interpolant residuals computed using (b),(c),(d) Matlab’s deval, (e),(f),(g) the L^2-residual in ?? and (h),(i),(j) the stage L^\infty-residual in ?? . The skeletons in each row were obtained using the Matlab packages ode15s, ode45 and ode113, as indicated on the graphs. We see incidentally that the polynomial interpolant for ode45 is indeed continuously differentiable, while those for ode15s and ode113 are not, although the residual (in (d)) for ode113 is zero as s \to 1^- on each step.

5.2 Example 2: a scalar nonlinear ODE↩︎

Now, we consider another scalar but nonlinear ODE given by \[\label{sqrteqn} \dot{z}(t) = \sqrt{z(t)}\,,\quad z(0) = 1\,.\tag{26}\] This ODE is reminiscent of the model for the celebrated “leaky bucket problem” given by \(\dot{z}(t) = -\sqrt{z(t)}\). An elementary relative residual analysis of the leaky bucket problem can be found in [15], intended for classroom use. Because this scalar autonomous problem is so simple, all the optimization can be done analytically, and this can help students to understand the effect of numerical methods. In this present paper we first carry out an analytical investigation similar to that of [15], in order to explain the idea, but then generalize the approach by using numerical optimization in order to deal with more complicated problems.

For the numerical methods ode15s and ode45 this is an informative example, as we will see. But “unfortunately” the numerical method that ode113 uses actually gets the exact solution to this problem (apart from roundoff)—this is a coincidence, and arises in part because the exact solution is a polynomial of degree two. Other methods, such as a second-order Taylor series method, also would get the exact solution. This means that the minimal residual would be zero, if exact computation were used. In practice, when we show the residuals that we actually compute, we will see only rounding error for the results of ode113 on this example. Nonetheless we have included the graphs for completeness.

5.2.1 -minimization of the residual↩︎

Proposition 4 (-residual of 26 ). The interpolant, which solves Problem (PL2)* with
\(f(x(t),t) = \sqrt{x(t)}\), is a quadratic, namely \[\label{sqrt95x95L2} x(t) = \frac{t^2}{4} + c_1\,t + c_2\,,\tag{27}\] for all \(t\in[t_{i-1},t_i]\), and all \(i = 1,\ldots,N\), where \[c_1 = \frac{1}{t_i-t_{i-1}}\,\left(z_i - z_{i-1} - \frac{t_i^2 - t_{i-1}^2}{4}\right)\quadand\quad c_2 = z_{i-1} - \frac{t_{i-1}^2}{4} - t_{i-1}\,c_1\,.\] The residual is in turn given by \[\label{sqrt95res95L2} u(t) = \frac{t}{2} + c_1 - \sqrt{x(t)}\,,\tag{28}\] for all \(t\in[t_{i-1},t_i]\), and all \(i = 1,\ldots,N\).*

Proof. The two-point boundary-value problem in ?? –?? in Theorem 1 reduces, for 26 and \(i=1,\ldots,N\), to \[\begin{align} &&\dot{x}(t) = \sqrt{x(t)} - \lambda(t)\,,\quad x(t_{i-1}) = z_{i-1}\,,\;\;x(t_i) = z_i\,, \\ &&\dot{\lambda}(t) = -\frac{1}{2\,\sqrt{x(t)}}\,\lambda(t)\,, \end{align}\] for all \(t\in[t_{i-1},t_i]\). Also, recall from Remark 1 that \(u(t) = -\lambda(t)\), and that the residual function \(u\) will typically have jumps at \(t_i\), \(i=1,\ldots,N-1\). Taking derivatives of both sides of the state equation, one gets \[\ddot{x}(t) = \frac{1}{2\,\sqrt{x(t)}}\,(\sqrt{x(t)} - \lambda(t)) - \dot{\lambda}(t) = \frac{1}{2} - \frac{1}{2\,\sqrt{x(t)}}\,\lambda(t) - \dot{\lambda}(t) = \frac{1}{2}\,,\] integration of which, twice, yields 27 , with the constants \(c_1\) and \(c_2\) obtained from the boundary conditions for the state equation. Finally, \(u(t) = -\lambda(t) = \dot{x}(t) - \sqrt{x(t)}\), with the substitution of \(\dot{x}(t) = t/2 + c_1\), furnishes 28 .

5.2.2 Stage \(L^\infty\)-minimization of the residual↩︎

Proposition 5 (Stage -residual of 26 ). The residual \(u(t) =: \overline{u}^{[i]}\), a constant, for
\(t\in[t_{i-1},t_i]\), and all \(i = 1,\ldots,N\), which solves 
(PLinf) with \(f(x(t),t) = \sqrt{x(t)}\), is given by the implicit equation \[\label{sqrt95res95Linfty} \overline{u}^{[i]}\,\ln\left(\frac{\sqrt{z_i}+\overline{u}^{[i]}}{\sqrt{z_{i-1}}+\overline{u}^{[i]}}\right) = \sqrt{z_i}-\sqrt{z_{i-1}} - \frac{t_i-t_{i-1}}{2}\,.\qquad{(7)}\] A corresponding implicit solution for \(x(t)\), for all \(t\in[t_{i-1},t_i]\), is then \[\label{sqrt95x95Linfty} \sqrt{x(t)} - \overline{u}^{[i]}\,\ln\left(\frac{\sqrt{x(t)} + \overline{u}^{[i]}}{\sqrt{z_{i-1}} + \overline{u}^{[i]}}\right) = \sqrt{z_{i-1}} + \frac{t-t_{i-1}}{2}\,.\qquad{(8)}\]

Proof. The DE ?? for the adjoint variable in Theorem 2 can be written for this example simply as \(\dot{\lambda}(t) = -(1/(2\,\sqrt{x(t)}))\,\lambda(t)\), the solution of which is \[\lambda(t) = c\,e^{-\int 1/(2\,\sqrt{x(t)})\,dt}\,,\] where \(c\) is a real constant. Clearly, \(c\neq0\), as otherwise \(\lambda(t) = 0\) for all \(t\in[t_{i-1},t_i]\) and so the optimal control is singular over \([t_{i-1},t_i]\), which is not possible by Lemma 3. With \(c\neq0\), one has that \(\lambda(t) \neq 0\) and does not change sign; therefore, the optimal control nonsingular and \(u(t) = -\mathop{\mathrm{sgn}}(\lambda(t))\,\alpha = \overline{u}^{[i]}\), a constant for all \(t\in[t_i,t_{i+1}]\). Then the state equation in ?? simply becomes \[\dot{x}(t) - \sqrt{x(t)} = \overline{u}^{[i]}\,\quad x(t_{i-1}) = z_{i-1}\,,\;\;x(t_i) = z_i\,.\] The DE above can be re-arranged, integrated, and manipulations can be carried out to get \[\begin{align} && \int_{t_{i-1}}^t \frac{dx}{\sqrt{x(t)} + \overline{u}^{[i]}} = \int_{t_{i-1}}^t dt \nonumber \\[2mm] && 2\left[(\sqrt{x(t)} - \overline{u}^{[i]}\,\ln\left(\sqrt{x(t)} + \overline{u}^{[i]}\right) - (\sqrt{z_{i-1}} - \overline{u}^{[i]}\,\ln\left(\sqrt{z_{i-1}} + \overline{u}^{[i]}\right)\right] = t - t_{i-1} \label{x95expanded} \\ && \sqrt{x(t)} - \sqrt{z_{i-1}} - \overline{u}^{[i]}\,\ln\left(\frac{\sqrt{x(t)} + \overline{u}^{[i]}}{\sqrt{z_{i-1}} - \overline{u}^{[i]}}\right) = \frac{t - t_{i-1}}{2} \nonumber \end{align}\tag{29}\] which yields ?? . Evaluating ?? at \(t=t_i\) and re-arranging gives ?? .

Remark 7 (Lambert \(W\) function). Since \(x(t) > 0\), after manipulations on 29 , the interpolating curve \(x(t)\), implicitly given in ?? , can alternatively be expressed as \[x(t) = \left\{-\overline{u}^{[i]}\,\left(W\left(-\frac{\overline{u}^{[i]}}{e}\left[\frac{t - t_{i-1}}{2} + \sqrt{z_{i-1}} - \overline{u}^{[i]}\,\ln\left(\sqrt{z_{i-1}} + \overline{u}^{[i]}\right)\right]^{-1/\overline{u}^{[i]}}\right) + 1\right)\right\}^2\,,\] for all \(t\in[t_{i-1},t_i]\), \(i=1,\ldots,N\), where \(W(\cdot)\) is the Lambert \(W\) function. (The Lambert \(W\) function* \(w\) is defined implicitly as \(w =W(a)\), where \(w\,e^w = a\); see [30].)*

5.2.3 A graphical illustration of the residuals↩︎

Figure 3 depicts the graphs of the residuals for various criteria. As with Example 1, to obtain the numerical skeleton, the Matlab solvers ode15s, ode45 and ode113 have been used. The number of mesh points corresponding to the solutions produced by each solver turns out to be \(N = 25\), \(12\) and \(22\), respectively. With each numerical skeleton, interpolating curves were generated using Matlab’s deval, as well as through the \(L^2\)- and stage \(L^\infty\)-minimization of the ODE residual, as given in 27 and ?? . The residuals for each case and approach, nine of them altogether, are depicted in Figure 3, where the residuals resulting from \(L^2\)- and stage \(L^\infty\)-minimization are computed using 28 and ?? .

As in Example 1, with ode15s and ode45, both \(\max_{0\le t\le 1}|u_{L^2}(t)|\) and \(\max_{0\le t\le 1}|u_{L^\infty}(t)|\) are less than \(\max_{0\le t\le 1}|u_M(t)|\). Overall, \(L^\infty\)-minimization seems to result in slightly smaller residuals than \(L^2\)-minimization.

All cases in Figure 3 considered, the ode45 skeleton with the stage \(L^\infty\)-minimum interpolation facilitates the smallest residual, by an order of magnitude of two compared with the residual obtained by Matlab’s deval.

Figure 3: Example 2 – (a) interpolant for \dot{z}(t) = \sqrt{z(t)}, 0\le t\le1, z(0)=1, and interpolant residuals computed using (b),(c),(d) Matlab’s deval, (e),(f),(g) the L^2-residual in ?? and (h),(i),(j) the stage L^\infty-residual in ?? . The skeletons in each row were obtained using the Matlab packages ode15s, ode45 and ode113, as indicated on the graphs. Figures (d), (g), and (j) are not very informative, because ode113 actually gets the exact solution to this problem: the minimal residual is zero, apart from roundoff error.

5.3 Example 3: the Van der Pol equation↩︎

Next, we consider the Van der Pol oscillator, a second order ODE used to model electrical circuits. It can be represented as a system of two first-order nonlinear ODEs as follows. \[\label{ode:vdp} \begin{array}{ll} \dot{z}_1(t) = z_2(t)\,, &\; z_1(0) = z_{1,0}\,, \\[2mm] \dot{z}_2(t) = -z_1(t) - (z_1(t)^2 - 1)\,z_2(t)\,, &\; z_2(0) = z_{2,0}\,, \end{array}\tag{30}\] for \(t\in[t_0,t_f]\).

Unlike Examples 1 and 2, we cannot obtain analytical solutions for the interpolating curves that minimize the \(L^2\)- and stage \(L^\infty\)-residuals of equation 30 . As explained at the beginning of Section 5, we can instead solve a direct discretization of the optimization problems (PL2a) and (PLinfb) to get approximate solutions [16].

For discretizing Problem (PL2a), we use the trapezoidal rule with 2000 subintervals in each stage. For discretizing Problem (PLinfb), on the other hand, we employ Euler’s method, since the optimal control (or the optimal residual) is in general of bang–bang type and therefore is not even continuous, making the interpolants continuous but non-differentiable. Although Euler’s method (in theory) requires differentiability of the solution curves, it has previously been successfully implemented to solve problems with bang–bang and singular types of optimal control, in particular in estimating the switching locations (or the “switching times”) of the control function—see e.g. [31][33] and the references therein. With all the numerical skeletons generated by ode15s, ode45 and ode113, we use 2000 subintervals for Euler’s method as well in each stage.

Figure 4 shows the graphs of the residuals resulting from the interpolants of the numerical skeleton of 110 nodes generated by ode15s. In the graphs, subscripts 1 and 2 are used to denote the two components of the residual vector. All graphs in Figure 4 (a)–(f) considered, \(\max_{0\le t\le 1} \|u_{L^\infty}(t)\|\) appears to be smaller than both \(\max_{0\le t\le 1} \|u_M(t)\|\) and \(\max_{0\le t\le 1} \|u_{L^2}(t)\|\). The optimality condition ?? seems to be verified: Figure 5, especially when zoomed in, shows that \((v_{L^\infty})_i(t) = \mathop{\mathrm{sgn}}(\lambda_i(t))\), \(i = 1,2\), as required by the theory. We provide the Matlab fig as well as portable network graphics (png) files of those graphs involving switchings not only in Figure 5, but also Figures 7 and 9, for the reader to be able to zoom in and explore, at https://github.com/rcorless/OptimalResiduals.

Figure 6 presents the graphs of the residuals resulting from the interpolants of the numerical skeleton of 36 nodes generated by ode45. In this case, \(\max_{0\le t\le 1} \|u_{L^2}(t)\|\) is the smallest, and \(\max_{0\le t\le 1} \|u_{L^\infty}(t)\|\) the largest. We note that, as can be seen in Figure 7, the optimality condition ?? seems to be verified.

In Figure 8, we provide the graphs of the residuals resulting from the interpolants of the numerical skeleton of 56 nodes generated by ode113. As in the case of ode45, this time as well, \(\max_{0\le t\le 1} \|u_{L^2}(t)\|\) is the smallest. We also observe in Figure 9 that the graphs of the adjoint variables \(\lambda_1(t)\) and \(\lambda_2(t)\), and the optimal controls \(v_1\) and \(v_2\) seem to verify the optimality condition ?? .

In each of the cases above, \((v_{L^\infty})_1(t)\) and \((v_{L^\infty})_2(t)\) seem to have at most one switching in each stage.

Figure 4: Example 3 – interpolant residual vector componentscomputed for equation 30 and Matlab’s ode15s, with 0\le t\le1 and x(0)=(-1,-3), using (a)–(b) Matlab’s deval, (c)–(d) L^2-minimization and (e)–(f) stage L^\infty-minimization.
Figure 5: Example 3 – (a)–(b) control variable vector and (c)–(d) costate variable vector components, all for equation 30 , with 0\le t\le1, x(0)=(-1,-3), and Matlab’s ode15s.
Figure 6: Example 3 – interpolant residual vector componentscomputed for equation 30 and Matlab’s ode45, with 0\le t\le1 and x(0)=(-1,-3), using (a)–(b) Matlab’s deval, (c)–(d) L^2-minimization and (e)–(f) stage L^\infty-minimization.
Figure 7: Example 3 – (a)–(b) control variable vector and (c)–(d) costate variable vector components, all for equation 30 , with 0\le t\le1, x(0)=(-1,-3), and Matlab’s ode45.
Figure 8: Example 3 – interpolant residual vector componentscomputed for equation 30 and Matlab’s ode113, with 0\le t\le1 and x(0)=(-1,-3), using (a)–(b) Matlab’s deval, (c)–(d) L^2-minimization and (e)–(f) stage L^\infty-minimization.
Figure 9: Example 3 – (a)–(b) control variable vector and (c)–(d) costate variable vector components, all for equation 30 , with 0\le t\le1, x(0)=(-1,-3), and Matlab’s ode113.

6 Discussion of Results↩︎

In all the examples using ode45, we see that the minimizing residuals (either \(L^2\) or \(L^\infty\)) are never larger than those produced by the polynomial interpolants used by deval, and are in some cases a hundred times smaller. This independent confirmation of the quality of the computed skeleton of the solution to the differential equation provides a useful reassurance that the code has done its job. In the case of ode113 and ode15s, where the polynomial approximations are not continuous, the residuals computed thereby are not reliable; the examples here show that (except when ode113 accidentally got the exact solution) the true residual can be larger than that computed using polynomials.

The fact that the optimal residuals are not continuous is perhaps disconcerting. They come from a piecewise continuous, but not piecewise continuously differentiable, interpolant. We point out that requiring the interpolant to be continuously differentiable is usually not possible: if the interpolant is continuously differentiable, then it will almost always be possible to find another interpolant with a slightly smaller residual. That is, the solution to the minimization problem will not actually exist, if we insist on too much continuity. This is a familiar fact, discussed in many textbooks, for example [34]. Nonetheless, the modeller may not be expecting discontinuous inputs to the right-hand side of the model differential equations, and be happier with the larger residuals produced by smoother interpolants.

A further refinement of this idea is possible, as mentioned in [15]: one might compute a “modified equation” to account for the largest part of the residual, and then use the techniques of this present paper to find the best interpolant that fits that modified equation. If we follow that process, the discontinuous residual will typically be an order of magnitude smaller.

For clarity, here is an example. Suppose we are solving \(\dot{y} = F_f(y)\) by using the parameterized 2nd order Runge–Kutta method \[\begin{align} \begin{array}{c|cc} 0\; \\ \frac{1}{2a}\; & \;\frac{1}{2a}\; \\ \hline & 1-a & a \end{array} \end{align}\] This is a 2nd order Runge–Kutta method for any choice of \(a \ne 0\).

Using the BSeries.jl package we mentioned previously, the modified equation is (to 2nd order) computed as follows.

using BSeries
using Latexify
import SymPyPythonCall; sp = SymPyPythonCall;
a = sp.symbols("a", real = true)
A = [0 0; 1/(2*a) 0]; b = [1-a, a]; c = [0, 1/(2*a)]
coeffs2 = bseries(A, b, c, 3)
meq = modified_equation(coeffs2)

The output of the above equation has to be divided by \(h\) (because of the conventions of the code normalization) in order to construct the right-hand side of the modified equation below. Again, we use a common notation for elementary differentials using trees. \[\dot{y} = F_{f}\mathopen{}\left( \rootedtree[.] \right)\mathclose{} + \frac{ - h^{2}}{6} F_{f}\mathopen{}\left( \rootedtree[.[.[.]]] \right)\mathclose{} + h^{2} \left( \frac{-1}{6} + \frac{1}{8 a} \right) F_{f}\mathopen{}\left( \rootedtree[.[.][.]] \right)\mathclose{} + O(h^3)\] If we take the skeleton and minimize the residual not for \(\dot{y} = F_f\) but rather in the above equation, then we would find that the minimal residual would be \(O(h^3)\) in size rather than \(O(h^2)\). That is, the largest part of the residual would be explained by the method of modified equations, and only a minor part would contain discontinuities (and its magnitude, if small, would be a guarantee of fidelity to the model).

One is tempted to choose \(a=3/4\) (this is Ralston’s method) which removes one term from the modified equation above, because that would make the example simpler. But the purpose of this paper is not to improve numerical methods but rather to provide tools from control theory that would explain their success. For that, the choice of parameter makes no difference.

7 Concluding Remarks↩︎

We passed several remarks about so-called “naı̈ve users” of modern codes for solving IVPs for ODEs numerically, saying that the users couldn’t be expected to understand just what the codes were doing. This is of course an oversimplification; yes, modern users are busy people and focused on what the results of their simulations mean, rather than on the intricacies of the basic and well-tested codes they are using in their simulations. Yes, they should know how to use modern codes, which are excellent and efficient. See in particular the Julia package by Rackauckas and Nie [35], which we find especially impressive. But it seems somehow unfair to force users to learn all the details of how the codes work: the codes should just work as the users expect them to.

Unfortunately, the codes do not always work as users expect them to, even though the documentation is usually clear. One issue is, we believe, that local error and its relation to forward error (also known as “global error”) is not as widely understood as it should be. Another and perhaps more significant issue is that the theory of error is true only asymptotically as the (mean) stepsize \(h\) goes to zero, while for efficiency the code tries to make \(h\) as large as possible consistent with the tolerances. This antinomy is managed by heuristics and “magic constants” in many codes. A further issue is that the language used to discuss the polynomial approximations provide by deval sometimes uses the words “interpolant” and sometimes “continuous extension.” These words are sometimes incorrect, strictly speaking, and can be misleading.

In cases where lives or a lot of money are involved, and greater surety of the correctness of the simulation is needed, it is incumbent on the user to be aware of that fact, and to check the results. There are many tools already in existence for doing so. Examples include using more reliable (continuously differentiable) interpolants [11] and computing residuals using those7. An alternative would be to use a different method, which controlled the residual directly [36]. This paper provides even more tools, which may be familiar to people who know some control theory.

The method we advocate here will guarantee that the code has solved a “nearby problem” and quantify just how near. Of course, one has to understand the effects of changing the problem to a nearby one, perhaps by using the Gröbner–Alexeev nonlinear variation of constants formula, or studying the sensitivity of the model to changes in another way, but this has to be done even if one has the exact analytical solution. See [6] for more discussion of this point.

We re-iterate that we used the stage \(L^\infty\)-norm mainly because we were able to obtain analytical expressions for the residual in Examples 1 and 2. With the true \(L^\infty\)-norm over the whole interval \([t_0,t_f]\), this advantage is lost, and the residual can only be obtained numerically. The multi-stage or multi-process optimal control setting we have used in the present paper will already facilitate the means to compute the residual. The (true) \(L^\infty\)-minimization should be expected to result in residuals which are better than those obtained by the stage \(L^\infty\)-minimization.

\(L^\infty\)-minimization has resulted in numerical experiments where there appears to be at most one switching in each optimal control stage. Therefore, another improvement in stage \(L^\infty\)-minimization could be achieved by parametrizing the problem with respect to switching times and thus computing the residuals more accurately—see [37] and the references therein. This would also allow one to use higher order discretization schemes in intervals between consecutive switchings, as in [37].

As a final remark, we see in [38] (a paper from more than thirty years ago) a similar idea in embryo. Only \(L^2\) minimization was considered in that paper, and only sketched for a simple example. But in one respect that paper advocates something that we have not done here, namely to include other terms in the equation which are motivated physically in order to interpret some aspects of the numerical solution. The example used there involved numerical solution of the simple harmonic oscillator \(\ddot y + y = 0\) and in the analysis thereof included an artificial damping term \(2\zeta \dot{y}\) with an unknown damping \(\zeta\). The paper suggested minimizing the residual with an optimal \(\zeta\) in order to interpret the numerics. We note that this can now be done more generally using the methods of this present paper.

Acknowledgements We are grateful to David Ketcheson and Hendrik Ranocha for help with the B-series package. We thank the author of [39] for putting the LaTeXsource for that paper on arXiv, which allowed us to avoid the effort of constructing the Butcher tableau for the Dormand–Prince RK (5,4) pair for ourselves. This, in addition for the work of the paper itself which we found helpful. This work was partially supported by NSERC grant RGPIN-2020-06438.

References↩︎

[1]
Ernst Hairer, Gerhard Wanner, and Syvert P Nørsett. Solving Ordinary Differential Equations I: Nonstiff Problems. Springer, 1993. URL: https://books.google.ca/books?id=F93u7VcSRyYC.
[2]
Ernst Hairer and Gerhard Wanner. Solving Ordinary Differential Equations II:Stiff and Differential Algebraic Problems. Springer Berlin Heidelberg, 1996. URL: http://dx.doi.org/10.1007/978-3-642-05221-7.
[3]
John C. Butcher. Numerical Methods for Ordinary Differential Equations. Wiley, July 2016. URL: http://dx.doi.org/10.1002/9781119121534.
[4]
Uri M. Ascher, Robert M. M. Mattheij, and Robert D. Russell. Numerical Solution of Boundary Value Problems for Ordinary Differential Equations. Society for Industrial and Applied Mathematics, January 1995. URL: http://dx.doi.org/10.1137/1.9781611971231.
[5]
Hans J. Stetter. Analysis of Discretization Methods for Ordinary Differential Equations. Springer Berlin Heidelberg, 1973. URL: http://dx.doi.org/10.1007/978-3-642-65471-8.
[6]
Robert M. Corless and Nicolas Fillion. A Graduate Introduction to Numerical Methods: From the Viewpoint of Backward Error Analysis. Springer New York, 2013. URL: http://dx.doi.org/10.1007/978-1-4614-8453-0.
[7]
Wayne H. Enright. Lecture notes on the numerical solution of ODEs. http://www.cs.toronto.edu/ enright/teaching/CSC2302/IVP.pdf, 2014.
[8]
Lawrence F. Shampine. Some practical Runge–Kutta formulas. Mathematics of Computation, 46(173):135–150, 1986. URL: http://dx.doi.org/10.1090/S0025-5718-1986-0815836-3.
[9]
Lawrence F Shampine and Mark W Reichelt. The Matlab ODE suite. SIAM Journal on Scientific Computing, 18(1):1–22, 1997. URL: https://epubs.siam.org/doi/10.1137/S1064827594276424.
[10]
Lawrence F Shampine. . SIAM journal on Numerical Analysis, 22(5):1014–1027, 1985. URL: https://epubs.siam.org/doi/pdf/10.1137/0722060.
[11]
Wayne H. Enright, Kenneth R. Jackson, Syvert P. Nørsett, and Per G. Thomsen. Interpolants for Runge–Kutta formulas. ACM Trans. Math. Softw., 12(3):193–218, September 1986. URL: https://doi.org/10.1145/7921.7923.
[12]
C. Yalçın Kaya and J. Lyle Noakes. Finding interpolating curves minimizing \(L^\infty\) acceleration in the Euclidean space via optimal control theory. SIAM Journal on Control and Optimization, 51(1):442–464, January 2013. URL: http://dx.doi.org/10.1137/12087880X.
[13]
C. Yalçın Kaya. Markov–Dubins interpolating curves. Computational Optimization and Applications, 73(2):647–677, February 2019. URL: http://dx.doi.org/10.1007/s10589-019-00076-y.
[14]
Robert M. Corless, C. Yalçın Kaya, and Robert H. C. Moir. Optimal residuals and the Dahlquist test problem. Numerical Algorithms, 81(4):1253–1274, November 2018. URL: http://dx.doi.org/10.1007/s11075-018-0624-x.
[15]
Robert M. Corless and Julia E. Jankowski. Variations on a theme of Euler. SIAM Review, 58(4):775–792, January 2016. URL: http://dx.doi.org/10.1137/15M1032351.
[16]
John T. Betts. Practical Methods for Optimal Control Using Nonlinear Programming, Third Edition. Society for Industrial and Applied Mathematics, January 2020. URL: http://dx.doi.org/10.1137/1.9781611976199.
[17]
Robert Fourer, David M Gay, and Brian W Kernighan. . https://ampl.com/resources/books/ampl-book/, 2003.
[18]
Richard H. Byrd, Jorge Nocedal, and Richard A. Waltz. Knitro: An Integrated Package for Nonlinear Optimization, page 35–59. Springer US, 2006. URL: http://dx.doi.org/10.1007/0-387-30065-1_4.
[19]
The MathWorks Inc. Matlab version 24.2.0.2712019 (r2024b), 2024. URL: https://www.mathworks.com.
[20]
Robert M. Corless. An elementary solution of a minimax problem arising in algorithms for automatic mesh selection. ACM SIGSAM Bulletin, 34(4):7–15, 2000. URL: https://doi.org/10.1145/377626.377633.
[21]
Robert M. Corless. . Numerical Algorithms, 31(1):115–124, 2002. URL: https://doi.org/10.1023/A:1021108323034.
[22]
Silvana Ilie, Gustaf Söderlind, and Robert M. Corless. Adaptivity and computational complexity in the numerical solution of ODEs. Journal of Complexity, 24(3):341–361, 2008. URL: https://doi.org/10.1016/j.jco.2007.11.004.
[23]
Jorge Nocedal and Stephen J Wright. Numerical Optimization. Springer New York, 2006. URL: http://dx.doi.org/10.1007/978-0-387-40065-5.
[24]
Markus Hegland and C. Yalçın Kaya. Probability density estimation via optimal control. https://arxiv.org/abs/2510.00835, 2025.
[25]
Dirk Augustin and Helmut Maurer. Second order sufficient conditions and sensitivity analysis for optimal multiprocess control problems. Control and Cybernetics, 29(1):11–31, 2000. URL: https://bibliotekanauki.pl/articles/206723.
[26]
Frank H. Clarke and Richard B. Vinter. Applications of optimal multiprocesses. SIAM Journal on Control and Optimization, 27(5):1048–1071, September 1989. URL: http://dx.doi.org/10.1137/0327056.
[27]
Andrei V. Dmitruk and Alexander M. Kaganovich. Maximum principle for optimal control problems with intermediate constraints. Computational Mathematics and Modeling, 22(2):180–215, April 2011. URL: http://dx.doi.org/10.1007/s10598-011-9096-8.
[28]
David I Ketcheson and Hendrik Ranocha. Computing with B-series. ACM Transactions on Mathematical Software, 49(2), 06 2023. https://arxiv.org/abs/2111.11680, https://doi.org/10.1145/3573384.
[29]
Hendrik Ranocha and David I Ketcheson. : Computing with B-series in Julia. https://github.com/ranocha/BSeries.jl, 09 2021. https://doi.org/10.5281/zenodo.5534602.
[30]
Robert M. Corless, Gaston H. Gonnet, David E. G. Hare, David J. Jeffrey, and Donald E. Knuth. On the Lambert \(W\) function. Advances in Computational Mathematics, 5(1):329–359, December 1996. URL: http://dx.doi.org/10.1007/BF02124750.
[31]
Regina S. Burachik, C. Yalçın Kaya, and Walaa M. Moursi. Infeasible and critically feasible optimal control. Journal of Optimization Theory and Applications, 203(2):1219–1245, April 2024. URL: http://dx.doi.org/10.1007/s10957-024-02419-0.
[32]
C. Yalçın Kaya. Observer path planning for maximum information. Optimization, 71(4):1097–1116, December 2022. URL: http://dx.doi.org/10.1080/02331934.2021.2011868.
[33]
C. Yalçın Kaya and Helmut Maurer. A numerical method for nonconvex multi-objective optimal control problems. Computational Optimization and Applications, 57(3):685–702, September 2014. URL: http://dx.doi.org/10.1007/s10589-013-9603-2.
[34]
Laurence Chisholm Young. Lectures on the Calculus of Variations and Optimal Control Theory. W. B. Saunders, 1969. URL: https://bookstore.ams.org/CHEL/304.
[35]
Christopher Rackauckas and Qing Nie. Differentialequations. jl–a performant and feature-rich ecosystem for solving differential equations in Julia. Journal of open research software, 5(1):15–15, 2017. URL: https://openresearchsoftware.metajnl.com/articles/10.5334/jors.151.
[36]
Wayne H. Enright. Continuous numerical methods for ODEs with defect control. Journal of Computational and Applied Mathematics, 125(1–2):159–170, December 2000. URL: http://dx.doi.org/10.1016/S0377-0427(00)00466-0.
[37]
C. Yalçın Kaya, Lyle Noakes, and Philip Schrader. Curves of minimax spirality, 2024. URL: https://arxiv.org/abs/2409.08644.
[38]
Robert M. Corless. Error backward. In P. Kloeden and K. Palmer, editors, Proceedings of Chaotic Numerics, Geelong, 1993, volume 172 of AMS Contemporary Mathematics, pages 31–62, 1994. URL: https://hdl.handle.net/20.500.14721/38912.
[39]
Misha Stepanov. Embedded (4,5) pairs of explicit 7-stage Runge–Kutta methods with FSAL property. Calcolo, 59(4), November 2022. URL: http://dx.doi.org/10.1007/s10092-022-00486-1.

  1. Department of Computer Science, University of Western Ontario, London, Ontario . E-mail: rcorless@uwo.ca .↩︎

  2. Mathematics, STEM, University of South Australia, Mawson Lakes, Australia. E-mail: yalcin.kaya@unisa.edu.au .↩︎

  3. Larry Shampine argues in [8] that a “thoughtful consideration of truncation error coefficients provides more detail and more reliable conclusions than does the study of a battery of numerical tests\(\ldots\)” and, while we agree, we point out that most users of modern software simply do not have the expertise to do this consideration themselves. We wish to provide simpler tools, for general use.↩︎

  4. Continuously differentiable interpolants are desirable for several purposes, and sometimes essential, and this is emphasized in [10] and again in [11]. However, in this paper we give up \(C^1\) continuity to gain minimality, which allows us to give sufficient conditions for accuracy. There are yet other criteria for “good” interpolation, such as preserving monotonicity or convexity to ensure a pleasing visual appearance, and indeed a “not misleading” visual appearance, but those considerations are also less important for the purposes of this paper.↩︎

  5. This is a natural usage for the word “stage” in the optimal control literature. Later in this paper we also use the word “stage” in a different sense, one more common in the Runge–Kutta community. This collision of meanings, this polysemy, is unfortunate, and may cause confusion.↩︎

  6. In this section of the paper, we use the word “stage” as the Runge–Kutta community does, to refer to each new function evaluation in each line of the Butcher tableau. The polysemy of the word “stage” is perhaps unfortunate, and might cause confusion; we did warn in an earlier footnote that this would happen.↩︎

  7. Re-reading that paper, we see that the task we set ourselves in this present paper is at least forty years old. Quoting from their section 5, Applications of Interpolation: ” Interpolants can also be employed to improve the reliability of a code. One way this can be done is to use an interpolant to produce an estimate of the defect \(z'(x) - f(x, z(x))\) in the numerical solution, where \(z\) interpolates the computed values \(\{y_n\}\). The size of the defect is frequently a good way to characterize the accuracy of the numerical solution. Ideally, we would like to choose \(z\) to minimize some norm of the defect. Although none of the interpolants discussed in this paper satisfy this ideal (as far as we know), the defect associated with the higher-order interpolant \(z_n^p\) is generally not much larger than the minimum. ”↩︎