April 02, 2024

Generating simulated training data needed for constructing sufficiently accurate surrogate models to be used for efficient optimization or parameter identification can incur a huge computational effort in the offline phase. We consider a fully adaptive greedy approach to the computational design of experiments problem using gradient-enhanced Gaussian process regression as surrogates. Designs are incrementally defined by solving an optimization problem for accuracy given a certain computational budget. We address not only the choice of evaluation points but also of required simulation accuracy, both of values and gradients of the forward model. Numerical results show a significant reduction of the computational effort compared to just position-adaptive and static designs as well as a clear benefit of including gradient information into the surrogate training.

The response of many parameter-dependent physical models can be described by a functional relation \(y\) mapping parameters \(p\) to observable quantities \(y(p)\). The inverse problem of inferring model parameters \(p\) from measurements \(y^m\) is frequently encountered in various fields including physics, engineering, finance, and biology. Point estimates \(p_*\), e.g., the maximum likelihood estimate \(\mathrm{arg\,min}_p \|y(p)-y^m\|\), are often computed by optimization methods or by sampling the posterior probability distribution of the parameters given the measurement data [1], [2].

Often, \(y\) is only available as a complex numerical procedure, such as the numerical solution of a partial differential equation. Solving an optimization problem for parameter estimation is then computationally expensive. The effort can be too high for online and real-time applications. Fast surrogate models approximating \(y\) are utilized as a replacement for the forward model when solving the inverse problems, in particular when both parameters \(p\) and measurements \(y^m\) are low-dimensional. Various types of surrogates are employed, including polynomials, sparse grids, tensor trains, artificial neural networks, and Gaussian process regression (GPR) [3], [4], on which we focus here.

Surrogate models rely on values \(y(p_i)\) at specific evaluation points \(p_i\) as training data. The quality of the resulting surrogate heavily depends on the number and position of these sample points. Constructing an accurate surrogate model can become computationally expensive when a large number of evaluations is required. Consequently, strategies for selecting near-optimal evaluation points have been proposed, in particular for analytically well-understood GPR [5]. A priori point sets [6], [7] are effectively supplemented by adaptive designs [8]–[11]. Active learning relies on pointwise estimates of the surrogate approximation error for guiding the selection process, usually including the parameter point that maximizes some acquisition function into the training set [12]–[14].

When numerical procedures like finite element (FE) solvers are employed to compute training data, the resulting evaluations of \(y(p_i)\) are affected by discretization and truncation errors. While uniformly high accuracy can be used, this incurs a high computational effort. The trade-off between accuracy and cost has received limited systematic investigation. Besides two-level approaches using a low-fidelity and a high-fidelity models [15], an adaptive choice of evaluation tolerances has been proposed [16]. Significant efficiency gains by joint optimization of evaluation position and tolerance given a limited computational budget have been achieved [17].

In the current paper, we extend [17] by including gradient information. Gradients of FE simulations can often be obtained efficiently by solving tangent or adjoint equations [18]. We make use of gradient enhanced GPR (GEGPR) promising higher accuracy than standard GPR [14], [19], [20]. Due to the high information content of gradient data, the required number of evaluation points is small compared to gradient-free GPR. We devise a greedy-type strategy that simultaneously optimizes the next positions and tolerances for forward model and gradient evaluation. For that, we extend the accuracy and work models used in the optimizer to take gradient computation into account.

For measuring the accuracy of the surrogate model, we adopt a goal-oriented approach [21], assessing the approximation error by its impact on the accuracy of the identified parameter. We aim for a uniform absolute tolerance or, if that is not reachable, at least a uniform bounded deterioration relative to the exact model results. We focus on adaptive FE simulations, where standard a priori error estimates and coarse estimates of the computational work are available.

The remainder of the paper is structured as follows: In Sec. 2 we formalize the parameter identification and surrogate model construction problems, introducing notation and GEGPR. Their adaptive construction is given in Sec. 3 after introducing gradient enhanced work and accuracy models. Numerical experiments are provided in Sec. 4.

First, we briefly define the parameter identification a maximum posterior evaluation in a Bayesian context, before introducing GEGPR surrogate models. For a more detailed discussion, we refer to [17].

Let the forward problem \(y(p) = \mathcal{F}(p):\mathcal{X}\subset \mathbb{R}^d\rightarrow\mathbb{R}^m\) of mapping a parameter vector \(p \in \mathcal{X}\) to a model output \(y(p) \in \mathbb{R}^m\) be Lipschitz-continuously differentiable and the parameter space \(\mathcal{X}\) be bounded and closed. We assume that numerical approximations \(y_{\tau}(p)\) of \(y(p)\) and \(y'_{\tau'}(p)\) of the derivative \(y'(p)\) can be computed, satisfying the error bounds \(\|y_{\tau}(p)-y(p)\| \le \tau\) and \(\|y'_{\tau'}(p)-y'(p)\| \le \tau'\), respectively, for any given tolerances \(\tau,\tau'>0\) in some norms \(\left \| \cdot \right \|\). We consider the simplest Bayesian setting of normally distributed measurement errors with likelihood covariance \(\Sigma_l\in\mathbb{R}^{m\times m}\) and a Gaussian prior with covariance \(\Sigma_p\in\mathbb{R}^{d\times d}\). Given measurements \(y^m\), we are interested in the maximum posterior point estimate \(p(y^m)\), a minimizer of the negative log-posterior \[\label{eq:objective} J(p;y^m) := \frac{1}{2}\| y(p)-y^m \|_{\Sigma_l^{-1}}^2+ \frac{1}{2} \| p-p^0 \|^2_{\Sigma_p^{-1}}\tag{1}\] over \(p\in\mathcal{X}\), where \(\left \| v \right \|_{\Sigma_{}^{-1}} := v^T \Sigma_{}^{-1} v\) for positive symmetric definite \(\Sigma\). By regularity of \(y\) and compactness of \(\mathcal{X}\), \(J\) is Lipschitz-continuously differentiable and a, not necessarily unique, minimizer exists. For the numerical minimization of 1 , we consider a Gauss-Newton (GN) method [22] and assume that the forward model is compatible and the measurement errors sufficiently small such that the residual \(y(p(y^m))-y^m\) is small, such that the GN method converges locally.

We aim at building a surrogate model \(y^*\) for \(y:\mathcal{X} \to \mathbb{R}\) based on a set of evaluations of the model and its derivative with certain tolerances \(\tau,\tau'\). We directly consider GEGPR, see [14] and the references therein, and refer to [17], [23]–[25] for purely value-based GPR surrogate modeling.

We define \(D:=\{\mathcal{D}:\mathcal{X} \to (\mathbb{R}_+\cup \{\infty\})^2 \mid \mathop{\mathrm{card}}X(\mathcal{D})\in\mathbb{N}\}\) as the set of admissible designs, where \(X(\mathcal{D}) := \{ p \in \mathcal{X} \mid \min_{k\in\{1,2\}}\{\mathcal{D}(p)_k\} < \infty\}\) is the evaluation set. For \(p_i\in X(\mathcal{D})\), the design \(\mathcal{D}(p_i) = [\tau_i,\tau_i']\) defines the evaluation tolerances for values and derivatives of the forward model \(y\) forming the training data \(y_i=y_{\tau_i}(p_i)\) and \(y_i'=y'_{\tau'_i}(p_i)\). We assume for simplicity that the component-wise evaluation errors \(e_i = y_i-y(p_i)\) and \(e'_i = y'_i-y'(p_i)\) are independent and normally distributed, i.e. \(e_i \sim \mathcal{N}(0,\tau_i^2I_m)\) and \(e'_i \sim \mathcal{N}(0,(\tau'_i)^2 I_{md})\), where we denote the identity matrix of dimension \(m\) by \(I_m\) and identify \(e'_i\in\mathbb{R}^{m\times d}\) with its column-major vector representation in \(\mathbb{R}^{md}\).

The complete model evaluation at a parameter point \(p_i\) thus comprises a vector \(y_i\in\mathbb{R}^m\) and a matrix \(y_i' \in\mathbb{R}^{m\times d}\). We define \(z_i = (y_i,y_i') \in \mathbb{R}^{m+md}\) as the vector representation of the complete evaluation. The training data is then distributed as \[\label{eq:joint-z-distribution} z_i\sim\mathcal{N}(z(p_i),E_i), \quad E_i = \begin{bmatrix} \tau_i I_m \\ & \tau_i' I_{md} \end{bmatrix}.\tag{2}\]

**Remark 1**. In principle, the accuracies of computing values \(y\) and derivatives \(y'\) can be independent. In practical computation, however, when evaluating
derivatives, obtaining the value at the same position \(p_i\) with the same accuracy \(\tau_i = \tau_i'\) usually comes for free. We therefore restrict the attention to designs
satisfying \(\tau_i \le \tau_i'\) for all \(i\).

Assume that forward models are a Gaussian process, i.e. model evaluations \(z(p_i)\) are jointly normally distributed for any finite evaluation set \(X=\{p_i|i=1,\dots, n\}\), which forms the GPR prior distribution \[\label{eq:EGP} \pi_{\mathrm{prior}}(Z) = \mathcal{N}(\zeta,\tilde{K})\tag{3}\] for joint model responses \(Z = (z(p_i))_{i=1,\dots,n} \in \mathbb{R}^{n(m+md)}\) with given mean \(\zeta\). The structure of \(Z\) induces a nested block form of the covariance matrix \(\tilde{K} = (\hat{K}_{ij})_{ij}\), \(i,j=1,\dots,n\), see [14], [26], with \[\label{eq:kovarianceenhanced} \hat{K}_{ij} = \begin{bmatrix} K_{ij} & K_{ij}' \\ K_{ji}' & K_{ij}'' \end{bmatrix} .\tag{4}\] Here, \(K_{ij}= k(p_i,p_j) K\in\mathbb{R}^{m\times m}\) is the covariance of the model values \(y(p_i)\) and \(y(p_j)\), assumed to be separable in terms of a positive definite kernel \(k\) depending on the evaluation positions \(p_i\) and \(p_j\) only, and a matrix \(K\) describing the covariance of model response components independently of the evaluation position. Most often, \(k\) is assumed to depend on the distance of \(p_i\) and \(p_j\) only, such that \(k(p_i,p_j) = k(\|p_i - p_j\|)\). A common choice is the Gaussian kernel \(k(x) = \exp(-\sigma x^2)\).

The remaining blocks \(K_{ij}'\in\mathbb{R}^{m\times dm}\) and \(K_{ij}''\in\mathbb{R}^{dm\times dm}\) describe the covariance of model values and model derivatives \(y'\), and of the covariance of model derivatives, respectively. Since differentiation is a linear operation, the derivatives are again normally distributed, and the covariances are given in terms of the kernel’s derivatives as \(\mathrm{cov}(y(p),y'(q)) = \nabla_q k\left(p,q\right)\), see [5]. We therefore can write the matrix blocks in terms of the Kronecker product \(\otimes\) as \[\label{eq:gradient-correlation} K'_{ij} = \partial_{p_j}k(p_i,p_j) \otimes K , \quad K'_{ji} = \partial_{p_i}k(p_i,p_j) \otimes K , \quad K''_{ij} = \partial_{p_i}\partial_{p_j}k(p_i,p_j) \otimes K.\tag{5}\]

Given the GPR prior 3 and approximate complete evaluations \(\hat{Z} = (z_i)_{i=1,\dots,n}\in\mathbb{R}^{n(m+md)}\) according to 2 , the GPR likelihood is \(\pi_{\rm like}(\hat{Z}\mid Z) = \mathcal{N}(z|_X,\tilde{E})\), with \(\tilde{E} = \mathrm{diag}(E_1,\dots,E_n)\), and the GPR posterior probability for the complete model response \(Z\) is \(\pi_{\rm post}(Z\mid\hat{Z}) \propto \pi_{\rm like}(\hat{Z}\mid Z) \pi_{\rm prior}(Z)\) by Bayes’ rule. As a product of two Gaussian distributions, \(\pi_{\rm post}\) is again Gaussian, with covariance \(\Gamma = \left(\tilde{K}^{-1} + \tilde{E}^{-1}\right)^{-1}\) and mean \(\bar Z = \Gamma(\tilde{K}^{-1} \zeta + \tilde{E}^{-1}\hat{Z})\). In particular for \(p_n\), the marginal covariance \(\Gamma_{\mathcal{D}}(p_n;\tau_n,\tau_n') \in \mathbb{R}^{m(1+d)\times m(1+d)}\) according to the block structure of \(Z\) and \(\Gamma\) is a monotone function of \(\tau_n\) and \(\tau_n'\), i.e. \(\Gamma_{\mathcal{D}}(p_n,\tau_n+\delta,\tau_n'+\delta') \succeq \Gamma_{\mathcal{D}}(p_n,\tau_n,\tau_n')\) for \(\delta,\delta'\ge 0\).

Regression is performed for a point \(p_n\) with unknown model response, i.e. formally \(\tau_n = \tau'_n = \infty\), by extracting the marginal mean \(z_{\mathcal{D}}(p_n) = \bar Z_n \in \mathbb{R}^{m+md}\) and the marginal covariance \(\Gamma_{\mathcal{D}}(p_n)\). Note that \(z_{\mathcal{D}}(p_n) = (y_{\mathcal{D}}, y'_{\mathcal{D}})(p_n)\) provides an estimate for both, values and derivatives of the model at \(p_n\), and \(\Gamma_{\mathcal{D}}(p_n)\) quantifies the uncertainty of this estimate. Due to the choice 5 for the prior covariance of values and gradients, the GPR model is consistent, i.e. the derivative estimate \(y_{\mathcal{D}}'\) coincides with the parametric derivative of the value estimate \(y_{\mathcal{D}}\).

**Remark 2**. If the components of the model response are assumed to be uncorrelated, the covariance matrix \(K\) is diagonal. With the independence of the evaluation errors postulated in 2 , both \(\tilde{K}\) and \(\tilde{E}\), and therefore \(\Gamma\), decompose into \(m\) independent blocks, one for each component. Then, the GPR can be performed independently for each component using a simpler scalar GPR model. An appropriate covariance model for the components can, however, result in
superior results [27].

Replacing exact model evaluations \(y(p)\) in the objective 1 by a cheaper GPR model \(y_{\mathcal{D}}(p)\) based on a design \(\mathcal{D}\) yields maximum posterior point estimates \[p_{{\mathcal{D}}}(y^m) = \mathop{\mathrm{arg\,min}}_{p\in \mathcal{X}} J_{{\mathcal{D}}}(p;y^m) := \frac{1}{2}\|y_{{\mathcal{D}}}(p)-y^m\|_{\Sigma_l^{-1}}^2 + \frac{1}{2}\|p-p^0\|_{\Sigma_p^{-1}}^2\] and saves computational effort when computing Gauss-Newton steps \(\Delta p_{\mathcal{D}}(p,y^m)\) by solving \[\left( y_{\mathcal{D}}'(p)^T\Sigma_l^{-1}y_{\mathcal{D}}'(p) + \Sigma_p^{-1} \right) \Delta p_{{\mathcal{D}}} = (y_{\mathcal{D}}(p)')^T\Sigma_l^{-1}(y_{\mathcal{D}}(p)-y^m) + \Sigma_p^{-1}(p^0-p).\] It also incurs both some error \(p_{{\mathcal{D}}}(y^m)-p(y^m)\) of the resulting identified parameters and a considerable computational effort for evaluating the model according to the design \({\mathcal{D}}\) beforehand.

If computational resources are limited, the challenge is to determine which parameters \(p_i\) simulations should be run with, and which tolerances \(\tau_i,\tau_i'\) should be used, to achieve the best accuracy. This design of experiments problem for \(\mathcal{D}\) involves balancing the competing objectives of minimizing the expected approximation error \(E(\mathcal{D})\) of the surrogate model and minimizing the computational effort \(W(\mathcal{D})\) required to create the training data. Since little is known about the model derivative \(y'\), and consequently about \(E(\mathcal{D})\), before any simulations have been performed, we follow a sequential design of experiments approach [17] by incrementally spending computational budget of size \(\Delta W\). In each step, we thus have to solve an incremental design problem for \(\tilde{\mathcal{D}}\in D\) refining a given preliminary design \(\mathcal{D}\): \[\label{eq:doe-incremental} \min_{\tilde{\mathcal{D}}\le \mathcal{D}} E(\tilde{\mathcal{D}}) \quad\text{subject to } W(\tilde{\mathcal{D}}|\mathcal{D}) \le \Delta W.\tag{6}\] We will establish quantitative error estimates \(E(\mathcal{D})\) and work models \(W(\mathcal{D})\), and then develop a heuristic for approximately solving problem 6 .

First we need to quantify the parameter reconstruction error \(p_{\mathcal{D}}(y^m)-p(y^m)\) in terms of the measurement error variance \(\Sigma_l\) and the surrogate model approximation quality \(y_{\mathcal{D}}-y\) depending on the design \(\mathcal{D}\). We start by establishing an estimate of the parameter reconstruction error for deterministic functions \(y(p)\) and \(y_{\mathcal{D}}(p)\).

**Theorem 1**. Assume that \(y\) is twice continuously differentiable with uniformly bounded first and second derivatives in a neighborhood \(B\) of a minimizer \(p^*\in\mathcal{X}\) of \(J(p;y^m)\) for some measurement data \(y^m\), such that the residual \(\left\|
y(p^*)-y^m\right\|_{\Sigma_l^{-1}}\) is sufficiently small and \(y'(p^*)\Sigma_l^{-1}y'(p^*) + \Sigma_p^{-1}\) is positive definite.

Then, there are \(\bar\epsilon,\bar\epsilon'>0\) as well as constants \(a,a'\), such that for all \(\epsilon\le \bar\epsilon\) and \(\epsilon' \le \bar\epsilon'\) and surrogate models \(y_\mathcal{D}:\mathbb{R}^d\to\mathbb{R}^m\) with \(\|y_{\mathcal{D}}-y\|_{L^\infty(B)} \le \epsilon\) and \(\|y'_{\mathcal{D}} - y'\|_{L^\infty(B)} \le \epsilon'\) there is a locally unique minimizer \(p_\mathcal{D}(y^m)\) of \(J_\mathcal{D}\) satisfying the error bound \[\label{eq:error-bound} \left\|p_{\mathcal{D}}(y^m)-p^*\right\| \le a\epsilon + a'\epsilon'.\tag{7}\]

A more detailed claim and the proof is given in [17]. Though the constants \(a,a'\) are quantitatively unavailable for concrete problems, Thm. 1 establishes a linear relation between the surrogate model accuracy and the incurred error in the parameter estimate. The factors \(a\) and \(a'\) can be estimated numerically by bounding linearized error transport through a Gauß-Newton iteration.

We expect a minimal absolute reconstruction error, but due to measurement inaccuracies, we cannot avoid some error \(e_0(p)\) and therefore consider a small relative error to be acceptable. The unavoidable error level can be bounded in terms of the problem properties, see [17]. We therefore define the local error quantity \[\label{eq:local-error-model} e_{\mathcal{D}}(p) := \frac{a(p)\epsilon + a(p)'\epsilon'}{1+\alpha e_0(p)} .\tag{8}\]

Since during the construction of the surrogate model \(y_{\mathcal{D}}\) by minimizing 6 the measurement values \(y^m\) and hence the parameter position \(p=p(y^m)\) of interest are unknown, the error quantity 8 needs to be considered over the whole parameter region \(\mathcal{X}\). We therefore define the accuracy model \[\label{eq:error-model} E({\mathcal{D}}) := \|e_{\mathcal{D}}\|_{L^q(\mathcal{X})} \quad \text{for some 1\le q < \infty},\tag{9}\] that is to be minimized by selecting an appropriate design \({\mathcal{D}}\). Choosing \(q\approx 1\) would focus on minimizing the average parameter reconstruction error, while choosing \(q\) very large would focus on the worst case and impose a roughly uniform accuracy. For the numerical experiments in Sec. 4 we have chosen \(q=2\).

Still missing are the surrogate model error bounds \(\epsilon,\epsilon'\) in terms of the design \({\mathcal{D}}\). Unfortunately, for virtually all cases of practical interest, there is little hope for obtaining simultaneously rigorous and quantitatively useful bounds. For GPR surrogate models, the global support of the posterior probability density precludes the existence of a strict bound, though its fast decay provides thresholds that are with high probability not exceeded. The assumed normal distribution of errors \(z(p)-z_{\mathcal{D}}(p) \sim \mathcal{N}(0,\Gamma_{\mathcal{D}}(p))\) implies that both the Euclidean norm \(\|y(p)-y_{\mathcal{D}}(p)\|_2\) and the Frobenius norm \(\|y'(p)-y'_{\mathcal{D}}(p)\|_F\) are generalized-\(\chi^2\)-distributed and hence formally unbounded. Instead of a strict bound, we use a representative statistical quantity for \(\epsilon\) and \(\epsilon'\) based on the marginal covariance \(\Gamma_{y_{\mathcal{D}}} = \Gamma_{\mathcal{D}}(p)_{1:m,1:m}\), such as the mean [28] \[\label{eq:GPR-error-estimate} \epsilon := \mathop{\mathrm{tr}}(\Gamma_{y_\mathcal{D}}).\tag{10}\] Analogously, \(\epsilon'\) can be defined in terms of \(\Gamma_{y'_{\mathrm{D}}}= {\Gamma_{\mathcal{D}}}(p)_{m+1:m(d+1),m+1:m(d+1)}\). Inserting the thus chosen values of \(\epsilon\) and \(\epsilon'\) into 8 completes the accuracy model.

One advantage of integrating derivative information into the GPR surrogate is that there is an explicit variance estimate available for defining \(\epsilon'\). In contrast, GPR surrogates defined only in terms of the values \(y\) as considered in [17] must rely on an empirical relation of \(\epsilon\) and \(\epsilon'\).

The evaluation of the forward model \(y(p)\) usually involves some kind of numerical approximation, resulting in an approximation \(y_\tau(p)\). While in principle any accuracy \(\|y_\tau(p)-y(p)\| \le \tau\) for arbitrary tolerance \(\tau>0\) can be achieved, this requires a computational effort \(W(\tau)\) to be spent on the evaluation. For adaptive finite element computations in \(\mathbb{R}^l\) with ansatz order \(r\) and \(N\) degrees of freedom, we expect a discretization error \(\epsilon = \mathcal{O}(N^{-r/l})\) [29]. Assuming an optimal solver with computational work \(\mathcal{O}(N)\), we obtain a work model \(W(\tau) = \tau^{-2s}\) with \(s=l/(2r)>0\), see [30]. \(W\) is monotone and satisfies the barrier property \(W(\tau)\to\infty\) for \(\tau\to 0\) and the minimum effort property \(W(\tau)\to 0\) for \(\tau\to\infty\).

Including gradient information is of particular interest if derivatives can be computed efficiently, e.g., with adjoint methods, such that the cost of derivative computation is a small multiple \(c\) of the value computation cost, and independent of the number \(d\) of parameters [18]. This leads to the work model \(W(\tau') = c(\tau')^{-2s}\).

The computational effort incurred by a design \(\mathcal{D}\) is then \[\label{eq:work-model} W({\mathcal{D}}) := \sum_{p_i\in X(\mathcal{D})}
\tau_i^{-2s} + c\; (\tau'_i)^{-2s}.\tag{11}\] Being interested in *incremental designs*, we assume \(\mathcal{D}\) to be a design already realized. Evaluating the model on a finer design \(\tilde{\mathcal{D}} \le \mathcal{D}\) can consist of simulating the model for parameters \(p\not\in X(\mathcal{D})\), or improving the accuracy of already performed simulations for \(p\in X(\mathcal{D})\) with \(\tilde{\mathcal{D}}(p)_k< \mathcal{D}(p)_k\) for some \(k\in\{1,2\}\), or both. If already conducted simulations can be continued
instead of started again, the computational effort of obtaining the training data set \(\tilde{\mathcal{D}}\) from \(\mathcal{D}\) is \[\label{eq:fe-work}
W(\tilde{\mathcal{D}}\mid\mathcal{D}) = W(\tilde{\mathcal{D}}) - W(\mathcal{D}).\tag{12}\]

The design problem 6 is combinatorial in nature due to the unknown number \(n\) of evaluation points and the decision between introducing new points or reusing existing ones. It is therefore highly nonlinear and difficult to treat rigorously due to the parameter locations \(p_i\) to be optimized. In particular, relaxing the design to the space of nonnegative regular Borel measures, as used in [31], is not feasible due to the nonlinearity of the work model 12 .

We therefore take a heuristic two-stage approach. We simplify the problem by decoupling the choice of evaluation positions \(p_i\) from the choice of evaluation accuracies \(\tau_i, \tau_i'\). In the first stage, we select few promising additional points for inclusion into the evaluation set \(X\). In the second stage, the evaluation tolerances \(\tau_i\) and \(\tau_i'\) are then optimized for minimal error \(E(\tilde{\mathcal{D}})\) given the computational budget constraint \(W(\tilde{\mathcal{D}}|\mathcal{D}) \le \Delta W\). Then the necessary evaluations of the forward model are performed and the GP surrogate model is updated, yielding improved error estimates for the next incremental design.

A parameter point \(p\) is particularly promising for inclusion, if its predicted impact on the overall error \(E\) is large. The impact can be estimated by the local derivative of \(e_{\mathcal{D}}(p)^q\) with respect to work spent for approximating \(y(p)\). We assume \(\tau' = \beta \tau\) for some \(\beta\in[0,1]\) and consider \[\begin{align} \frac{\partial e_{\mathcal{D}}(p)}{\partial W(p)} (\beta) &= \frac{\partial e_{\mathcal{D}}(p)}{\partial (\epsilon,\epsilon')} \,
\frac{(\partial\epsilon,\epsilon')}{\partial\tau} \left(\frac{d W}{d \tau}\right)^{-1} \\ &= \frac{[a(p),a(p)']}{1+\alpha e_0(p)} \left[ \mathop{\mathrm{tr}}\frac{\partial\Gamma_{y_\mathcal{D}}}{\partial (\tau,\tau')} ,
\mathop{\mathrm{tr}}\frac{\partial\Gamma_{y'_{\mathrm{D}}}}{\partial(\tau,\tau')} \right] \begin{bmatrix}1 \\ \beta\end{bmatrix} \frac{\tau^{2s+1}}{(-2s)(1+c\beta^{-2s})}.
\end{align}\] Neglecting constant factors, which are irrelevant for the *relative* merit of candidate points, and evaluating the derivatives at the current tolerance level \(\mathop{\mathrm{tr}}(\Gamma_{y_\mathcal{D}})\), we define the acquisition function \[\label{eq:utility} \phi(p) = \max_{\beta\in[0,1]} e_{\mathcal{D}}(p)^{q-1}
\frac{\partial e_{\mathcal{D}}(p)}{\partial W(p)} (\beta).\tag{13}\] As it is often only practical to compute derivatives along with the values or not at all, we may restrict the choice of \(\beta\) to the two
extreme cases \(\beta\in\{0,1\}\). Selecting candidate points for inclusion into the valuation set \(X\) can be based on finding local maximizers of the acquisition function \(\phi\), or finding the best points from a random sampling or from low discrepancy sequences.

With the evaluation set \(X\) fixed, the optimization problem 6 reduces to the nonlinear programming problem \(\min_{\tilde{\tau},\tilde{\tau}'}E(\tilde{\tau},\tilde{\tau}')\) for the new evaluation tolerances \(\tilde{\tau}\), \(\tilde{\tau}'\), subject to the pure improvement requirement \(\tilde{\tau}\le \tau\), \(\tilde{\tau}' \le \tau'\) and the computational budget constraint \(W(\tilde{\tau},\tilde{\tau}') \le W(\tau,\tau')+\Delta W\). For convexity reasons, we consider an equivalent reformulation in terms of \(v=\tau^{-2}\) and \(v'=(\tau')^{-2}\): \[\label{eq:doe-incremental-reduced} \min_{\tilde{v},\tilde{v}'\in\mathbb{R}^n_+} E(\tilde{v},\tilde{v}')^q \quad \text{s.t.} \quad \tilde{v} \ge v, \; \tilde{v}' \ge v', \; \tilde{v}' \le \tilde{v}, \; W(\tilde{v},\tilde{v}') \le W(v,v')+\Delta W.\tag{14}\]

**Theorem 2**. The objective \(\tilde{E}(\tilde{v}, \tilde{v}')^q\) is convex.

*Proof.* As in [17], \(\epsilon = \mathop{\mathrm{tr}}(\Gamma_{y_{\mathcal{D}}})\) and \(\epsilon' = \mathop{\mathrm{tr}}(\Gamma_{y'_{\mathcal{D}}})\) are convex in \(v\) and \(v'\) at all \(p\in\mathcal{X}\), and consequently, \(e_{\mathcal{D}}(p)\) is convex as well. Due to convexity and monotonicity, \(e_{\mathcal{D}}(p)^q\), and due to linearity,
also its integral \(E\) are convex. ◻

**Remark 3**. In contrast to purely value-based GPR surrogates, here the objective \(E^q\) is in general not strictly convex, since the covariance kernel containing entries of the form \(\partial_{p_i}k(p_i,p_j)\) is not pointwise positive.

The convexity of the admissible set \(\{\tilde{v}\in\mathbb{R}^{n+j}_+\mid W(\tilde{v})\le \Delta W + W(v)\}\) depends on the exponent \(s\) in the work model 11 . Clearly, for \(s\ge 1\), \(W\) is convex, whereas for \(s<1\) it is in general non-convex, not even quasi-convex. In combination with Thm. 2, we obtain the following result.

**Corollary 1**. For exponents \(s\ge 1\), the tolerance design problem 14 is convex.

Finite elements of order \(r\) in \(l\) dimensions with an optimal solver yield \(s = l/(2r)\). Consequently, for linear finite elements in two or three space dimensions, any minimizer is a global one, most often unique and non-sparse. In contrast, high order finite elements with \(r\ge 2\) lead to \(s<1\) and non-convex admissible sets. Their pronounced corners on the coordinate axes make the sparsity of a minimizer likely, see Fig. 1 right. This agrees with intuition: if increasing the accuracy at a specific sample point is computationally expensive, it is advantageous to distribute the work on a lower accuracy level to several points. If increasing the accuracy is cheap, then it is often better to increase the accuracy of a single point, that, to some extent, shares its increased accuracy in a certain neighborhood. Level lines of the gradient-enhanced objective \(E(v)\) are also shown (green). A smaller error can be achieved with the same amount of computational work. Note that gradient data does not affect the convexity of the problem.

While in the convex case \(s\ge 1\) the optimization is straightforward with any nonlinear programming solver, the non-convex case is more difficult. Fortunately, guaranteed global optimality is in practice not necessary. The expected sparsity structure suggests a particular heuristic approach: For \(i=1,\dots,n+j\) consider \(\tilde{v}=v+ae_i\) with \(a>0\) such that \(W(v)=\Delta W + W(\mathcal{D})\), i.e. the accuracy of only a single point is improved, and select the design \(v\) with smallest objective. If this satisfies the necessary first-order conditions, accept it as solution. Otherwise, perform a local minimization starting from this point.

We present two numerical examples, a low-dimensional one with simple model function \(y\), and a PDE model. We compare the results of the adaptive phase with and without gradient data.

As a model \(y\) we consider the rotated parabolic cylinder, i.e \[y_\phi(p) = \left ( \cos(\phi)(p_1+p_2)+ \sin(\phi)(p_2-p_1) \right )^2 \quad \text{for p \in \mathcal{X} = [0,2]^2}, \; \phi \in \mathbb{R}_{> 0}.\] We acquire \(m=3\) measurements for \(\phi \in \{0,2,4\}\), assume different accuracies of these independent measurements, and a likelihood \(\Sigma_{L} = 10^{-2}\mathrm{diag}(1,0.1,1)\). We start with an initial design of seven evaluation points, see Fig. 2, and an evaluation variance of \(\sigma^2 = 0.1\). We use the work model with \(s=1/2\). Evaluation of the error quantity \(E\) from 9 is performed by Monte Carlo integration. To determine candidate points for inclusion in the evaluation set \(X\), we sample the acquisition function 13 for \(\beta\in\{0,1\}\). Only the point with the largest value is included, and gradient information included by setting \(\tau'=\tau\) if \(\beta=1\) obtained a higher value. The adaptive phase is terminated if the error \(E\) drops below the desired tolerance \(\mathrm{TOL} = 10^{-2}\).

**Resulting designs.** Fig. 2 illustrates the resulting design with (right) and without (left) gradient data. Black dots are evaluation points without gradient data, while red triangles mark points
with gradient information. The size of each marker corresponds to the evaluation accuracy, with larger markers indicating higher accuracy. Red dots represent the initial design, while the color mapping represents the estimated error. The gradient-based
algorithm required 24 points, 11 of them with gradient information, each with different evaluation accuracy. The purely value-based design required 30 points.

**Convergence.** Fig. 3 presents the error against the computational work with different incremental budgets. On the left, we compare designs with gradient data (solid lines) and without (dashed lines). On the
right, we compare adaptive designs with gradient data with a position-adaptive design using uniform tolerances for values and gradients.

Including derivative information is clearly more efficient, by roughly one order of magnitude, but also leads to pronounced non-monotone convergence. This is likely an effect of inexact gradient data affecting hyperparameter optimization. In contrast, value-based designs exhibit a mostly monotonic behavior. Also apparent is that smaller incremental budgets are more efficient, in particular for low desired accuracy. On the right, the behaviour of uniform tolerance designs is shown. For large tolerances (\(\tau=10^{-1}\)), the computational work is small, but the desired accuracy is not reached. Smaller tolerances (\(\tau\le 10^{-2}\)) achieve the desired accuracy, but at a much higher cost, roughly two orders of magnitude and more.

**Reliability of local error estimates.** The error model 9 is coarse and may not capture the actual error in identified parameters correctly. It is affected both by linearization error in estimated error
transport and the GPR error estimate, which in turn depends also on the hyperparameter optimization. We compare the estimated global error \(E(\mathcal{D})\) to the actually obtained errors. For 1600 points \(p_i\), sampled randomly from \(\mathcal{X}\), we compute the local error estimate \(e_i = e_{\mathcal{D}}(p_i)\) from 8 ,
and compare this to the expected actual error, approximated by the sample mean \(\tilde{e}_i := n_k^{-1} \sum_{k=1}^{n_k} \|p(y(p_i)+\delta_{i,k})-p_\mathcal{D}(y(p_i)+\delta_{i,k})\|\) with realizations \(\delta_{i,k}\) of the measurement error distributed as \(\mathcal{N}(0,\Sigma_l)\). We construct histograms of the ratio \({e}_i/\tilde{e}_i\). Ratios less than
1 indicate an underestimation of the error, while values greater than 1 indicate an overestimation.

The results in Fig. 4 suggest that value-based and gradient enhanced GPR surrogates behave similar. The coarse a priori error estimate is not strictly reliable, but appears to be reasonably accurate for steering the adaptive design process.

**Parameter reconstruction.** For an exemplary parameter reconstruction we assume a true parameter \(p = (p_1,p_2) = (1,1.5)\) and exact measurements. Tab. 1 shows the reconstruction results lying well within the prescribed global tolerance, with comparable errors.

\(p_1\) | \(\Delta p_1\) | \(p_2\) | \(\Delta p_2\) | |
---|---|---|---|---|

value | 0.9968 | 0.0032 | 1.496 | 0.0047 |

value+gradient | 0.9978 | 0.0022 | 1.502 | 0.0023 |

Scatterometry is a more complex problem from optical metrology [32]–[34]. The aim is to identify geometric parameters in a nano-structured diffraction pattern from the intensity of reflected monochromatic light of different polarizations and incidence angles \(\phi,\theta\), see Fig. 5 and [35]. A forward model \(y(p)\) was developed using JCMsuite^{1} as a solver for the governing Maxwell’s equations. The model is parametrized by the geometry of the line
grid sample, see Fig. 5.

The numerical model maps model parameters \(p = \left[ \mathrm{cd}, t, r_{\text{top}}, r_{\text{bot}} \right] \in \mathcal{X} \subset \mathbb{R}^{4}\) to \(m=4\) zero’th order scattering intensities \(y(p)\in\mathbb{R}^4\), given by incrementing \(\theta\) from 5 ° to 11 ° in steps of \(\SI{2}{°}\), utilizing \(P\)-polarized light with an angle of incidence \(\phi = \SI{0}{°}\).

Parameter | \(\mathrm{cd}\) | \(t\) | \(r_{\mathrm{top}}\) | \(r_{\mathrm{bot}}\) | \(h\) | \(\mathrm{swa}\) | |
---|---|---|---|---|---|---|---|

range | \([24,28]\,\si{\nano\meter}\) | \([4,6]\, \si{\nano\meter}\) | \([8,10]\, \si{\nano\meter}\) | \([3,7]\,\si{\nano\meter}\) | \(\SI{48.3}{\nano\meter}\) | 87.98 ° |

At the beginning, \(\mathcal{X}\) is covered with \(2^7 = 128\) points from a Sobol sequence. According to the knwon measurement uncertainty, we set \[\Sigma_l
= 10^{-2}\text{diag}(8.57,8.56,8.60,8.63)\] and ask for \(E(\mathcal{D}) \leq \text{TOL} = 10^{-3}\). **Convergence.** In Fig. 6 we plot the estimated
global error against the computational work.

On the left, we a GPR design (dashed) with a GEGPR design (solid) for the same incremental budget \(\Delta W\). As in the analytical example, including gradient information improves the efficiency at the cost of a non-monotone convergence behaviour. The improved efficiency is consistent, though quantitatively varying, over a wide range of incremental work budgets, as shown in Tab. 3.

\(\Delta W\) | \(10^3\) | \(10^4\) | \(10^5\) | \(10^6\) |

\(W/W_g\) | \(10\) | \(500\) | \(10\) | \(50\) |

In Fig. 6 right, we compare the presented approach with position-adaptive designs using fixed finite element grids. The coarse grid has a maximum edge length of \(h_c = \SI{16}{\nano\meter}\), the fine grid has a maximum edge length of \(h_f = \SI{1}{\nano\meter}\), corresponding to FE discretization errors \(\epsilon_{c} = 10^{-2}\) and \(\epsilon_{f} = 10^{-4}\), respectively. When commencing with the coarse grid, the error reduction in the curve is initially gradual. However, it significantly accelerates beyond a certain threshold, identified as \(W \approx 3\cdot 10^7\). Interestingly, beyond this threshold, the presence of pronounced oscillations becomes evident, causing the error to temporarily plateau between \(W=2\cdot 10^8\) and \(W=4\cdot 10^8\). Subsequently, the curve resumes its descent and eventually achieves the desired tolerance level around \(W \approx 2 \cdot 10^9\). Notably, during the analysis, it was observed that within the aforementioned oscillation range, the hyperparameter optimization for a specific parameter component failed to yield satisfactory results. As a result, it was necessary to set this parameter to a default value of \(L = 1\), providing a plausible explanation for the observed behavior.

Surprisingly, we encountered difficulties in achieving convergence for the fine grid. Despite an initial rapid decrease in error, the curve exhibited pronounced oscillations across various settings, necessitating the termination of the procedure. Additional tests involving different constant hyperparameters and varied hyperparameter bounds within the optimization failed to yield any improvements. Consequently, further research is required to comprehensively elucidate and resolve this issue.

Thus, in the direct comparison between fully adaptive and semi-adaptive algorithm, we can see that we can save computational work by a factor of \(\approx 100\).

**Reliability of local error estimator.** As before, we employ \(7^4 = 2401\) parameter points to generate the estimated local errors \(e_{i}\) and the expected true errors
\(\tilde{e}_i\) with and without gradient information. A FE simulation on a fine grid with maximum edge length of \(\SI{1}{\nano\meter}\) is used as exact forward model.

In contrast to the analytical example, the GPR and GEGPR surrogates differ slightly, with GPR consistently overestimating the true errors. This suggests that the GPR model should provide smaller actual errors than aimed at. Again, the GEGPR error estimator is not strictly reliable, but apparently sufficiently robust for steering the adaptive design selection process.

**Parameter reconstruction.** We perform an exemplary parameter reconstruction with true parameters \(p_{\mathrm{true}} = (26.0,5.0,10.5,5.0)\, \si{\nano\meter}\) and simulated measurements with artificial
measurement noise of size \(10^{-3}\). The parameters are recovered within the imposed tolerance, as shown in Tab. 4.

Parameter | \(p_{\rm CD}\) | \(|\Delta p_{\rm CD}|\) | \(p_t\) | \(|\Delta p_t|\) | |
---|---|---|---|---|---|

val | 26.001 | 1.000E-3 | 4.992 | 0.878E-3 | |

val+grad | 25.999 | 0.983E-4 | 4.999 | 0.812E-4 | |

Parameter | \(p_{\rm top}\) | \(|\Delta p_{\rm top}|\) | \(p_{\rm bot}\) | \(|\Delta p_{\rm bot}|\) | |

val | 10.502 | 2.00E-3 | 4.991 | 9.20E-3 | |

val+grad | 10.499 | 0.87E-4 | 4.999 | 1.54E-4 |

The joint adaptive selection of evaluation positions and evaluation tolerances with a greedy heuristic improves the efficiency of building a GPR surrogate from finite element simulation significantly. Including gradient information enhances this further, if derivatives can be computed cheaply. In numerical experiments, improvement factors between 100 and 1000 have been observed compared to methods relying on only selecting evaluation positions. While the error estimator based on the GPR variance is not strictly reliable, it appears to be sufficiently well-suited for steering the adaptive design selection.

**Funding.** This work has been supported by Bundesministerium für Bildung und Forschung – BMBF, project number 05M20ZAA (siMLopt).

[1]

H.W. Engl, M. Hanke, and A. Neubauer. *Regularization of inverse problems*. Kluwer, 1996.

[2]

J. Kaipio and E. Somersalo. *Statistical and Computational Inverse Problems*. Springer, 2005.

[3]

P. Schneider, M. Hammerschmidt, L. Zschiedrich, and S. Burger. . In *Metrology, Inspection, and Process Control for Microlithography XXXIII*, volume 10959. SPIE, 2019.

[4]

A. Zaytsev. Reliable surrogate modeling of engineering data with more than two levels of fidelity. In *2016 7th International Conference on Mechanical and Aerospace Engineering
(ICMAE)*, pages 341–345, 2016.

[5]

C. Rasmussen and C.K.I. Williams. *Gaussian Processes for Machine Learning*. MIT Press, 2006.

[6]

A. Giunta, S. Wojtkiewicz, and M. Eldred. Overview of modern design of experiments methods for computational simulations (invited). In *41st Aerospace Sciences Meeting and Exhibit,
AIAA 2003-649*, pages 1–17, 2003.

[7]

N. Queipo, R. Haftka, W. Shyy, T. Goel, R. Vaidyanathan, and P. Tucker. Surrogate-based analysis and optimization. *Progress in Aerospace Sciences*, 41(1):1–28, 2005.

[8]

K. Crombecq, E. Laermans, and T. Dhaene. Efficient space-filling and non-collapsing sequential design strategies for simulation-based modeling. *European Journal of Operational
Research*, 214:683–696, 2011.

[9]

V. Joseph and Y. Hung. Orthogonal-maximin latin hypercube designs. *Statistica Sinica*, 18:171–186, 2008.

[10]

R. Lehmensiek, P. Meyer, and M. Müller. Adaptive sampling applied to multivariate, multiple output rational interpolation models with application to microwave circuits. *International
Journal of RF and Microwave Computer-Aided Engineering*, 12(4):332–340, 2002.

[11]

M. Sugiyama. Active learning in approximately linear regression based on conditional expectation of generalization error. *Journal of Machine Learning Research*, 7:141––166,
2006.

[12]

P. Hennig and C.J. Schuler. Entropy search for information-efficient global optimization. *Journal of Machine Learning Research*, 13:1809–1837, 2012.

[13]

J. Močkus. On Bayesian methods for seeking the extremum. In *Optimization Techniques IFIP Technical Conference Novosibirsk*, pages 400–404. Springer,
1975.

[14]

J. Wu, M. Poloczek, A.G. Wilson, and P.I. Frazier. Bayesian optimization with gradients. In *Advances in Neural Information Processing Systems*, volume 30, pages 3–6. Curran
Associates, 2017.

[15]

J. Nitzler, J. Biehler, N. Fehn, P.-S. Koutsourelakis, and A. Wall. A generalized probabilistic learning approach for multi-fidelity uncertainty quantification in complex physical
simulations. *Comp. Meth. Appl. Mech. Eng.*, 400:115600, 2022.

[16]

G. Sagnol, H.-C. Hege, and M. Weiser. Using sparse kernels to design computer experiments with tunable precision. In *Proceedings of COMPSTAT 2016*, pages 397–408, 2016.

[17]

P. Semler and M. Weiser. Adaptive Gaussian process regression for efficient building of surrogate models in inverse problems. *Inverse Problems*, 39:125003,
2023.

[18]

A. Griewank and A. Walther. *Evaluating derivatives: principles and techniques of algorithmic differentiation*. SIAM, 2008.

[19]

D. Eriksson, K. Dong, E. Lee, D. Bindel, and A Wilson. Scaling gaussian process regression with derivatives. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, editors, *Advances in Neural Information Processing Systems*, volume 31. Curran Associates, Inc., 2018.

[20]

A. Wu, M. C. Aoi, and J. W. Pillow. Exploiting gradients and hessians in Bayesian optimization and quadrature.

[21]

B. Vexler. Adaptive finite element methods for parameter identification problems. In H.G. Bock, T. Carraro, W. Jäger, S. Körkel, R. Rannacher, and J.P.
Schlöder, editors, *Model Based Parameter Estimation: Theory and Applications*, pages 31–54. Springer, 2013.

[22]

P. Deuflhard. *Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive Algorithms*, volume 35 of *Computational Mathematics*. Springer, 2004.

[23]

D. Duvenau. *Automatic Model Construction with Gaussian Processes*. PhD thesis, University of Cambridge, 2014.

[24]

M. Kuß. *Gaussian Process Models for Robust Regression, Classification, and Reinforcement Learning*. PhD thesis, Technische Universität Darmstadt, 2006.

[25]

J. Quiñonero-Candela, C.E. Rasmussen, and C.K.I. Williams. Approximation methods for gaussian process regression. In *Large-Scale Kernel Machines*, Neural Information Processing,
pages 203–223. MIT Press, 2007.

[26]

E. Solak, M. Roderick, W.E. Leithead, D. Leith, and C. Rasmussen. Derivative observations in gaussian process models of dynamic systems. pages 1033–1040, Jan 2002.

[27]

T. Bai, A.L. Teckentrup, and K.C. Zygalakis. Gaussian processes for Bayesian inverse problems associated with linear partial differential equations. arxiv:2307.08343,
2023.

[28]

A.M. Mathai and S.B. Provost. *Quadratic Forms in Random Variables*. Marcel Dekker, 1992.

[29]

P. Deuflhard and M. Weiser. *Adaptive numerical solution of PDEs*. de Gruyter, 2012.

[30]

M. Weiser and S. Ghosh. . *Communications in Applied Mathematics and Computational Science*, 13(1):53–86, 2018.

[31]

I. Neitzel, K. Pieper, B. Vexler, and D. Walter. A sparse control approach to optimal sensor placement in pde-constrained parameter estimation problems. *Numer. Math.*,
143:943–984, 2019.

[32]

M. Hammerschmidt, M. Weiser, X. Garcia Santiago, L. Zschiedrich, B. Bodermann, and S. Burger. . In B. Bodermann, K. Frenner, and R. M. Silver, editors, *Modeling Aspects in Optical
Metrology VI*, volume 10330, page 1033004. International Society for Optics and Photonics, SPIE, 2017.

[33]

P. Schneider, M. Hammerschmidt, L. Zschiedrich, and S. Burger. . In Vladimir A. Ukraintsev and Ofer Adan, editors, *Metrology, Inspection, and Process Control for Microlithography
XXXIII*, volume 10959, page 1095911. International Society for Optics and Photonics, SPIE, 2019.

[34]

N. Farchmin, M. Hammerschmidtm, P.I. Schneider, M. Wurm, B. Bodermann, M. Bär, and S. Heidenreich. . In Bernd Bodermann, Karsten Frenner, and Richard M. Silver, editors,
*Modeling Aspects in Optical Metrology VII*, volume 11057, page 110570J. International Society for Optics and Photonics, SPIE, 2019.

[35]

M. Wurm, S. Bonifer, B. Bodermann, and J. Richter. Deep ultraviolet scatterometer for dimensional characterization of nanostructures: system improvements and test measurements.
*Measurem. Sci. Techn.*, 22(9):094024, 2011.