January 22, 2025
The Active Subspace (AS) method is a widely used technique for identifying the most influential directions in high-dimensional input spaces that affect the output of a computational model. The standard AS algorithm requires a sufficient number of gradient evaluations (samples) of the input output map to achieve quasi-optimal reconstruction of the active subspace, which can lead to a significant computational cost if the samples include numerical discretization errors which have to be kept sufficiently small. To address this issue, we propose a multilevel version of the AS method (MLAS) that utilizes samples computed with different accuracies and yields different active subspaces across accuracy levels, which can match the accuracy of single-level AS with reduced computational cost, making it suitable for downstream tasks such as function approximation. In particular, we propose to perform the latter via optimally-weighted least-squares polynomial approximation in the different active subspaces, and we present an adaptive algorithm to choose dynamically the dimensions of the active subspaces and polynomial spaces. We demonstrate the practical viability of the MLAS method with polynomial approximation through numerical experiments based on random partial differential equations (PDEs).
The domain of scientific computing has witnessed great advances over the past decades. The field of uncertainty quantification (UQ) and the study of parametric or stochastic models lead to high-dimensional problems which necessitate analysis and interpretation. These models, though offering precious insights, come with an attached computational cost that drastically increases with their dimensionality. Reducing the dimensionality while retaining the core information is thus a challenge of paramount importance. Multiple techniques to tackle this problem have flourished. Notably, the Reduced Basis Method (RBM) is a classical approach for reducing the dimensionality of the output space in parametric differential models. It includes, in particular, an offline stage that generates a low-dimensional linear subspace of solution space using high-fidelity snapshots [1]. Other techniques have emerged, on the other hand, to mitigate the dimensionality of the input space in parametric models, so that uncertainty quantification can be achieved with a reduced complexity. These techniques span nonlinear and linear methods, such as basis adaptation for polynomial chaos expansion [2], [3] and Active Subspaces (AS) method [4]. The latter, in particular, emerged in the field of uncertainty quantification as a dimensionality reduction technique seeking to identify the most influential directions in the input space, thus greatly helping a variety of downstream tasks on the input-output map such as approximation, integration, and optimization. While the AS method excels in addressing the dimensionality issue, the associated computational costs for achieving higher accuracy levels remains a concern.
A technique to reduce the computational complexity of parametric/stochastic differential models is the multilevel paradigm which has grown popular in different computational domains. The paradigm first appeared in [5] in the context of Monte Carlo approximation (MLMC). This work provided a framework where different accuracy discretization levels of the underlying differential model could be combined to improve overall efficiency. The fundamental idea is the use of a hierarchy of discretization levels, where each level provides a different approximation of the mathematical model with coarse discretizations being associated to low query costs, and vice-versa. The primary objective of the multilevel paradigm is to achieve the same accuracy of the original methodology at a reduced computational cost by taking advantage of the hierarchical structure. The multilevel paradigm has been successfully employed in other settings such as polynomial regression [6], filtering [7], and stochastic collocation for random PDEs [8]. Furthermore, it inspired several multifidelity methods, which exploit multiple models at different accuracy, in various fields, from optimization to uncertainty quantification. The authors of [9] propose to construct an active subspace from a multifidelity Monte Carlo estimation of the gradient of the model output, however its complexity is limited by the Monte Carlo error.
In this work we present a multilevel version of the Active Subspace method (MLAS) inspired by [6] that builds a different active subspace for each accuracy level, hence retaining the desirable stability property of the AS method and, in turn, not suffering from the Monte Carlo error. Furthermore, we discuss how to perform the function approximation in the estimated active subspaces by (adaptive) optimally-weighted least-squares polynomial fit.
In this section, we briefly review the AS method. We begin by introducing some basic objects and notation. Consider a Lipschitz domain \(\Gamma \subseteq {\mathbb{R}}^d\) with \(d \in {\mathbb{N}}\cup \{ \infty \}\) and let \(\mathcal{B}(\Gamma)\) be the Borel \(\sigma\)-algebra on \(\Gamma\). Let further \(\mu\) be a probability measure on \((\Gamma,\mathcal{B}(\Gamma))\). For \(1 \leq p \leq +\infty\) and \(\mathcal{V}\) a Banach space, let us consider the weighted \(L^p_\mu(\Gamma;\mathcal{V})\) Lebesgue-Bochner spaces with standard definition, which we will mainly use with \(\mathcal{V} = {\mathbb{R}}\) or \(\mathcal{V} = \ell^2(d)\), where \(\ell^2(d)\) is the space of sequences \((\mu_1,\dots,\mu_d) \in {\mathbb{R}}^d\) endowed with the \(2\)-norm \(\left\lVert\cdot\right\rVert_{\ell^2(d)}\). Furthermore, we will omit \(\mathcal{V}\) and write \(L^p_\mu(\Gamma)\) when no confusion arises. Finally, let us introduce the weighted Sobolev space \(H^1_\mu(\Gamma)\) of squre-integrable functions with square-integrable weak derivative.
We now consider a function \(f \in H^1_\mu(\Gamma)\). The goal is to approximate it by first reducing the dimensionality of the input space from \(d\) (which is large and possibly infinite) to \(r \ll d\). Then, a second approximation step on the reduced \(r\)-dimensional space is considered. The idea behind the AS method is to achieve these two goals by considering a model of the form \[\label{nrt-eq:approx-model} f \approx g \circ V^\ast,\tag{1}\] where \(g: {\mathbb{R}}^r \to {\mathbb{R}}\) is a measurable function. The dimensionality reduction is relegated to a linear operator \({\boldsymbol{y}}\to V^\ast{\boldsymbol{y}}\) with \(V \in {\mathbb{R}}^{d \times r}\) orthogonal, \(V^\ast\) being the adjoint of \(V\).
Let \(\mathrm{St}(d,r)\) denote the Stiefel manifold embedded in \({\mathbb{R}}^{d \times r}\), i.e. the set of orthogonal matrices \(V \in {\mathbb{R}}^{d \times r}\) such that \(V^\astV = I_{r \times r}\). Throughout the paper, slightly abusing notation, we sometimes denote by \(V\) the subspace spanned by the columns of the matrix \(V \in \mathrm{St}(d,r)\). Then, \(\Pi_V\) denotes the orthogonal projector onto this subspace, that is \(\Pi_V {\boldsymbol{y}}= V V^\ast{\boldsymbol{y}}\), and \(\Pi_V^\perp\) the orthogonal projector onto its orthogonal complement. In order to approximate \(f\) with the model 1 , we have to choose both \(V\) and \(g\). A natural benchmark is to consider the best approximation in the \(L^2_\mu\) sense, namely the optimization problem \[\label{nrt-eq:general-AS-obj} \min_{ \substack{\text{g:\Gamma_V \to {\mathbb{R}} measurable}, \\ V \in \mathrm{St}(d,r)} } \int_\Gamma (f({\boldsymbol{y}}) - g(V^\ast{\boldsymbol{y}}))^2 d\mu({\boldsymbol{y}}).\tag{2}\] Consider \(V \in \mathrm{St}(d,r)\) fixed and \(W \in \mathrm{St}(d,d-r)\) such that \(V V^\ast+ W W^\ast= I\), \(I\) being the identity map in \({\mathbb{R}}^d\), so that any \({\boldsymbol{y}}\in \Gamma\) can be written as \[\label{nrt-eq:active-and-inactive-vars} {\boldsymbol{y}}= V {\boldsymbol{x}}+ W {\boldsymbol{z}}, \qquad {\boldsymbol{x}}:= V^\ast{\boldsymbol{y}}\in \Gamma_V, \quad {\boldsymbol{z}}:= W^\ast{\boldsymbol{y}}\in \Gamma_W,\tag{3}\] where \(\Gamma_V := V^\ast\Gamma = \{V^\ast{\boldsymbol{y}}\; : \; {\boldsymbol{y}}\in \Gamma \}\), and similarly for \(\Gamma_W\). The AS method distinguishes between active variables, \({\boldsymbol{x}}\), and inactive ones, \({\boldsymbol{z}}\). The active variables are the key parameters or dimensions that exert a significant influence on the output of the mathematical model represented by the function \(f\). Inactive variables, on the other hand, ideally have minimal impact on the system and can be discarded without significantly affecting the accuracy of the model.
Letting \(Y\) be a random vector with distribution \(\mu\), we can define the marginal measures, denoted as \(d\mu_V\) and \(d\mu_W\), of the random vectors \(X := V^\astY\) and \(Z := W^\astY\), respectively. Define further the conditional measure \(d\mu_{Z \vert X}(\cdot \vert {\boldsymbol{x}})\) for \(Z\) given \(X = {\boldsymbol{x}}\). It is a well-known result [10] that the optimal solution of Problem 2 when performing the minimization over all measurable functions \(g:\Gamma_V \to {\mathbb{R}}\) is given by the conditional expectation \[\label{nrt-eq:best-g} g^\star({\boldsymbol{x}}) = \int_{\Gamma_{{\boldsymbol{x}}}} f(V {\boldsymbol{x}}+ W {\boldsymbol{z}}) d\mu_{Z \vert X}({\boldsymbol{z}}\vert {\boldsymbol{x}}),\tag{4}\] where \(\Gamma_{{\boldsymbol{x}}} := \{ {\boldsymbol{z}}\in {\mathbb{R}}^{d-r} \; \vert \; V {\boldsymbol{x}}+ W {\boldsymbol{z}}\in \Gamma \} \subseteq \Gamma_W\) denotes the support of the measure \(\mu_{Z \vert X}(\cdot \vert {\boldsymbol{x}})\). In other words, given \({\boldsymbol{x}}\in {\mathbb{R}}^r\), \(g^\star({\boldsymbol{x}})\) is the average of \(f\) with respect to \(\mu\) over all points \({\boldsymbol{y}}\in {\mathbb{R}}^d\) which map onto \({\boldsymbol{x}}\) by \(V^\ast\). Note that, since \(f \in L^2_\mu(\Gamma)\), \(g^\star\) is unique as an element of \(L^2_{\mu_V}(\Gamma_V)\), and we may write \[g^\star \circ V^\ast = \Lambda_{L^2_\mu(\Gamma) / V} f,\] where we defined for a given closed subspace \({\mathcal{S}}\subseteq L^2_\mu(\Gamma)\) the orthogonal projector \(\Lambda_{{\mathcal{S}}}: L^2_\mu(\Gamma;{\mathcal{V}}) \to {\mathcal{S}}\otimes {\mathcal{V}}\) by \[\label{nrt-eq:L2-proj} \Lambda_{{\mathcal{S}}} f := \operatorname*{argmin}_{v \in {\mathcal{S}}\otimes {\mathcal{V}}} \left\lVert f - v\right\rVert_{L^2_\mu(\Gamma)},\tag{5}\] and \(L^2_\mu(\Gamma) / V\) denotes the space of square-integrable functions which are constant with respect to directions in \(V\). We can hence reduce 2 to the problem \[\label{nrt-eq:approx-err} \min_{ V \in \mathrm{St}(d,r) } \left\lVert f - g^\star \circ V^\ast\right\rVert_{L^2_\mu(\Gamma)}^2 = \int_{\Gamma} (f({\boldsymbol{y}}) - g^\star(V^\ast{\boldsymbol{y}}))^2 d\mu({\boldsymbol{y}}).\tag{6}\] Interestingly, the error in this construction can be bounded assuming a Poincaré-type inequality.
Assumption 1. (Sliced Poincaré inequality) There exists a constant \(C_P > 0\) such that, for any \({\boldsymbol{x}}\in \Gamma_V\) and \(h \in H^1_{\mu_{Z \vert X}(\cdot \vert {\boldsymbol{x}})}(\Gamma_{{\boldsymbol{x}}})\) with zero mean, it holds \[\left( \int_{\Gamma_{{\boldsymbol{x}}}} h({\boldsymbol{z}})^2 d\mu_{Z \vert X}({\boldsymbol{z}}\vert {\boldsymbol{x}}) \right)^{\frac{1}{2}} \leq C_P \left( \int_{\Gamma_{{\boldsymbol{x}}}} \left\lVert\nabla h({\boldsymbol{z}})\right\rVert^2_{\ell^2(d-r)} d\mu_{Z \vert X}({\boldsymbol{z}}\vert {\boldsymbol{x}}) \right)^{\frac{1}{2}}.\]
The Poincaré inequality is inherently tied to the domain \(\Gamma\), the measure \(\mu\), and the subspace \(V\). The setting we are most interested in is one where the \(y_i\)’s denote independent and identically distributed random variables, hence we focus on the case where the domain \(\Gamma\) is expressed as a Cartesian product space and the measure \(\mu\) is of tensor product type, namely \(\Gamma = \times_{i=1}^d \Gamma_i\) and \(\mu = \otimes_{i=1}^d \mu_i\), where each \(\Gamma_i \subseteq \mathbb{R}\) and \(\mu_i\) is a univariate measure on \(\Gamma_i\). The most common settings are the ones where \(\mu_i\) is the standard Gaussian measure on \(\Gamma_i = \mathbb{R}\), or the uniform measure with \(\Gamma_i = [-1,1]\). One can show [11], [12] that in the former case \(C_P = 1\), while in the latter \(C_P\) scales with \(\sqrt{d}\). These results suggest that the AS method is better suited for the Gaussian case.
Theorem 1. [4] Let 1 hold, then, for a fixed \(V \in \mathrm{St}(d,r)\), the \(L^2_\mu\) error of the approximation \(g^\star \circ V^\ast\), where \(g^\star\) is defined in 4 , satisfies \[\label{nrt-eq:err-bound} \left\lVert f - g^\star \circ V^\ast\right\rVert_{L^2_\mu(\Gamma)} \leq C_P \left\lVert(I-\Pi_{V}) \nabla f\right\rVert_{L^2_\mu(\Gamma)}.\qquad{(1)}\]
This result states that an upper bound for the best reconstruction error of \(f\) from the \(r\) dimensional subspace \(V\) is given by the projection error of the gradient of \(f\) onto the subspace \(V\). Instead of minimizing the reconstruction error 6 over \(\mathrm{St}(d,r)\), owing to Theorem 1 one may minimize the upper bound in ?? . This problem, indeed, turns out to have a simple solution given by the subspace spanned by the first \(r\) dominant eigenvectors of \[\label{nrt-eq:cov} C := \int_\Gamma \nabla f({\boldsymbol{y}}) \otimes \nabla f({\boldsymbol{y}}) d \mu({\boldsymbol{y}}),\tag{7}\] where \(\otimes\) denotes the outer product in \({\mathbb{R}}^d\). This quantity represents the second moment of \(\nabla f\), yet, in accordance with the prevailing terminology in the AS literature, we hereafter refer to it as the covariance. In practice, in order to compute the active subspace, the integral in 7 must be discretized, which can be done, e.g., using a Monte Carlo estimator based on a random sample \(\{{\boldsymbol{y}}_i\}_{i =1}^M \stackrel{\text{iid}}{\sim} \mu\) of size \(M\). This naturally leads to the straightforward 1 as the core of the AS methodology. There, \(\mathrm{SVD}(C,r)\) denotes the truncated SVD of the matrix \(C\) at rank \(r\).
Note that, given a rank \(r\), to run 1 one should choose a sample size such that a good reconstruction \(\hat{U}_r\) of the optimal subspace \(U_r\) is achieved, that is \(M\) should be chosen large enough so that \(\left\lVert(\Pi_{U_r}-\Pi_{\hat{U}_r}) \nabla f\right\rVert_{L^2_\mu(\Gamma)}\) is small enough. Following [4], we make the following assumption.
Assumption 2. There exists \(\eta > 0\) such that, provided \(M \geq {\mathcal{O}}(\kappa r^\eta \log(r))\) for some \(\kappa > 1\), \[\label{nrt-eq:} \left\lVert(I-\Pi_{\hat{U}_r}) \nabla f\right\rVert_{L^2_\mu(\Gamma)} \lesssim \left\lVert(I-\Pi_{U_r}) \nabla f\right\rVert_{L^\infty_\mu(\Gamma)}\qquad{(2)}\] holds with probability \(1 - {\mathcal{O}}(M^{-\kappa})\).
In scientific computing applications we often face the impossibility of evaluating the function \(f\) and its gradient exactly. This can be, for instance, due to the fact that the relevant computational model involves the discretization of an ODE or a PDE. Hence, in practice we have access only to approximations \(f_l \approx f\) and, consequently, \(\nabla f_l \approx \nabla f\), \(l \in {\mathbb{N}}\) being the level of approximation with \(f_\infty := f\). For instance, in a PDE setting, the level \(l\) could be related to a mesh discretization with mesh size \(h_l\). This entails that the larger \(l\) is, the higher is the computational cost needed to evaluate \(f_l\) and \(\nabla f_l\).
A single level AS approximation consists in choosing a level \(L \in {\mathbb{N}}\) and running 1 with input parameters \((f_L, r_L, M_L)\). Denoting the orthogonal matrix computed in the SLAS algorithm by \(\hat{U}_L\), let \[\mathcal{S}^{\mathrm{SL},\star}_L(f_\infty) := g_L^\star \circ \hat{U}_L^\ast = \Lambda_{L^2_\mu(\Gamma)/\hat{U}_{L}} f_L,\] be the idealized single-level estimator, which satisfies the single-level error decomposition \[\label{nrt-eq:SL-err-dec} \left\lVert f_\infty - \mathcal{S}^{\mathrm{SL},\star}_L(f_\infty)\right\rVert_{L^2_\mu(\Gamma)} \leq \left\lVert f_\infty - f_L\right\rVert_{L^2_\mu(\Gamma)} + C_P \left\lVert(I - \hat{\Pi}_L) \nabla f_L\right\rVert_{L^2_\mu(\Gamma)},\tag{8}\] where we defined the projector \(\hat{\Pi}_L := \hat{U}_L(\hat{U}_L)^\ast\). Note that the above only involves the function at level \(L\). If a sufficiently small error tolerance is requested, \(L\) should be chosen large enough to control the first error contribution in 8 and many evaluations of \(\nabla f_L\) should be utilized for the approximation of the covariance, making the AS method prohibitive in terms of computational cost. The key observation is that the single-level AS method does not exploit the flexibility we have of querying \(f\) at different levels. Hence, in this section, we aim at designing an improved method which can exploit the whole hierarchy \(f_0,f_1,\dots\) of discretizations available.
Our idea is to introduce the multilevel paradigm in the AS method. The multilevel paradigm is a powerful computational strategy revolving around the use of hierarchies of discretization levels to reduce complexity of computational routines. A prime example of this approach is the multilevel Monte Carlo (MLMC) method [5] where the objective is to compute an expectation of the form \(\mathbb{E}\left[f_L\right]\) with reduced complexity. MLMC is based on the observation that \(\mathbb{E}\left[\cdot\right]\) is a linear operator so that one can decompose the original expectation with the telescoping sum \[\mathbb{E}\left[f_L\right] = \mathbb{E}\left[f_0\right] + \sum_{l=0}^L \mathbb{E}\left[f_l - f_{l-1}\right].\] Then, the idea is to approximate expectations \(\mathbb{E}\left[f_l - f_{l-1}\right]\) separately with a sample size which is large at coarse levels of discretization and small at finer levels, where the differences \(\Delta_l := f_l - f_{l-1}\) are small but expensive to evaluate. In our setting, however, the operator is given by the truncated SVD of the gradient which is nonlinear, hence the telescoping argument cannot be straightforwardly applied. We take inspiration from [6] which deals with a similar issue. Given a sequence of ranks \(r_0 < \dots < r_L\), our idea is to first build an approximation \(f_0 \approx g_0^\star \circ \hat{U}_{L,0}^\ast\) using \(M_{L}\) gradient evaluations, that is running 1 with input parameters \((f_0,r_L,M_L)\). Then, we correct it with an approximation of \(\Delta_1 \approx g_1^\star \circ \hat{U}_{L-1,1}^\ast\) using \(M_{{L-1}}\) evaluations running 1 with input parameters \((\Delta_1,r_{L-1},M_{L-1})\) and iterate in this manner computing approximations \(\Delta_l \approx g_l^\star \circ \hat{U}_{L-l,l}^\ast\) until the finest level \(L\) is reached. We then define the Multilevel Active Subspaces (MLAS) method \[\mathcal{S}_L^{\mathrm{ML},\star}(f_\infty) := \sum_{l =0}^L g_l^\star \circ \hat{U}_{L-l,l}^\ast,\] where we used the auxiliary definition \(f_{-1} := 0\). Since the sample sizes \(M_l\) are linked to the subspace dimensions \(r_l\) by 2, it holds that \(M_{0} < \dots < M_L\), hence this procedure uses only a few gradient evaluations for differences at finer levels and progressively enlarges the subspace dimension while taking more gradient evaluations at coarser and cheaper levels. The strategy is summarized in 2.
Defining projectors \(\hat{\Pi}_{L-l,l} := \hat{U}_{L-l,l} (\hat{U}_{L-l,l})^\ast\) we have the multilevel error decomposition \[\label{nrt-eq:ml-err-decomp} \left\lVert f_\infty - \mathcal{S}_L^{\mathrm{ML},\star}(f_\infty)\right\rVert_{L^2_\mu(\Gamma)} \leq \left\lVert f_\infty - f_L\right\rVert_{L^2_\mu(\Gamma)} + C_P \sum_{l = 0}^L \left\lVert(I-\hat{\Pi}_{L-l,l}) \nabla \Delta_l\right\rVert_{L^2_\mu(\Gamma)},\tag{9}\] which reveals a spatial approximation component at level \(L\) – as in the single-level method – plus a subspace approximation component for the differences at level \(l=0,\ldots,L\). The latter are referred to as mixed differences in the literature since they couple spatial and subspace approximation.
In order to analyze the complexity (cost versus accuracy) of the MLAS method, we make some assumptions on the decay of the mixed differences. We introduce a function class \({\mathcal{F}}\) of continuous functions on \(\Gamma\) endowed with seminorm \(\left\lvert\cdot\right\rvert_{{\mathcal{F}}}\) and define the subspace approximation error at rank \(r\) in the function class \({\mathcal{F}}\), \(r \leq d\), as follows \[e_{r} ({\mathcal{F}}) := \sup_{f \in {\mathcal{F}}} \frac{\left\lVert(I - \Pi_{U(f)}) \nabla f\right\rVert_{L^\infty_\mu(\Gamma)}}{\left\lvert f\right\rvert_{F_w}},\] where we recall that \(U(f) \in \mathrm{St}(d,r)\) denotes the optimal subspace for \(\nabla f\) in the \(L^2_\mu(\Gamma)\) sense. This enables us to make an assumption on the rate at which this error decays.
Assumption 3. (Strong subspace approximability and convergence of approximations) It holds \[e_{r}({{\mathcal{F}}}) \lesssim r^{-{\alpha}}\] for some \(\alpha > 0\). Furthermore, given \(f_\infty \in {{\mathcal{F}}}\) there exist approximations \(f_l \in {{\mathcal{F}}}\) such that \[\left\lvert\Delta_l\right\rvert_{L^2_\mu(\Gamma)} \lesssim h_l^{\beta_w} \quad \mathrm{and} \quad \left\lvert\Delta_l\right\rvert_{{\mathcal{F}}} \lesssim h_l^{\beta_s}\] for some \(0 < \beta_s \leq \beta_w\).
As anticipated, since the approximation level \(l\) is connected to the granularity \(h_l\) of the discretization of the computational model, we consider the setting where the smaller \(h_l\) is (i.e., the larger \(l\) is) the costlier it is to evaluate \(\nabla f_l\). We call “\(\mathrm{Work}\)” the computational cost of all gradient evaluations (we assume that the cost of the \(\mathrm{SVD}\)’s is negligible with respect to the cost of collecting all gradient evaluations). The following assumption makes this intuition precise.
Assumption 4. (Sample work) One evaluation of \(\nabla f_{l}\) requires a computational effort which is \(\mathrm{Work}(\nabla f_{l}) \lesssim h_l^{-\gamma}\), for some \(\gamma > 0\).
Given these assumptions, we can state the following complexity result.
Theorem 2. (Idealized multilevel – complexity) Denote by \[\mathrm{Work}(\mathcal{S}_L^{\mathrm{ML},\star}(f_\infty)) := \sum_{l=0}^L M_{L-l} (\mathrm{Work}(\nabla f_l) + \mathrm{Work}(\nabla f_{l-1}))\] the work that \(\mathcal{S}_L^{\mathrm{ML},\star}(f_\infty)\) requires for evaluations of the functions \(\nabla f_l\), \(l = 0,\ldots,L\). Define \[\rho_{\mathrm{ML}} = \begin{cases} \frac{\eta}{\alpha}, & \text{if \frac{\gamma}{\beta_s} \leq \frac{\eta}{\alpha}} \\ \theta \frac{\gamma}{\beta_s} + (1-\theta) \frac{\eta}{\alpha} & \text{otherwise} \end{cases},\] where \(\theta := \frac{\beta_s}{\beta_w}\) and \[t := \begin{cases} 2 & \text{if \frac{\gamma}{\beta_s} < \frac{\eta}{\alpha}} \\ 3 + \frac{1}{\alpha} & \text{if \frac{\gamma}{\beta_s} = \frac{\eta}{\alpha}} \\ 1 & \text{if \frac{\gamma}{\beta_s} > \frac{\eta}{\alpha} and \beta_w = \beta_s} \\ 2 & \text{if \frac{\gamma}{\beta_s} > \frac{\eta}{\alpha} and \beta_w > \beta_s} \\ \end{cases}.\] Let \(0<\varepsilon\lesssim 1\). Then, there exists \(L = L(\varepsilon) \in {\mathbb{N}}\), \(r_0,\dots,r_L\), and \(M_0,\dots, M_L\) such that \[\mathrm{Work}(\mathcal{S}^{\mathrm{ML}, \star}_L( f_\infty)) \lesssim \left\lvert\log(\varepsilon)\right\rvert^{-t}\log(\left\lvert\log(\varepsilon)\right\rvert) \varepsilon^{-\rho_{\mathrm{ML}}}\] and such that in an event \(E\) with \(\mathbb{P}\left(E^c\right) \lesssim \varepsilon^{\log \left\lvert\log(\varepsilon)\right\rvert}\) the multilevel approximation satisfies \[\left\lVert f_\infty - \mathcal{S}^{\mathrm{ML},\star}_L(f_\infty)\right\rVert_{L^2_\mu(\Gamma)} \lesssim \varepsilon.\]
Proof. This proof closely follows that of [6]. The only adjustment needed is to address the fact that the (polynomial) projection in [6] requires \({\mathcal{O}}(r \log(r))\) (function) evaluations, whereas our (active subspace) projection requires \({\mathcal{O}}(r^\eta \log(r))\) (gradient) evaluations. ◻
In this section we aim at developing a fully discrete version of the MLAS method for function approximation, giving a practical recipe for the approximation of the idealized reconstruction function 4 . In particular, we focus on discrete least-squares fitting based on random evaluations with optimal sampling as proposed in [13]. Recall the definition of \(L^2_\mu(\Gamma)\)-orthogonal projector in 5 and let us fix some space \({\mathcal{S}}_m \subseteq L^2_\mu(\Gamma;{\mathbb{R}})\) of dimension \(m\) of functions defined everywhere on \(\Gamma\). Furthermore, assume that for any \({\boldsymbol{y}}\in \Gamma\), there exists \(v \in {\mathcal{S}}_m\) such that \(v({\boldsymbol{y}}) \neq 0\). Let \(\nu\) be a probability measure on \(\Gamma\) such that \(\mu \ll \nu\), and consider an i.i.d. random sample \(\{ {\boldsymbol{y}}_j \}_{j=1}^N\) from \(\nu\). Then, the least-squares estimator is given by \[\hat{\Lambda}_{{\mathcal{S}}_m} g = \mathrm{argmin}_{v \in {\mathcal{S}}_m \otimes {\mathcal{V}}} \frac{1}{N} \sum_{i=1}^N w({\boldsymbol{y}}_i) \left\lVert v({\boldsymbol{y}}_i)-g({\boldsymbol{y}}_i)\right\rVert_{\mathcal{V}}^2,\] where \(w := \frac{d\mu}{d\nu}\). The solution to this least squares problem is provided by the well known normal equations, which are described next. We first introduce the empirical semi-norm for \({\mathcal{V}}\)-valued functions \(v: \Gamma \to {\mathcal{V}}\) \[\left\lVert v\right\rVert_{N,{\mathcal{V}}} = \frac{1}{N} \sum_{i=1}^N w({\boldsymbol{y}}_i) \left\lVert v({\boldsymbol{y}}_i)\right\rVert_{{\mathcal{V}}}^2,\] which is a Monte Carlo approximation of the \(L^2_\mu(\Gamma;{\mathcal{V}})\)-norm. Assume now \(\{ \phi_1,\dots,\phi_m \}\) is an \(L^2_\mu(\Gamma)\)-orthonormal basis of \({\mathcal{S}}_m\) and define the matrices \(G \in {\mathbb{R}}^{m \times m}\) and \(J \in {\mathcal{V}}^m\) \[G_{ij} = \frac{1}{N} \sum_{k=1}^{N} w({\boldsymbol{y}}_k) \phi_i({\boldsymbol{y}}_k)\phi_j({\boldsymbol{y}}_k) = \left \langle \phi_i, \phi_j \right \rangle_{N}, \quad J_i := \left \langle g, \phi_i \right \rangle_{N}.\] Then, the normal equations read \[G c = J,\] where \(c \in {\mathcal{V}}^m\) is such that \(\hat{\Lambda}_{{\mathcal{S}}_m} g = \sum_{i=1}^m c_i \phi_i\). It turns out that, whenever \(G\) is well conditioned, the reconstruction error is quasi-optimal, namely it is bounded by the best approximation error \[\label{nrt-eq:poly-best-err} e_m(g) = \min_{v \in {\mathcal{S}}_m \otimes {\mathcal{V}}} \left\lVert g-v\right\rVert\tag{10}\] up to a constant term.
Then, the idea is to choose the sample size \(N\) big enough so that \(\left\lVert G-I\right\rVert < \delta\) with high probability. Let us introduce the function \[{\boldsymbol{y}}\mapsto k_{m,w}({\boldsymbol{y}}) = \sum_{j=1}^m w({\boldsymbol{y}}) \phi_j({\boldsymbol{y}})^2,\] whose reciprocal is known as the Christoffel function [14], and define \[K_{m,w}:= \left\lVert k_{m,w}\right\rVert_{L^\infty(\Gamma)},\] which trivially satisfies \(K_{m,w} \geq m\) since \(\int_\Gamma k_{m,w} d \mu = m\). We can now state the following theorem.
Theorem 3. (From [13]) For any \(t>0\), if \(m\) and \(N\) are such that \[K_{m,w} \leq \kappa \frac{N}{\log(N)}, \quad \mathrm{with} \quad \kappa =\frac{1- \log(2)}{2+2t},\] then the weighted least square estimator satisfies \[\mathbb{E}\left[\left\lVert g-\hat{\Lambda}_{{\mathcal{S}}_m} g\right\rVert^2_{L^2_\mu(\Gamma;{\mathcal{V}})}\right] \leq (1 + \epsilon(s)) e_m(g)^2,\] where \(e_m(g)\) is defined in 10 , with probability at least \(1-N^{-t}\).
The above theorem suggests that, to reduce as much as possible the sample size, one should choose the sampling distribution \(\nu\) that minimizes \(K_{m,w}\). In this sense, the optimal sampling measure \(\nu^\star\) is given by \[\frac{1}{w^\star} := \frac{d \nu^\star}{d \mu} = \frac{1}{m} \sum_{j=1}^m \phi_j(y)^2,\] which yields the optimal \(K_{m,w^\star} = m\).
Example 1. Let \(\Gamma={\mathbb{R}}\) with Gaussian density \(d \mu\), \({\mathcal{V}}= {\mathbb{R}}\), and consider the space of polynomials of degree smaller than \(m\), that is \({\mathcal{S}}_m = \mathbb{P}_{m-1}\), then we may take \(\phi_j = H_{j-1}\), \(H_j\) being the \(j\)-th normalized Hermite polynomial. When the sampling measure \(d\nu = d\mu\) and \(w(x)=1\), it holds that \(K_{m,w}\) is unbounded. However, if we consider the optimal sampling measure \(\frac{1}{m} \sum_{j=0}^{m-1} H_j(y)^2\), drawing \({\mathcal{O}}(m \log(m))\) samples is sufficient to achieve quasi-optimal reconstruction error.
We now turn to the combination of the AS method and weighted least-squares for the full approximation of the target function \(f:\Gamma \to {\mathbb{R}}\). For simplicity, here we consider the case where \(\mu\) is tensorized Gaussian, so that 1 holds with \(C_P = 1\), the projected measure \(\mu_V\) on the active variables is a tensorized Gaussian for any \(V \in \mathrm{St}(d,r)\), and the conditional measure \(\mu_{Z \vert X}(\cdot \vert {\boldsymbol{x}})\) is tensorized Gaussian independent of \({\boldsymbol{x}}\in \Gamma_V\) and coincides with the projected measure \(\mu_W\) on the orthogonal to the active subspace, where we defined \(W \in \mathrm{St}(d,d-r)\) such that \(W W^\ast= (I - V V^\ast)\). Note that, in the following, we use \(W\) repeatedly to denote the subspace corresponding to the inactive variables. Introduce now the set of multi indices \({\mathcal{L}}:= \{ {\boldsymbol{\nu}}\in {\mathbb{N}}^d \,:\, \sup_{j \,:\, \nu_j \neq 0} j < \infty \}\), \({\mathcal{L}}_r := \{ {\boldsymbol{\nu}}\in {\mathcal{L}}\,:\, \nu_j = 0 \, \forall \, j>r \}\), and \({\mathcal{L}}_{>r} := \{ {\boldsymbol{\nu}}\in {\mathcal{L}}\,:\, \exists \, j>r \, \mathrm{s.t.}\, \nu_j \neq 0\}\). We can then define the tensorized Hermite polynomial in the rotated variables \(({\boldsymbol{x}}, {\boldsymbol{z}})\) associated to a multi index \({\boldsymbol{\nu}}\in {\mathcal{L}}_r\) as \[H_{\boldsymbol{\nu}}({\boldsymbol{x}},{\boldsymbol{z}}) := \Pi_{j = 1}^r H_{\nu_j}(x_j),\] \(H_{\nu}\) being the one-dimensional normalized Hermite polynomial of degree \(\nu \in {\mathbb{N}}\). Let us introduce a polynomial space \(P_{\Xi_{m,r}}\) of dimension \(m\) spanned by polynomials constant in the inactive variables, that is \(P_{\Xi_{m,r}} := \mathrm{span}\{ H_{{\boldsymbol{\nu}}} \,:\, \nu \in \Xi_{m,r} \}\) where \(\Xi_{m,r} \subset {\mathcal{L}}_r\) has cardinality \(m\). Note that for such a polynomial space the optimal sampling measure is given by \[\frac{d \nu^\star}{d \mu}({\boldsymbol{x}}, {\boldsymbol{z}}) = \frac{d \nu_V^\star}{d \mu_V}({\boldsymbol{x}})\] since the polynomials are constant with respect to the inactive variables. Hence, once the active subspace \(\hat{U}_L\) has been computed as in 1, we further draw a pair of i.i.d. samples \(\{ {\boldsymbol{x}}_j \}_{j=1}^{N_L}\) and \(\{ {\boldsymbol{z}}_j \}_{j=1}^{N_L}\) from the optimal measure \(\nu_V^\star\) and the Gaussian measure \(\mu_W\), respectively. Then, defining \(\hat{W}_L \in \mathrm{St}(d,d-r)\) such that \(\hat{W}_L \hat{W}_L^\ast= (I - \hat{U}_L \hat{U}_L^\ast)\), the single-level estimator is defined by \[\begin{align} {\mathcal{S}}_L^{\mathrm{SL}}(f) := & \left( \mathrm{argmin}_{v \in P_{\Xi_{m,r}}} \frac{1}{N} \sum_{i=1}^N w({\boldsymbol{x}}_i) \left(v({\boldsymbol{x}}_i)-f(\hat{U}_L {\boldsymbol{x}}_i + \hat{W}_L {\boldsymbol{z}}_i)\right)^2 \right) \circ \hat{U}_L^\ast\\ = & \left( \hat{\Lambda}_{P_{\Xi_{m,r}}} f \right) \circ \hat{U}_L^\ast. \end{align}\] The single-level error decomposition \[\label{nrt-eq:SLASPA-err-dec} \begin{align} \left\lVert f_\infty - \mathcal{S}^{\mathrm{SL}}_L(f_\infty)\right\rVert_{L^2_\mu(\Gamma)} & \leq \left\lVert f_\infty - f_L\right\rVert_{L^2_\mu(\Gamma)} + C_P \left\lVert(I - \hat{\Pi}_L) \nabla f_L\right\rVert_{L^2_\mu(\Gamma)} \\ & + \left\lVert g_L^\star - \hat{\Lambda}_{P_{\Xi_{m,r}}} f_L\right\rVert_{L^2_{\mu_{\hat{U}_L}}(\Gamma_{\hat{U}_L})} \end{align}\tag{11}\] reveals the fact that three errors should be balanced: the spatial approximation error, the SVD error of the active subspace, and the polynomial approximation error in the active subspace.
For the multilevel strategy, we simply repeat the procedure just described at each level of approximation. More precisely, let us consider a sequence of ranks \(r_0 < r_1 < \dots<r_L\) and a sequence of polynomial spaces of sizes \(m_0 < m_1 < \dots < m_L\). Then, once the active subspace \(\hat{U}_{L-l,l}\) at level \(l\) of size \(r_{L-l}\) has been computed as in 1, we further draw a pair of \(N_{L-l} = {\mathcal{O}}(m_{L-l} \log(m_{L-l}))\) i.i.d. samples \(\{ {\boldsymbol{x}}_j \}_{j=1}^{N_{L-l}}\) and \(\{ {\boldsymbol{z}}_j \}_{j=1}^{N_{L-l}}\) from the optimal measure \(\nu_V^\star\) and the Gaussian measure \(\mu_W\), respectively. This defines the empirical projector \(\hat{\Lambda}_{L-l,L-l} := \hat{\Lambda}_{P_{\Xi_{m_{L-l},r_{L-l}}}}\). Then, we define the multilevel estimator as \[\mathcal{S}_L^{\mathrm{ML}}(f_\infty) := \sum_{l =0}^L \left( \hat{\Lambda}_{L-l,L-l} \Delta_{l} \right) \circ \hat{U}_{L-l,l}^\ast,\] with \(f_{-1} := 0\). The full procedure is detailed in 3. The multilevel error decomposition then reads \[\label{nrt-eq:mlaspa-err-decomp} \begin{align} \left\lVert f_\infty - \mathcal{S}_L^{\mathrm{ML}}(f_\infty)\right\rVert_{L^2_\mu(\Gamma)} & \leq \left\lVert f_\infty - f_L\right\rVert_{L^2_\mu(\Gamma)} + C_P \sum_{l = 0}^L \left\lVert(I-\hat{\Pi}_{L-l,l}) \nabla \Delta_l\right\rVert_{L^2_\mu(\Gamma)} \\ & + \sum_{l=0}^L \left\lVert g_l^\star - \hat{\Lambda}_{{L-l,L-l}} \Delta_l\right\rVert_{L^2_{\mu_{\hat{U}_L}}(\Gamma_{\hat{U}_L})}. \end{align}\tag{12}\]
In order to choose \(L\), \((r_0,\dots,r_L)\), \((M_0,\dots,M_L)\), and \((\Xi_{m_0, r_0}, \dots,\Xi_{m_L,r_L})\) to run 3 – that is, in order to balance all error contributions involved – we need to know a priori estimates for the strong subspace approximation error, the spatial discretization error associated to \(f_{\infty} \approx f_l\), the cost per evaluation of \(f_l\) and \(\nabla f_l\), and we need to be able to design an adequate sequence of polynomial spaces to use in each subspace \(\hat{U}_{L-l,l}\). This is not always possible in practice, hence, in this section we introduce an adaptive Multilevel Active Subspaces algorithm with polynomial approximation (AMLASPA), which does not require the knowledge of the aforementioned information.
As far as the polynomial approximation is concerned, we assume we have a routine \(\texttt{Polyapprox}(h,V,\mathrm{tol})\) which, given a function \(h \in L^2_\mu(\Gamma)\), an orthogonal
matrix \(V \in \mathrm{St}(r,d)\), and a tolerance \(\mathrm{tol}\), adaptively builds and returns a polynomial approximant of \(\Lambda_{L^2_\mu(\Gamma)/V}
h\) on the \(r\) variables defined by \({\boldsymbol{x}}= V^\ast{\boldsymbol{y}}\) whose \(L^2_{\mu_V}(\Gamma_V)\)-error is below the prescribed
tolerance \(\mathrm{tol}\), along with the work needed for its construction (e.g., proportional to the number of evaluations of \(h\) used to fit the polynomial). Furthermore, we assume that
Polyapprox can take a warm start, that is, we can start building the polynomial approximant from an existing index set. See [6] for more details on how such a function may be constructed. Then, when we use this function in the AMLASPA, if we have already built a polynomial approximant based on parameters \((\Delta_l,V,\mathrm{tol})\) at an earlier iteration, and we wish to build a polynomial approximant with parameters \((\Delta_l,V',\mathrm{tol}')\) with \(\mathrm{dim}(V')>\mathrm{dim}(V)\) and \(\mathrm{tol}' < \mathrm{tol}\), we assume we build the polynomial approximant starting from the old one as initial condition, so that we do
not waste any computational effort.
To construct the sequence of active subspaces in an adaptive fashion, the algorithm constructs a (finite) downward-closed index set \({\mathcal{I}}\subset {\mathbb{N}}^2\). Given such a set and chosen a priori a function \(r:{\mathbb{N}}\to {\mathbb{N}}\), we let \(\hat{V}_{k,l}\) be the subspace spanned by the eigenvectors of the empirical covariance of \(\nabla \Delta_l\) from position \(r(k-1)+1\) to \(r(k)\) and \(\hat{U}_{k,l} := \cup_{i=0}^k \hat{V}_{k,l}\). Let further \[\mathcal{A}(k,l) := \begin{cases} \{(k+1,l)\} & \text{if (k,l) \neq (0,l_{\mathrm{max}})} \\ \{(k+1,l), (0,l_{\mathrm{max}}+1)\} & \text{if (k,l) = (0,l_{\mathrm{max}})} \\ \end{cases}\] denote the set of admissible multi-indices that are neighboring \((k,l) \in {\mathcal{I}}\), where \(l_{\mathrm{max}} := \max \{l \,:\, (0,l) \in {\mathcal{I}}\}\). For each multi-index \((k,l) \in {\mathcal{I}}\), we estimate the norm of the projection of \(\nabla \Delta_l\) onto \(\hat{V}_{k,l}\) via \(\sqrt{\sum_{j=r(k-1)+1}^{r(k)} \hat{\sigma}^2_{l,j}}\), which we also use to update an estimate of the \(\mathrm{SVD}\) error at level \(l\). This norm represents the gain that was made by adding \((k,l)\) to \({\mathcal{I}}\). Furthermore, we estimate the \(\mathrm{SVD}\) work that adding this multi-index incurred. The work, which we denote through a function \(\mathrm{Work}: {\mathbb{N}}^2 \to {\mathbb{R}}\), could be estimated directly using a timing function, or it can be based on a work model, e.g. the product of work per gradient evaluation times number of evaluations needed. Then, we call \(\texttt{Polyapprox}\) to perform polynomial approximation at level \(l\) on the variables defined by \(\hat{U}_{k,l}\), with tolerance matching the SVD error. Note that in the work associated to \((k,l)\) we add the extra work needed to fit the polynomial (we say extra to allow for a warm start of the polynomial space). Furthermore, we shall choose a function for the number of samples \(M:{\mathbb{N}}\to {\mathbb{R}}\) by, e.g., linking it to the rank function. We suggest as in [4] to use \(M(l) = C r(l) \log(r(l)+1)\) for some constant \(C>1\). With these ingredients, we can construct \({\mathcal{I}}\) similarly to [6]. We simply start with \({\mathcal{I}}= \{(0,0)\}\), then, for every iteration of our algorithm, we find the index \((k,l) \in {\mathcal{I}}\) which has a non-empty set \({\mathcal{A}}(k,l)\) of admissible neighbors and which maximizes the ratio between the gain and work estimates. Finally, we add those neighbors to the set \({\mathcal{I}}\) and repeat.
Let us consider the following linear elliptic second order PDE \[\label{nrt-eq:model-prob} \begin{align} -\mathrm{div}(a \nabla u) & = h & & \text{on D \subset {\mathbb{R}}^n}, \\ u & = 0 & & \text{on \partial D}, \end{align}\tag{13}\] where \(a:D \times \Gamma \to {\mathbb{R}}\) is the diffusion coefficient. We assume that the diffusion coefficient \(a\) is a random function defined as \(a = \exp(b)\), where \(b\) is a centered Gaussian process defined on \(D\). In particular, we focus on the case where \(\Gamma = {\mathbb{R}}^d\), \(a = a({\boldsymbol{y}}) = \exp(b({\boldsymbol{y}}))\) and \(b({\boldsymbol{y}})\) is expanded in parametric form as \[\label{nrt-eq:diffusion-aff-exp} b({\boldsymbol{y}}) = \bar{b} + \sum_{j=1}^d y_j \psi_j,\tag{14}\] where the \(y_j\)’s are standard normal random variables and the functions \(\psi_j: D \to {\mathbb{R}}\) are given. Let us remark that here \(d \in {\mathbb{N}}\cup \{\infty\}\), that is the number of parameters \({\boldsymbol{y}}\) in the model, may in theory be infinite. For any given \({\boldsymbol{y}}\in \Gamma\) such that \(b({\boldsymbol{y}})\in L^\infty(D)\), Lax-Milgram theory allows us to define the solution \(u({\boldsymbol{y}})\) in the space \(V \coloneq H^1_0(D)\) through the variational formulation \[\int_D a({\boldsymbol{y}}) \nabla u \cdot \nabla v = \int_D h v, \qquad v \in V.\] The solution map is only defined on the set \[\Gamma_0 \coloneq \{{\boldsymbol{y}}\in \Gamma \; : \; b({\boldsymbol{y}})\in L^\infty(D) \},\] which is equal to \(\Gamma\) in the case of finitely many variables, that is, when \(\psi _j=0\) for \(j>J\) for some \(J \in {\mathbb{N}}\). However, in both finitely or countably many variable cases, the solution map is not uniformly bounded.
Theorem 4. [15] Assume that there exists a sequence \((\rho _j)_{j\geq 1}\) of positive numbers such that \[\sup _{x\in D}\sum _{j\geq 1} \rho _j |\psi _j(x)| =K <\infty,\] holds and \(\sum _{j\geq 1} \exp (-\rho _j^2)<\infty\). Then, \(\Gamma_0\) has full measure and \(\mathbb{E}(\exp (k\|b\|_{L^\infty }))<\infty\) for all \(k<\infty\). In turn, the map \({\boldsymbol{y}}\mapsto u({\boldsymbol{y}})\) belongs to the Bochner space \(L^k_\mu(\Gamma;V)\) for all \(k<\infty\).
Introduce now a smooth quantity of interest \(\mathrm{QoI}: V \to {\mathbb{R}}\) and denote \(\mathrm{QoI}'(u)\) its Fréchet differential at \(u \in V\). Then, we have that the function of interest is \[f({\boldsymbol{y}}) := \mathrm{QoI}(u({\boldsymbol{y}})).\] Note that it is possible to show that the gradient of \(f\) at \({\boldsymbol{y}}\in \Gamma\) is in \(L^2_\mu(\Gamma;\ell^2(d))\) [15].
We ran our experiments based on model 13 14 where \(h=0\), \(D = [0,1]^2\), \(\bar{b} = -4.6\), \(\psi_{2j} = \frac{1}{j^\alpha} \cos(j \pi x_1)\), \(\psi_{2j+1} = \frac{1}{j^\alpha} \sin(j \pi x_2)\), and \(\mathrm{QoI}\) is the spatial average.
In the first pair of plots in 4, we show a decay in the projection error, indicating that this example provides fertile ground for applying the AS technique. In the right plot, which illustrates the projection error of the gradient of the differences, we see that increasing the level of approximation leads to a decrease in error. However, the rate of decay with respect to the rank remains consistent across levels, meaning that \(\nabla f\) features mixed regularity in physical and parameter space. Interestingly, the rate of decay with respect to the rank deteriorates when transitioning from functions to differences, indicating that the differences feature less regularity that the functions themselves.
In the second pair of plots, the left plot demonstrates the computational complexity of SLAS and MLAS under ideal reconstruction ([algo:SLAS,algo:ideal-MLAS], respectively), estimated with a model where the computational work is proportional to the number of evaluations of \(\nabla \Delta_l\) and \(\Delta_l\) used for the computation of the active subspaces and the polynomial approximations, respectively. The complexity of MLAS aligns with the predictions of 2, showing it to be smaller than that of SLAS. The right plot presents the complexity of adaptive polynomial approximation algorithms. Again, the multilevel strategy outperforms the single-level approach, as well as the multilevel polynomial approximation algorithm from [6].


Figure 4: Projection error for \(d=100\) and \(\alpha = 2\) across different levels for \(\nabla f_l\) (left) and \(\nabla \Delta_l\) (right)..


Figure 5: Left: Complexity of [algo:SLAS,algo:ideal-MLAS]. Right: complexities of single-level AS with polynomial approximation (ASLASPA), Algorithm AMLASPA described in Section 4.3, and the adaptive multilevel polynomial approximation algorithm (AMLPA) proposed in [6]..
In this work, we introduced the Multilevel Active Subspaces (MLAS) method, a novel framework that combines the dimensionality reduction capabilities of the Active Subspace (AS) approach with the computational efficiency of the multilevel paradigm. The MLAS method is designed for scenarios where approximations \(f_l \approx f\) of a target map \(f\) are available at varying discretization levels \(l\), with the cost of evaluating \(f_l\) and its gradient \(\nabla f_l\) increasing with \(l\). By hierarchically distributing computational resources across discretization levels, MLAS achieves efficiency through a level-dependent reduction in active subspace dimension: at finer levels, where evaluation costs are highest, smaller active subspaces are constructed to minimize the number of gradient evaluations required. We conducted a theoretical analysis of the MLAS algorithm with ideal reconstruction functions in the active subspaces, demonstrating that it reduces computational complexity compared to single-level AS methods, provided that the maps \(f_l\) satisfy certain smoothness and regularity assumptions. We also proposed a practical algorithm, which constructs polynomial approximations of the ideal reconstruction functions on the active subspaces. This approach employs optimally weighted discrete least-squares, yielding a fully implementable algorithm for multilevel function approximation. An adaptive version of this algorithm was also proposed, enabling the simultaneous construction of active subspaces and polynomial approximants at each level without requiring explicit knowledge of model parameters. Numerical experiments validated the effectiveness of MLAS on a linear elliptic PDE with log-normal random diffusion coefficients. These experiments demonstrated the method’s ability to significantly reduce computational costs while maintaining accuracy, outperforming single-level methods. The results confirm the applicability of MLAS to parametric PDEs with sufficiently smooth solution maps, showcasing its potential as a tool for high-dimensional approximation problems.