Cauchy Random Features for Operator Learning in Sobolev Space


Abstract

Operator learning refers to the applications of machine learning approaches to approximate operators between infinite dimensional Banach spaces. While most progress in this area has been driven by variants of deep neural networks such as the Deep Operator Network (DeepONet) and Fourier Neural Operator (FNO), the theoretical guarantees are often in the form of a universal approximation property. However, the existence theorems do not guarantee that an accurate operator network is obtainable in practice. Motivated by the encoder-decoder framework, we propose a random feature operator learning framework. Moreover, we provide a generalization error bounds for our proposed method depending on the number of random features. Our experiments on several operator learning benchmarks demonstrate that our model can obtain similar or better test errors compared with state-of-the-art operator learning models and can significantly reduce the computational complexity.

1 Introduction↩︎

Approximating the solutions of partial differential equations (PDEs) with numerical algorithms is a fundamental problem in scientific and engineering applications. Traditional methods include finite difference, finite element, and spectral method which require access to the governing equation. In recent years, the use of machine learning for solving PDEs has attracted attention, with several directions for addressing the approximation problems. Physics-informed neural network (PINN) [1] trains a network to fit a single PDE task using a least squares fit on the data and on the PDE using a collocation method. However, any changes to the original problem leads to retraining the neural network, which is computationally expensive. Operator Learning trains a network to approximate the solution operator between the input and output function spaces directly. For example, we aim to approximate an operator between boundary condition and the PDE solution itself. The approximation of operators using neural network was first introduced in [2]. With the development of deep neural networks, recent results include various neural operators such as Deep Operator Nets [3], Fourier Neural Operators [4] and several others [5], [6]. Neural operators have been used in several scientific applications, for example, solving spatio-temporal dynamics [3], [7][9], dynamical system with control [2], reduced order modeling (ROM) [10], climate predictions [11] and uncertainty quantification [12]. While these neural networks based operator learning methods are often used in practice, the theoretical analysis relies on universal approximation property of deep neural networks, which shows the existence of a network of a requisite size achieving a certain error rate. However, the existence result does not guarantee that the network is obtained in practice, [2], [13], [14]. Some prior results go beyond universal approximation by providing theoretical guarantees for training algorithms and sample complexity, see [15][18] and references therein.

To address the limited theoretical results in operator learning, a kernel or Gaussian Process (GP) based framework was proposed in [19], [20] along with a priori error estimates and convergence guarantees. In several benchmark tests, it was shown that the kernel-based approach matches or outperforms neural operator methods in terms of test accuracy and computational costs. Kernel methods operate on the kernel matrix (Gram matrix) of the data with size \(m\times m\), where \(m\) represents the number of samples in the dataset. This leads to poor scaling with the size of training dataset, i.e., large training sets incur significant computational and storage costs.

The class of random features is one of the most popular techniques to speed up kernel methods in large-scale problems. Rather than forming and computing solutions using the kernel matrix directly, the random feature method (RFM) [21], [22] maps data into a relatively low-dimensional randomized feature space, which significantly reduces the computation needed for training. Additionally, the random feature model can be viewed as a randomized two-layer neural network whose weights connecting input layer and single hidden layer are randomly generated from a known distribution rather than trainable parameters. Several approaches utilized RFMs for approximating solutions to PDEs [23][25]. In [23], the authors utilized the random features as a randomized Galerkin method for solving the PDE. In [24] and a subsequent review paper [25], the authors took the operator learning perspective to approximate the solutions of the PDEs. Their analysis relies on the theory of vector-valued reproducing kernels. As a consequence of [26], a high probability non-asymptotic error bound was derived. However, this quantitative result is dependent on the assumption that the target operator belongs to a reproducing kernel Hilbert space corresponding to an operator-valued kernel, which may not hold in practice. Moreover, the random feature maps were carefully designed and adapted to different problems. However, training also requires samples in the frequency domain, which may not be available for the original problem.

In this paper, we propose a novel random features based operator learning method. We use the random Fourier features [21] and randomly generate features from a known probability distribution to approximate the solution operator. We provide the theoretical guarantees and experimental validation of our proposed method. Our contributions are summarized below.

  • Random Feature Method for Operator Learning. We follow the existing operator learning framework in [18], [19], [27]. Rather than using neural networks or kernel methods, we introduce the random features method for the operator learning problem addressing one of the open problems in [19].

  • Error Analysis. We derive an error bound for proposed approach in Section 4. Our error analysis relies on the generalization error of random feature model for function approximation (Section 3), which has been of recent interest as well, see [28][38]. We consider the overparametrized setting (the number of random features \(N\) is larger than the number of training samples \(m\)) and the min-norm interpolation problem (or the ridgeless limit as the regularization parameter \(\lambda\to0^+\)). Our analysis depends on the condition number of random feature matrix, which is similar as the theoretical results in [31]. To better match the expected smoothness of the solutions, we propose Cauchy random features which lead to mixed Sobolev spaces.

  • Improved Numerical Performance. In Section 5, we compare our method with kernel based method and neural operator methods empirically. Numerical experiments on benchmark PDE problems show several advantages of our method: 1) our method is easy to implement, 2) expensive computational resources are not required, 3) the training involves solving several convex optimization problems (in parallel) which significantly reduces the training time, and 4) it is competitive in terms of test-accuracy.

The paper is organized as follows. In Section 2, we introduce the proposed random feature operator learning method. Then, we prove the generalization error bounds of the random feature method for function learning in Section 3. Using the obtained error bounds in section 3, we derive generalization error bounds for random feature operator learning in Section 4. Lastly, Section 5 presents the numerical results and comparisons.

Notations. We let \(\mathbb{R}\) be the set of all real number and \(\mathbb{C}\) be the set of all complex number where \(i=\sqrt{-1}\) denotes the imaginary unit. We define the set \([N]\) to be all natural numbers smaller than of equal to \(N\), i.e., \(\{1,2,\dots,N\}\). We use bold letters to denote vectors or matrices, and denote the identity matrix of size \(n\times n\) by \(\mathbf{I}_n\). For any two vectors \(\mathbf{x},\mathbf{y}\in\mathbb{C}^d\), the inner product is denoted by \(\langle \mathbf{x},\mathbf{y}\rangle = \sum_{j=1}^d x_j\Bar{y}_j\), where \(\mathbf{x}= [x_1,\dots,x_d]^\top\) and \(\mathbf{y}= [y_1,\dots,y_d]^\top\). For a vector \(\mathbf{x}\in\mathbb{C}^d\), we denote by \(\|\mathbf{x}\|_p\) the \(\ell_p\)-norm of \(\mathbf{x}\) and for a matrix \(\mathbf{A}\in\mathbb{C}^{m\times N}\) the (induced) \(p\)-norm is written as \(\|\mathbf{A}\|_p\). The conjugate transpose of a matrix \(\mathbf{A}\in\mathbb{C}^{m\times N}\) is denoted by \(\mathbf{A}^*\). We use \(\mathcal{U}\) to denote the input function set with domain \(D_\mathcal{U}\), and \(\mathcal{V}\) to denote the output function set with domain \(D_\mathcal{V}\). The operator mapping functions in \(\mathcal{U}\) to functions in \(\mathcal{V}\) is denoted by \(G\). The \(L^2\) norm of a function over domain \(D_\mathcal{U}\) is defined as \(\|u\|_{L^2(D_\mathcal{U})} = \sqrt{\int_{D_\mathcal{U}} |u(\mathbf{x})|^2 d\mathbf{x}}\). We write \(B_\delta(\mathcal{U})\) for the ball of radius \(\delta>0\) in a normed space \(\mathcal{U}\) equipped with the \(L^2\) norm.

2 Problem Statement↩︎

The goal is to learn an unknown Lipschitz operator \(G:\mathcal{U}\to\mathcal{V}\) between two function sets \(\mathcal{U}\) and \(\mathcal{V}\) from training samples \(\{(u_j, v_j)\}_{j\in[M]} \subset \mathcal{U}\times\mathcal{V}\) such that \(G(u_j)=v_j\) for all \(j\in[M]\). We consider Lipschitz operators in the following sense.

Assumption 1. Let \(\mathcal{U}\) and \(\mathcal{V}\) be vector spaces equipped with \(L^2\) norm. Let \(D_\mathcal{U}\) and \(D_\mathcal{V}\) be the domain of functions in \(\mathcal{U}\) and \(\mathcal{V}\), respectively. We assume that \(G:\mathcal{U}\to\mathcal{V}\) is a Lipschitz operator, i.e. there exists a constant \(L_G>0\) such that \[\|G(u_1) - G(u_2)\|_{L^2(D_\mathcal{V})} \leq L_G \|u_1 - u_2\|_{L^2(D_\mathcal{U})}\] for any \(u_1,u_2\in\mathcal{U}\).

For our problem, we consider the setting where the input/output functions are only partially observed through a finite collection of function values at given collocation points. Denote collocation points in \(D_\mathcal{U}\) and \(D_\mathcal{V}\) by \(\{\mathbf{x}_j\}_{j\in[n]} \subset D_\mathcal{U}\) and by \(\{\mathbf{y}_j\}_{j\in[m]} \subset D_\mathcal{V}\), respectively. Then, we define sampling operators \(S_\mathcal{U}:\mathcal{U}\to\mathbb{R}^n\) and \(S_\mathcal{V}:\mathcal{V}\to\mathbb{R}^m\) as \[S_\mathcal{U}(u) = [u(\mathbf{x}_1),\dots,u(\mathbf{x}_n)]^\top\in\mathbb{R}^n,andS_\mathcal{V}(v) = [v(\mathbf{y}_1),\dots,v(\mathbf{y}_m)]^\top\in\mathbb{R}^m.\] Indeed sampling operators are linear, and we further assume that \(S_\mathcal{U}\) and \(S_\mathcal{V}\) are bounded operators throughout this paper. We could also generalize to the situation where functions are accessible through arbitrary linear measurements. For the sake of simplicity, we will focus on point evaluations in this paper. We formalize the problem statement as follows:

Problem 1. We aim to learn unknown Lipschitz operator \(G:\mathcal{U}\to\mathcal{V}\) from training data \(\left\{( S_\mathcal{U}(u_j), S_\mathcal{V}(v_j))\right\}_{j\in[M]}\), where \(S_\mathcal{U}:\mathcal{U}\to\mathbb{R}^n\) and \(S_\mathcal{V}:\mathcal{V}\to\mathbb{R}^m\) are bounded linear sampling operators and functions \(u_j\in\mathcal{U}, v_j\in\mathcal{V}\) satisfy \(G(u_j)=v_j\) for all \(j\in[M]\).

2.1 Operator learning framework↩︎

Our problem statement gives rise to a diagram for operator learning, which is depicted in Figure 1. Here, the map \(f:\mathbb{R}^n\to\mathbb{R}^m\) is defined explicitly as \[\label{operator95f} f = S_\mathcal{V}\circ G\circ R_\mathcal{U},\tag{1}\] where \(R_\mathcal{U}:\mathbb{R}^n\to\mathcal{U}\) is a recover map. Notice that the mapping \(f:\mathbb{R}^n\to\mathbb{R}^m\) is between finite-dimensional spaces, which is more amenable to numerical approximation. We make some assumptions on \(f\) throughout this paper.

Assumption 2. We assume that the map \(f:\mathbb{R}^n\to\mathbb{R}^m\) takes the form \(f(\mathbf{u}) = [f_1(\mathbf{u}),\dots, f_m(\mathbf{u})]^\top\) for any \(\mathbf{u}\in\mathbb{R}^n\), and hence each component \(f_j:\mathbb{R}^n\to\mathbb{R}\) can be approximated separately. Moreover, we assume that each component \(f_j:\mathbb{R}^n\to\mathbb{R}\) of \(f\) is Lipschitz in the sense that there exists \(L_j>0\) such that \(|f_j(\mathbf{u}_1) - f_j(\mathbf{u}_2)| \leq L_j\|\mathbf{u}_1 - \mathbf{u}_2\|_2\) for any \(\mathbf{u}_1,\mathbf{u}_2\in\mathbb{R}^n\).

After constructing recovery maps \(R_\mathcal{V}:\mathbb{R}^m\to\mathcal{V}\), and \(\hat{f}:\mathbb{R}^n\to\mathbb{R}^m\), we are able to define operator \(\Hat{G}\) to approximate the target operator \(G\) as \[\Hat{G} = R_\mathcal{V}\circ \Hat{f} \circ S_\mathcal{U}.\] In [19], the authors proposed the same framework to do operator learning where they used the optimal recovery maps in reproducing kernel Hilbert spaces (RKHS) as recovery maps \(R_\mathcal{V}\), \(R_\mathcal{U}\) and \(\hat{f}\). Precisely, the map \(\hat{f}:\mathbb{R}^n\to\mathbb{R}^m\), which is viewed as an approximation of \(f\), is the optimal recovery map in a vector-valued RKHS. The optimal recovery map \(R_\mathcal{V}:\mathbb{R}^m\to\mathcal{V}\) takes function values as inputs and returns a function in \(\mathcal{V}\). As the theory of optimal recovery suggests, the optimal recovery map returns the kernel interpolation [39]. Other function approximation/reconstruction techniques can be applied as well. For example, in [40], the authors used neural networks to approximate \(f:\mathbb{R}^n\to\mathbb{R}^m\).

Figure 1: Commutative diagram of the operator learning setup. This diagram can be generalized to the mesh invariant setting where the measurement functionals are different at test time. We refer readers to [19] for more details.

2.2 Random Feature Operator Learning↩︎

The key idea of the random feature method is to draw \(N\) random frequencies \(\boldsymbol{\omega}_k\in\mathbb{R}^d\) from a density \(\rho(\boldsymbol{\omega})\), and then to construct an approximation as \[\label{RF95model} f(\cdot,\mathbf{c}^\sharp) = \sum_{k=1}^N c_k^\sharp \exp \left(i\langle \boldsymbol{\omega}_k,\cdot\rangle\right).\tag{2}\] The random feature model can be viewed as a one layer neural network where the weights in the first layer are drawn from a known distribution and only the weights in the second layer are trainable parameters. Instead of using the exponential function as an activation function, we could also use rectified linear unit (ReLU) activation function [41], or trigonometric functions [30]. Originally, the random feature method was proposed to speed up large-scale kernel machines, see [21], [42] for details.

We define the recovery map \(R_\mathcal{U}:\mathbf{u}\mapsto R_\mathcal{U}(\mathbf{u}):=f(\cdot, \mathbf{c}^\sharp(\mathbf{u}))\), where \[\label{opt95Ru} \mathbf{c}^\sharp(\mathbf{u}) \in \mathop{\mathrm{argmin}}_{\mathbf{c}\in\mathbb{R}^N} \|\mathbf{c}\|_2 \qquad subject toS_\mathcal{U}(f(\cdot,\mathbf{c})) = \mathbf{u}.\tag{3}\] The coefficient vector \(\mathbf{c}^\sharp\in\mathbb{R}^N\) is indeed trained by solving the min-norm interpolation problem. Recall that \(S_\mathcal{U}:\mathcal{U}\to\mathbb{R}^n\) is the point evaluation operator, then the constraint reads as \(R_\mathcal{U}(\mathbf{u})(\mathbf{x}_i) = u_i\) for \(i\in[n]\). Therefore, we can rewrite the min-norm interpolation problem and the optimal coefficient vector \(\mathbf{c}^\sharp\in\mathbb{R}^N\) is trained by solving \[\mathbf{c}^\sharp(\mathbf{u}) \in \mathop{\mathrm{argmin}}_{\mathbf{A}_x \mathbf{c}=\mathbf{u}} \|\mathbf{c}\|_2,\] where matrix \(\mathbf{A}_x\) is defined (component-wise) by \((\mathbf{A}_x)_{j,k} = \exp(i\langle \boldsymbol{\omega}_k, \mathbf{x}_j\rangle)\). The solution is given by \(\mathbf{c}^\sharp = \mathbf{A}_x^*(\mathbf{A}_x\mathbf{A}_x^*)^{-1}\mathbf{u}= \mathbf{A}_x^\dagger\mathbf{u}\), where \(\mathbf{A}_x^\dagger\) is the pseudo-inverse of matrix \(\mathbf{A}_x\). We will prove that the matrix \(\mathbf{A}_x\mathbf{A}_x^*\) is invertible with high probability in Section 3, and hence pseudo-inverse \(\mathbf{A}_x^\dagger\) is well-defined. Following the same arguments, we define the recovery map \(R_\mathcal{V}:\mathbb{R}^m\to\mathcal{V}\). The optimal coefficient vector \(\mathbf{c}^\sharp\in\mathbb{R}^N\) is trained by solving the following min-norm interpolation problem 1 \[\mathbf{c}^\sharp \in \mathop{\mathrm{argmin}}_{\mathbf{A}_y \mathbf{c}=\mathbf{v}} \|\mathbf{c}\|_2,\] where the matrix \(\mathbf{A}_y\) is defined (component-wise) by \((\mathbf{A}_y)_{j,k} = \exp(i\langle \boldsymbol{\omega}_k,\mathbf{y}_j\rangle)\). The optimal coefficient vector is computed by \(\mathbf{c}^\sharp = \mathbf{A}_y^*(\mathbf{A}_y\mathbf{A}_y^*)^{-1}\mathbf{v}= \mathbf{A}_y^\dagger \mathbf{v}\), where \(\mathbf{A}_y^\dagger\) is the pseudo-inverse of matrix \(\mathbf{A}_y\).

Suppose that Assumption 2 holds, we propose to approximate each \(f_j\) separately using random feature model, and hence the vector-valued random feature map \(\Hat{f}:\mathbb{R}^n\to\mathbb{R}^m\) is defined as \[\label{operator95rf} \Hat{f}(\mathbf{u}) = \left[ \Hat{f}_1(\mathbf{u}), \dots, \Hat{f}_m(\mathbf{u})\right]^\top\in\mathbb{R}^m.\tag{4}\] Each \(\Hat{f}_j:\mathbb{R}^n\to\mathbb{R}\) is a random feature approximation of \(f_j:\mathbb{R}^n\to\mathbb{R}\) and takes the form \[\label{hat95f} \Hat{f}_j(\mathbf{u})= \sum_{k=1}^{N} c_k^{(j)} \exp(i\langle \boldsymbol{\omega}_k,\mathbf{u}\rangle).\tag{5}\]

To compute the coefficient vectors \(\mathbf{c}^{(j)}\in\mathbb{R}^N\) for all \(j\in[m]\), we solve the following min-norm interpolation problem \[\min_{\mathbf{c}^{(1)},\dots,\mathbf{c}^{(m)}\in\mathbb{R}^N} \sum_{j=1}^{m}\|\mathbf{c}^{(j)}\|_2 \qquad s.t.\Hat{f}(\mathbf{u}_\ell) = \mathbf{v}_\ell \quadfor all\ell\in[M]\] where \(\mathbf{u}_\ell\in\mathbb{R}^n\) and \(\mathbf{v}_\ell\in\mathbb{R}^m\) contain the function values at collocation points of the \(\ell\)-th data pair \((u_\ell,v_\ell)\in \mathcal{U}\times \mathcal{V}\), i.e \[\mathbf{u}_\ell = S_u(u_\ell) = [u_\ell(\mathbf{x}_1), \dots, u_\ell(\mathbf{x}_n)] \in\mathbb{R}^n,and\mathbf{v}_\ell = S_v(v_\ell) = [v_\ell(\mathbf{y}_1), \dots, v_\ell(\mathbf{y}_m)] \in\mathbb{R}^m.\] Due to Assumption 2, the training process can be done simultaneously in all components of \(f\). Therefore, we consider a sequence of parallel min-norm interpolation problems \[\label{parallel95c} \min_{\mathbf{c}^{(j)}\in\mathbb{R}^N} \|\mathbf{c}^{(j)}\|_2 \quad s.t.\mathbf{A}\mathbf{c}^{(j)} = \mathbf{v}^{(j)},\tag{6}\] where the matrix \(\mathbf{A}\in\mathbb{R}^{M\times N}\) is defined (component-wise) by \(\mathbf{A}_{\ell,k}=\exp(i\langle \boldsymbol{\omega}_k,\mathbf{u}_\ell\rangle)\) and \(\mathbf{v}^{(j)} = [v_1(\mathbf{y}_j),\dots,v_M(\mathbf{y}_j)]\in\mathbb{R}^M\). The solution to the min-norm interpolation problem is given by \(\mathbf{c}^{(j)} = \mathbf{A}^*(\mathbf{A}\mathbf{A}^*)^{-1}\mathbf{v}^{(j)} = \mathbf{A}^\dagger \mathbf{v}^{(j)}\) for each \(j\in[m]\), where \(\mathbf{A}^\dagger\) is the pseudo-inverse of \(\mathbf{A}\).

Figure 2: Random Feature Operator Learning - Training
Figure 3: Random Feature Operator Learning - Inference

3 Random Feature Generalization Error↩︎

In this section, we provide bounds on the generalization error of approximating a function using random feature model.

3.1 Set-up and Notations↩︎

For a probability distribution \(\rho\) associated with random feature \(\boldsymbol{\omega}\) in \(\mathbb{R}^d\), we define function space \[\label{target95space} \mathcal{F}(\rho) := \left\{ f(\mathbf{x}) = \int_{\mathbb{R}^d} \hat{f}(\boldsymbol{\omega})\exp(i\langle \boldsymbol{\omega}, \mathbf{x}\rangle)d\boldsymbol{\omega}: \|f\|^2_\rho = \mathbb{E}_\omega \left|\frac{\hat{f}(\boldsymbol{\omega})}{\rho(\boldsymbol{\omega})} \right|^2 <\infty \right\},\tag{7}\] where \(\hat{f}(\boldsymbol{\omega})\) is the Fourier transform of function \(f\). Let \(D\subset\mathbb{R}^d\) be a compact domain with finite volume, i.e., \(\mathop{\mathrm{vol}}(D)<\infty\). Function \(f:D\to\mathbb{C}\) has finite \(\rho\)-norm if it belongs to \(\mathcal{F}(\rho)\). By Proposition 4.1 in [38], the function space \(\mathcal{F}(\rho)\) is a reproducing kernel Hilbert space (RKHS) with associated kernel function \[\label{kernel} k(\mathbf{x},\mathbf{y}) = \int_{\mathbb{R}^d} \exp(i\langle \boldsymbol{\omega}, \mathbf{x}\rangle)\exp(-i\langle \boldsymbol{\omega}, \mathbf{y}\rangle)d\rho(\boldsymbol{\omega}).\tag{8}\] We consider Cauchy random features which are drawn from the tensor-product Cauchy distribution with scaling parameter \(\gamma>0\), whose density function is \[\rho(\boldsymbol{\omega}) := \prod_{j=1}^d \frac{1}{\pi\gamma(1+\omega_j^2/\gamma^2) }.\] Here, \(\omega_j\) is the \(j\)-th entry of \(\boldsymbol{\omega}\in\mathbb{R}^d\). Using the tensor-product Cauchy distribution, the kernel function defined in 8 is indeed the Laplace kernel, i.e. \(k(\mathbf{x},\mathbf{y}) = \exp(-\gamma\|\mathbf{x}-\mathbf{y}\|_1)\). Hence, function space \(\mathcal{F}(\rho)\) defined in 7 is the reproducing kernel Hilbert space corresponding to Laplace kernel. The function space \(\mathcal{F}(\rho)\) also relates to the well-known Sobolev space of mixed smoothness, which is an appropriate function space for studying PDEs [43]. Precisely, let \(p>0\), we define the Sobolev spaces of mixed smoothness by [44], \[H^p_{\text{mix}}(\mathbb{R}^d):= \left\{f:D\rightarrow \mathbb{C}^d \,\, \Big\rvert \,\, \|f\|_{H^p_{\text{mix}}(\mathbb{R}^d)}< \infty \right\}\] where the associated norm is defined by: \[\|f\|^2_{H^p_{\text{mix}}(\mathbb{R}^d)}:=\int_{\mathbb{R}^d} \, |\hat{f}(\boldsymbol{\omega})|^2 \prod_{i=1}^d \left(1+|\omega_i|^2\right)^p d\boldsymbol{\omega}.\] If the scaling parameter of tensor-product Cauchy distribution \(\gamma=1\), then we have \(\|f\|_\rho = \|f\|_{H^1_{\text{mix}}}\). For arbitrary \(\gamma>0\), we could derive that \[\|f\|_\rho^2 \leq \|f\|^2_{H^1_{\text{mix}}(\mathbb{R}^d)} \max\left(\frac{\pi}{\gamma}, \pi\gamma\right)^d.\]

We consider the regression problem which is to find an approximation of \(f\in\mathcal{F}(\rho)\). Given a set of random weights \(\{ \boldsymbol{\omega}_k\}_{k\in[N]}\) which are i.i.d. samples from tensor-product Cauchy distribution \(\rho(\boldsymbol{\omega})\), we train a random feature model \[\label{rf95train} f^\sharp(\mathbf{x}) = \sum_{k=1}^N c_k^\sharp \exp(i\langle \boldsymbol{\omega}_k,\mathbf{x}\rangle)\tag{9}\] using samples \((\mathbf{x}_j, f(\mathbf{x}_j))\) for \(j\in[m]\). Let \(\mathbf{A}\in\mathbb{C}^{m\times N}\) be the random feature matrix with \(\mathbf{A}_{j,k}=\exp(i\langle \boldsymbol{\omega}_k, \mathbf{x}_j\rangle)\) for \(j\in[m]\) and \(k\in[N]\). The coefficient vector \(\mathbf{c}^\sharp\in\mathbb{R}^N\) is trained by solving the min-norm interpolation problem: \[\mathbf{c}^\sharp \in \mathop{\mathrm{argmin}}_{\mathbf{A}\mathbf{c}= \mathbf{y}} \|\mathbf{c}\|_2,\] where \(\mathbf{y}= [f(\mathbf{x}_1), \dots,f(\mathbf{x}_m)]\in\mathbb{R}^m\). Our main goal of this section is to bound the generalization error \(\|f-f^\sharp\|_{L^2(D)}\), which is defined as \[\|f-f^\sharp\|_{L^2(D)} = \sqrt{ \int_{D} |f(\mathbf{x})-f^\sharp(\mathbf{x})|^2 d\mathbf{x}}.\] We decompose the generalization error into two parts \[\|f-f^\sharp\|_{L^2(D)} \leq \underbrace{ \| f - f^* \|_{L^2(D)} }_{ \text{Approximation Error} } + \underbrace{ \|f^* - f^\sharp\|_{L^2(D)} }_{ \text{Estimation Error} }\] by triangular inequality. The approximation \(f^\sharp\) is the random feature model defined in 9 and \(f^*\) is the best random feature model that will be defined later. We first notice that we can write the target function \(f(\mathbf{x})\) as \[\label{target95f} f(\mathbf{x}) = \int_{\mathbb{R}^d} \hat{f}(\boldsymbol{\omega})\exp(i\langle \boldsymbol{\omega}, \mathbf{x}\rangle)d\boldsymbol{\omega}= \mathbb{E}_{\boldsymbol{\omega}\sim\rho} \left[ \frac{\hat{f}(\boldsymbol{\omega})}{\rho(\boldsymbol{\omega})}\exp(i\langle \boldsymbol{\omega}, \mathbf{x}\rangle) \right].\tag{10}\] To simplify the notation, we let \(\alpha(\boldsymbol{\omega}) = \frac{\hat{f}(\boldsymbol{\omega})}{\rho(\boldsymbol{\omega})}\) (defined on the \(\boldsymbol{\omega}\in \text{supp}(\rho)\) and zero otherwise) and then define \(\alpha_{\leq T}(\boldsymbol{\omega}) = \alpha(\boldsymbol{\omega})\mathbb{1}_{\left|\alpha(\boldsymbol{\omega})\right|\leq T}\). Therefore, we could write \(\alpha_{>T} = \alpha(\boldsymbol{\omega}) - \alpha_{\leq T}(\boldsymbol{\omega})\), i.e. \[\alpha_{\leq T}(\boldsymbol{\omega}) = \begin{cases} \alpha(\boldsymbol{\omega}) \quad &if\left|\alpha(\boldsymbol{\omega}) \right| \leq T \\ 0 \quad &otherwise\end{cases} \quadand\quad \alpha_{> T}(\boldsymbol{\omega}) = \begin{cases} \alpha(\boldsymbol{\omega}) \quad &if\left|\alpha(\boldsymbol{\omega}) \right| > T \\ 0 \quad &otherwise . \end{cases}\] We define \(f^*(\mathbf{x})\) as \[\label{rf95best} f^*(\mathbf{x}) = \frac{1}{N}\sum_{k=1}^N \alpha_{\leq T}(\boldsymbol{\omega}_k) \exp(i\langle \boldsymbol{\omega}_k,\mathbf{x}\rangle)\tag{11}\] where \(\boldsymbol{\omega}_k\) are i.i.d samples from distribution \(\rho(\boldsymbol{\omega})\). Note that the expectation is given by: \[\mathbb{E}f^*(\mathbf{x}) = \mathbb{E}_{\boldsymbol{\omega}} \left[ \alpha_{\leq T}(\boldsymbol{\omega})\exp(i\langle \boldsymbol{\omega},\mathbf{x}\rangle) \right].\]

3.2 Main Results↩︎

Before showing the main results of this section, we first introduce fill-in distance of training samples and state assumptions on the samples and random features. Specifically, the samples are well-separated and the random features are randomly generated from the tensor-product Cauchy distribution. The assumptions are more amenable to solving PDE problems where the Sobolev space is an appropriate space and collocation points are usually non-random.

Definition 1 (Fill-in Distance). Let \(X = \{\mathbf{x}_j\}_{j\in[m]} \subset D\) be a collection of points in \(D\) and we define fill-in distance of \(X\) as \[h_X = \max_{\mathbf{x}\in D} \min_{\mathbf{x}'\in X} \|\mathbf{x}-\mathbf{x}'\|_2.\]

Assumption 3. We assume that \(f^*-f^\sharp\) is \(L_f\)-Lipschitz continuous in the sense that there exists \(L_f>0\) such that \[\left| \left(f^*(\mathbf{x}_1)-f^\sharp(\mathbf{x}_1)\right) - \left(f^*(\mathbf{x}_2)-f^\sharp(\mathbf{x}_2)\right) \right| \leq L_f \|\mathbf{x}_1 - \mathbf{x}_2\|_2\] for any \(\mathbf{x}_1, \mathbf{x}_2\in D\).

Assumption 4. The random feature matrix \(\mathbf{A}\) is defined (component-wise) by \(\mathbf{A}_{j,k} = \exp(i\langle \boldsymbol{\omega}_k, \mathbf{x}_j\rangle)\) and the feature weights \(\{ \boldsymbol{\omega}_k\}_{k\in[N]} \subset\mathbb{R}^d\) are sampled from the tensor-product Cauchy distribution with scaling parameter \(\gamma>0\), and that for the data points \(X = \{\mathbf{x}_j\}_{j\in[m]} \subset D\) there is a constant \(K>0\) such that \(\|\mathbf{x}_j-\mathbf{x}_{j'}\|_2\geq K\) for all \(j,j'\in[m]\) with \(j\neq j'\).

We first state the main results of Section 3 where we provide a generalization error bound for random features model.

Theorem 1 (Generalization Error Bound). Consider a compact domain \(D\subset\mathbb{R}^d\). Assume the data \(X = \{\mathbf{x}_j\}_{j\in[m]} \subset D\), the random features \(\{ \boldsymbol{\omega}_k\}_{k\in[N]} \subset\mathbb{R}^d\), and the random feature matrix \(\mathbf{A}\in\mathbb{C}^{m\times N}\) satisfy Assumption 4. If the following conditions hold: \[\begin{align} & N \geq C\eta^{-2}m\log\left(\frac{2m}{\delta}\right) \\ & \gamma \geq \frac{1}{K}\log\left(\frac{m}{\eta}\right) \end{align}\] for some \(\delta,\eta\in(0,1)\), where \(C>0\) is a universal constant, then with probability at least \(1-5\delta\), we have \[\|f - f^\sharp\|^2_{L^2(D)} \leq 2\mathop{\mathrm{vol}}(D) I_1 + 4L_f^2h_X^2\mathop{\mathrm{vol}}(D) + 2Ch_X^d(1+2\eta)\left( \frac{2m}{1-2\eta}I_1 + 2I_2 \right)\] where \[\begin{align} &I_1 := \frac{2\|f\|_\rho^2}{N} + \frac{32\|f\|_\rho^2\log^2(2/\delta)}{N} + \frac{4\|f\|^2_\rho\log(2/\delta)}{N} \leq \frac{96\|f\|_\rho^2\log^2(2/\delta)}{N}, \\ &I_2 := \|f\|_\rho^2 + 4\|f\|_\rho^2\log(2/\delta) + \sqrt{2\log(2/\delta)}\|f\|_\rho^2 \leq 12\log(2/\delta)\|f\|_\rho^2. \end{align}\]

Remark 1. In summary, we have \[\|f - f^\sharp\|^2_{L^2(D)} \leq \mathcal{O}\left(\frac{1}{N}+h_X^2 + h_X^d\right)\|f\|^2_\rho.\] Compared with the generalization error bound in [31], a key difference is that we do not consider random samples \(\{\mathbf{x}_j\}_{j\in[m]}\) since our problem is motivated from solving PDEs where the collocation points are fixed. This results in the appearance of fill-in distance \(h_X\) in the generalization error bound.

Remark 2. There is a connection between the fill-in distance \(h_X\) and the cardinality of set \(X\) in which the points are quasi-uniformly distributed; that is \(h_X = m^{-1/d}\).

Remark 3. Consider the equidistant points \(X = \{\mathbf{x}_j\}_{j\in[m]} \subset D\), then the lower bound of pairwise distance, i.e., constant \(K=m^{-1/d}\).

3.3 Approximation Error Bound↩︎

Theorem 2 provides an upper bound of the approximation error, which is the error between target function \(f\) and best random feature approximation \(f^*\).

Theorem 2 (Bound on Approximation Error). Let \(f\) and \(f^*\) be defined in 10 and 11 , respectively. Then for all \(T>0\) and \(\mathbf{x}\in D\), we have \[|f(\mathbf{x}) - f^*(\mathbf{x})|^2 \leq \frac{2\|f\|_\rho^4}{T^2} + \frac{32T^2\log^2(2/\delta)}{N^2} + \frac{4\|f\|^2_\rho\log(2/\delta)}{N}\] with probability at least \(1-\delta\). Furthermore, with probability at least \(1-\delta\), \[\label{first95term} \| f - f^* \|_{L^2(D)}^2 \leq \left( \frac{2\|f\|_\rho^4}{T^2} + \frac{32T^2\log^2(2/\delta)}{N^2} + \frac{4\|f\|^2_\rho\log(2/\delta)}{N} \right) \mathop{\mathrm{vol}}(D).\tag{12}\]

The proof of Theorem 2 is based on the following two lemmas.

Lemma 1. Let \(f\) and \(f^*\) be defined as 10 and 11 , respectively. Then for all \(T>0\), we have \[\left| f(\mathbf{x}) - \mathbb{E}_{\boldsymbol{\omega}} f^*(\mathbf{x}) \right|^2 \leq \frac{\left(\mathbb{E}_{\boldsymbol{\omega}}[\alpha(\boldsymbol{\omega})^2]\right)^2}{T^2} = \frac{\|f\|_{\rho}^4}{T^2}.\]

Proof. The equality is easy to check by the definitions of \(\alpha(\boldsymbol{\omega})\) and \(\|f\|_\rho\). The inequality follows from \[\begin{align} \left| f(\mathbf{x}) - \mathbb{E}_{\boldsymbol{\omega}} f^*(\mathbf{x}) \right|^2 =& \left| \mathbb{E}_{\boldsymbol{\omega}} \left[ \alpha_{>T}(\boldsymbol{\omega})\exp(i\langle \boldsymbol{\omega},\mathbf{x}\rangle) \right] \right|^2 \\ \leq& \mathbb{E}_{\boldsymbol{\omega}} \left[\alpha(\boldsymbol{\omega})\right]^2 \mathbb{E}_{\boldsymbol{\omega}} \left[ \mathbb{1}_{\left|\alpha(\boldsymbol{\omega})\right|> T}\exp(i\langle \boldsymbol{\omega},\mathbf{x}\rangle)\right]^2 \\ =& \mathbb{E}_{\boldsymbol{\omega}} \left[\alpha(\boldsymbol{\omega})\right]^2 \mathbb{P}(\alpha(\boldsymbol{\omega})^2>T^2) \\ \leq& \frac{\left(\mathbb{E}_{\boldsymbol{\omega}}[\alpha(\boldsymbol{\omega})^2]\right)^2}{T^2} \end{align}\] where we use the Cauchy-Schwarz inequality in the second line and Markov’s inequality in the last line. ◻

Lemma 2. Let \(f^*\) be defined as 11 . For all \(T>0\), the following inequality holds with probability at least \(1-\delta\), \[\left|f^*(\mathbf{x}) - \mathbb{E}_{\boldsymbol{\omega}} f^*(\mathbf{x}) \right|^2 \leq \frac{32T^2\log^2(2/\delta)}{N^2} + \frac{4\|f\|^2_\rho\log(2/\delta)}{N}.\]

Proof. For each \(\mathbf{x}\in D\), we define random variable \(Z(\boldsymbol{\omega}) = \alpha_{\leq T}(\boldsymbol{\omega})\exp(i\langle \boldsymbol{\omega},\mathbf{x}\rangle)\) and let \(Z_1, \dots, Z_N\) be \(N\) i.i.d copies of \(Z\) defined by \(Z_k = Z(\boldsymbol{\omega}_k)\) for each \(k\in[N]\). By boundedness of \(\alpha_{\leq T}(\boldsymbol{\omega})\), we have an upper bound \(|Z_k| \leq T\) for any \(k\in[N]\). The variance of \(Z\) is bounded above as \[\sigma^2 := \mathbb{E}_{\boldsymbol{\omega}} |Z-\mathbb{E}_{\boldsymbol{\omega}} Z|^2 \leq \mathbb{E}_{\boldsymbol{\omega}} |Z|^2 \leq \mathbb{E}_{\boldsymbol{\omega}} [\alpha(\boldsymbol{\omega})^2] = \|f\|^2_\rho.\] By Lemma A.2 and Theorem A.1 in [26], it holds that, with probability at least \(1-\delta\), \[\left|f^*(\mathbf{x}) - \mathbb{E}_{\boldsymbol{\omega}} f^*(\mathbf{x}) \right| = \left|\frac{1}{N}\sum_{k=1}^N Z_k - \mathbb{E}_{\boldsymbol{\omega}} Z \right| \leq \frac{4T\log(2/\delta)}{N} + \sqrt{\frac{2\|f\|^2_\rho\log(2/\delta)}{N}}.\] Taking the square for both sides and using the inequality \((a+b)^2\leq 2a^2+2b^2\) give the desired result. ◻

Proof of Theorem 2. For each \(\mathbf{x}\in D\), we decompose \(|f(\mathbf{x}) - f^*(\mathbf{x})|^2\) into two parts as \[|f(\mathbf{x}) - f^*(\mathbf{x})|^2 \leq 2 \underbrace{|f(\mathbf{x}) - \mathbb{E}_\omega f^*(\mathbf{x})|^2}_{{\rm \Romannum{1}}} + 2 \underbrace{|\mathbb{E}_\omega f^*(\mathbf{x}) - f^*(\mathbf{x})|^2}_{ {\rm \Romannum{2}} }.\] Bounds on term and are given by Lemma 1 and Lemma 2, respectively. Adding the bounds together gives the desired result. Integrating the bound over domain \(D\) leads to inequality 12 . ◻

3.4 Estimation Error Bound↩︎

In this section, we aim to bound the estimation error \(\|f^*-f^\sharp\|_{L^2(D)}\). The error bound depends on the condition number of random feature matrix \(\mathbf{A}\), which can be bounded by using the concentration properties of random feature matrix. We refer readers to [31], [35] for more details about the concentration properties of random feature matrix. We first state the concentration results, and then derive an upper bound for the estimation error.

Theorem 3 (Concentration Property of Random Feature Matrix). Assume the data \(X = \{\mathbf{x}_j\}_{j\in[m]} \subset D\), the random features \(\{ \boldsymbol{\omega}_k\}_{k\in [N]} \subset\mathbb{R}^d\), and the random feature matrix \(\mathbf{A}\in\mathbb{C}^{m\times N}\) satisfy Assumption 4. If the following conditions hold \[\begin{align} \tag{13} & N \geq C\eta^{-2}m\log\left(\frac{2m}{\delta}\right) \\ \tag{14} & \gamma \geq \frac{1}{K}\log\left(\frac{m}{\eta}\right) \end{align}\] for some \(\delta,\eta\in(0,1)\), where \(C>0\) is a universal constant. Then with probability at least \(1-\delta\), we have \[\label{concentration} \left\|\frac{1}{N}\mathbf{A}\mathbf{A}^* - \mathbf{I}_m\right\|_2 \leq 2\eta.\tag{15}\]

The direct consequence of this result is that each eigenvalue of \(\frac{1}{N}\mathbf{A}\mathbf{A}^*\) is close to 1. Specifically, if the conditions in Theorem 3 are satisfied, then \[\left| \lambda_k\left(\frac{1}{N}\mathbf{A}\mathbf{A}^*\right) -1 \right| \leq 2\eta\] with probability at least \(1-\delta\) for all \(k\in[m]\). Here, \(\lambda_k(\frac{1}{N}\mathbf{A}\mathbf{A}^*)\) is the \(k\)-th eigenvalue of matrix \(\frac{1}{N}\mathbf{A}\mathbf{A}^*\). Therefore, the matrix \(\mathbf{A}\mathbf{A}^*\) is invertible with high probability and hence the pseudo-inverse \(\mathbf{A}^\dagger\) is well-defined.

Proof of Theorem 3. The main idea in the proof is to bound difference 15 by \[\left\|\frac{1}{N}\mathbf{A}\mathbf{A}^* - \mathbf{I}_m\right\|_2 \leq \left\| \frac{1}{N}\mathbf{A}\mathbf{A}^* - \mathbb{E}_{\boldsymbol{\omega}} \left[ \frac{1}{N}\mathbf{A}\mathbf{A}^* \right] \right\|_2 + \left\| \mathbb{E}_{\boldsymbol{\omega}} \left[ \frac{1}{N}\mathbf{A}\mathbf{A}^* \right] - \mathbf{I}_m \right\|_2\] To bound the first term, we let \(\mathbf{W}_\ell\) be the \(\ell\)-th column of \(\mathbf{A}\). Define the random matrices \(\{ \mathbf{Y}_\ell\}_{\ell\in[N]}\) as \[\mathbf{Y}_\ell = \mathbf{W}_\ell \mathbf{W}_\ell^* - \mathbb{E}_\omega[\mathbf{W}_\ell \mathbf{W}_\ell^*].\] Then \((\mathbf{Y}_\ell)_{j,j} = 0\) and for \(j,k\in[m]\), \[\begin{align} (\mathbf{Y}_\ell)_{j,k} =& \exp(i\langle \boldsymbol{\omega}_\ell, \mathbf{x}_j-\mathbf{x}_k\rangle) - \mathbb{E}_{\boldsymbol{\omega}} [\exp(i\langle \boldsymbol{\omega}, \mathbf{x}_j-\mathbf{x}_k\rangle)] \\ =& \exp(i\langle \boldsymbol{\omega}_\ell, \mathbf{x}_j-\mathbf{x}_k\rangle) - \exp(-\gamma\|\mathbf{x}_j-\mathbf{x}_k\|_1), \end{align}\] where we use the characteristic function of the tensor-product Cauchy distribution in the second equality. Note that \(\mathbf{Y}_\ell\) is self-adjoint and its induced \(\ell_2\) norm is bounded by its largest eigenvalue. By Gershgorin’s disk theorem, \(\|\mathbf{x}_j-\mathbf{x}_k\|_2\geq K\) for \(j,k\in[m]\), and condition 14 , \[\begin{align} \|\mathbf{Y}_\ell\|_2 \leq& \max_{j\in[m]} \sum_{k\neq j} |\exp(i\langle \boldsymbol{\omega}_\ell, \mathbf{x}_j-\mathbf{x}_k\rangle) - \exp(-\gamma\|\mathbf{x}_j-\mathbf{x}_k\|_1)| \\ \leq& \max_{j\in[m]} \sum_{k\neq j} \left( 1+\exp(-\gamma\|\mathbf{x}_j-\mathbf{x}_k\|_1) \right) \\ \leq& \max_{j\in[m]} \sum_{k\neq j} \left( 1+\exp(-\gamma\|\mathbf{x}_j-\mathbf{x}_k\|_2) \right) \\ \leq& \max_{j\in[m]} m(1+\exp(-\gamma K)) \\ \leq& m+\eta \end{align}\] The variance term is bounded by \[\begin{align} \left\|\sum_{\ell=1}^N \mathbb{E}_{\boldsymbol{\omega}} [\mathbf{Y}_\ell^2] \right\|_2 \leq& \sum_{\ell=1}^N \left\|\mathbb{E}_{\boldsymbol{\omega}} [\mathbf{Y}_\ell^2]\right\|_2 = \sum_{\ell=1}^N \left\|\mathbb{E}_{\boldsymbol{\omega}} \left[\mathbf{W}_\ell \mathbf{W}_\ell^* \mathbf{W}_\ell \mathbf{W}_\ell^*\right] - \left( \mathbb{E}_{\boldsymbol{\omega}} \left[\mathbf{W}_\ell \mathbf{W}_\ell^*\right]\right)^2\right\|_2\\ =& \sum_{\ell=1}^N \left\| m\mathbb{E}_{\boldsymbol{\omega}} \mathbf{W}_\ell \mathbf{W}_\ell^* - \left( \mathbb{E}_{\boldsymbol{\omega}} \left[\mathbf{W}_\ell \mathbf{W}_\ell^*\right]\right)^2 \right\|_2 \\ \leq& N\left(m(1+\eta)+(1+\eta)^2\right). \end{align}\] Here we use the fact that \(\mathbf{W}_\ell\) is a vector with \(\|\mathbf{W}_\ell\|_2 = \sqrt{m}\), which implies \(\mathbf{W}_\ell \mathbf{W}_\ell^* \mathbf{W}_\ell \mathbf{W}_\ell^* = m \mathbf{W}_\ell \mathbf{W}_\ell^*\), and \(\mathbb{E}_{\boldsymbol{\omega}} \mathbf{W}_\ell \mathbf{W}_\ell^*\) is self-adjoint whose \(\ell_2\) norm is bounded by \[\left\| \mathbb{E}_{\boldsymbol{\omega}} \mathbf{W}_\ell \mathbf{W}_\ell^* \right\|_2 \leq 1 + \max_{j\in[m]} \sum_{k\neq j} |\exp(-\gamma \|\mathbf{x}_k-\mathbf{x}_j\|_1)| \leq 1 + m\exp(-\gamma \|\mathbf{x}_k-\mathbf{x}_j\|_2 ) \leq 1+\eta\] by Gershgorin’s disk theorem. Since \(\{\mathbf{Y}_\ell\}_{\ell\in[N]}\) are independent mean-zero self-adjoint matrices, applying matrix Bernstein’s inequality gives \[\begin{align} \mathbb{P}\left( \left\| \frac{1}{N}\mathbf{A}\mathbf{A}^* - \mathbb{E}_{\boldsymbol{\omega}} \left[\frac{1}{N}\mathbf{A}\mathbf{A}^*\right] \right\|_2 \geq \eta \right) =& \mathbb{P}\left( \left\|\sum_{\ell=1}^N \mathbf{Y}_\ell \right\|_2 \geq N\eta \right)\\ \leq & 2m\exp\left( - \frac{N\eta^2/2}{m(1+\eta)+(1+\eta)^2 + (m+\eta)\eta/3} \right) \\ \leq& 2m\exp\left( - \frac{N\eta^2}{5m+9} \right). \end{align}\] The left-hand term is less than \(\delta\), provided condition 13 is satisfied with \(C=6\) by assuming that \(m\geq 9\) and \(\eta<1\).

For the second term, denote by \(\mathbf{B}\) the matrix \(\mathbb{E}_{\boldsymbol{\omega}} [\frac{1}{N}\mathbf{A}\mathbf{A}^*] - \mathbf{I}_m\). Then \(\mathbf{B}\) is symmetric and \(\mathbf{B}_{j,j}=0\) and \(\mathbf{B}_{j,k} = \mathbb{E}_{\boldsymbol{\omega}} \exp(i\langle \boldsymbol{\omega}, \mathbf{x}_j-\mathbf{x}_k\rangle) = \exp(-\gamma\|\mathbf{x}_j-\mathbf{x}_k\|_1)\) for all \(j,k\in[m]\) with \(j\neq k\). Note that \(\mathbf{B}\) is self-adjoint, by Gershgorin’s disk theorem and condition 14 , the induced \(\ell_2\) norm is bounded by \[\|\mathbf{B}\|_2 = \max_{j\in[m]}\sum_{k\neq j} |\exp(-\gamma\|\mathbf{x}_j-\mathbf{x}_k\|_1)| \leq \max_{j\in[m]}\sum_{k\neq j}|\exp(-\gamma\|\mathbf{x}_j-\mathbf{x}_k\|_2)| \leq m\exp(-\gamma K) \leq \eta.\] ◻

Utilizing the concentration property of random feature matrix \(\mathbf{A}\in\mathbb{C}^{m\times N}\), we derive an upper bound for the estimation error \(\|f^* - f^\sharp\|_{L_2(D)}\), which is summarized in Theorem 4.

Theorem 4 (Bound on Estimation Error). Assume the data \(X = \{\mathbf{x}_j\}_{j\in[m]} \subset D\), the random features \(\{ \boldsymbol{\omega}_k\}_{k\in[N]} \subset\mathbb{R}^d\), and the random feature matrix \(\mathbf{A}\in\mathbb{C}^{m\times N}\) satisfy Assumption 4. Let \(f^*\) and \(f^\sharp\) be defined as 11 and 9 , respectively, and satisfy Assumption 3. If for some \(\eta,\delta>0\) the following conditions hold, \[\begin{align} & N \geq C\eta^{-2}m\log(\frac{m}{2\delta}) \\ & \gamma \geq \frac{1}{K}\log(\frac{m}{\eta}) \end{align}\] where \(C>0\) is a universal constant independent of the dimension \(d\), then the following bound holds with probability at least \(1-4\delta\) \[\|f^* - f^\sharp\|^2_{L^2(D)} \leq 2L_f^2h_X^2\mathop{\mathrm{vol}}(D) + Ch_X^d(1+2\eta)\left( \frac{2m}{1-2\eta}I_1 + 2I_2 \right),\] where quantities \(I_1\) and \(I_2\) are \[\begin{align} &I_1 := \frac{2\|f\|_\rho^4}{T^2} + \frac{32T^2\log^2(2/\delta)}{N^2} + \frac{4\|f\|^2_\rho\log(2/\delta)}{N}, \\ &I_2 := \|f\|_\rho^2 + \frac{4T^2\log(2/\delta)}{N} + \sqrt{\frac{2T^2\|f\|_\rho^2\log(2/\delta)}{N}}. \end{align}\]

Lemma 3. Assume the data \(X = \{\mathbf{x}_j\}_{j\in[m]} \subset D\), the random features \(\{ \boldsymbol{\omega}_k\}_{k\in[N]} \subset\mathbb{R}^d\), and the random feature matrix \(\mathbf{A}\in\mathbb{C}^{m\times N}\) satisfy Assumption 4. Let \(f^*\) and \(f^\sharp\) be defined as 11 and 9 , respectively, and satisfy Assumption 3. Then we have \[\| f^* - f^\sharp\|_{L^2(D)}^2 \leq 2L_f^2h_X^2\mathop{\mathrm{vol}}(D) + Ch_X^d\|\mathbf{A}\|_2^2 \| \mathbf{c}^* - \mathbf{c}^\sharp\|_2^2.\]

Proof. Consider a partition2 \(\{ D_j \}_{j\in[m]}\) of domain \(D\) where \[D_j = \{\mathbf{x}\in D: \|\mathbf{x}-\mathbf{x}_j\| \leq \|\mathbf{x}-\mathbf{x}_k\|for allk\neq j \}.\] For each \(\mathbf{x}\in D\), we use Assumption 3 and apply triangle inequality to obtain \[\begin{align} |f^*(\mathbf{x}) - f^\sharp(\mathbf{x})| &\leq \left| \left(f^*(\mathbf{x}) - f^\sharp(\mathbf{x})\right) - \left(f^*(\mathbf{x}_j) - f^\sharp(\mathbf{x}_j)\right)\right| + \left|f^*(\mathbf{x}_j) - f^\sharp(\mathbf{x}_j)\right| \\ & \leq L_f\|\mathbf{x}-\mathbf{x}_j\|_2 + \left|f^*(\mathbf{x}_j) - f^\sharp(\mathbf{x}_j)\right| \end{align}\] Then, the approximation error is bounded as \[\label{riemann95error} \begin{align} \|f^* - f^\sharp\|^2_{L^2(D)} &= \int_{D} \left|f^*(\mathbf{x})-f^\sharp(\mathbf{x})\right|^2 d\mathbf{x}= \sum_{j=1}^m \int_{D_j} \left|f^*(\mathbf{x})-f^\sharp(\mathbf{x})\right|^2 d\mathbf{x}\\ & \leq \sum_{j=1}^m \int_{D_j} 2L_f^2\|\mathbf{x}-\mathbf{x}_j\|_2^2 + 2\left|f^*(\mathbf{x}_j) - f^\sharp(\mathbf{x}_j)\right|^2 d\mathbf{x}\\ & \leq 2L_f^2h_X^2\mathop{\mathrm{vol}}(D) + \sum_{j=1}^m 2\mathop{\mathrm{vol}}(D_j)\left|f^*(\mathbf{x}_j) - f^\sharp(\mathbf{x}_j)\right|^2 \\ & \leq 2L_f^2h_X^2\mathop{\mathrm{vol}}(D) + Ch_X^d\sum_{j=1}^m \left|f^*(\mathbf{x}_j) - f^\sharp(\mathbf{x}_j)\right|^2 \\ & \leq 2L_f^2h_X^2\mathop{\mathrm{vol}}(D) + Ch_X^d\|\mathbf{A}\mathbf{c}^* - \mathbf{A}\mathbf{c}^\sharp\|_2^2 \\ & \leq 2L_f^2h_X^2\mathop{\mathrm{vol}}(D) + Ch_X^d\|\mathbf{A}\|^2 \| \mathbf{c}^* - \mathbf{c}^\sharp\|_2^2 \end{align}\tag{16}\] The third line holds since each \(\mathbf{x}\in D_j\) satisfies \[\|\mathbf{x}- \mathbf{x}_j \|_2 = \min_{\mathbf{x}'\in X} \|\mathbf{x}-\mathbf{x}'\|_2 \leq h_X,\] and the fourth line holds since the volume of each \(D_j\) is bounded by \(Ch_X^d\) for some constant \(C>0\). Although the constant \(C\) depends on the dimension \(d\), for the problems consider here, \(d\leq3\). Specifically, the constant \(C=2\) when \(d=1\), \(C=\pi\) when \(d=2\), and \(C=\frac{4\pi}{3}\) if \(d=3\). ◻

Lemma 4 (Decay Rate of \(\|\mathbf{c}^*\|_2^2\)). Let \(f^*\) be defined as 11 and \(\mathbf{c}^*\) be the corresponding coefficient vector. For some \(\delta\in(0,1)\), it holds with probability at least \(1-\delta\) that \[\|\mathbf{c}^*\|_2^2 \leq \frac{1}{N} \left( \|f\|_\rho^2 + \frac{4T^2\log(2/\delta)}{N} + \sqrt{\frac{2T^2\|f\|_\rho^2\log(2/\delta)}{N}} \right).\]

Proof. By the definition of \(c_k^*\), we have \[\|\mathbf{c}^*\|_2^2 = \sum_{k=1}^N |c_k^*|^2 = \frac{1}{N^2} \sum_{k=1}^N |\alpha_{\leq T}(\boldsymbol{\omega}_k)|^2.\] Define random variable \(Z(\boldsymbol{\omega}) = |\alpha_{\leq T}(\omega)|^2\) and let \(Z_1,\dots, Z_N\) be \(N\) i.i.d copies of \(Z\) defined as \(Z_k = |\alpha_{\leq T}(\boldsymbol{\omega}_k)|^2\) for each \(k\in[N]\). By the boundedness of \(\alpha_{\leq T}(\boldsymbol{\omega})\), we have an upper bound \(|Z_k|\leq T^2\) for each \(k\in[N]\). The variance of \(Z\) is bounded above as \[\sigma^2 := \mathbb{E}_{\boldsymbol{\omega}} |Z-\mathbb{E}_{\boldsymbol{\omega}} Z|^2 \leq \mathbb{E}_{\boldsymbol{\omega}} |Z|^2 \leq T^2\mathbb{E}_{\boldsymbol{\omega}} \left[ |\alpha(\boldsymbol{\omega})|^2 \right] = T^2\|f\|^2_\rho.\] By Lemma A.2 and Theorem A.1 in [26], it holds with probability at least \(1-\delta\) that \[\left| \frac{1}{N} \sum_{k=1}^N Z_k - \mathbb{E}_{\boldsymbol{\omega}} Z \right| \leq \frac{4T^2\log(2/\delta)}{N} + \sqrt{\frac{2T^2\|f\|_\rho^2\log(2/\delta)}{N}}.\] Then, we have \[\|\mathbf{c}^*\|_2^2 = \frac{1}{N} \left( \frac{1}{N}\sum_{k=1}^N |\alpha_{\leq T}(\boldsymbol{\omega}_k)|^2 \right) \leq \frac{1}{N} \left( \|f\|_\rho^2 + \frac{4T^2\log(2/\delta)}{N} + \sqrt{\frac{2T^2\|f\|_\rho^2\log(2/\delta)}{N}} \right).\] ◻

Now, we are ready to prove Theorem 4 which states an upper bound for the estimation error. Recall the result of Lemma 3, it remains to bound \(\|\mathbf{A}\|_2^2\) and \(\|\mathbf{c}^*-\mathbf{c}^\sharp\|_2^2\).

Proof of Theorem 4.. Using Theorem 3, the squared (induced) 2-norm of random feature matrix \(\mathbf{A}\) is bounded by \[\|\mathbf{A}\|_2^2 = \lambda_{\max}(\mathbf{A}\mathbf{A}^*) = N \lambda_{\max}\left(\frac{1}{N}\mathbf{A}\mathbf{A}^*\right) \leq N(1+2\eta)\] with probability at least \(1-\delta\) and thus \[\| f^* - f^\sharp \|_{L^2(D)}^2 \leq 2L_f^2h_X^2\mathop{\mathrm{vol}}(D) + Ch_X^d N(1+2\eta) \| \mathbf{c}^* - \mathbf{c}^\sharp\|_2^2\] with probability at least \(1-\delta\).

To estimate \(\|\mathbf{c}^*-\mathbf{c}^\sharp\|^2_2\), we use the pseudo-inverse \(\mathbf{A}^\dagger = \mathbf{A}^*(\mathbf{A}\mathbf{A}^*)^{-1}\) of \(\mathbf{A}\in\mathbb{C}^{m\times N}\). Since the vector \(\mathbf{c}^\sharp = \mathbf{A}^\dagger \mathbf{y}\), we apply the triangle inequality and the inequality \((a+b)^2 \leq 2a^2 + 2b^2\) to obtain \[\begin{align} \| \mathbf{c}^* - \mathbf{c}^\sharp\|_2^2 \leq & 2\| \mathbf{A}^\dagger \mathbf{A}\mathbf{c}^* - \mathbf{A}^\dagger \mathbf{y}\|_2^2 + 2\|\mathbf{A}^\dagger \mathbf{A}\mathbf{c}^* - \mathbf{c}^*\|_2^2 \\ \leq & 2\| \mathbf{A}^\dagger \|_2^2 \|\mathbf{A}\mathbf{c}^* - \mathbf{y}\|_2^2 + 2\|\mathbf{A}^\dagger \mathbf{A}- \mathbf{I}\|_2^2\|\mathbf{c}^*\|_2^2. \end{align}\] Using Theorem 3, the squared (induced) 2-norm of the pseudo-inverse \(\mathbf{A}^\dagger\) is bounded by \[\|\mathbf{A}^\dagger\|_2^2 = \frac{1}{\lambda_{\min}(\mathbf{A}\mathbf{A}^*)} = \frac{1}{N\lambda_{\min}\left(\frac{1}{N}\mathbf{A}\mathbf{A}^*\right)} \leq \frac{1}{N(1-2\eta)}\] with probability at least \(1-\delta\). Note that \(\mathbf{A}^\dagger \mathbf{A}- \mathbf{I}_m\) is an orthogonal projection, and hence \(\|\mathbf{A}^\dagger \mathbf{A}- \mathbf{I}_m\|_2^2 \leq 1\). Moreover, we can show that \[\begin{align} \|\mathbf{A}\mathbf{c}^* - \mathbf{y}\|_2^2 =& \sum_{j=1}^m |f(\mathbf{x}_j) - f^*(\mathbf{x}_j)|^2 \leq m \sup_{\mathbf{x}\in D} |f(\mathbf{x}) - f^*(\mathbf{x})|^2 \\ \leq & m \left( \frac{2\|f\|_\rho^4}{T^2} + \frac{32T^2\log^2(2/\delta)}{N^2} + \frac{4\|f\|^2_\rho\log(2/\delta)}{N} \right) \end{align}\] where we use Theorem 2 to obtain the last inequality. Putting Lemma 3, Lemma 4 and all inequalities together gives, with probability at least \(1-4\delta\), \[\|f^* - f^\sharp\|_{L^2(D)}^2 \leq 2L_f^2h_X^2\mathop{\mathrm{vol}}(D) + Ch_X^d(1+2\eta)\left( \frac{2m}{1-2\eta}I_1 + 2I_2 \right)\] where \[\begin{align} &I_1 := \frac{2\|f\|_\rho^4}{T^2} + \frac{32T^2\log^2(2/\delta)}{N^2} + \frac{4\|f\|^2_\rho\log(2/\delta)}{N}, \\ &I_2 := \|f\|_\rho^2 + \frac{4T^2\log(2/\delta)}{N} + \sqrt{\frac{2T^2\|f\|_\rho^2\log(2/\delta)}{N}}. \end{align}\] ◻

Proof of Theorem 1. Combining the results of Theorem 2 and Theorem 4 leads to an upper bound for the generalization error. Selecting \(T=\|f\|_\rho \sqrt{N}\) yields the desired result. ◻

4 Error Analysis for Random Feature Operator Learning↩︎

In this section, we present an error bound for our proposed random feature operator learning method. Let \(G\) be the target operator, the generalization error of estimator \(\Hat{G}\) is defined as \(\|G(u) - \Hat{G}(u)\|_{L^2(D_\mathcal{V})}\) for any \(u\in\mathcal{U}\), where \(\mathcal{U}\) is chosen to be the function space \(\mathcal{F}(\rho)\) defined in 7 . Using triangle inequality, we decompose the generalization error of our proposed estimator \(\Hat{G} = R_\mathcal{V}\circ\Hat{f}\circ S_\mathcal{U}\) into \[\begin{align} & \|G(u) - R_\mathcal{V}\circ\Hat{f}\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})} \\ & \leq \underbrace{\|G(u) - R_\mathcal{V}\circ f\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})}}_{\text{Approximation Error}} + \underbrace{ \|R_\mathcal{V}\circ f\circ S_\mathcal{U}(u) - R_\mathcal{V}\circ\Hat{f}\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})} }_{\text{Estimation Error}}. \end{align}\]

Lemma 5 (Bound on Approximation Error). Assume that \(G\) satisfies Assumption 1 and map \(f:\mathbb{R}^n\to\mathbb{R}^m\) is defined by \(f=S_\mathcal{V}\circ G\circ R_\mathcal{U}\), then we have \[\|G(u) - R_\mathcal{V}\circ f\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})} \leq L_G \| u - R_\mathcal{U}\circ S_\mathcal{U}(u)\|_{L_2(D)} + \| v - R_\mathcal{V}\circ S_\mathcal{V}(v)\|_{L^2(D_\mathcal{V})},\] where \(v = G\circ R_\mathcal{U}\circ S_\mathcal{U}(u)\).

Proof. Recall the definition of map \(f=S_\mathcal{V}\circ G\circ R_\mathcal{U}\) and apply triangle inequality to get \[\begin{align} & \|G(u) - R_\mathcal{V}\circ f\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})} \\ \leq & \|G(u) - G\circ R_\mathcal{U}\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})} + \|G\circ R_\mathcal{U}\circ S_\mathcal{U}(u) - R_\mathcal{V}\circ S_\mathcal{V}\circ G\circ R_\mathcal{U}\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})} \\ \leq & L_G \| u - R_\mathcal{U}\circ S_\mathcal{U}(u)\|_{L_2(D)} + \| v - R_\mathcal{V}\circ S_\mathcal{V}(v)\|_{L^2(D_\mathcal{V})}. \end{align}\] The last inequality holds since we assume \(G:\mathcal{U}\to \mathcal{V}\) is Lipschitz continuous with Lipschitz constant \(L_G\) and we denote \(v=G\circ R_\mathcal{U}\circ S_\mathcal{U}(u)\). ◻

To bound the estimation error, we need to introduce the following vector-valued random feature map \(\Bar{f}:\mathbb{R}^n\to\mathbb{R}^m\) where each \(\Bar{f}_j:\mathbb{R}^n\to\mathbb{R}\) is a random feature approximation taking the form \[\label{operator95temp} \Bar{f}_j(\mathbf{u}) = \sum_{k=1}^N \Bar{c}_k^{(j)} \exp(i\langle \boldsymbol{\omega}_k,\mathbf{u}\rangle),\tag{17}\] where the coefficient vector \(\Bar{\mathbf{c}}^{(j)}\in\mathbb{R}^N\) for all \(j\in[m]\) is trained by solving the min-norm interpolation problem \[\min_{\mathbf{c}^{(j)}\in\mathbb{R}^N} \|\mathbf{c}^{(j)}\|_2 \quads.t.\Bar{f}_j(\mathbf{u}_\ell) = f_j(\mathbf{u}_\ell) \quadfor all\ell\in[M].\] Notice that we can never compute \(\Bar{f}\) in practice since \(f:\mathbb{R}^n\to\mathbb{R}^m\) is not known to us.

Assumption 5. We assume that each component \(\Bar{f}_j:\mathbb{R}^n\to\mathbb{R}\) is Lipschitz continuous in the sense that there exists \(\Bar{L}_j>0\) such that \(|\Bar{f}_j(\mathbf{u}_1) - \Bar{f}_j(\mathbf{u}_2)| \leq \Bar{L}_j\|\mathbf{u}_1 - \mathbf{u}_2\|_2\) for any \(\mathbf{u}_1,\mathbf{u}_2\in\mathbb{R}^n\).

Recall that \(R_v:\mathbb{R}^m\to\mathcal{V}\) is a random feature recovery map and denote \(f\circ S_\mathcal{U}(u)\), \(\Hat{f}\circ S_\mathcal{U}(u)\), and \(\Bar{f}\circ S_\mathcal{U}(u)\) by vectors \(\mathbf{z}\), \(\Hat{\mathbf{z}}\), and \(\Bar{\mathbf{z}}\in\mathbb{R}^m\), respectively. Then, we can write \[\begin{align} R_\mathcal{V}(\mathbf{z}) = \sum_{k=1}^N c_k \exp(i\langle \boldsymbol{\omega}_k,\mathbf{y}\rangle),R_\mathcal{V}(\Hat{\mathbf{z}}) = \sum_{k=1}^N \Hat{c_k} \exp(i\langle \boldsymbol{\omega}_k,\mathbf{y}\rangle),R_\mathcal{V}(\Bar{\mathbf{z}}) = \sum_{k=1}^N \Bar{c_k} \exp(i\langle \boldsymbol{\omega}_k,\mathbf{y}\rangle), \end{align}\] where \(\mathbf{c}= \mathbf{A}^\dagger_y \mathbf{z}\), \(\Hat{\mathbf{c}} = \mathbf{A}^\dagger_y\Hat{\mathbf{z}}\), \(\Bar{\mathbf{c}} = \mathbf{A}^\dagger_y \Bar{\mathbf{z}}\in\mathbb{R}^N\), and the matrix \(\mathbf{A}_y\) is defined (component-wise) by \((\mathbf{A}_y)_{j,k} = \exp(i\langle \boldsymbol{\omega}_k, \mathbf{y}_j\rangle)\) with \(\mathbf{A}^\dagger_y=\mathbf{A}_y^*(\mathbf{A}_y\mathbf{A}_y^*)^{-1}\) being its pseudo-inverse.

Let \(\mathcal{U}\) and \(\mathcal{V}\) are set of functions defined on compact domain \(D_\mathcal{U}\) and \(D_\mathcal{V}\), respectively. Distinct collocation points are denoted by \(X=\{\mathbf{x}_j\}_{j\in[n]}\subset D_\mathcal{U}\) and \(Y=\{\mathbf{y}_j\}_{j\in[m]}\subset D_\mathcal{V}\). The finite training data \(\{ (u_j,v_j) \}_{j\in[M]} \subset \mathcal{U}\times \mathcal{V}\) satisfies \(G(u_j) = v_j\) for all \(j\in[M]\) and is accessible via sampling operators \(S_\mathcal{U}\) and \(S_\mathcal{V}\), i.e., denote the point evaluations of function \(u_j\) and \(v_j\) by \[\mathbf{u}_j = S_\mathcal{U}(u_j) = [u_j(\mathbf{x}_1), \dots, u_j(\mathbf{x}_n)]^\top \in\mathbb{R}^n, \quad \mathbf{v}_j = S_\mathcal{V}(v_j) = [v_j(\mathbf{y}_1), \dots, v_j(\mathbf{y}_m)]^\top \in\mathbb{R}^m.\] Let \(U\subset \mathbb{R}^n\) be a compact set with Lipschitz boundary and \(\{ \mathbf{u}_j\}_{j\in[M]}\) be the set of all point evaluation vectors with fill-in distance \[h_\mathbf{u}:= \max_{\mathbf{u}'\in U}\min_{1\leq j\leq M} \|\mathbf{u}_j - \mathbf{u}'\|_2.\]

Lemma 6 (Bound on Estimation Error). Suppose that target operator \(G\) satisfies Assumption 1. Let \(f\), \(\hat{f}\), and \(\Bar{f}\) be defined by 1 , 4 , and 17 , respectively. Suppose that \(f\) and \(\Bar{f}\) satisfy Assumption 2 and 5, respectively. For some parameter \(\eta\in(0,1)\), then we have \[\begin{align} & \|R_\mathcal{V}\circ f \circ S_\mathcal{U}(u) - R_\mathcal{V}\circ \Hat{f} \circ S_\mathcal{U}(u) \|_{L_2(D_\mathcal{V})} \\ \leq & \sqrt{\frac{\mathop{\mathrm{vol}}(D_\mathcal{V})}{1-\eta}} \left( 2\sqrt{m}h_\mathbf{u}\sup_{j\in[m]}(L_j+\Bar{L}_j) + \frac{\sqrt{2}\|S_\mathcal{V}\| L_G\sqrt{M}}{\sqrt{1-\eta}} \sup_{\ell\in[M]}\| u_\ell - R_\mathcal{U}\circ S_\mathcal{U}(u_\ell)\|_{L_2(D_\mathcal{U})} \right) \end{align}\] with probability at least \(1-\delta\).

Proof. Following the notations of random feature functions, we can bound the error by \[\label{boundI2} \begin{align} & \left\|R_\mathcal{V}\circ f\circ S_\mathcal{U}(u) - R_\mathcal{V}\circ\Hat{f}\circ S_\mathcal{U}(u)\right\|^2_{L_2(D_\mathcal{V})} \leq \int_{D_\mathcal{V}} \left|\sum_{k=1}^N \left(c_k - \Hat{c_k}\right) \exp(i\langle \boldsymbol{\omega}_k,\mathbf{y}\rangle) \right|^2 d\mathbf{y}\\ \leq & \int_{D_\mathcal{V}} N \|\mathbf{c}-\Hat{\mathbf{c}}\|^2_2 d\mathbf{y} \leq N\mathop{\mathrm{vol}}(D_\mathcal{V})\|\mathbf{A}_y^\dagger\|_2^2 \|\mathbf{z}-\Hat{\mathbf{z}}\|_2^2 \leq \frac{\mathop{\mathrm{vol}}(D_\mathcal{V})}{1-\eta}\|\mathbf{z}-\Hat{\mathbf{z}}\|_2^2, \end{align}\tag{18}\] where we use the bound on the (induced) 2-norm of the pseudo-inverse \(\mathbf{A}_y^\dagger\). It remains to bound \(\|\mathbf{z}-\Hat{\mathbf{z}}\|_2\). By definitions, we write its square as \[\|\mathbf{z}-\Hat{\mathbf{z}}\|_2^2 = \sum_{j=1}^m |f_j(\mathbf{u}) - \Hat{f}_j(\mathbf{u})|_2^2 \leq 2\sum_{j=1}^m |f_j(\mathbf{u}) - \Bar{f}_j(\mathbf{u})|^2 + 2\sum_{j=1}^m |\Bar{f}_j(\mathbf{u}) - \Hat{f}_j(\mathbf{u})|^2.\] Since \(f_j\) and \(\Bar{f}_j\) are Lipschitz continuous with Lipschitz constant \(L_j, \Bar{L}_j\) for each \(j\in[m]\), and \(\Bar{f}_j(\mathbf{u}_\ell) = f_j(\mathbf{u}_\ell)\) for each training samples \(\mathbf{u}_\ell\). Then we can give the following bound \[\begin{align} & |f_j(\mathbf{u}) - \Bar{f}_j(\mathbf{u})| \leq |f_j(\mathbf{u}) - f_j(\mathbf{u}_\ell)| + |f_j(\mathbf{u}_\ell) - \Hat{f}_j(\mathbf{u}_\ell)| + |\Hat{f}_j(\mathbf{u}_\ell) - \Hat{f}_j(\mathbf{u})| \\ & \leq L_j \|\mathbf{u}-\mathbf{u}_\ell\|_2 + \Bar{L}_j\|\mathbf{u}-\mathbf{u}_\ell\|_2 \\ & \leq (L_j + \Bar{L}_j) h_\mathbf{u}, \end{align}\] where we assume that \(\mathbf{u}_\ell\) is one training samples which is closest to \(\mathbf{u}\), and hence the Euclidean distance \(\|\mathbf{u}-\mathbf{u}_\ell\|_2\) is less than the fill-in distance \(h_\mathbf{u}\), i.e \(\|\mathbf{u}-\mathbf{u}_\ell\|_2 \leq h_\mathbf{u}\). Recall the constructions of \(\Bar{f}\) and \(\Hat{f}\), then we have \[\begin{align} & \sum_{j=1}^m \left|\Bar{f}_j(\mathbf{u}) - \Hat{f}_j(\mathbf{u})\right|^2 \leq \sum_{j=1}^m \left( \sum_{k=1}^N \left|\Bar{\mathbf{c}}_k^{(j)} - \Hat{\mathbf{c}}_k^{(j)}\right|\right)^2 \leq \sum_{j=1}^m N\|\Bar{\mathbf{c}}^{(j)} - \Hat{\mathbf{c}}^{(j)}\|_2^2 \\ \leq & N\|\mathbf{A}^\dagger\|_2^2 \sum_{j=1}^m \| \Bar{\mathbf{v}}^{(j)} - \Hat{\mathbf{v}}^{(j)}\|^2_2 \\ \leq & \frac{1}{1-\eta} \sum_{\ell=1}^M \|S_\mathcal{V}\circ G \circ R_\mathcal{U}\circ S_\mathcal{U}(u_\ell) - S_\mathcal{V}\circ G \circ u_\ell\|^2_2 \\ \leq & \frac{1}{1-\eta} \|S_\mathcal{V}\|^2 L_G^2 M \sup_{\ell\in[M]}\| u_\ell - R_\mathcal{U}\circ S_\mathcal{U}(u_\ell)\|^2_{L_2(D_\mathcal{U})}. \end{align}\] Therefore, we can bound \(\|\mathbf{z}-\Hat{\mathbf{z}}\|_2^2\) by \[\label{bound95z} \| \mathbf{z}- \Hat{\mathbf{z}} \|_2^2 \leq 4mh_\mathbf{u}^2\sup_{j\in[m]}(L_j^2+\Bar{L}_j^2) + \frac{2\|S_\mathcal{V}\|^2 L^2_G M}{1-\eta} \sup_{\ell\in[M]}\| u_\ell - R_\mathcal{U}\circ S_\mathcal{U}(u_\ell)\|^2_{L_2(D_\mathcal{U})}.\tag{19}\] Substituting 19 into 18 , and then taking the squared root lead to the desired error bound. ◻

Theorem 5 (Generalization Error Bound for Operator Learning). Suppose that target operator \(G\) satisfies Assumption 1. Let \(f\), \(\hat{f}\), and \(\Bar{f}\) be defined by 1 , 4 , and 17 , respectively. Suppose that \(f\) and \(\Bar{f}\) satisfy Assumption 2 and 5, respectively. Then, for any \(u\in B_1(\mathcal{F}(\rho))\), it holds that \[\begin{align} & \|G(u) - R_\mathcal{V}\circ\Hat{f}\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})} \leq L_G \| u - R_\mathcal{U}\circ S_\mathcal{U}(u)\|_{L_2(D)} + \| v - R_\mathcal{V}\circ S_\mathcal{V}(v)\|_{L^2(D_\mathcal{V})} + \\ & \sqrt{\frac{\mathop{\mathrm{vol}}(D_\mathcal{V})}{1-\eta}} \left( 2\sqrt{m}h_\mathbf{u}\sup_{j\in[m]}(L_j+\Bar{L}_j) + \frac{\sqrt{2}\|S_\mathcal{V}\| L_G\sqrt{M}}{\sqrt{1-\eta}} \sup_{k\in[M]}\| u_k - R_\mathcal{U}\circ S_\mathcal{U}(u_k)\|_{L_2(D_\mathcal{U})} \right) \end{align}\] with probability at least \(1-\delta\). Furthermore, let training inputs satisfy \(u_j\in B_1(\mathcal{F}(\rho))\) for all \(j\in[M]\). Suppose the collocation points \(X=\{\mathbf{x}_j\}_{j\in[n]}\) and \(Y=\{\mathbf{y}_j\}_{j\in[m]}\), random features \(\{\boldsymbol{\omega}_k\}_{k\in[N]}\), and random feature matrix \(\mathbf{A}\in\mathbb{C}^{m\times N}\) satisfy Assumption 4. In addition, the conditions in Theorem 1 hold. Then, for any \(u\in B_1(\mathcal{F}(\rho))\), it holds that \[\|G(u) - R_v\circ\Hat{f}\circ S_u(u)\|_{L_2(D_\mathcal{V})} \leq \mathcal{O}\left(\frac{1}{\sqrt{N}} + h_X + h_X^{d/2} + h_Y + h_Y^{d/2} + h_\mathbf{u}\right)\] with probability at least \(1-6\delta\).

Proof. We first decompose the generalization error into two terms \[\begin{align} & \|G(u) - R_\mathcal{V}\circ\Hat{f}\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})} \\ \leq& \underbrace{\|G(u) - R_\mathcal{V}\circ f\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})}}_{\text{Approximation Error}} + \underbrace{ \|R_\mathcal{V}\circ f\circ S_\mathcal{U}(u) - R_\mathcal{V}\circ\Hat{f}\circ S_\mathcal{U}(u)\|_{L_2(D_\mathcal{V})} }_{\text{Estimation Error}}. \end{align}\] Bounds on approximation error and estimation error are given by Lemma 5 and Lemma 6, respectively. Applying the generalization error bound for random feature model in Theorem 1 to \(u\), \(G\circ R_\mathcal{U}\circ S_\mathcal{U}(u)\), and \(u_k\) leads to the desired result. ◻

5 Numerical Experiments↩︎

Numerical experiments are used to benchmark and compare our proposed random feature operator learning method to kernel-based method [19] and neural operator methods, including DeepONet [3] and FNO [4]. We measure the model performance by using an empirical relative error on a test set \(U_{test}\): \[\mathrm{Error}(\Hat{G}) = \frac{1}{|U_{test}|} \sum_{u_k\in U_{test}} \frac{\|G(u_k)-\hat{G}(u_k)\|_{L^2(D_\mathcal{V})}}{\|G(u_k)\|_{L^2(D_\mathcal{V})}}\] where \(G:\mathcal{U}\to\mathcal{V}\) is the true operator. For any \(v\in\mathcal{V}\), we take \(\|v\|_{L^2(D_\mathcal{V})} = \sqrt{ \int_{D_\mathcal{V}} |v(x)|^2 dx}\), which in turn is estimated by the function values at collocation points. We outline the setup of each problem in the following subsections. In all problems, \(\mathcal{U}\) and \(\mathcal{V}\) are spaces of real-valued functions defined on domains \(D_\mathcal{U}, D_\mathcal{V}\subset\mathbb{R}^d\) for \(d=1,2\). All the experiments are conducted using Python and our source codes are available online3.

For the kernel method, we consider RBF kernel and Matérn kernel. Specifically, the RBF kernel function with parameter \(\gamma>0\) is \(K(\mathbf{x},\mathbf{x}') = \exp(-\gamma\|\mathbf{x}- \mathbf{x}'\|_2^2)\). The Matérn kernel with scaling parameter \(\sigma>0\) and smoothness parameter \(\nu>0\) is defined by \[K_{\nu, \sigma}(\mathbf{x},\mathbf{x}') = \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu}\|\mathbf{x}-\mathbf{x}'\|_2}{\sigma} \right)^\nu K_\nu\left( \frac{\sqrt{2\nu}\|\mathbf{x}-\mathbf{x}'\|_2}{\sigma} \right),\] where \(\Gamma\) is the gamma function, and \(K_\nu\) is the modified Bessel function of the second kind of order \(\nu\). The Gaussian random feature model with parameter \(\gamma>0\) means that the random features \(\boldsymbol{\omega}\) are drawn from normal distribution \(\mathcal{N}(0,2\gamma)\), which approximates RBF kernel with parameter \(\gamma\). The Cauchy random feature model with parameter \(\gamma>0\) means that the random features \(\boldsymbol{\omega}\) are drawn from tensor-product Cauchy distribution with scaling parameter \(\gamma>0\).

5.1 Advection Equations , , and↩︎

Consider the one-dimensional advection equation: \[\begin{align} \frac{\partial v}{\partial t} + \frac{\partial v}{\partial x} &= 0 \quad && x\in(0,1), t\in(0,1] \\ v(x,0) &= u(x) \quad && x\in(0,1), \end{align}\] we aim to learn the operator \(G\) mapping from the initial condition \(u(x)=v(x, 0)\) to the solution at time \(t=0.5\), which is denoted by \(v(\cdot, 0.5)\). The domains of function spaces \(\mathcal{U}\) and \(\mathcal{V}\) are \(D_\mathcal{U}= D_\mathcal{V}= (0,1)\). This problem was considered in [45], [46] with different distributions \(\mu\) for the initial condition. We consider, referred as Advection here, the initial condition which is a square wave centered at \(x = c\) of width \(b\) and height \(h\): \[u(x) = h1_{ \{c- 2b ,c + 2b\} }.\] The parameters \((c, b, h)\) are randomly generated following the uniform distribution on \([0.3, 0.7] \times [0.3, 06] \times [1, 2]\). The spatial grid is of resolution 40, and we use 1000 instances for training and 200 instances for testing. Figure 4 shows an example of training input, training output, true test example and its Cauchy random feature approximation, and the corresponding pointwise error.

a

b

c

d

Figure 4: An example of training input, training output, test function and its Cauchy random feature approximation, and pointwise error for the Advection Equation ..

Advection equation takes a more complicated initial condition, i.e. \[u(x) = h_1 1_{ \{c_1 - w, c_1 + w\} } + \sqrt{ \max(h_2^2 - a_2^2(x - c_2)^2, 0)}.\] The resolution is of 40 and we use 1000 samples to train the random feature models and another 1000 samples to test the performance. Similarly, we show an example of training input, training output, true test example, its Cauchy random feature approximation, and the corresponding pointwise error in Figure 5.

a

b

c

d

Figure 5: An example of training input, training output, test function and its Cauchy random feature approximation, and pointwise error for the Advection Equation ..

A different initial condition was considered in [46]. We call it Advection equation here. Suppose that \(\tilde{u}_0\) is generated from a Gaussian process \(GP(0,(-\Delta+32\mathbf{I})^{-2})\) where \(\Delta\) and \(\mathbf{I}\) represent the Laplacian and the identity, respectively. The initial condition is defined as \[u=-1+21_{ \{\tilde{u}_0 \geq 0\} }.\] The resolution is of 200 and we use 1000 samples for training and another 1000 instances for testing. In Figure 6, we show an example of training input and output, true test function, its Cauchy random feature approximation, and the corresponding pointwise error.

a

b

c

d

Figure 6: An example of training input, training output, test function, its Cauchy random feature approximation, and pointwise error for the Advection Equation ..

In Table 1, we compare the relative test errors and computational times of RBF kernel-based model, Gaussian random feature model, and Cauchy random feature model. The scaling parameter \(\gamma\) for each model and the number of features \(N\) for random feature models are included as well. We observe that our proposed random feature method is reliable in terms of accuracy and even outperforms vanilla kernel method. Moreover, the training time is reduced significantly. As discussed in [19], since the Advection tests have more jumps, the prediction task poses challenges for smooth models such as our model and the kernel models around the discontinuities.

Table 1: Summary of numerical results of Advection Equations: we report relative test errors for proposed random feature methods and compare with kernel method proposed in [19]. We also report the number of features \(N\) and the scaling parameter \(\gamma\) for each experiment.
relative
model \(N\) \(\gamma\) test error
Advection RBF Kernel n/a 0.5 \(2.08\times 10^{-5}\) %
Random Feature (Gaussian) 5000 \(10^{-5}\) \(4.70 \times 10^{-6}\) %
Random Feature (Cauchy) 5000 \(10^{-5}\) \(1.26 \times 10^{-6}\) %
Advection RBF Kernel n/a 0.5 \(4.20\times 10^{-5}\) %
Random Feature (Gaussian) 5000 \(10^{-5}\) \(6.26 \times 10^{-6}\) %
Random Feature (Cauchy) 5000 \(10^{-5}\) \(2.16\times10^{-6}\%\)
Advection RBF Kernel n/a 0.5 0.17
Random Feature (Gaussian) 5000 0.01 0.22
Random Feature (Cauchy) 3000 \(10^{-6}\) 0.14

5.2 Burgers’ Equation↩︎

Consider the one-dimensional Burgers’ equation: \[\begin{align} \frac{\partial w}{\partial t} + w\frac{\partial w}{\partial x} = \nu \frac{\partial^2 w}{\partial x^2},& \quad && (x,t)\in(0,1)\times (0,1] \\ w(x,0) = u(x),& \quad && x \in (0,1). \end{align}\] We set the viscosity parameter \(\nu=0.1\). Our goal is to learn the operator \(G:u(x)\mapsto w(x,1)\), which maps the initial condition \(u(x) = w(x,0)\) to the solution \(w(x,t)\) at time \(t=1\). The data are generated by sampling the initial condition \(u\) from a Gaussian Process \(\mu\sim GP(0, 625(-\Delta + 25I)^{-2}))\), where \(\Delta\) and \(I\) represent the Laplacian and the identity, respectively. As in [19], [45], we use a spatial resolution with 128 grid points to represent the input and output functions, and use 1848 instances for training and 200 instances for testing. Figure 7 shows an example of training input, training output, true test function and its approximation using Cauchy random feature model along with the pointwise error.

a

b

c

d

Figure 7: Burgers’ equation: an example of (a) training input, (b) training output, (c) test function and its prediction using our random feature method, and (d) pointwise error. We randomly generate \(N=10^5\) i.i.d samples from the tensor-product Cauchy distribution with scaling parameter \(\gamma=0.01\) as random features..

5.3 Darcy Problem↩︎

Consider the two-dimensional Darcy flow problem: \[\begin{align} -\mathop{\mathrm{div}} e^u \nabla v &= w, &&inD, \\ v &= 0, &&on\partial D \end{align}\] with \(D=(0,1)^2\) and zero Dirichlet boundary condition, we are interested in learning the operator \(G:u\mapsto v\) where \(u\) is the permeability field and \(v\) is the solution. Hence, the domains of function \(u\) and \(v\) are the same, i.e., \(D_\mathcal{U}= D_\mathcal{V}= (0,1)^2\). The source term \(w\) is assumed to be fixed. Here, we consider the piecewise constant permeability field. Specifically, the coefficient \(u\) is sampled from \(u \sim \log \circ h (\mu)\), where \(\mu\) is sampled from Gaussian Process \(\mu \sim GP(0, (-\Delta+9I)^{-2})\) and \(h\) is binary function mapping positive inputs to 12 and negative inputs to 3. Therefore, the permeability field \(e^u\) is piecewise constant, see an example of permeability field (training input) and the corresponding solution (training output) in Figure 8. The grid of resolution is \(29\times29\) and we use 1000 samples for training, and 200 samples for testing. For our random feature model, we randomly generate \(N=10^4\) random features from the tensor-product Cauchy distribution with scaling parameter \(\gamma=2\times10^{-4}\). Figure 8 shows one example of true test solution \(v\), its Cauchy random feature approximation, and the pointwise error.

a

b

c

d

e

Figure 8: An example of training input, training output, true test and its Cauchy random feature approximation, and the pointwise error for Darcy problem in a rectangular domain with piecewise constant permeability field..

5.4 Helmholtz’s Equation↩︎

We consider the Helmholtz PDE \[\label{Helmholtz} \begin{align} & \left( - \Delta - \frac{\omega^2}{u^2(x)} \right)v = 0, \quad x\in(0,1)^2 \\ & \frac{\partial v}{\partial n} = 0, \quad x\in\{0,1\}\times[0,1]\cup[0,1]\times\{0\} \quadand\quad \frac{\partial v}{\partial n} = v_N, \quad x\in[0,1]\times\{1\}, \end{align}\tag{20}\] where frequency \(\omega\) and wave speed field \(u: D_\mathcal{U}\to \mathbb{R}\) are assumed to be fixed and given. The excitation field \(v: D_\mathcal{V}\to \mathbb{R}\) solves equation 20 . The domains of function \(u\) and \(v\) are \(D_\mathcal{U}= D_\mathcal{V}= (0,1)^2\) in this example. In the numerical experiments, we take \(\omega=10^3\) and \(v_N=I_{\{0.35\leq x\leq 0.65\}}\). Our goal is to learn operator \(G:u\mapsto v\) from the wave speed field \(u\) to the excitation field \(v\). The wave speed field \(u\) takes the form \(u(x) = 20+\tanh(\tilde{u}(x))\), where \(\tilde{u}\) is drawn from Gaussian Process \(GP(0,(-\Delta+9I)^{-2})\). Here, denote \(\Delta\) by the Laplacian on \(D_\mathcal{U}\) subject to homogeneous Neumann boundary conditions on the space of spatial-mean zero functions. As described in [19], [46], samples are generated by solving the equation 20 using a Finite Element Method on a discretization grid of size \(26\times26\) of the unit square. We use 1000 functions for training and 1000 functions for testing. In Figure 9, we show an example of training input, training output, true test function, Cauchy random feature approximation, and the corresponding pointwise error.

a

b

c

d

e

Figure 9: Helmholtz’s equation: an example of training input, training output, true test, predicted test using Cauchy random features, and pointwise error. We randomly generate \(N=5000\) i.i.d samples from tensor-product Cauchy distribution with scaling parameter \(\gamma = 10^{-4}\)..

5.5 Structural Mechanics↩︎

The system that governs the displacement vector \(w\) in an elastic solid undergoing infinitesimal deformations is defined as \[\begin{align} & \nabla\cdot \sigma = 0 \quad &&inD \\ & w=\Bar{w} \quad &&on\Gamma_w \\ & \sigma\cdot n = u \quad && on\Gamma_u \end{align}\] where \(\sigma\) is the (Cauchy) stress tensor and \(n\) is the outward unit normal vector. The computational domain is denoted by \(D=(0,1)^2\), and its boundary \(\partial D\) is split in \([0,1]\times 1 = \Gamma_u\) and its complement \(\Gamma_w\). The prescribed displacement \(\Bar{w}\) and the surface traction \(u\) are imposed on the domain boundaries \(\Gamma_w\) and \(\Gamma_u\), respectively. We aim to learn the operator that maps the one-dimensional load \(u\) on \(\Gamma_u\) to the two-dimensional von Mises stress field \(v\) on \(D_\mathcal{V}=(0,1)^2\). The load \(u\) is drawn from Gaussian Process \(GP(100, 400^2(-\Delta+\tau^2\mathrm{\mathop{Id}})^{-d})\) with \(\Delta\) being the Laplacian subject to homogeneous Neumann boundary conditions on the space of spatial-mean zero functions. The inverse length scale of the random field is taken to be \(\tau=3\) and \(d = 1\) determines its regularity (upto 1/2 a fractional derivative for samples from this measure). The load \(u\) is interpolated on a 41 grid and extruded in the \(y\) direction and the stress field is interpolated on a 21 \(\times\) 21 grid via radial basis function interpolation. We use 1000 functions for training and 1000 functions for testing. Figure 10 shows an example of training data, true test function and its approximation using Cauchy random features along with the corresponding pointwise error.

a

b

c

d

e

Figure 10: Structural mechanics: an example of training data, true test function, predicted test using Cauchy random features, and pointwise error..

5.6 Navier-Stokes Equation↩︎

We consider the vorticity-stream \((\omega,\psi)\) formulation of the incompressible Navier-Stokes equation: \[\begin{align} & \frac{\partial w}{\partial t} + (c\cdot \nabla)w - v\Delta w = u \\ & w=-\Delta\psi \qquad \int_{D}\psi=0 \\ & c = (\frac{\partial\psi}{\partial x_2}, -\frac{\partial\psi}{\partial x_1}), \end{align}\] where computational domain is \(D=[0,2\pi]^2\) and periodic boundary conditions are considered. We are interested in learning the operator \(G:u\mapsto w(\cdot,T)\), where \(u\) is the forcing term \(u\) and \(w(\cdot,T)\) is the vorticity field at a given time \(t=T\). The forcing term \(u\) is sampled from Gaussian Process \(GP(0, (-\Delta+\tau^2\mathrm{\mathop{Id}})^{-d})\). As previous examples, \(\Delta\) denotes the Laplacian on \(D\) subject to periodic boundary conditions on the space of spatial-mean zero functions, \(\tau=3\) denotes the inverse length scale of the random field and \(d = 4\) determines its regularity; the choice of \(d\) then leads to up to 3 fractional derivatives for samples from this measure. We fix the initial condition \(w(\cdot,0)\) which is generated from the same distribution. Given the constant viscosity \(\nu=0.025\), the equation is solved on a \(16\times16\) grid with a pseudo-spectral method and Crank-Nicholson time integration. The size of the training dataset is 2000 and the test dataset is of size 2000. Figure 11 shows an example of training data, true test function, its approximation using Cauchy random features, and the corresponding pointwise error.

a

b

c

d

e

Figure 11: Example of training data, true test and predicted test using Cauchy random feature method, and pointwise error for Navier-Stokes equation..

5.7 Results↩︎

In Table 2, we report the relative test errors of random feature methods and of kernel methods. Even though random feature methods can be viewed as an approximation of a kernel method, we observe that the random feature method achieves similar relative test errors or even improves the accuracy over the kernel method in some tests. Both the random feature method and kernel method are easy to implement in practice. However, the implementation of kernel methods requires high-performance numerical computing and machine learning python library such as JAX, see [19]. The implementation of our proposed random feature method can be done by using scikit-learn library and GPU computing is not required. Therefore, our proposed algorithm achieves similar accuracy even if the computational resources are limited. As concluded in [19], the kernel method either matches the performance of neural network methods or outperforms them in several benchmarks. Therefore, the random feature method can match the performance or outperform neural network methods as well.

Table 2: Summary of relative test errors of random features methods and kernel methods.
Example RBF Kernel Matérn Kernel RF(Cauchy) RF(Gaussian)
Burgers’ 3.76 2.03 3.82 2.70
Darcy 4.93 4.47 3.08 3.74
Helmholtz 5.05 3.76 6.66 3.63
Structural Mechanics 10.31 7.73 7.67 8.71
Navier-Stokes 1.09 0.91 2.54 1.06

We further compare the performances of the kernel method and the random feature method provided that each model is trained over a similar time period. Since the training time of the kernel method depends on the number of training samples, we further subsample a small dataset to train the kernel methodsİn Table [Table:RF95kernel2], we report the relative test errors and training times of kernel methods and of random feature methods. We observe that our proposed random feature method outperforms kernel method cross all benchmarks provided that the same amount of training time is allocated to each model.

|c|c|c|c|c|c| & & RBF Kernel & Matérn Kernel & RF(Cauchy) & RF(Gaussian)
& Relative error (%) & 6.87 & 5.33 & 3.85 & 4.10
& Training time (seconds) & 3.9 & 4.8 & 3.1 & 3.2
& Relative error (%) & 3.01 & 3.05 & 2.52 & 1.06
& Training time (seconds) & 2.4 & 3.1 & 3.0 & 1.9

We also compare our proposed random feature method with DeepONet and FNO directly in terms of the accuracy, see the summary of test relative errors in Table [Table:Test95error]. We report the training times of random feature methods as well. The relative errors of DeepONet and FNO are cited from [19], [45], [46]. We observe that the random feature methods can be trained fast even if the problem is complicated. For example, there are 20000 training samples in the Navier-Stokes problem and each training function is interpolated on a \(41\times 41\) grid. The random feature method matches the performance of DeepONet and FNO in the structural mechanics example, and outperforms DeepONet in the Helmholtz example, and outperforms both DeepONet and FNO in the Navier-Stokes example.

|c|c|c|c|c|c| & & DeepONet & FNO & RF(Cauchy) & RF(Gaussian)
& Relative error (%) & 5.88 & 1.86 & 2.54 & 2.92
& Training time (seconds) & - & - & 21.8 & 12.5
& Relative error (%) & 5.20 & 4.76 & 6.08 & 6.25
& Training time (seconds) & - & - & 4.2 & 5.8
& Relative error (%) & 3.63 & 0.26 & 0.73 & 0.11
& Training time (seconds) & - & - & 24.9 & 23.9

The choice of discretizations is important in operator learning. A true operator learning method should demonstrate the invariance of the expected relative test error to the mesh resolution used for training and testing. We empirically verify that our proposed method is stable to discretization changes, and hence is discretization-invariant, see also [47] for a discussion on discretization-invariant operator networks. We present the relative test errors at different resolutions for Darcy problem and Helmholtz problem in Tables 3 and 4, respectively. For each problem, we use 1000 functions for training and 1000 functions for testing.

Table 3: Darcy problem: relative test error at different resolutions.
Resolution \(13\times13\) \(26\times26\) \(51\times51\) \(101\times101\)
Relative Test Error (%) 8.18 8.35 8.30 8.53
Table 4: Helmholtz problem: relative test error at different resolutions.
Resolution \(29\times29\) \(31\times31\) \(36\times36\) \(43\times43\)
Relative Test Error (%) 4.06 3.95 3.88 3.98

In Figure 12, we empirically verify the error bound we obtained in Theorem 5 using some benchmarks. As the number of random features \(N\) increases, we observe that the test errors are all convergent as a rate of \(1/\sqrt{N}\). We also observe that the growth rate of training time is between \(\sqrt{N}\) and \(N\).

a

b

Figure 12: Empirical verifications of (a) decay rate of generalization error and (b) growth rate of training time. We repeat each experiment 10 trials to average..

6 Conclusion↩︎

We propose a random feature method for learning nonlinear Lipschitz operators related to PDE. We provide a detailed error analysis for our proposed method with Cauchy random features. The theory suggests that the generalization error decays as a rate of \(1/\sqrt{N}\), where \(N\) is the number of random features. The results hold for both Gaussian and Cauchy distributions. Numerical experiments show that the random feature method not only enjoys the easy and fast implementation, but also matches or outperforms kernel method and neural network methods in benchmark 2D PDE examples. In addition, the theoretical and computational analysis shows the benefits of randomization. The use of Cauchy features supports the application to PDE. These methods may also be applicable to the recent multi-operator learning approaches, for example, the transformer based models [48][52] or DeepONet based approaches [53]. Future work could explore other heavy tail random features and noisy data.

Acknowledgments↩︎

HS was partially supported by NSF DMS 2331033. DN was partially supported by NSF DMS 2408912.

References↩︎

[1]
M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
[2]
Tianping Chen and Hong Chen. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6(4):911–917, 1995.
[3]
Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021.
[4]
Zong-Yi Li, Nikola B. Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. ArXiv, abs/2010.08895, 2020.
[5]
Zecheng Zhang, Leung Tat, and Hayden Schaeffer. : basis enhanced learning, a mesh-free neural operator. Proceedings of the Royal Society A, 479, 08 2023.
[6]
Elizabeth Qian, Ionuţ-Gabriel Farcaş, and Karen Willcox. Reduced operator inference for nonlinear partial differential equations. SIAM Journal on Scientific Computing, 44(4):A1934–A1959, 2022.
[7]
Sifan Wang, Hanwen Wang, and Paris Perdikaris. Learning the solution operator of parametric partial differential equations with physics-informed deeponets. Science Advances, 7(40):eabi8605, 2021.
[8]
Shunyuan Mao, Ruobing Dong, Lu Lu, Kwang Moo Yi, Sifan Wang, and Paris Perdikaris. Ppdonet: Deep operator networks for fast prediction of steady-state solutions in disk–planet systems. The Astrophysical Journal Letters, 950(2):L12, jun 2023.
[9]
Min Zhu, Handi Zhang, Anran Jiao, George Em Karniadakis, and Lu Lu. Reliable extrapolation of deep neural operators informed by physics or sparse observations. Computer Methods in Applied Mechanics and Engineering, 412:116064, 2023.
[10]
David J. Lucia, Philip S. Beran, and Walter A. Silva. Reduced-order modeling: new approaches for computational physics. Progress in Aerospace Sciences, 40(1):51–117, 2004.
[11]
Thorsten Kurth, Shashank Subramanian, Peter Harrington, Jaideep Pathak, Morteza Mardani, David Hall, Andrea Miele, Karthik Kashinath, and Anima Anandkumar. Fourcastnet: Accelerating global high-resolution weather forecasting using adaptive fourier neural operators. In Proceedings of the Platform for Advanced Scientific Computing Conference, PASC ’23, New York, NY, USA, 2023. Association for Computing Machinery.
[12]
Yibo Yang, Georgios Kissas, and Paris Perdikaris. Scalable uncertainty quantification for deep operator networks using randomized priors. Computer Methods in Applied Mechanics and Engineering, 399:115399, 2022.
[13]
Beichuan Deng, Yeonjong Shin, Lu Lu, Zhongqiang Zhang, and George Em Karniadakis. Approximation rates of deeponets for learning operators arising from advection–diffusion equations. Neural Networks, 153:411–426, 2022.
[14]
Nikola Kovachki, Samuel Lanthaler, and Siddhartha Mishra. On universal approximation and error bounds for fourier neural operators. The Journal of Machine Learning Research, 22(1), jan 2021.
[15]
Ben Adcock, Simone Brugiapaglia, Nick Dexter, and Sebastian Morage. Deep neural networks are effective at learning high-dimensional hilbert-valued functions from limited data. In Joan Bruna, Jan Hesthaven, and Lenka Zdeborova, editors, Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, volume 145 of Proceedings of Machine Learning Research, pages 1–36. PMLR, 16–19 Aug 2022.
[16]
Hao Liu, Haizhao Yang, Minshuo Chen, Tuo Zhao, and Wenjing Liao. Deep nonparametric estimation of operators between infinite dimensional spaces. Journal of Machine Learning Research, 25(24):1–67, 2024.
[17]
Nicola Rares Franco and Simone Brugiapaglia. A practical existence theorem for reduced order models based on convolutional autoencoders. Foundations of Data Science, 7(1):72–98, 2025.
[18]
Hao Liu, Biraj Dahal, Rongjie Lai, and Wenjing Liao. Generalization error guaranteed auto-encoder-based nonlinear model reduction for operator learning. Applied and Computational Harmonic Analysis, 74:101717, 2025.
[19]
Pau Batlle, Matthieu Darcy, Bamdad Hosseini, and Houman Owhadi. Kernel methods are competitive for operator learning. Journal of Computational Physics, 496:112549, 2024.
[20]
Carlos Mora, Amin Yousefpour, Shirin Hosseinmardi, Houman Owhadi, and Ramin Bostanabad. Operator learning with gaussian processes, 2024.
[21]
Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
[22]
Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561, 2008.
[23]
Jingrun Chen, Xurong Chi, Weinan E, and Zhouwang Yang. Bridging traditional and machine learning-based algorithms for solving pdes: The random feature method. Journal of Machine Learning, 1(3):268–298, 2022.
[24]
Nicholas H. Nelsen and Andrew M. Stuart. The random feature model for input-output maps between banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021.
[25]
Nicholas H. Nelsen and Andrew M. Stuart. Operator learning using random features: A tool for scientific computing. SIAM Review, 66(3):535–571, 2024.
[26]
Samuel Lanthaler and Nicholas H. Nelsen. Error bounds for learning with vector-valued random features. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[27]
H.N. Mhaskar. Local approximation of operators. Applied and Computational Harmonic Analysis, 64:194–228, 2023.
[28]
Zhu Li, Jean-François Ton, Dino Oglic, and D. Sejdinovic. Towards a unified analysis of random fourier features. Journal of Machine Learning Research, 22:108:1–108:51, 2018.
[29]
Nicholas Richardson, Hayden Schaeffer, and Giang Tran. . Communications on Applied Mathematics and Computation, 6(2):879–906, 2024.
[30]
Abolfazl Hashemi, Hayden Schaeffer, Robert Shi, Ufuk Topcu, Giang Tran, and Rachel Ward. Generalization bounds for sparse random feature expansions. Applied and Computational Harmonic Analysis, 62:310–330, 2023.
[31]
Zhijun Chen and Hayden Schaeffer. . Information and Inference: A Journal of the IMA, 13(2):iaad054, 04 2024.
[32]
Yuxuan Liu, Scott G McCalla, and Hayden Schaeffer. . Proceedings of the Royal Society A, 479(2275):20220835, 2023.
[33]
Esha Saha, Hayden Schaeffer, and Giang Tran. . Sampling Theory, Signal Processing, and Data Analysis, 21(2):27, 2023.
[34]
Yuege Xie, Robert Shi, Hayden Schaeffer, and Rachel Ward. . In Mathematical and Scientific Machine Learning, pages 303–318. PMLR, 2022.
[35]
Zhijun Chen, Hayden Schaeffer, and Rachel Ward. . In Mathematical and Scientific Machine Learning, pages 287–302. PMLR, 2022.
[36]
Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 3218–3228, Red Hook, NY, USA, 2017. Curran Associates Inc.
[37]
Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3–84, 2022. Special Issue on Harmonic Analysis and Machine Learning.
[38]
Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561, 2008.
[39]
Simon Foucart, Chunyang Liao, Shahin Shahrampour, and Yinsong Wang. Learning from non-random data in hilbert spaces: an optimal recovery perspective. Sampling Theory, Signal Processing, and Data Analysis, 20(5), 2022.
[40]
Francesca Bartolucci, Emmanuel de Bezenac, Bogdan Raonic, Roberto Molinaro, Siddhartha Mishra, and Rima Alaifari. Representation equivalent neural operators: a framework for alias-free operator learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[41]
Daniel Hsu, Clayton H Sanford, Rocco Servedio, and Emmanouil Vasileios Vlatakis-Gkaragkounis. On the approximation power of two-layer networks of random relus. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 2423–2461. PMLR, 15–19 Aug 2021.
[42]
Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan A. K. Suykens. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7128–7148, 2022.
[43]
Haim Brézis. Functional analysis, Sobolev spaces and partial differential equations, volume 2. Springer, 2011.
[44]
Daniel Potts and Laura Weidensager. Anova-boosting for random fourier features, 2024.
[45]
Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and George Em Karniadakis. A comprehensive and fair comparison of two neural operators (with practical extensions) based on fair data. Computer Methods in Applied Mechanics and Engineering, 393:114778, 2022.
[46]
Maarten V. de Hoop, Daniel Zhengyu Huang, Elizabeth Qian, and Andrew M. Stuart. The cost-accuracy trade-off in operator learning with neural networks, 2022.
[47]
Zecheng Zhang, Wing Tat Leung, and Hayden Schaeffer. A discretization-invariant extension and analysis of some deep operator networks. Journal of Computational and Applied Mathematics, 456:116226, 2025.
[48]
Yuxuan Liu, Zecheng Zhang, and Hayden Schaeffer. predicting multiple operators and symbolic expressions using multimodal transformers. Neural Networks, 180:106707, 2024.
[49]
Jingmin Sun, Yuxuan Liu, Zecheng Zhang, and Hayden Schaeffer. . arXiv preprint arXiv:2404.12355, 2024.
[50]
Yuxuan Liu, Jingmin Sun, Xinjie He, Griffin Pinney, Zecheng Zhang, and Hayden Schaeffer. . arXiv preprint arXiv:2409.09811, 2024.
[51]
Yuxuan Liu, Jingmin Sun, and Hayden Schaeffer. . arXiv preprint arXiv:2501.18972, 2025.
[52]
Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, and Stanley Osher. Vicon: Vision in-context operator networks for multi-physics fluid dynamics prediction. arXiv preprint arXiv:2411.16063, 2024.
[53]
Zecheng Zhang, Christian Moya, Lu Lu, Guang Lin, and Hayden Schaeffer. . arXiv preprint arXiv:2411.07239, 2024.

  1. To simplify the notation, we use the same number of random features \(N\).↩︎

  2. If \(\mathbf{x}\in D\) belongs to more than one \(D_j\), we randomly assign it to one of them to make \(\{D_j\}_{j\in[m]}\) a partition.↩︎

  3. https://github.com/liaochunyang/RandomFeatureOperatorLearning↩︎