January 14, 2025
We introduce the Tensorized-and-Restricted Krylov (TReK) method, a simple and efficient algorithm for estimating covariance tensors with large observational sizes. TReK extends the Krylov subspace method to incorporate range restrictions, enabling its use in a variety of covariance smoothing applications. By leveraging tensor-matrix operations, it achieves significant improvements in both computational speed and memory cost, improving over existing methods by an order of magnitude. TReK ensures finite-step convergence in the absence of rounding errors and converges fast in practice, making it well-suited for large-scale problems. The algorithm is element-free and highly flexible, supporting a wide range of forward and projection tensors.
This work introduces the Tensorized-and-Restricted Krylov (TReK
) method for scalable covariance smoothing in Functional Data Analysis (FDA). Rather than propose a new estimation scheme, we develop a class of algorithms that remain
computationally efficient at problem scales where conventional numerical approaches become impractical. Our algorithm can be used with both commonly used “pool-then-smooth" estimation schemes, namely spline-based and RKHS-based methods. Underlying our
proposal is a careful analysis of the algebraic structure of the problem. Our contributions are threefold:
A unified block tensor-matrix formulation of least-squares problems in various pool-then-smooth covariance estimation methods,
An efficient implementation of Krylov subspace methods tailored to this formulation,
An adaptation of the Lanczos algorithm for Functional Principal Component Analysis (FPCA).
Our methodology is both theoretically grounded and practically versatile: it supports sparse or dense designs, regular or irregular sampling schemes, and –as mentioned– allows for estimation using either B-splines or reproducing kernels. In sparse
settings, TReK
yields near-instantaneous solutions (fast); in large-scale problems where existing methods struggle, it remains computationally tractable (cheap) as shown in 1. While our focus is on FDA, the core
principles of TReK
are more broadly applicable to problems where multivariate interactions and structural constraints must be preserved, such as in multi-dimensional ANOVA models [1], [2], digital antenna array design [3]–[5], space-time MIMO communication systems [6]–[8], and heterogeneous tomography [9], [10].
In FDA, the goal (or at least starting point) is to estimate the mean or covariance function of i.i.d. random functions, which are observed discretely with noise. Mean estimation using \(P\) basis functions from \(N\) observations leads to a standard matrix-vector least-squares problem: \[\min_{\mathbf{a}} \|\mathbf{F} \mathbf{a} - \mathbf{y} \|^{2} \quad \Longleftrightarrow \quad (\mathbf{F}^{\top} \mathbf{F}) \mathbf{a} = \mathbf{F}^{\top} \mathbf{y},\] where \(\mathbf{F} \in \mathbb{R}^{N \times P}\) is the forward matrix, \(\mathbf{y} \in \mathbb{R}^{N}\) is the observed data vector, and \(\mathbf{a} \in \mathbb{R}^{P}\) contains the coefficients to be estimated in the chosen basis. Crucially, the formulation should be invariant under reindexing or basis changes – a property that holds mathematically, but not computationally when using direct factorization-based solvers (e.g., SVD, Cholesky or LDLT), which could be sensitive to sparsity patterns. Even minor permutations can introduce significant fill-in and degrade numerical stability, thus necessitate pivoting [11]–[13].
Covariance estimation via pooling presents additional complexity. The target function has two arguments, so the resulting regression problem now becomes a tensor–matrix equation: \[\label{eq:intro:cov:tens} \min_{\mathbf{A}} \|\mathscr{F} \mathbf{A} - \mathbf{Y} \|^{2} \quad \Longleftrightarrow \quad (\mathscr{F}^{*} \mathscr{F}) \mathbf{A} = \mathscr{F}^{*} \mathbf{Y},\tag{1}\] where \(\mathscr{F}\) is a structured forward operator incorporating:
a block-wise tensor product (Khatri–Rao product) [14]–[16] of the mean-design matrix \(\mathbf{F}\) that encapsulates the second-order dependence;
an elimination operator [17] that discards diagonal entries of \(\mathbf{Y}\) contaminated by additive noise [18], [19].
A common strategy in the literature is to flatten the block tensor-matrix equation 1 via vectorization, reducing it to a matrix–vector problem: \[\label{eq:intro:cov:flat} \min_{\mathbf{a}_{\otimes}} \|\mathbf{F}_{\otimes} \mathbf{a}_{\otimes} - \mathbf{y}_{\otimes} \|^{2} \quad \Longleftrightarrow \quad (\mathbf{F}_{\otimes}^{\top} \mathbf{F}_{\otimes}) \mathbf{a}_{\otimes} = \mathbf{F}_{\otimes}^{\top} \mathbf{y}_{\otimes},\tag{2}\] after which direct solvers are applied [2], [20]–[29]. However, once the operator \(\mathscr{F}\) is explicitly represented as the matrix \(\mathbf{F}_{\otimes}\), its underlying algebraic structure is lost when handed off to generic direct solvers. More critically, this approach demands the explicit construction of either \(\mathbf{F}_{\otimes}\), its Gram matrix \(\mathbf{F}_{\otimes}^{\top} \mathbf{F}_{\otimes}\), or its pseudo inverse \((\mathbf{F}_{\otimes})^{\dagger}\), which can incur considerable computational and memory overhead [13], [30]–[34]. These challenges have popularized methods based on B-splines, where a tall-and-skinny sparse matrix \(\mathbf{F}_{\otimes}\) can be efficiently evaluated through the de-Boor recursion [26], [35]–[37]. While efficient, such approaches lack adaptability to the data-driven geometries, including irregular grids or manifold domains. By contrast, RKHS-based methods [9], [38]–[41] offer greater data-adaptability but rely on the dense semi-positive definite (s.p.d.) kernel Gram matrix \(\mathbf{F}_{\otimes}\), which scales poorly with increasing data size and thus require aggressive truncation.
To address these limitations, we propose solving the system iteratively using Krylov subspace methods (e.g., CGLS, GMRES, MINRES, LSQR, or LSMR) [42]–[49], selected based on the properties of the forward matrix \(\mathbf{F}_{\otimes}\). These methods work by projecting the solution onto a growing sequence of low-dimensional Krylov subspaces, generated by only one matrix–vector product per iteration with \(\mathbf{F}_{\otimes}\) (and with \(\mathbf{F}_{\otimes}^{\top}\) in case of assymetry), unlike SVD-based methods [13], [50], [51]. Consequently, these methods are inherently element-free: they rely only on the action of \(\mathbf{F}_{\otimes}\) rather than its explicit components, and are thus well-suited for large-scale problems in FDA as vectorization tricks can be applied to its action. Furthermore, unlike conventional approaches that rely on flattening the problem as in 2 , Krylov methods can operate directly on the native inner-product structure of 1 , provided this structure is preserved via an inner-product isomorphism. This raises a key question: what is the natural structure governing the action of \(\mathbf{F}\), and how can it be exploited to avoid redundant computation?
Given that data pairs within each random function reflect the symmetric nature of covariance functions, we model the data \(\mathbf{Y}\) as a symmetric block diagonal matrix. Instead of half-vectorization [23], we apply the operator directly to these symmetric matrix blocks, excluding noise-contaminated entries. This yields a block tensor–matrix equation that can be solved iteratively – avoiding the boilerplate code for vectorization and streamlining the FPCA pipeline through the Lanczos algorithm [50], [52]. Compared to existing approaches, our method naturally promotes sparsity, avoids ad hoc flattening schemes, and adheres closely to theoretical insights. Consequently, it is readily extensible to broader applications, including signal processing [3]–[8] and functional inverse problems [9], [10], [53].
Whether using the spline or RKHS estimation scheme, regularization is essential – both from a theoretical and practical standpoint – to prevent overfitting and ensure numerical stability [30]–[32]. While Tikhonov regularization is commonly employed in direct methods for its computational convenience, it still becomes impractical in large-scale problems due to the costly process of tuning the penalty parameter [33]. In contrast, the Krylov methods we propose provide implicit regularization via early stopping (still allowing additional penalty functionals to be incorporated, if desired). The iteration count itself serves as a surrogate regularization parameter, balancing data fidelity and numerical stability by controlling how closely the projected subproblem approximates the original ill-posed system [30]–[32].
We refer to this conceptual class of structured, element-free approaches as the TReK
method. By circumventing explicit construction of large-scale operators, TReK
offers substantial gains in both speed and memory usage. This
dramatic reduction highlights TReK
’s potential to efficiently process large datasets, making it a scalable and economical alternative to conventional covariance smoothers.
Throughout the paper, bold uppercase symbols denote matrices, while bold lowercase symbols represent vectors. For an s.p.d. matrix \(\mathbf{P}\), we denote the weighted \(\mathbf{P}\)-inner product by \({\left\langle {\mathbf{c}},{\mathbf{d}}\right\rangle}_{\mathbf{P}} := \mathbf{c}^{\top} \mathbf{P} \mathbf{d}\), and the \(\mathbf{P}\)-norm by \(\|\mathbf{c}\|_{\mathbf{P}} := (\mathbf{c}^{\top} \mathbf{P} \mathbf{c})^{1/2}\), which is a semi-norm in case of rank-deficiency. Script-style symbols refer to linear operators (e.g., \(\mathscr{F}\)) acting on matrices, and we denote the corresponding Moore–Penrose pseudoinverse by \(\mathscr{F}^{\dagger}\), and the range by \(\mathcal{R}(\mathscr{F})\) [13], [53], [54]. Calligraphic-style symbols denote spaces or sets. Scalar quantities are written in standard (non-bold) font, with key dimensions as follows:
\(n\): the number of random functions,
\(r_{i}\): the number of measurement locations for each random function,
\(g\): the number of grid points to evaluate the mean function; the covariance function is evaluated over a \(g \times g\) grid.
Definition 1. Let \(\mathbb{H}\) be a Hilbert space of real-valued functions defined on a set \(\mathcal{T}\). A bivariate function \(K:\mathcal{T} \times \mathcal{T}\rightarrow \mathbb{R}\) is called a reproducing kernel for \(\mathbb{H}\) if
For any \(t \in \mathcal{T}\), a feature map \(\mathrm{k}_{t}(\cdot):= K(\cdot, t)\) at \(t \in \Omega\) belongs to \(\mathbb{H}\).
For any \(f \in \mathbb{H}\), the point evaluation at \(t \in \mathcal{T}\) is given by \(f(t)=\langle f, \mathrm{k}_{t} \rangle\).
A Hilbert space equipped with a reproducing kernel is called a Reproducing Kernel Hilbert Space (RKHS).
By the Moore–Aronszajn theorem [55], any reproducing kernel is an s.p.d. function. Conversely, any s.p.d. function induces a unique RKHS, justifying the notation \(\mathbb{H}=\mathbb{H}(K)\) [54], [56]. When the elementary tensor between \(f_{1}, f_{2} \in \mathbb{H}\) is understood as a bivariate function, the reproducing property yields \[\label{eq:tens:repro:prop} (f_{1} \otimes f_{2})(t_{1}, t_{2}) = {\left\langle {(f_{1} \otimes f_{2})},{(\mathrm{k}_{t_{1}} \otimes \mathrm{k}_{t_{2}})}\right\rangle} = {\left\langle {f_{1}},{\mathrm{k}_{t_{1}}}\right\rangle} {\left\langle {f_{2}},{\mathrm{k}_{t_{2}}}\right\rangle} = f_{1}(t_{1}) f_{2}(t_{2}).\tag{3}\]
Consider a sample of i.i.d. second-order random functions \(X_{1}, \cdots, X_{n}\) on \(\mathcal{T}\), where \(\mathbb{E}[X_{i}(t)^{2}] < \infty\) for \(i=1, \dots, n\) and \(t \in \mathcal{T}\), that are mean-square continuous and jointly measurable [54], [57]. Let \(\mu: \mathcal{T} \to \mathbb{R}\) denote the mean function, and \(\Gamma, \Sigma: \mathcal{T} \times \mathcal{T} \to \mathbb{R}\) the second moment and covariance functions, respectively: \[\label{eq:def:moments} \mu(t) = \mathbb{E}[X_{1}(t)], \quad \Gamma(t_{1}, t_{2}) = \mathbb{E}[X_{1}(t_{1}) X_{1}(t_{2})], \quad \Sigma(t_{1}, t_{2}) = \Gamma(t_{1}, t_{2}) - \mu(t_{1}) \mu(t_{2}), \quad t, t_{1}, t_{2} \in \mathcal{T}.\tag{4}\] In practice, these random functions are observed at (random) locations \(t_{ij} \in \mathcal{T}\), subject to i.i.d. noise \(\varepsilon_{ij} \in \mathbb{R}\) with variance \(0<\sigma^{2} <\infty\), \[y_{ij} = X_{i} (t_{ij}) + \varepsilon_{ij}, \quad 1 \le i \le n, 1 \le j \le r_{i},\] where \(X_{i}\), \(t_{ij}\), and \(\varepsilon_{ij}\) are mutually independent. To estimate the mean function \(\mu\), we minimize the following unregularized empirical risk functional: \[\label{eq:loss:1st:intro} \mathcal{L}(\mu) = \sum_{i=1}^{n} \sum_{j=1}^{r_{i}} (y_{ij} - \mu(t_{ij}))^{2},\tag{5}\] which pools all observed data, thus effectively treating the observations as samples from a single process.
For covariance estimation, the structure is more nuanced. While cross-subject pairs (\(i_1 \ne i_2\)) are uninformative due to independence, within-subject off-diagonal pairs \[\mathcal{O}_{i} := \{(j_{1}, j_{2}): 1 \le j_{1} \neq j_{2} \le r_{i}\}, \quad 1 \le i \le n,\] capture the second-order structure [18], [19]. This is because the conditional covariance of any two observations satisfies: \[\begin{align} \text{Cov}[y_{i_{1}j_{1}} , y_{i_{2}j_{2}} \vert t_{i_{1}j_{1}}, t_{i_{2}j_{2}} ] = \begin{cases} 0, &i_{1} \neq i_{2}, \\ \Sigma(t_{i_{1}j_{1}}, t_{i_{2}j_{2}}) + \sigma^{2} \delta_{j_{1}j_{2}}, &i_{1} = i_{2} \end{cases}, \end{align}\] which motivates the exclusion of diagonal products to estimate \(\Gamma\) and obtain a plug-in estimator for \(\Sigma\) as \(\hat{\Sigma} = \hat{\Gamma} - \hat{\mu} \otimes \hat{\mu}\) [39]. To reformulate this as a least-square problem, we define the elimination operator with respect to \(\mathcal{O}_{i}\): \[\mathscr{O}_{i}: \mathbb{R}^{r_{i} \times r_{i}} \to \mathbb{R}^{r_{i} \times r_{i}}, \quad (\mathscr{O}_{i} \mathbf{C}_{i})[j_{1}, j_{2}] = \begin{cases} \mathbf{C}_{i}[j_{1}, j_{2}] &, \quad (j_{1}, j_{2}) \in \mathcal{O}_{i} \\ 0 &, \quad (j_{1}, j_{2}) \notin \mathcal{O}_{i} \end{cases}, \qquad 1 \le i \le n.\] Consequently, the associated unregularized loss becomes: \[\begin{align} \label{eq:loss:2nd:intro} \mathcal{L}_{\otimes}(\Gamma) = \sum_{i=1}^{n} \sum_{(j_{1}, j_{2}) \in \mathcal{O}_{i}} \left(y_{ij_{1}} y_{ij_{2}} - \Gamma(t_{ij_{1}}, t_{ij_{2}}) \right)^{2}. \end{align}\tag{6}\]
Remark 1. We remark that several works [21], [23], [29], [38], [58] have proposed directly estimating the covariance \(\Sigma\) using centered* observations \(\tilde{y}_{ij}:= y_{ij} - \hat{\mu}(t_{ij})\), which yields the following empirical risk functional: \[\begin{align} \tilde{\mathcal{L}}_{\otimes}(\Sigma) &:= \sum_{i=1}^{n} \sum_{(j_{1}, j_{2}) \in \mathcal{O}_{i}} (\tilde{y}_{ij_{1}} \tilde{y}_{ij_{2}} - \Sigma(t_{ij_{1}}, t_{ij_{2}}) )^{2}, \end{align}\] when the penalty functional is disregarded, and the noise level \(\sigma^{2}\) is known or treated as a nuisance parameter in their formulations. Whether we estimate \(\Gamma\) first and subsequently compute \(\hat{\Sigma} = \hat{\Gamma} - \hat{\mu} \otimes \hat{\mu}\), or instead incorporate \(\hat{\mu}\) into the risk functional, the resulting linear equation remains effectively the same. Consequently, the numerical complexity remains unchanged, so we focus on the plug-in approach via 5 and 6 .*
Technical derivations in this section are deferred to [sec:cov:est] [sec:cplxty]. To enable a unified comparison across different estimation methods, we adopt a generic notation: \(\mathcal{D}\), \(\mathcal{S}\), and \(\mathcal{C}\) respectively denote the data space, the search space, and the coefficient space. Their dimensions are denoted by \(N = \dim(\mathcal{D})\) and \(P = \dim(\mathcal{S}) = \dim(\mathcal{C})\). When explicitly distinguishing between first- and second-order estimation problems, we use a subscript \(\otimes\) to indicate quantities associated with covariance estimation (e.g., \(\mathcal{D}_{\otimes}, N_{\otimes}\)), in contrast to their mean estimation counterparts (e.g., \(\mathcal{D}, N\)). The underlying estimation framework – such as dictionary-based or RKHS-based – is indicated via superscripts (e.g., \(\mathcal{C}_{\otimes}^{\mathop{\mathrm{dict}}}\) vs.\(\mathcal{C}_{\otimes}^{\mathop{\mathrm{RK}}}\)).
Let \(\mathbf{r} := [r_{1}, \dots, r_{n}] \in \mathbb{N}^{n}\) denote a sequence of the number of observed locations for \(i\)-th sample path, and denote \(|\mathbf{r}| := \sum_{i=1}^{n} r_{i}\), \(|\mathbf{r}^{2}| := \sum_{i=1}^{n} r_{i}^{2}\). By the Cauchy-Schwarz inequality, we have \(n |\mathbf{r}^{2}| \ge |\mathbf{r}|^{2}\) with equality holds if and only if \(r_{i} \equiv r\) so that \(|\mathbf{r}| = nr\) and \(|\mathbf{r}^{2}| = nr^{2}\). For notational consistency with the existing literature in this section, we adopt a flattened matrix-vector representation of the second-order equations as in 2 , although this is by no means how our iterative algorithm is implemented. Accordingly, the elimination operator is expressed in its matrix form \[\mathbf{O}^{\flat} := \mathop{\mathrm{diag}}[\mathbf{O}_{i}^{\flat}] = \begin{pmatrix} \mathbf{O}_{1}^{\flat} & \cdots & \mathbf{0} \\ \vdots & \ddots & \vdots \\ \mathbf{0} & \cdots & \mathbf{O}_{n}^{\flat} \end{pmatrix} \in \mathbb{R}^{|\mathbf{r}^{2}| \times |\mathbf{r}^{2}|}, \quad \mathbf{O}_{i}^{\flat} \in \mathbb{R}^{r_{i}^{2} \times r_{i}^{2}} : \mathop{\mathrm{vec}}(\mathbf{C}_{i}) \mapsto \mathop{\mathrm{vec}}(\mathscr{O}_{i} \mathbf{C}_{i}),\] where the flat symbol \(\flat\) indicates that a matrix level operation has been flattened to the vector level. First, we systematically organize the observations \(y_{ij}\) and \(y_{ij_{1}} y_{ij_{2}}\) in 5 and 6 as \[\label{eq:data:intro} \mathbf{y} = \begin{pmatrix} \mathbf{y}_{1} \\ \mathbf{y}_{2} \\ \vdots \\ \mathbf{y}_{n} \end{pmatrix} \in \mathbb{R}^{|\mathbf{r}|} =: \mathcal{D}, \quad \mathbf{y}_{\otimes} = \begin{pmatrix} \mathbf{y}_{1} \otimes_{\flat} \mathbf{y}_{1} \\ \mathbf{y}_{2} \otimes_{\flat} \mathbf{y}_{2} \\ \vdots \\ \mathbf{y}_{n} \otimes_{\flat} \mathbf{y}_{n} \end{pmatrix} = \begin{pmatrix} \mathop{\mathrm{vec}}[\mathbf{y}_{1} \mathbf{y}_{1}^{\top}] \\ \mathop{\mathrm{vec}}[\mathbf{y}_{2} \mathbf{y}_{2}^{\top}] \\ \vdots \\ \mathop{\mathrm{vec}}[\mathbf{y}_{n} \mathbf{y}_{n}^{\top}] \end{pmatrix} \in \mathbb{R}^{|\mathbf{r}^{2}|} =: \mathcal{D}_{\otimes},\tag{7}\] where \(\otimes_{\flat}\) represents the Kronecker product as it flattens the bilinear form \(\otimes\), leading to \(N = |\mathbf{r}|\) and \(N_{\otimes} = |\mathbf{r}^{2}|\). Notably, the data space depends solely on the order of the moment being estimated, not on the estimation methods.
Since \(\mu\) and \(\Gamma\) reside in infinite-dimensional function spaces, the loss functionals in 5 and 6 admit
infinitely many minimizers without further structural assumptions. While a comprehensive review of the extensive literature is beyond our scope, a common remedy is to constrain the search space to a finite-dimensional subspace that reflects a desired level
of smoothness. Under such a basis expansion, empirical risk minimization reduces to solving a least-squares problem in the corresponding coefficient space. Observe that solving 6 highlights the pool-then-smooth
strategy, in contrast to the more traditional smooth-then-pool approach [59]–[61]. Although many alternative methods for covariance smoothing have been proposed – such as FACE
, which relies on Nadaraya–Watson kernel regression [62]–[64], or CovNet
, which leverages
neural networks [65] – our focus is on two widely used approaches:
Dictionary Methods: The function space is predefined via a fixed basis, most often a spline basis. For the mean, we assume \[\mu \in \mathcal{S}^{\mathop{\mathrm{dict}}} = \mathop{\mathrm{span}}\{\phi_{l}: \mathcal{T} \to \mathbb{R}, 1 \le l \le p \}, \quad \mathcal{C}^{\mathop{\mathrm{dict}}} = \mathbb{R}^{p}, \quad P^{\mathop{\mathrm{dict}}} = p.\] For second-order estimation, the search space becomes the tensor product: \[\Gamma \in \mathcal{S}_{\otimes}^{\mathop{\mathrm{dict}}} = \mathop{\mathrm{span}}\{\phi_{l_{1}} \otimes \phi_{l_{2}}: \mathcal{T} \times \mathcal{T} \to \mathbb{R} : 1 \le l_{1}, l_{2} \le p\}, \quad \mathcal{C}_{\otimes}^{\mathop{\mathrm{dict}}} = \mathbb{R}^{p^{2}}, \quad P_{\otimes}^{\mathop{\mathrm{dict}}} = p^{2},\] where \((\phi \otimes \phi') (t, t') := \phi(t) \phi'(t')\) for \((\phi \otimes \phi') \in \mathcal{L}_2(\mathcal{T}) \otimes \mathcal{L}_2(\mathcal{T}) \cong \mathcal{L}_2(\mathcal{T} \times \mathcal{T})\). This bivariate spline framework underlies the majority of existing covariance smoothers [2], [20]–[24], [29]. While other choices \(\phi_l\) exist – such as known eigenfunctions of the covariance operator [66], or truncated orthonormal bases in \(\mathcal{L}_2(\mathcal{T})\) [18], [19] – B-splines are most commonly chosen [25]–[27], [36], due to their compact support, which induces sparsity in the linear system.
RKHS Methods: Alternatively, smoothness can be encoded via an s.p.d. kernel \(K\), modeling each \(X_i\) as a random element in an RKHS \(\mathbb{H}(K)\) [9], [10], [38]–[41]. The representer theorem [67], [68] ensures that the minimizer of 5 lies in the span of feature maps at observed locations: \[\hat{\mu} \in \mathcal{S}^{\mathop{\mathrm{RK}}} = \mathop{\mathrm{span}}\{\mathrm{k}_{ij} := K(\cdot, t_{ij}): 1 \le i \le n, 1 \le j \le r_{i} \}, \quad \mathcal{C}^{\mathop{\mathrm{RK}}} = \mathcal{D} = \mathbb{R}^{|\mathbf{r}|}, \quad P= N = |\mathbf{r}|.\] Unlike dictionary methods, the basis functions \(\mathrm{k}_{ij} \in \mathbb{H}(K)\) are data-adaptive. However, this flexibility comes at the cost of non-parsimony – the number of basis functions equals the total sample size. For the same reason, the minimizer of 6 lies in the span of tensorized feature maps \(\{\mathrm{k}_{ij_{1}} \otimes \mathrm{k}_{ij_{2}} : 1 \le i \le n, (j_{1}, j_{2}) \in \mathcal{O}_{i} \}\) between within-subject off-diagonal pairs. An equivalent but more convenient formulation is to consider the larger search space over a block square grid \[\hat{\Gamma} \in \mathcal{S}_{\otimes}^{\mathop{\mathrm{RK}}} = \mathop{\mathrm{span}}\{\mathrm{k}_{ij_{1}} \otimes \mathrm{k}_{ij_{2}} : 1 \le i \le n, 1 \le j_{1}, j_{2} \le r_{i} \}, \quad \mathcal{C}_{\otimes}^{\mathop{\mathrm{RK}}} = \mathcal{D}_{\otimes} = \mathbb{R}^{|\mathbf{r}^{2}|}, \quad P_{\otimes}^{\mathop{\mathrm{RK}}} = N_{\otimes} = |\mathbf{r}^{2}|,\] and instead impose a constraint on the coefficient vector: \[\hat{\mathbf{a}}_{\otimes}^{\mathop{\mathrm{RK}}} = \begin{pmatrix} \hat{\mathbf{a}}_{1}^{\mathop{\mathrm{RK}}} \\ \vdots \\ \hat{\mathbf{a}}_{n}^{\mathop{\mathrm{RK}}} \end{pmatrix} \in \mathbb{R}^{|\mathbf{r}^{2}|}, \quad \hat{\mathbf{a}}_{i}^{\mathop{\mathrm{RK}}} \in \mathcal{R}(\mathbf{O}_{i}^{\flat}) \subset \mathbb{R}^{r_{i}^{2}}, \quad 1 \le i \le n.\]
We remark that, in both methods, the coefficient matrix \(\hat{\mathbf{A}}\) satisfying \(\hat{\mathbf{a}}_{\otimes} = \mathop{\mathrm{vec}}(\hat{\mathbf{A}}) \in \mathcal{C}_{\otimes}\) is symmetric, allowing for further dimensionality reduction via half-vectorization [21], [23], [29].
In each approach, minimizing the loss over the function space \(\mathcal{S}\) corresponds to solving a least-squares problem with the corresponding coefficient space \(\mathcal{C}\), which takes the form: \[\min_{\mathbf{a} \in \mathcal{C}} \left\lVert \mathbf{F} \mathbf{a} - \mathbf{y} \right\rVert_{\mathcal{D}}^{2},\] where \(\mathbf{F}:\mathcal{C} \to \mathcal{D}\) is a fixed forward matrix derived from the loss, \(\mathbf{y} \in \mathcal{D}\) is the observed data vector, and \(\mathbf{a} \in \mathcal{C}\) is the coefficient vector to solve. These components are summarized in 1 2, with the proof given in 6. Notably, unlike the data space \(\mathcal{D}\), the coefficient space \(\mathcal{C}\) and the corresponding forward matrix \(\mathbf{F}\) vary depending on the estimation method. We first review the first-order forward matrix:
In dictionary methods, the evaluation matrix is given by \[\mathbf{F}^{\mathop{\mathrm{dict}}} = \mathbf{\Phi} = \begin{pmatrix} \mathbf{\Phi}_{1} \\ \vdots \\ \mathbf{\Phi}_{n} \end{pmatrix} \in \mathbb{R}^{|\mathbf{r}| \times p}, \quad \mathbf{\Phi}_{i}[j, l] := \phi_{l}(t_{ij}), \quad 1 \le i \le n, \, 1 \le j \le r_{i}, \, 1 \le l \le p,\] representing evaluations of basis functions at observed locations. Note that this evaluation matrix is sparse for the B-spline, and can be obtained by the de-Boor recursion [35]. Since the function space \(\mathcal{S}^{\mathop{\mathrm{dict}}}\) is predefined, the block structure of \(\mathbf{F}^{\mathop{\mathrm{dict}}}\) is partitioned only row-wise.
In contrast, RKHS method defines the first-order forward matrix \[\mathbf{F}^{\mathop{\mathrm{RK}}} = \mathbf{K} = \begin{pmatrix} \mathbf{K}_{11} & \cdots & \mathbf{K}_{1n} \\ \vdots & \ddots & \vdots \\ \mathbf{K}_{n1} & \cdots & \mathbf{K}_{nn} \end{pmatrix} \in \mathbb{R}^{|\mathbf{r}| \times |\mathbf{r}|}, \quad \mathbf{K}_{i_{1}i_{2}}[j_{1}, j_{2}] := K(t_{i_{1}j_{1}}, t_{i_{2}j_{2}}), \quad \quad 1 \le i_{1}, i_{2} \le n,\] called the kernel Gram matrix, representing inner products between feature maps. Here, the block partitioning is both row- and column-wise, reflecting the fact that the function space \(\mathcal{S}^{\mathop{\mathrm{RK}}} = \mathcal{D}\) is selected a posteriori and tailored to the data.
Method | Dictionary-based | RKHS-based |
---|---|---|
Data (\(\mathbf{y} \in \mathcal{D}\)) | \(\mathbf{y} \in \mathbb{R}^{|\mathbf{r}|}\) | \(\mathbf{y} \in \mathbb{R}^{|\mathbf{r}|}\) |
Coefficients (\(\mathbf{a} \in \mathcal{C}\)) | \(\mathbf{a}^{\mathop{\mathrm{dict}}} \in \mathbb{R}^{p}\) | \(\mathbf{a}^{\mathop{\mathrm{RK}}} \in \mathbb{R}^{|\mathbf{r}|}\) |
Forward Matrix (\(\mathbf{F}\)) | \(\mathbf{\Phi} \in \mathbb{R}^{|\mathbf{r}| \times p}\) | \(\mathbf{K} \in \mathbb{R}^{|\mathbf{r}| \times |\mathbf{r}|}\) |
Now, we proceed to the second-order forward matrix. Given a basis for the search space \(\mathcal{S}_{\otimes}\), the first-order forward matrix acts to the coefficient matrix \(\mathbf{A}\) both from the left and right to evaluate the term \(\Gamma(t_{ij_{1}}, t_{ij_{2}})\) in the loss functional 6 . When \(\mathbf{A}\) is vectorized, this bilinear form can be represented via the Kronecker product \(\otimes_{\flat}\). However, the Kronecker product is performed in a block-wise manner, commonly referred to as the Khatri-Rao product [14]–[16], denoted by \(\odot_{\flat}\). Subsequently, the orthogonal projection \(\mathbf{O}^{\flat}\) is applied to restrict evaluation to the index pairs \(\mathcal{O}_{i}\) in 6 . These observations yield the following flattened structure for the second-order forward matrix:
In dictionary methods, the second-order forward matrix is given by \[\mathbf{F}_{\otimes}^{\mathop{\mathrm{dict}}} = \mathbf{O}^{\flat} (\mathbf{\Phi} \odot_{\flat} \mathbf{\Phi}) = \begin{pmatrix} \mathbf{O}_{1}^{\flat} (\mathbf{\Phi}_{1} \otimes_{\flat} \mathbf{\Phi}_{1}) \\ \vdots \\ \mathbf{O}_{n}^{\flat} (\mathbf{\Phi}_{n} \otimes_{\flat} \mathbf{\Phi}_{n}) \end{pmatrix} \in \mathbb{R}^{|\mathbf{r}^{2}| \times p^{2}}.\] Here, there is no partitioning in the column direction. In this case, the row-wise Khatri-Rao product is also referred to the face-splitting product [3]–[8], which appears naturally when the search space is predefined.
In RKHS methods, The second-order forward matrix applies tensorization in both the row and column directions: \[\begin{align} \mathbf{F}_{\otimes}^{\mathop{\mathrm{RK}}} = \mathbf{O}^{\flat} (\mathbf{K} \odot_{\flat} \mathbf{K}) \mathbf{O}^{\flat} = \begin{pmatrix} \mathbf{O}_{1}^{\flat} (\mathbf{K}_{11} \otimes_{\flat} \mathbf{K}_{11}) \mathbf{O}_{1}^{\flat} & \dots & \mathbf{O}_{1}^{\flat} (\mathbf{K}_{1n} \otimes_{\flat} \mathbf{K}_{1n}) \mathbf{O}_{n}^{\flat} \\ \vdots & \ddots & \vdots \\ \mathbf{O}_{n}^{\flat} (\mathbf{K}_{n1} \otimes_{\flat} \mathbf{K}_{n1}) \mathbf{O}_{1}^{\flat} & \dots & \mathbf{O}_{n}^{\flat} (\mathbf{K}_{nn} \otimes_{\flat} \mathbf{K}_{nn}) \mathbf{O}_{n}^{\flat} \end{pmatrix} \in \mathbb{R}^{|\mathbf{r}^{2}| \times |\mathbf{r}^{2}|}, \end{align}\] which arises naturally when the search space is data-adaptive. Here, we attach \(\mathbf{O}^{\flat}\) on both sides so that the sandwich form \(\mathbf{F}_{\otimes}^{\mathop{\mathrm{RK}}}\) is s.p.d., which facilitates numerical development. Additionally, this projection increases the sparsity of the system matrix, thereby improving computational efficiency, although negligible.
Method | Dictionary-based | RKHS-based |
---|---|---|
Data (\(\mathbf{y}_{\otimes} \in \mathcal{D}_{\otimes}\)) | \(\mathbf{y}_{\otimes} \in \mathbb{R}^{|\mathbf{r}^{2}|}\) | \(\mathbf{y}_{\otimes} \in \mathbb{R}^{|\mathbf{r}^{2}|}\) |
Coefficients (\(\mathbf{a}_{\otimes} \in \mathcal{C}_{\otimes}\)) | \(\mathbf{a}_{\otimes}^{\mathop{\mathrm{dict}}} \in \mathbb{R}^{p^{2}}\) | \(\mathbf{a}_{\otimes}^{\mathop{\mathrm{RK}}} \in \mathbb{R}^{|\mathbf{r}^{2}|}\) |
Forward Matrix (\(\mathbf{F}_{\otimes}\)) | \(\mathbf{O}^{\flat} (\mathbf{\Phi} \odot_{\flat} \mathbf{\Phi}) \in \mathbb{R}^{|\mathbf{r}^{2}| \times p^{2}}\) | \(\mathbf{O}^{\flat} (\mathbf{K} \odot_{\flat} \mathbf{K}) \mathbf{O}^{\flat} \in \mathbb{R}^{|\mathbf{r}^{2}| \times |\mathbf{r}^{2}|}\) |
Complexity of Element-free Action | \(O(|\mathbf{r}| p^{2} + |\mathbf{r}^{2}|p)\) | \(O(|\mathbf{r}| |\mathbf{r}^{2}|)\) |
Complexity of Explicit Evaluation | \(O(|\mathbf{r}^{2}| p^{2})\) | \(O(|\mathbf{r}^{2}|^{2})\) |
To avoid overfitting, it is common to introduce an s.p.d. penalty matrix \(\mathbf{P}:\mathcal{C} \to \mathcal{C}\), selected based on the desired level of smoothness in estimation:
In dictionary methods, suppose we aim to penalize total curvature via the squared \(L^2\) norm of the second derivative for mean estimation, i.e., \(\|\mu\|_{\mathbf{P}}^{2} = \int_{\mathcal{T}} |\mu^{(2)}(t)|^{2} dt\). This leads to a penalty matrix: \[\mathbf{P}^{\mathop{\mathrm{dict}}} \in \mathbb{R}^{p \times p}, \quad \mathbf{P}^{\mathop{\mathrm{dict}}}[l_{1}, l_{2}] := \int_{\mathcal{T}} \phi_{l_{1}}^{(2)}(t) \phi_{l_{2}}^{(2)}(t) dt, \quad 1 \le l_{l}, l_{2} \le p.\] The term P-spline is often abused to refer to the penalized B-spline [24], [26], where this integral is discretized in practice to yield a penalty of the form \[\mathbf{P}^{\mathop{\mathrm{dict}}} = (\mathbf{D}^{\mathop{\mathrm{dict}}})^{\top} \mathbf{D}^{\mathop{\mathrm{dict}}}, \quad \mathbf{D}^{\mathop{\mathrm{dict}}} \in \mathbb{R}^{(p-2) \times p},\] with \(\mathbf{D}^{\mathop{\mathrm{dict}}}\) being a sparse matrix encoding second-order finite differences of the coefficients. Higher-order differencing schemes are also possible but are beyond the scope of this paper.
For second-order estimation, we consider penalizing smoothness using the Laplacian \(\Delta = \partial_{1}^{2} + \partial_{2}^{2}\): \[\|\Gamma\|_{\mathbf{P}_{\otimes}}^{2} = \frac{1}{2} \int_{\mathcal{T} \times \mathcal{T}} |\partial_{1} \Gamma(t_{1}, t_{2})|^{2} + |\partial_{2} \Gamma(t_{1}, t_{2})|^{2} dt_{1} dt_{2}.\] Unlike the thin-plate spline penalty involving \((\partial_{1} + \partial_{2})^{2}\) [69], the absence of a cross-term allows a clean Kronecker-structured discretization: \[\begin{align} \mathbf{P}_{\otimes}^{\mathop{\mathrm{dict}}} &= \frac{(\mathbf{P}^{\mathop{\mathrm{dict}}} \otimes_{\flat} \mathbf{I} + \mathbf{I} \otimes_{\flat} \mathbf{P}^{\mathop{\mathrm{dict}}})}{2} = \frac{(\mathbf{D}_{\otimes}^{\mathop{\mathrm{dict}}})^{\top} \mathbf{D}_{\otimes}^{\mathop{\mathrm{dict}}}}{2} \in \mathbb{R}^{p^{2} \times p^{2}}, \end{align}\] where \[\begin{align} \mathbf{D}_{\otimes}^{\mathop{\mathrm{dict}}} = \begin{pmatrix} \mathbf{D}^{\mathop{\mathrm{dict}}} \otimes_{\flat} \mathbf{I} \\ \mathbf{I} \otimes_{\flat} \mathbf{D}^{\mathop{\mathrm{dict}}} \end{pmatrix} \in \mathbb{R}^{2(p-2)p \times p^2}. \end{align}\]
In RKHS methods, the natural regularization is induced by the RKHS norm. For mean estimation, this is \(\|\mu\|_{\mathbf{P}}^{2} = \|\mu\|_{\mathbb{H}(K)}^{2}\), leading to \(\mathbf{P}^{\mathop{\mathrm{RK}}} = \mathbf{F}^{\mathop{\mathrm{RK}}} = \mathbf{K}\). For second-order estimation, the penalty becomes the squared tensor-product norm \(\|\Gamma\|_{\mathbf{P}_{\otimes}}^{2}= \|\Gamma\|_{\mathbb{H}(K) \otimes \mathbb{H}(K)}^{2}\), resulting in \(\mathbf{P}_{\otimes}^{\mathop{\mathrm{RK}}} = \mathbf{F}_{\otimes}^{\mathop{\mathrm{RK}}} = \mathbf{O}^{\flat} (\mathbf{K} \odot_{\flat} \mathbf{K}) \mathbf{O}^{\flat}\).
With a penalty parameter \(\eta \ge 0\), where \(\eta = 0\) corresponds to the unregularized problem, the regularized least-squares problem becomes: \[\min_{\mathbf{a} \in \mathcal{C}} \left\lVert \mathbf{F} \mathbf{a} - \mathbf{y} \right\rVert_{\mathcal{D}}^{2} + \eta {\left\langle {\mathbf{a}},{\mathbf{P} \mathbf{a}}\right\rangle}_{\mathcal{C}},\] which yields the normal equations: \[\label{eq:normal:gen:intro} \underbrace{(\mathbf{F}^{*} \mathbf{F} + \eta \mathbf{P})}_{=:\mathbf{S}(\eta)} \hat{\mathbf{a}}(\eta) = \mathbf{F}^{*} \mathbf{y} \, \in \mathcal{C}.\tag{8}\] We call \(\mathbf{S}(\eta): \mathcal{C} \to \mathcal{C}\) the regularized Gram matrix, and \(\mathbf{F}^{\dagger}(\eta) := [\mathbf{S}(\eta)]^{\dagger} \mathbf{F}^{*} : \mathcal{D} \to \mathcal{C}\) the regularized inverse. For the unregularized case, we omit \(\eta=0\) since \(\mathbf{F}^{\dagger} = (\mathbf{F}^{*} \mathbf{F})^{\dagger} \mathbf{F}^{*} = \mathbf{S}^{\dagger} \mathbf{F}^{*}\).
The primary bottleneck in the pool-then-smooth framework is the computational cost of the second-order problem, which far exceeds that of the first-order problem. The second-order sample size \(N_{\otimes} = |\mathbf{r}^{2}|\) scales linearly with \(n\) and quadratically with \(r_{i}\), leading to acute numerical challenges when \(r_{i}\) is large – a common occurrence in complex scenarios such as spatio-temporal data or higher-dimensional settings, where capturing second-order dependencies requires large \(r_{i}\) [65], [70], [71]. Consequently, when \(N_{\otimes}\) becomes prohibitively large, various strategies have been proposed to alleviate computational costs by enforcing sparsity – trading off estimation fidelity for tractability:
In dictionary methods, the second-order normal equation for 8 becomes: \[\label{eq:normal:dict:intro} \underbrace{[(\mathbf{F}_{\otimes}^{\mathop{\mathrm{dict}}})^{\top} \mathbf{F}_{\otimes}^{\mathop{\mathrm{dict}}} + \eta (\mathbf{D}_{\otimes}^{\mathop{\mathrm{dict}}})^{\top} \mathbf{D}_{\otimes}^{\mathop{\mathrm{dict}}}]}_{=\mathbf{S}_{\otimes}^{\mathop{\mathrm{dict}}}(\eta) \in \mathbb{R}^{p^{2} \times p^{2}}} \hat{\mathbf{a}}_{\otimes}^{\mathop{\mathrm{dict}}}(\eta) = (\mathbf{F}_{\otimes}^{\mathop{\mathrm{dict}}})^{\top} \mathbf{y}_{\otimes} \in \mathbb{R}^{p^{2}}.\tag{9}\] To explicitly store the regularized Gram operator \(\mathbf{S}_{\otimes}^{\mathop{\mathrm{dict}}}(\eta)\) or the regularized inverse \([\mathbf{F}_{\otimes}^{\mathop{\mathrm{dict}}}]^{\dagger}(\eta)\) – referred to as sandwich smoothing – the number of basis functions \(p\) is chosen to be small in advance [21], [23], [28], [29], [58]. In such cases, \(\mathbf{F}_{\otimes}^{\mathop{\mathrm{dict}}} \in \mathbb{R}^{|\mathbf{r}^{2}| \times p^{2}}\) becomes a tall-and-skinny sparse matrix. This approach implicitly assumes low-rank separability in the covariance function, often justified by a truncated Mercer decomposition [72], [73].
RKHS-based methods take a different route due to their data-adaptive nature: \(\mathcal{C}_{\otimes}^{\mathop{\mathrm{RK}}} = \mathcal{D}_{\otimes} = \mathbb{R}^{|\mathbf{r}^{2}|}\). However, as the forward and penalty matrices coincide, the second-order normal equation reduces to \[\label{eq:normal:rkhs:intro} (\mathbf{F}_{\otimes}^{\mathop{\mathrm{RK}}} + \eta \mathbf{I}) \hat{\mathbf{a}}_{\otimes}^{\mathop{\mathrm{RK}}}(\eta) = \mathbf{O}^{\flat} \mathbf{y}_{\otimes} \in \mathbb{R}^{|\mathbf{r}^{2}|}.\tag{10}\] As direct low-rank sparsity with small \(p\) is infeasible, [39] instead enforces sparsity directly in the second-order forward matrix \(\mathbf{F}_{\otimes}^{\mathop{\mathrm{RK}}} = \mathbf{O}^{\flat} (\mathbf{K} \odot_{\flat} \mathbf{K}) \mathbf{O}^{\flat} \in \mathbb{R}^{|\mathbf{r}^{2}| \times |\mathbf{r}^{2}|}\) via aggressive component truncation of the kernel Gram matrix \(\mathbf{K} \in \mathbb{R}^{|\mathbf{r}| \times |\mathbf{r}|}\). Otherwise, explicitly storing a \(|\mathbf{r}^{2}| \times |\mathbf{r}^{2}|\) matrix during preprocessing would require approximately \(300\) [GB] of memory when \(n=20\) and \(r_{i} \equiv 100\). Theoretically, such truncation corresponds to kernel compression via Schur complements, which defines a reproducing kernel for a subspace of the original RKHS \(\mathbb{H}(K)\) [56]. Low-rank approximation approaches, such as the Nyström method [74] or incomplete matrix factorizations [75], [76], can also be applied.
While these sparsity-inducing strategies are often effective in practice, they are primarily motivated by computational necessity rather than structural belief. In many applications, the assumption of low-rank separability often fails [70], [77], [78], and aggressive component truncation – effectively reducing the kernel to a highly localized version – contradicts the core virtue of smoothing. Although numerical considerations may make dictionary methods appear always appealing – particularly when \(p = o(|\mathbf{r}|)\) – they are not suitable in settings where the underlying random functions are not directly observed, as in functional inverse problems [9], [10].
We conclude this section by highlighting the key limitations of existing covariance smoothers. First, as shown in 2, evaluating \(\mathbf{F}_{\otimes}\) explicitly can be computationally expensive. Moreover, due to the partitioned nature of the Khatri–Rao product, standard factorizations of the first-order forward matrix \(\mathbf{F}\) – such as eigen, Cholesky, or LDLT decompositions – do not carry over to \(\mathbf{F}_{\otimes}\) [16]. While incomplete block factorizations of \(\mathbf{F}_{\otimes}\) can be applied, they typically rely on iterative methods anyway. More importantly, such preprocessing disrupts the structural relationship between \(\mathbf{F}_{\otimes}\) and \(\mathbf{F}\), leading to inefficiencies.
Second, any efficient Kronecker product operations are implemented at the matrix level, hence flattening the problem to vector form introduces unnecessary obfuscation. This is especially problematic when evaluating \(\mathbf{F}_{\otimes}\) explicitly, as the involvement of the elimination matrix \(\mathbf{O}_{i}^{\flat}\) depends heavily on the specific observation pattern and indexing scheme for vectorization. While observational schemes \(\mathcal{O}_{i}\) may vary in different setups, it is usually straightforward to adjust for this variation, and thus the action of \(\mathscr{O}_{i}\) on matrix inputs is simple to adapt. However, its explicit matrix representation depends intricately on the specific reordering of the double array \((j_1, j_2) \notin \mathcal{O}_{i}\), making general implementation more cumbersome. Additionally, while exploiting symmetry (e.g., through half-vectorization) can reduce computational cost of the linear system by a constant factor, the most substantial performance improvements – by an order of magnitude – stem from avoiding explicit evaluation altogether and instead leveraging efficient block tensor-matrix operations, see 6.
In the following section, we demonstrate how the forward actions can be efficiently implemented directly at the matrix level, which is sufficient for using Krylov subspace methods. Moreover, once the linear system is solved, the estimated surface of \(\hat{\Gamma}\) or \(\hat{\Sigma}\) is typically evaluated over a square grid, and FPCA is conducted [52]. These post-processing steps represent the ultimate goals of the estimation procedure. Crucially, the relevant information for these tasks is encoded in the coefficient matrix itself – not in its vectorized form – further reinforcing the advantages of a block tensor-matrix formulation.
Given matrices \(\mathbf{A} \in \mathbb{R}^{a_{1} \times a_{2}}\) and \(\mathbf{B} \in \mathbb{R}^{b_{1} \times b_{2}}\), their outer product \(\mathbf{A} \otimes \mathbf{B}\) defines the linear operator: \[(\mathbf{A} \otimes \mathbf{B}) : \mathbb{R}^{b_{2} \times a_{2}} \to \mathbb{R}^{b_{1} \times a_{1}}, \quad \mathbf{C} \mapsto \mathbf{B} \mathbf{C} \mathbf{A}^{\top},\] which lifts the Kronecker product to the matrix-level. When \(a_{2} = b_{2} = 1\), this bilinear form entails \(\mathbf{a} \otimes \mathbf{b} = \mathbf{b} \mathbf{a}^{\top} \in \mathbb{R}^{b_{1} \times a_{1}}\) for vectors \(\mathbf{a} \in \mathbb{R}^{a_{1}}\) and \(\mathbf{b} \in \mathbb{R}^{b_{1}}\). The following identities are trivial: \[(\mathbf{A} \otimes \mathbf{B})^{*} = \mathbf{A}^{\top} \otimes \mathbf{B}^{\top}, \quad (\mathbf{A} \otimes \mathbf{B})^{\dagger} = \mathbf{A}^{\dagger} \otimes \mathbf{B}^{\dagger}, \quad (\mathbf{A}_{1} \otimes \mathbf{B}_{1})(\mathbf{A}_{2} \otimes \mathbf{B}_{2}) = (\mathbf{A}_{1} \mathbf{A}_{2} \otimes \mathbf{B}_{1} \mathbf{B}_{2}).\]
Recall \(\mathbf{r} = [r_{1}, \dots, r_{n}] \in \mathbb{N}^{n}\) to denote the number of observed locations per sample path. To highlight the block structure of the data, we define the following direct sums: \[\mathbb{R}^{\mathbf{r}} := \bigoplus_{i=1}^{n} \mathbb{R}^{r_{i}}, \quad \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}) := \bigoplus_{i=1}^{n} \mathbb{R}^{r_{i} \times r_{i}}.\] As each space \(\mathbb{R}^{r_{i}}\) and \(\mathbb{R}^{r_{i} \times r_{i}}\) carries a natural Euclidean and Frobenius inner product, their direct sums inherit the natural inner products: \[\label{eq:innpr:direct} {\left\langle {\mathbf{c}},{\mathbf{d}}\right\rangle}_{\mathbb{R}^{\mathbf{r}}} = \sum_{i=1}^{n} \mathbf{c}_{i}^{\top} \mathbf{d}_{i}, \quad {\left\langle {\mathbf{C}},{\mathbf{D}}\right\rangle}_{\mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}})} = \sum_{i=1}^{n} \mathop{\mathrm{tr}}(\mathbf{C}_{i}^{\top} \mathbf{D}_{i}).\tag{11}\] Under canonical inner-product isomorphisms, the vectorization of \(\mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}})\) becomes: \[\mathbf{C} = \mathop{\mathrm{diag}}[\mathbf{C}_{i}] = \left( \begin{array}{c|c|c|c} \mathbf{C}_{1} & \mathbf{0} & \dots & \mathbf{0} \\ \hline \mathbf{0} & \mathbf{C}_{2} & \dots & \mathbf{0} \\ \hline \vdots & \vdots & \ddots & \vdots \\ \hline \mathbf{0} & \mathbf{0} & \dots & \mathbf{C}_{n} \end{array} \right) \in \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}) \quad \cong \quad \begin{pmatrix} \mathop{\mathrm{vec}}(\mathbf{C}_{1}) \\ \mathop{\mathrm{vec}}(\mathbf{C}_{2}) \\ \vdots \\ \mathop{\mathrm{vec}}(\mathbf{C}_{n}) \end{pmatrix} \in \mathbb{R}^{|\mathbf{r}^{2}|}.\] The data layout of the direct sum can be thought as a (1D) vector of square (2D) matrices of varying sizes. Due to this heterogeneity in dimensions (1D vs. 2D), existing approaches flatten these matrix blocks into (1D) vectors to fit conventional matrix–vector formulations. In contrast, we show that considering this structure as block-diagonal (2D) matrix delivers both conceptual and practical advantages. To this end, we define the evaluation matrix in dictionary-based methods as a block matrix: \[\mathbf{\Phi} = \left( \begin{array}{c} \mathbf{\Phi}_{1} \\ \hline \vdots \\ \hline \mathbf{\Phi}_{n} \end{array} \right) \in \mathbb{R}^{\mathbf{r} \times p}: \quad \mathbf{c} \in \mathbb{R}^{p} \mapsto \left( \begin{array}{c} \mathbf{\Phi}_{1} \mathbf{c} \\ \hline \vdots \\ \hline \mathbf{\Phi}_{n} \mathbf{c} \end{array} \right) \in \mathbb{R}^{\mathbf{r}}.\] Using the inner product in 11 , the adjoint of \(\mathbf{\Phi}\) is given by: \[\mathbf{\Phi}^{*} = \left( \begin{array}{c|c|c} \mathbf{\Phi}_{1}^{\top} & \cdots & \mathbf{\Phi}_{n}^{\top} \end{array} \right) \in \mathbb{R}^{p \times \mathbf{r}}: \quad \left( \begin{array}{c} \mathbf{c}_{1} \\ \hline \vdots \\ \hline \mathbf{c}_{n} \end{array} \right) \in \mathbb{R}^{\mathbf{r}} \mapsto \sum_{i=1}^{n} \mathbf{\Phi}_{i}^{\top} \mathbf{c}_{i} \in \mathbb{R}^{p}.\] Similarly, the kernel Gram matrix in RKHS-based methods is expressed with block structure as: \[\begin{align} \mathbf{K} = \left( \begin{array}{c} \mathbf{K}_{1\cdot} \\ \hline \vdots \\ \hline \mathbf{K}_{n\cdot} \end{array} \right) = \left( \begin{array}{c|c|c} \mathbf{K}_{11} & \cdots & \mathbf{K}_{1n} \\ \hline \vdots & \ddots & \vdots \\ \hline \mathbf{K}_{n1} & \cdots & \mathbf{K}_{nn} \end{array} \right) \in \mathbb{R}^{\mathbf{r} \times \mathbf{r}}: \quad \mathbf{c} = \left( \begin{array}{c} \mathbf{c}_{1} \\ \hline \vdots \\ \hline \mathbf{c}_{n} \end{array} \right) \in \mathbb{R}^{\mathbf{r}} \mapsto \left( \begin{array}{c} \mathbf{K}_{1\cdot} \mathbf{c} \\ \hline \vdots \\ \hline \mathbf{K}_{n\cdot} \mathbf{c} \end{array} \right) \in \mathbb{R}^{\mathbf{r}}. \end{align}\]
We now define the Khatri–Rao product at the matrix level. First, the row-wise outer product is given by \[\mathbf{\Phi} \odot \mathbf{\Phi} : \quad \mathbf{C} \in \mathbb{R}^{p \times p} \mapsto \mathop{\mathrm{diag}}[\mathbf{\Phi}_{i} \mathbf{C} \mathbf{\Phi}_{i}^{\top}] \in \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}).\] In the special case \(p = 1\), this corresponds to the observation vector \(\mathbf{y}_{\otimes}\) in 7 upon vectorization: \[\mathbf{y} = \left( \begin{array}{c} \mathbf{y}_{1} \\ \hline \vdots \\ \hline \mathbf{y}_{n} \end{array} \right) \in \mathbb{R}^{\mathbf{r}} \quad \Longrightarrow \quad \mathbf{y} \odot \mathbf{y} = \mathop{\mathrm{diag}}[\mathbf{y}_{i} \mathbf{y}_{i}^{\top}] \in \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}).\] The adjoint yields the column-wise outer product: \[(\mathbf{\Phi} \odot \mathbf{\Phi})^{*} = \mathbf{\Phi}^{*} \odot \mathbf{\Phi}^{*} : \quad \mathop{\mathrm{diag}}[\mathbf{C}_{i}] \in \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}) \mapsto \sum_{i=1}^{n} \mathbf{\Phi}_{i}^{\top} \mathbf{C}_{i} \mathbf{\Phi}_{i} \in \mathbb{R}^{p \times p}.\] Finally, the block-wise outer product of the kernel Gram matrix \(\mathbf{K} \in \mathbb{R}^{\mathbf{r} \times \mathbf{r}}\) is defined by: \[\mathbf{K} \odot \mathbf{K} : \quad \mathbf{C} = \mathop{\mathrm{diag}}[\mathbf{C}_{i}] \in \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}) \mapsto \mathop{\mathrm{diag}}[\mathbf{K}_{i \cdot} \mathbf{C} \mathbf{K}_{\cdot i}] \in \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}),\] where \(\mathbf{K}_{i \cdot} \in \mathbb{R}^{r_{i} \times \mathbf{r}}\) and \(\mathbf{K}_{\cdot i} := \mathbf{K}_{i \cdot}^{*} \in \mathbb{R}^{\mathbf{r} \times r_{i}}\) for \(1 \le i \le n\). Since \(\mathbf{K} \succeq \mathbf{0}\), it is straightforward that \((\mathbf{K}_{i\cdot} \otimes \mathbf{K}_{i\cdot}) \succeq \mathbf{0}\) for each \(1 \le i \le n\), leading to \(\mathbf{K} \odot \mathbf{K} \succeq \mathbf{0}\). As the row- and column-wise outer products for dictionary methods are even more straightforward to implement, we present the pseudocode of the block-wise outer product for RKHS methods in 2.
The elimination operator acts block-wise as follows, see 3 for pseudocode: \[\mathscr{O} = \mathop{\mathrm{diag}}[\mathscr{O}_{i}] : \quad \mathop{\mathrm{diag}}[\mathbf{C}_{i}] \in \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}) \mapsto \mathop{\mathrm{diag}}[\mathscr{O}_{i} \mathbf{C}_{i}] \in \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}),\] which is an orthogonal projection operator. Note that, if the input is a symmetric (partitioned) matrix, then the output of any of the operators – \(\mathbf{\Phi} \odot \mathbf{\Phi}\), \(\mathbf{\Phi}^{*} \odot \mathbf{\Phi}^{*}\), \(\mathbf{K} \odot \mathbf{K}\), or \(\mathscr{O}\) – also preserves symmetry. Consequently, one can exploit symmetry to reduce memory usage by storing only the upper (or lower) triangular portion of the block diagonal matrices. While this reduction is theoretically appealing, it does not constitute a major computational bottleneck and not necessarily yields an expected amount of speedup, see 6.
The Tikhonov loss for the mean function \(\mu\) is defined as \[\label{eq:loss:1st} \mathcal{L}^{\eta}(\mu) = \sum_{i=1}^{n} \sum_{j=1}^{r_{i}} (y_{ij} - \mu(t_{ij}))^{2} + \eta \|\mu\|_{\mathbf{P}}^{2}.\tag{12}\]
When \(\mu \in \mathcal{S}\) is represented by a coefficient vector \(\mathbf{a}\), we also write \(\mathcal{L}^{\eta}(\mathbf{a})\). The minimizers are denoted by \(\hat{\mu}(\eta)\) and \(\hat{\mathbf{a}}(\eta)\), where we omit the dependency on \(\eta\) when \(\eta = 0\) (unregularized case). We denote by \(\mathcal{G} := \{t_{m} : 1 \le m \le g\} \subset \mathcal{T}\) the grid over which \(\hat{\mu}(\eta): \mathcal{T} \to \mathbb{R}\) is evaluated. Similarly, the Tikhonov loss for the second-moment function \(\Gamma\) is given by \[\begin{align} \label{eq:loss:2nd} \mathcal{L}_{\otimes}^{\eta}(\Gamma) = \sum_{i=1}^{n} \sum_{(j_{1}, j_{2}) \in \mathcal{O}_{i}} \left(y_{ij_{1}} y_{ij_{2}} - \Gamma(t_{ij_{1}}, t_{ij_{2}}) \right)^{2} + \eta \|\Gamma\|_{\mathscr{P}}^{2}, \end{align}\tag{13}\] If \(\Gamma \in \mathcal{S}_{\otimes}\) is represented by a coefficient matrix \(\mathbf{A}\), we equivalently write \(\mathcal{L}_{\otimes}^{\eta}(\mathbf{A})\). The corresponding minimizers are denoted by \(\hat{\Gamma}(\eta)\) and \(\hat{\mathbf{A}}(\eta)\), again omitting \(\eta\) when zero. The estimator \(\hat{\Gamma}(\eta): \mathcal{T} \times \mathcal{T} \to \mathbb{R}\) is evaluated over the square grid \(\mathcal{G} \times \mathcal{G}\).
From this point onward, in the context of block tensor-matrix formulations for second-order estimation, we adopt the following script-style notation: the identity operator is denoted by \(\mathscr{I}\), the forward operator by \(\mathscr{F}\), and the penalty operator by \(\mathscr{P}\) where \(\left\lVert \Gamma \right\rVert_{\mathscr{P}}^{2} = \left\lVert \mathbf{A} \right\rVert_{\mathscr{P}}^{2} = {\left\langle {\mathbf{A}},{\mathscr{P}\mathbf{A}}\right\rangle}\). In analogy to 8 , we denote the regularized Gram operator by \(\mathscr{S}(\eta) = \mathscr{F}^{*} \mathscr{F} + \eta \mathscr{P}\), and the regularized inverse by \(\mathscr{F}^{\dagger}(\eta) = [\mathscr{S}(\eta)]^{\dagger} \mathscr{F}^{*}\). For the unregularized case, we omit the dependence on \(\eta=0\).
Dictionary Methods
Define the frame matrix \(\mathbf{E} \in \mathbb{R}^{g \times p}\) with respect to the evaluation grid \(\mathcal{G}\) by \(\mathbf{E}[m, l]= \phi_{l}(t_{m})\) [79]. Then, the evaluation of \[\mu = \sum_{l=1}^{p} a_{l} \phi_{l}, \quad \mathbf{a} = [a_{1}, \cdots, a_{p}]^{\top} \in \mathbb{R}^{p},\] is given by \(\boldsymbol{\mu} := [\mu(t_{m})]_{1 \le m \le g} = \mathbf{E} \mathbf{a} \in \mathbb{R}^{g}\). Also, the loss functional in 12 becomes \[\begin{align} \mathcal{L}^{\eta}(\mathbf{a}) = \sum_{i=1}^{n} \left\lVert \mathbf{y}_{i} - \mathbf{\Phi}_{i} \mathbf{a} \right\rVert^{2} + \eta \left\lVert \mathbf{a} \right\rVert_{\mathbf{P}}^{2} = \left\lVert \mathbf{y} - \mathbf{\Phi} \mathbf{a} \right\rVert^{2} + \eta \left\lVert \mathbf{a} \right\rVert_{\mathbf{P}}^{2}, \end{align}\] which leads to the normal equation: \[(\mathbf{\Phi}^{*} \mathbf{\Phi} +\eta \mathbf{P}) \hat{\mathbf{a}}(\eta) = \mathbf{\Phi}^{*} \mathbf{y} \quad \Longleftrightarrow \quad \left( \sum_{i=1}^{n} \mathbf{\Phi}_{i}^{*} \mathbf{\Phi}_{i} +\eta \mathbf{P} \right) \hat{\mathbf{a}}(\eta) = \sum_{i=1}^{n} \mathbf{\Phi}_{i}^{*} \mathbf{y}_{i}.\] In the rank-deficient case, the solution of minimum norm is given by \(\hat{\mathbf{a}}(\eta) = \mathbf{\Phi}^{\dagger}(\eta) \mathbf{y} \in \mathbb{R}^{p}\), where \(\mathbf{\Phi}^{\dagger}(\eta)\) is the regularized inverse defined in 8 .
For second-moment estimation, we consider \[\Gamma = \sum_{l_{1}, l_{2} = 1}^{p} a_{l_{1} l_{2}} \phi_{l_{1}} \otimes \phi_{l_{2}}, \quad \mathbf{A} \in \mathbb{R}^{p \times p},\] where its evaluation over the square grid \(\mathcal{G} \times \mathcal{G}\) is given by \[\boldsymbol{\Gamma} := [\Gamma(t_{m_{1}}, t_{m_{2}})]_{1 \le m_{1}, m_{2} \le g} = (\mathbf{E} \otimes \mathbf{E}) \mathbf{A} = \mathbf{E} \mathbf{A} \mathbf{E}^{\top} \in \mathbb{R}^{g \times g}.\] It readily follows that the evaluation of the covariance \(\Sigma\) over the square grid \(\mathcal{G} \times \mathcal{G}\) is given by \[\boldsymbol{\Sigma} := [\Sigma(t_{m_{1}}, t_{m_{2}})]_{1 \le m_{1}, m_{2} \le g} = (\mathbf{E} \otimes \mathbf{E}) (\mathbf{A} - \mathbf{a} \otimes \mathbf{a}) = \mathbf{E} (\mathbf{A} - \mathbf{a} \mathbf{a}^{\top}) \mathbf{E}^{\top} \in \mathbb{R}^{g \times g}.\]
Theorem 2. Denote the identity operator by \(\mathscr{I} : \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}) \to \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}})\), the forward operator by \(\mathscr{F} := \mathscr{O} (\mathbf{\Phi} \odot \mathbf{\Phi}) : \mathbb{R}^{p \times p} \to \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}})\) and the s.p.d. penalty operator by \(\mathscr{P}: \mathbb{R}^{p \times p} \to \mathbb{R}^{p \times p}\). Then, the loss functional in 13 becomes \[\begin{align} \mathcal{L}_{\otimes}^{\eta}(\mathbf{A}) = \left\lVert (\mathbf{y} \odot \mathbf{y}) - \mathscr{F} \mathbf{A} \right\rVert^{2} + \eta \left\lVert \mathbf{A} \right\rVert_{\mathscr{P}}^{2} - \left\lVert (\mathscr{I} - \mathscr{O}) (\mathbf{y} \odot \mathbf{y}) \right\rVert^{2}. \end{align}\] The last term does not depend on \(\mathbf{A}\), and the normal equation is given by \[\begin{align} \label{eq:dict:2nd:normal} (\mathscr{F}^{*} \mathscr{F} + \eta \mathscr{P}) \hat{\mathbf{A}}(\eta) = \mathscr{F}^{*} (\mathbf{y} \odot \mathbf{y}). \end{align}\qquad{(1)}\]
Again, in the rank-deficient case, the solution of minimum Frobenius norm is given by \(\hat{\mathbf{A}}(\eta) = \mathscr{F}^{\dagger}(\eta) (\mathbf{y} \odot \mathbf{y}) \in \mathbb{R}^{p \times p}\), which is symmetric due to the convexity of \(\mathcal{L}_{\otimes}^{\eta}\) for any \(\eta \ge 0\).
RKHS Methods
In the RKHS-based approach [9], [10], [38]–[40], [80], we consider \(n\) i.i.d. copies \(\{X_{i}: i=1, \dots, n \}\) to be a second-order random element in \(\mathbb{H}\). The mean \(\mu\), second moment tensor \(\Gamma\), and the covariance tensor \(\Sigma\) of \(X_{1}\) are defined by Bochner integral [54]: \[\begin{align} \label{eq:tens61ftns} \mu := \mathbb{E} X_{1} \in \mathbb{H}, \quad \Gamma := \mathbb{E} [X_{1} \otimes X_{1}] \in \mathbb{H} \otimes \mathbb{H}, \quad \Sigma := \Gamma - \mu \otimes \mu \in \mathbb{H} \otimes \mathbb{H}, \end{align}\tag{14}\] which are same when considered as functions in 4 due to the reproducing property 3 .
The representer theorem [67], [68] asserts that the search space becomes \(\mathcal{S} = \mathop{\mathrm{span}}\{\mathrm{k}_{ij} : 1 \le i \le n, 1 \le j \le r_{i} \}\). Unlike dictionary-based methods, the representer theorem yields a data-driven basis a posteriori, tailored to the observed input locations. In essence, any finite-dimensional inner product space is an RKHS [56], so the RKHS methods can be seen as spline methods, where all the observed locations effectively serve as knots. In this regard, we consider the coefficient vector to be a direct sum: \[\mu = \sum_{i=1}^{n} \sum_{j=1}^{r_{i}} a_{ij} \mathrm{k}_{ij} \in \mathcal{S}, \quad \mathbf{a} = \left( \begin{array}{c} \mathbf{a}_{1} \\ \hline \vdots \\ \hline \mathbf{a}_{n} \end{array} \right) \in \mathbb{R}^{\mathbf{r}},\] which leads to the normal equation \((\mathbf{K} + \eta \mathbf{I}) \hat{\mathbf{a}}(\eta) = \mathbf{y}\). Note that the solution is unique when \(\eta > 0\) due to the strict convexity of \(\mathcal{L}^{\eta}\) in 12 , regardless of rank-deficiency of the kernel Gram matrix \(\mathbf{K} \succeq \mathbf{0}\). In the case where \(\eta = 0\), we can impose the uniqueness by choosing \(\hat{\mu}\) of minimum norm, which is given by \(\hat{\mathbf{a}} = \mathbf{K}^{\dagger} \mathbf{y}\) [56].
Define the frame block matrix \(\mathbf{E} = ( \mathbf{E}_{1} \vert \cdots \vert \mathbf{E}_{n} ) \in \mathbb{R}^{g \times \mathbf{r}}\), where the \(i\)-th block \(\mathbf{E}_{i} \in \mathbb{R}^{g \times r_{i}}\) is given by \(\mathbf{E}_{i}[m, j]= K(t_{m}, t_{ij})\). Then, the evaluation of \(\mu \in \mathbb{H}\) over \(\mathcal{G}\) is given by \[\boldsymbol{\mu} = \mathbf{E} \mathbf{a} = \sum_{i=1}^{n} \mathbf{E}_{i} \mathbf{a}_{i} \in \mathbb{R}^{g}.\]
Similarly, the search space for second-moment estimation guided by the representer theorem becomes \(\mathop{\mathrm{span}}\{\mathrm{k}_{ij_{1}} \otimes \mathrm{k}_{ij_{2}} : 1 \le i \le n, (j_{1}, j_{2}) \in \mathcal{O}_{i} \}\). However, it is rather convenient to impose a restriction on the coefficients: \[\Gamma = \sum_{i=1}^{n} \sum_{j_{1}, j_{2} =1}^{r_{i}} a_{ij_{1}j_{2}} \mathrm{k}_{ij_{1}} \otimes \mathrm{k}_{ij_{2}}, \quad \mathbf{A} = \mathop{\mathrm{diag}}[\mathbf{A}_{i}] \in \mathcal{R}(\mathscr{O}^{\oplus}) \subset \mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}}).\] Consequently, the evaluation of \(\Gamma\) and \(\Sigma \in \mathbb{H} \otimes \mathbb{H}\) over the square grid \(\mathcal{G} \times \mathcal{G}\) are given by the column-wise outer product: \[\begin{align} &\boldsymbol{\Gamma} = \mathbf{E} \mathbf{A} \mathbf{E}^{*} = \sum_{i=1}^{n} \mathbf{E}_{i} \mathbf{A}_{i} \mathbf{E}_{i}^{\top} \in \mathbb{R}^{g \times g}, \quad \boldsymbol{\Sigma} = \mathbf{E} (\mathop{\mathrm{diag}}[\mathbf{A}_{i}] - \mathbf{a} \mathbf{a}^{*}) \mathbf{E}^{*} \in \mathbb{R}^{g \times g}. \end{align}\]
Theorem 3. Denote the forward operator on \(\mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}})\) by \(\mathscr{F} := \mathscr{O} (\mathbf{K} \odot \mathbf{K}) \mathscr{O}\). Then, the loss functional in 13 becomes \[\begin{align} \mathcal{L}_{\otimes}^{\eta}(\mathbf{A}) = \left\lVert (\mathbf{y} \odot \mathbf{y}) - \mathscr{F} \mathbf{A} \right\rVert^{2} + \eta \left\lVert \mathbf{A} \right\rVert_{\mathscr{P}}^{2} - \left\lVert (\mathscr{I} - \mathscr{O}) (\mathbf{y} \odot \mathbf{y}) \right\rVert^{2}. \end{align}\] The last term does not depend on \(\mathbf{A}\), and the normal equation is given by \[\begin{align} \label{eq:rkhs:2nd:normal} \mathscr{F}^{*} (\mathscr{F} + \eta \mathscr{I}) \hat{\mathbf{A}}(\eta) = \mathscr{F}^{*} (\mathbf{y} \odot \mathbf{y}). \end{align}\qquad{(2)}\]
It is important to observe that \[\label{eq:rk:deficient} \mathrm{rank} (\mathscr{O}) = \mathop{\mathrm{tr}}(\mathscr{O}) = \sum_{i=1}^{n} r_{i} (r_{i}-1) < |\mathbf{r}^{2}| = \dim [\mathop{\mathrm{diag}}(\mathbb{R}^{\mathbf{r}} \otimes \mathbb{R}^{\mathbf{r}})],\tag{15}\] which implies that the forward operator \(\mathscr{F} := \mathscr{O} (\mathbf{K} \odot \mathbf{K}) \mathscr{O} \succeq \mathbf{0}\) is always rank-deficient. Consequently, the solution to this ill-posed problem ?? is not unique in the unregularized case. Among the infinitely many solutions, one may select the minimum-norm solution (both in \(\hat{\Gamma}\) and \(\hat{\mathbf{A}}\)), given by \(\hat{\mathbf{A}} = \mathscr{F}^{\dagger} (\mathbf{y} \odot \mathbf{y})\) [56]. However, solving this equation directly or iteratively using the Conjugate Gradient (CG) method can lead to numerical instability in the presence of rank deficiency [12], [32]. In such cases, a more suitable Krylov subspace method, such as MINRES, should be employed [46]. In contrast, when \(\eta > 0\), the solution to ?? is still not unique at the level of the coefficient matrix \(\hat{\mathbf{A}}(\eta)\). However, all such solutions correspond to the same function \(\hat{\Gamma}(\eta)\), owing to the strict convexity of the Tikhonov loss \(\mathcal{L}_{\otimes}^{\eta}\) defined in 13. Therefore, we may equivalently solve the regularized system \[\begin{align} \label{eq:rkhs:2nd:normal:simp} (\mathscr{F} + \eta \mathscr{I}) \hat{\mathbf{A}}(\eta) = \mathscr{O} (\mathbf{y} \odot \mathbf{y}), \quad \eta \ge 0. \end{align}\tag{16}\]
Given an estimator of covariance function \(\Sigma\), one can associate a corresponding covariance operator on a Hilbert space, and FPCA involves extracting its top eigenvalues and eigenfunctions. These spectral components are encoded in the coefficient matrix \(\mathbf{A}\) with respect to the inner-product structure determined by the basis of the search space. However, since this basis is typically non-orthonormal, the problem naturally leads to a generalized eigenvalue problem.
Dictionary Methods
The canonical approach associates the covariance function \(\Sigma\) with the integral operator \[\mathscr{T}_{\Sigma}: \mathcal{L}_2(\mathcal{T}) \to \mathcal{L}_2(\mathcal{T}), \quad \mathscr{T}_{\Sigma} f (s) = \int_{\mathcal{T}} \Sigma(s, t) f(t) dt,\] which is a trace-class s.p.d. operator, and its spectral decomposition yields the celebrated Mercer expansion [54], [56]. When restricting to the search space \(\mathcal{S}^{\mathop{\mathrm{dict}}} = \mathop{\mathrm{span}}\{\phi_{l}: \mathcal{T} \to \mathbb{R}, 1 \le l \le p \}\), the spectral information is encoded in the generalized eigenvalues and eigenvectors of the centered coefficient matrix \(\tilde{\mathbf{A}} := \mathbf{A} - \mathbf{a} \mathbf{a}^{\top} \in \mathbb{R}^{p \times p}\) [21].
Proposition 4. Let \(\mathbf{G} = [g_{l_{1} l_{2}}]_{1 \le l_{1}, l_{2} \le p} \in \mathbb{R}^{p \times p}\) where \(g_{l_{1} l_{2}} = {\left\langle {\phi_{l_{1}}},{\phi_{l_{2}}}\right\rangle}_{\mathcal{L}_2(\mathcal{T})}\), and consider the covariance function of the form \[\begin{align} \mathscr{T}_{\Sigma} = \sum_{l_{1}, l_{2}=1}^{p} \tilde{a}_{l_{1} l_{2}} \phi_{l_{1}} \otimes_{\mathcal{L}_{2}} \phi_{l_{2}} \quad \Rightarrow \quad \Sigma(s, t) = \sum_{l_{1}, l_{2}=1}^{p} \tilde{a}_{l_{1} l_{2}} \phi_{l_{1}}(s) \phi_{l_{2}}(t), \quad \tilde{\mathbf{A}} \in \mathbb{R}^{p \times p}. \end{align}\] Let \(\lambda_{k}\) and \(f_{k} = \sum_{l=1}^{p} v_{lk} \phi_{l}\) denote the \(k\)-th eigenvalue and eigenfunction of \(\Sigma\), i.e. \(\Sigma = \sum_{k=1}^{p} \lambda_{k} f_{k} \otimes f_{k}\). If we define \[\mathbf{\Lambda} = \mathop{\mathrm{diag}}[\lambda_{k}]_{1 \le k \le p} \in \mathbb{R}^{p \times p}, \quad \mathbf{V} = [v_{lk}]_{1 \le l, k \le p} = [ \mathbf{v}_{1} \vert \cdots \vert \mathbf{v}_{p} ] \in \mathbb{R}^{p \times p},\] then the diagonal matrix \(\mathbf{\Lambda}\) and \(\mathbf{G}\)-orthogonal matrix \(\mathbf{V}\) satisfy \[\label{eq:fpca:rkhs} \tilde{\mathbf{A}} \mathbf{G} \mathbf{V} = \mathbf{V} \mathbf{\Lambda}, \quad \mathbf{V}^{\top} \mathbf{G} \mathbf{V} = \mathbf{I}.\qquad{(3)}\] Also, when \(\mathbf{G}\) is rank-deficient, the FPCA does not depend on the choice of \(\mathbf{V}\).
RKHS Methods
In the RKHS setting, while a Mercer expansion in 4 is still valid, computing \(\mathcal{L}_2(\mathcal{T})\)-inner products between feature maps is typically intractable unless the Sobolev-type kernel is constructed through a concrete differential operator [38], [39]. Instead, we reinterpret \(\Sigma\) as a tensor in \(\mathbb{H} \otimes \mathbb{H}\) (as noted in 14 ), where the FPCA is then performed via the centered coefficient matrix \(\tilde{\mathbf{A}} := \mathbf{A} - \mathbf{a} \mathbf{a}^{*} \in \mathbb{R}^{\mathbf{r} \times \mathbf{r}}\). Note that \(\mathbf{a} \mathbf{a}^{*}\) is generically full-block unless identically zero, and hence so is \(\tilde{\mathbf{A}}\). Furthermore, since \(\mathbf{A} \in \mathcal{R}(\mathscr{O})\) is trace-free, it cannot be s.p.d. unless zero. This degeneracy, not explicitly known in the literature, is closely related to the rank-deficiency discussed in 15 , which highlights the necessity of performing FPCA to truncate negative spectrum tails. Finally, the following PCA is a decomposition in the RKHS, often called the kernel PCA, not a Mercer expansion in \(\mathcal{L}_{2}(\mathcal{T})\).
Proposition 5. Let \(\mathbf{K} \in \mathbb{R}^{\mathbf{r} \times \mathbf{r}}\) be the mean Gram matrix, and consider the covariance tensor of the form \[\begin{align} \Sigma = \sum_{i_{1}, i_{2} = 1}^{n} \sum_{j_{1}=1}^{r_{i_{1}}} \sum_{j_{2}=1}^{r_{i_{2}}} \tilde{a}_{i_{1} j_{1}, i_{2} j_{2}} \mathrm{k}_{i_{1} j_{1}} \otimes_{\mathbb{H}} \mathrm{k}_{i_{2} j_{2}} \quad \Rightarrow \quad \Sigma(s, t) = \sum_{i_{1}, i_{2} = 1}^{n} \sum_{j_{1}=1}^{r_{i_{1}}} \sum_{j_{2}=1}^{r_{i_{2}}} \tilde{a}_{i_{1} j_{1}, i_{2} j_{2}} \mathrm{k}_{i_{1} j_{1}}(s) \mathrm{k}_{i_{2} j_{2}}(t), \end{align}\] with \(\tilde{\mathbf{A}} \in \mathbb{R}^{\mathbf{r} \times \mathbf{r}}\) symmetric. Let \(\lambda_k\) and \(f_k = \sum_{i=1}^{n} \sum_{j=1}^{r_i} v_{ij, k} \mathrm{k}_{ij}\) denote the \(k\)-th eigenvalue and eigenfunction of \(\Sigma\) in \(\mathbb{H}\), where \(1 \le k \le |\mathbf{r}|\). Then defining the diagonal matrix \(\mathbf{\Lambda} \in \mathbb{R}^{\mathbf{r} \times \mathbf{r}}\) and \(\mathbf{K}\)-orthogonal matrix \(\mathbf{V} \in \mathbb{R}^{\mathbf{r} \times \mathbf{r}}\) as in 4, we have \[\tilde{\mathbf{A}} \mathbf{K} \mathbf{V} = \mathbf{V} \mathbf{\Lambda}, \quad \mathbf{V}^{*} \mathbf{K} \mathbf{V} = \mathbf{I}.\] In the case where \(\mathbf{K}\) is rank-deficient, the FPCA does not depend on the choice of \(\mathbf{V}\).
Both 4 5 lead to the same formulation, so we discuss ?? for notational convenience. Let \(\mathbf{L}\) be any square matrix satisfying \(\mathbf{G} = \mathbf{L} \mathbf{L}^{\top}\). A common two-step factorization-based method [28], [39], [81] solves ?? by first computing the standard eigendecomposition \(\mathbf{L}^{\top} \tilde{\mathbf{A}} \mathbf{L} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^{\top}\), and then recover the generalized eigenvectors \(\mathbf{V}\) via \(\mathbf{U} = \mathbf{L}^{\top} \mathbf{V}\). While the final solution is independent of the specific choice (e.g., Cholesky) of \(\mathbf{L}\) since all such \(\mathbf{L}^{\top} \tilde{\mathbf{A}} \mathbf{L}\) are unitarily equivalent [82], a prevalent choice in the literature is \(\mathbf{L} = \mathbf{G}^{1/2}\). However, computing the square root requires an additional eigendecomposition of \(\mathbf{G}\), which can be computationally intensive and numerically unstable when the spectrum has small gaps [11], [13]. Moreover, while this method is tractable for moderate matrix sizes, e.g., \(\tilde{\mathbf{A}} \in \mathbb{R}^{p \times p}\) in dictionary-based approaches, it becomes impractical in RKHS-based methods, where the kernel matrix \(\mathbf{K}\) is large and often ill-conditioned.
From a theoretical perspective, this two-step procedure amounts to orthogonalizing the basis of the search space, performing PCA in that new artificial function space, and then mapping the results back to the original basis. However, this detour may obscure the conceptual elegance of the Hilbert space formulation. To maintain abstraction and computational efficiency, we next describe how to compute an incomplete generalized eigendecomposition to approximate the leading principal components iteratively.
Lanczos Tridiagonalization
The Lanczos algorithm [50], [83] is a cornerstone of Krylov subspace methods for symmetric matrices \(\tilde{\mathbf{A}}\), where the original least-squares problem is projected over lower-dimensional subspaces involving a tridiagonal matrix \(\mathbf{T}_{k}\). We showcase how this algorithm can be tailored into our problem ?? , where \(\mathbf{G} = \mathbf{I}\) corresponds to the usual algorithm. Once again, especially for the RKHS methods, we do not need to explicitly evaluate the full block matrix \(\tilde{\mathbf{A}} = \mathbf{A} - \mathbf{a} \mathbf{a}^{*} \in \mathbb{R}^{\mathbf{r} \times \mathbf{r}}\), but rather implement its action on \(\mathbb{R}^{\mathbf{r}}\).
Starting with an initial vector \(\mathbf{b}_{0} \in \mathbb{R}^{p}\), we construct the Krylov subspace for the matrix \(\tilde{\mathbf{A}} \mathbf{G}\): \[\mathcal{K}_{k}(\tilde{\mathbf{A}} \mathbf{G}, \mathbf{b}_{0}) = \mathop{\mathrm{span}}\{ \mathbf{b}_{0}, (\tilde{\mathbf{A}} \mathbf{G}) \mathbf{b}_{0}, \cdots, (\tilde{\mathbf{A}} \mathbf{G})^{k-1} \mathbf{b}_{0} \}.\] Note that \(\tilde{\mathbf{A}} \mathbf{G}\) is \(\mathbf{G}\)-symmetric since \(\langle \mathbf{c}, \tilde{\mathbf{A}} \mathbf{G} \mathbf{d} \rangle_{\mathbf{G}} = \langle \tilde{\mathbf{A}} \mathbf{G} \mathbf{c}, \mathbf{d} \rangle_{\mathbf{G}}\). The \(\mathbf{G}\)-Lanczos procedure generates a basis for this subspace using a set of vectors that are orthogonal with respect to the \(\mathbf{G}\)-inner product. After \(k\) iterations, as outlined in 4, this process yields a \(\mathbf{G}\)-orthonormal matrix \(\mathbf{U}_{k} \in \mathbb{R}^{p \times k}\) that spans \(\mathcal{K}_{k}(\tilde{\mathbf{A}} \mathbf{G}, \mathbf{b}_{0})\), satisfying the three-term recurrence [11]: \[\tilde{\mathbf{A}} \mathbf{G} \mathbf{U}_{k} = \mathbf{U}_{k} \mathbf{T}_{k} + \beta_{k} \mathbf{u}_{k+1} \mathbf{e}_{k}^{\top}, \quad \mathbf{T}_{k} = \begin{pmatrix} \alpha_{1} & \beta_{1} & 0 & \cdots & 0 \\ \beta_{1} & \alpha_{2} & \beta_{2} & \ddots & 0 \\ 0 &\beta_{2} & \alpha_{3} & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & \cdots & 0 & \beta_{k-1} & \alpha_{k} \end{pmatrix} \in \mathbb{R}^{k \times k}, \quad \mathbf{e}_{k} = \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ 1 \end{pmatrix} \in \mathbb{R}^{k},\] where the second term is the residual. Left-multiplying by \(\mathbf{U}_{k}^{\top} \mathbf{G}\), we obtain the projected system: \[\mathbf{U}_{k}^{\top} [\mathbf{G} \tilde{\mathbf{A}} \mathbf{G}] \mathbf{U}_{k} = [\mathbf{U}_{k}^{\top} \mathbf{G} \mathbf{U}_{k}] \mathbf{T}_{k} + \beta_{k} [\mathbf{U}_{k}^{\top} \mathbf{G} \mathbf{u}_{k+1}] \mathbf{e}_{k}^{\top} = \mathbf{T}_{k}.\] The eigenpairs \((\mathbf{\Lambda}_k, \mathbf{S}_k)\) of this small projected system \(\mathbf{T}_{k} = \mathbf{S}_{k} \mathbf{\Lambda}_{k} \mathbf{S}_{k}^{\top}\) are known as the Ritz values and Ritz vectors, respectively. We can then form approximations to the generalized eigenvectors of the original problem by mapping the Ritz vectors back to the high-dimensional space by letting \(\mathbf{V}_{k} = \mathbf{U}_{k} \mathbf{S}_{k} \in \mathbb{R}^{p \times k}\). These Ritz vectors \(\mathbf{V}_{k} = [\mathbf{v}_{1} \vert \mathbf{v}_{2} \vert \cdots \vert \mathbf{v}_{k}]\) are \(\mathbf{G}\)-orthonormal by construction: \[\mathbf{V}_{k}^{\top} \mathbf{G} \mathbf{V}_{k} = \mathbf{S}_{k}^{\top} (\mathbf{U}_{k}^{\top} \mathbf{G} \mathbf{U}_{k}) \mathbf{S}_{k} = \mathbf{S}_{k}^{\top} \mathbf{I}_{k} \mathbf{S}_{k} = \mathbf{I}_{k}.\] In exact arithmetic, at most \(\mathrm{rank}( \tilde{\mathbf{A}} \mathbf{G})\) iterations yield the exact solutions. The quality of the approximation, which is assessed by the \(\mathbf{G}\)-norm of the \(j\)-th residual vector \[\left\lVert \mathbf{r}_{j} \right\rVert_{\mathbf{G}} = \| \tilde{\mathbf{A}} \mathbf{G} \mathbf{v}_{j} - \lambda_{k} \mathbf{v}_{j} \|_{\mathbf{G}} = \left\lVert \beta_{k} \mathbf{u}_{k+1} (\mathbf{e}_{k}^{\top} \mathbf{s}_{j}) \right\rVert_{\mathbf{G}} = \beta_{k} |s_{kj}|,\] converges rapidly to zero as \(k\) increases for leading eigenvalues, hence the full iteration is not only unnecessary but may also degrade numerical stability [84]. Consequently, if we aim to find the Ritz pairs \((\mathbf{\Lambda}_k, \mathbf{V}_k)\) corresponding to the largest eigenvalues, and we only need to compute the first few eigenpairs of \(\mathbf{T}_{k}\) to obtain accurate approximations after early termination [33].
This section provides the numerical results; additional details are provided in 8. Code for reproducing the results is available in the open-source Julia package, with accompanying documentation.
All simulations consider zero-mean Gaussian processes over the interval \(\mathcal{T} = [0, 1]\). This centering avoids discrepancies between the two estimation formulations discussed in 1, ensuring that \(\tilde{\mathbf{A}} = \mathbf{A}\). For the TReK
method, to minimize tuning complexity, we apply early termination as a form of
regularization [30]–[32], setting the
Tikhonov parameter \(\eta = 0\) in 2 3
when solving \[\min_{\tilde{\mathbf{A}}} \|\mathscr{O} (\mathbf{y} \odot \mathbf{y}) - \mathscr{F} \tilde{\mathbf{A}}\|^{2}.\] Since the resulting linear system is often numerically inconsistent – meaning \(\mathscr{O} (\mathbf{y} \odot \mathbf{y}) \notin \mathcal{R}(\mathscr{F})\) – we replace CGLS with LSQR for the dictionary methods and MINRES for the RKHS methods with a maximum iteration count of 50. In practice, convergence
typically occurs well before reaching this limit, as illustrated in 4.1 and the animation.
To assess estimation accuracy, we use the relative Mean Integrated Squared Error (MISE): \[\text{Rel.MISE}(\Sigma) := \frac{\|\hat{\Sigma} - \Sigma\|_{\mathcal{L}_2(\mathcal{T})}^{2}}{\|\Sigma\|_{\mathcal{L}_2(\mathcal{T})}^{2}} = \frac{\int_{0}^{1} \int_{0}^{1} (\hat{\Sigma}(s, t) - \Sigma(s, t))^{2} ds dt}{\int_{0}^{1} \int_{0}^{1} \Sigma(s, t)^{2} ds dt}\] For the evaluation of Mercer FPCA in the dictionary setting, we report both eigenfunction and eigenvalue errors for the top three components: \[\begin{align} \mathrm{MISE}(f_{k}) := \|\hat{f}_{k} - f_{k} \|_{\mathcal{L}_2(\mathcal{T})}^{2} = \int_{0}^{1} (\hat{f}_{k}(t) - f_{k}(t))^{2} dt, \quad \text{Rel.MSE}(\lambda_{k}) := \frac{|\hat{\lambda}_{k} - \lambda_{k}|^{2}}{\lambda_{k}^{2}}, \quad k=1, 2, 3. \end{align}\]
This section is organized as follows. 4.1 discusses strategies for selecting key estimation parameters in TReK
, including the iteration count and the choice of search spaces. 4.2 illustrates the practical advantages of element-free Krylov methods over direct solvers in terms of computational efficiency. Finally, 4.3 evaluates the estimation accuracy using the
tuned parameters, benchmarking against fdapace
and fpca.sc
across various experimental setups.
While the Morozov discrepancy principle [85], [86] is widely used to prevent overfitting in unregularized problems, it is not applicable here due to the aforementioned numerical inconsistency of the linear system. Instead, we employ the L-curve as the stopping criterion [87], [88], which seeks a corner in the log-log plot of the solution norm versus the gradient-residual norm over iterations, analogous to the elbow in a PCA scree plot. In our context, the solution norm of the covariance function \(\Sigma\) (with matrix representation \(\tilde{\mathbf{A}}\)) is given by \(\mathop{\mathrm{tr}}(\mathbf{G} \tilde{\mathbf{A}} \mathbf{G} \tilde{\mathbf{A}})\) for dictionary methods and \(\mathop{\mathrm{tr}}(\mathbf{K} \tilde{\mathbf{A}} \mathbf{K} \tilde{\mathbf{A}})\) for RKHS methods, where \(\mathbf{G}\) is the Galerkin matrix from 4 and \(\mathbf{K}\) is the mean kernel Gram matrix from 5.
5 demonstrates this process using sample paths generated from Brownian Motion (BM), whose eigenvales and eigenfunctions of the Mercer expansion are given by: \[\lambda_{k} = [(k-1/2)\pi]^{-2}, \, f_{k}(t) = \sqrt{2} \sin \left( (k-1/2)\pi t \right), \quad k \in \mathbb{N}, \, t \in \mathcal{T}=[0, 1].\] The figure illustrates the phenomenon of semi-convergence, where initial iterations reduce reconstruction error, but later iterations begin to overfit the noise in the data [33].
The choice of function space for smoothing is another critical hyperparameter. To compare the performance of different bases, we generate \(n=100\) sample paths from a Brownian Bridge (BB) process, whose covariance surface is more sharply concentrated along the diagonal than that of Brownian Motion, with the number of observations per path drawn from \(r_{i} \stackrel{iid}{\sim} \mathrm{Unif}(10:12)\), and with observation noise \(\sigma = 0.1\). We then apply the L-curve early stopping criterion to estimates derived from three types of bases, each with two hyperparameter configurations:
Cubic B-splines: \(p=10\) and \(p=20\).
Gaussian Kernel: \(K(s, t) = \exp (-\gamma (s-t)^2)\) with \(\gamma = 2\) and \(\gamma = 10\).
Laplacian Kernel : \(K(s, t) = \exp (-\gamma |s-t|)\) with \(\gamma = 1\) and \(\gamma = 3\).
The results, presented in 6, indicate that the cubic B-spline basis with a smaller number of basis (\(p=10\)) is the most effective choice for smoothing task of 1D longitudinal data structure. This approach is not only computationally less complex than the RKHS methods but also converges in significantly fewer iterations while achieving a lower relative MISE. Furthermore, the performance of the RKHS methods proved highly sensitive to the locality parameter \(\gamma\), with estimation quality degrading as \(\gamma\) increases. This sensitivity is notable, as RKHS methods in other applications, such as computerized tomography, often require large \(\gamma\) values to capture fine local details [9], [10]. Based on these findings, we exclusively use cubic B-splines with \(p=10\) in the subsequent analyses of [ssec:num:eff] [ssec:num:acc]. In practical scenarios where the true covariance is unknown, hyperparameters like \(p\) or \(\gamma\) can be effectively chosen using (\(k\)-fold) cross-validation [38], [39].
We benchmark the numerical efficiency of the proposed TReK
algorithm against two widely used R implementations: fdapace
and fpca.sc
. While precise comparisons across different high-level programming languages are
inherently complex (e.g., compiler or garbage collection in Julia vs. R) , the results in 7 suffice to reveal that TReK
achieves superior scalability, reducing both runtime and memory usage by orders of
magnitude in our tests, regardless of the configuration of \((n, r)\). A fair comparison with fpca.sc
is particularly relevant, as TReK
’s B-spline approach solves a linear system of the same size
as the second estimation method in fpca.sc
, since both employ a “pool-then-smooth” strategy. To ensure a conservative benchmark that favors the competing methods, TReK
was run for a fixed 50 iterations, even though it practically
converges in substantially fewer, as demonstrated in 4.1. It should be noted that the memory usage of TReK
is independent of the iteration count, and 7 shows that the
memory usage adheres to our theoretical growth of space complexity proportional to \(nr^{2}\).
This efficiency has direct practical implications. As shown in 1, TReK
successfully processes a large dataset of \(n = 1000\) curves with \(r_i
\equiv r = 400\) observations each on a standard laptop, completing estimation in a few seconds. Although this scenario is extremely dense within the domain \(\mathcal{T} = [0, 1]\), it would be considered sparse in
high-dimensional domains, e.g. \(\mathcal{T} = [0, 1]^{3}\). This scalability is particularly advantageous for high-dimensional applications such as fMRI [65]. For these problems, the computational complexity can be reduced even further by leveraging the outer product structure in 3.1 across dimensions. A
full exploration of such high-dimensional applications is beyond the scope of this paper and remains a direction for future work.
We evaluate the accuracy of TReK
through a Monte Carlo simulation study. We consider four distinct data-generating scenarios by crossing two sizes of sample paths (\(n \in \{100, 400\}\)) with two noise
levels (\(\sigma \in \{0.01, 0.1\}\)). For each scenario, we compare the performance of two estimation strategies: the fully iterated solution (50 iterations) and a regularized solution obtained via early stopping based on
the L-curve criterion. The ground-truth covariance structure for all simulations is generated from the following Mercer expansion: \[\begin{align} \Sigma(s, t) = \sum_{k=1}^{3} \lambda_{k} f_{k}(s) f_{k}(t), \quad \lambda_{k} =
2^{1-k}, \, f_{k}(t) = \begin{cases} \sqrt{2} \sin(\pi t) &, \quad k=1, \\ \sqrt{2} \cos(2 \pi t) &, \quad k=2, \\ \sqrt{2} \sin(3 \pi t) &, \quad k=3. \end{cases}
\end{align}\] In each of the 200 Monte Carlo repetitions per setting, the number of observation points for each curve is drawn independently from \(r_{i} \stackrel{iid}{\sim} \mathrm{Unif}(5, 15)\). Our
TReK
estimator employs a cubic B-spline basis with \(p=10\) coefficients, defined by 8 equidistant knots on the domain \(\mathcal{T} = [0, 1]\). The entire simulation,
comprising \(4 \times 2 \times 200 = 1600\) distinct estimations, completes in under one minute on 6 threads of Apple M1 Pro.
The results are summarized in 8, which clearly demonstrate the effectiveness of early stopping as a regularization strategy; it consistently reduces estimation error and variance across all metrics, with the most significant gains in the high-noise (\(\sigma=0.1\)) scenarios. As expected, performance improves with a larger sample size (\(n=400\)) and lower noise level (\(\sigma=0.01\)).
The central message of this paper is to emphasize how abstraction in Hilbert space theory should inform and guide numerical methods in FDA. Unlike multivariate statistics – where a standard orthonormal basis is readily available – functional settings lack such a canonical coordinate system, and concepts like vectorization become contrived and ill-suited. This is precisely the aim of FPCA: to extract an intrinsic basis that reflects the geometry of the covariance operator. The coefficient matrix serves as a finite-dimensional proxy for the projection of this operator onto a chosen function subspace. Just as a matrix is fully characterized by its action on vectors, the same principle applies to operators acting on functions. Krylov-based methods exploit this perspective, forming approximations using only repeated forward actions, without requiring full construction [89]. This shift not only adheres to a genuinely functional perspective but also leads to substantial practical benefits.
Despite the centrality of linear operators in FDA, Krylov subspace methods and their variants remain underutilized in this context. This paper demonstrates how these methods provide an economical and efficient means of estimating covariance structures. By bridging operator-theoretic insights with numerical strategies, this work offers a principled and scalable framework for FPCA – and opens the door to broader applications of Krylov-based methods in FDA.
We thank the associate editor and the three referees for their insightful comments and constructive suggestions, which have significantly improved the manuscript. We are also grateful to Almond Stöcker and Jake Grainger. A. Stöcker provided valuable insights into both the practical and theoretical challenges of covariance smoothing. J. Grainger assisted with code verification, and shared many practical tips that greatly improved the efficiency of the package over its preliminary version.
We analyze the computational complexity of the operations involved. The time or space complexity of multiplying two square matrices of size \(r \times r\) is denoted as \(O(r^{2+\Delta})\), where \(\Delta > 0\) depends on the specific algorithm used. For a classical approach, Strassen’s algorithm [90], and the state-of-the-art approaches [91], [92] achieve \(\Delta = 1\), \(\log_{2}7-2 \approx 0.807\), and \(\approx 0.376\), respectively. However, in our setting, the matrix multiplications are generally block-wise and not necessarily square. For the typical block dimensions encountered in covariance smoothing, these advanced algorithms often underperform due to their intricate data layout requirements and larger constant factors. Therefore, we adopt the standard assumption that multiplying an \(a \times b\) matrix with a \(b \times c\) matrix incurs a computational cost of \(O(abc)\). The following proposition shows that the main computational cost of the second-order forward action is governed by the partitioned outer products rather than the elimination operator, as listed in 2 in the Introduction.
Proposition 6. The computational complexity of the elimination operator \(\mathscr{O}\), the partitioned outer products \(\mathbf{\Phi} \odot \mathbf{\Phi}\) (or \(\mathbf{\Phi}^{*} \odot \mathbf{\Phi}^{*}\)), and \(\mathbf{K} \odot \mathbf{K}\) are given by \(O(|\mathbf{r}|)\), \(O(|\mathbf{r}| p^{2} + |\mathbf{r}^{2}|p)\), and \(O(|\mathbf{r}| |\mathbf{r}^{2}|)\), respectively.
Proof of 6. It is trivial that the complexity of \(\mathscr{O}\) is \(\sum_{i=1}^{n} O(r_{i}) = O(|\mathbf{r}|)\). Since \(\mathbf{\Phi}_{i} \in \mathbb{R}^{r_{i} \times p}\), the complexity of both \((\mathbf{\Phi}_{i} \otimes \mathbf{\Phi}_{i})\) and \((\mathbf{\Phi}_{i}^{\top} \otimes \mathbf{\Phi}_{i}^{\top})\) is given by \(O(r_{i}p^{2} + r_{i}^{2}p)\), hence the complexity of both \(\mathbf{\Phi} \odot \mathbf{\Phi}\) and \(\mathbf{\Phi}^{*} \odot \mathbf{\Phi}^{*}\) is given by \[\sum_{i=1}^{n} O(r_{i}p^{2} + r_{i}^{2}p) = O(|\mathbf{r}| p^{2} + |\mathbf{r}^{2}|p).\] Finally, for each \(1 \le i \le n\), the complexity of \((\mathbf{K}_{i\cdot} \otimes \mathbf{K}_{i\cdot})\) is given by \(\sum_{i'=1}^{n} O(r_{i}r_{i'}^{2} + r_{i}^{2}r_{i'})\), hence the complexity of \(\mathbf{K} \odot \mathbf{K}\) is given by \[\begin{align} O \left( \sum_{i, i'=1}^{n} (r_{i}r_{i'}^{2} + r_{i}^{2}r_{i'}) \right) = O \left( \sum_{i, i'=1}^{n} r_{i}r_{i'}^{2} \right) = O(|\mathbf{r}| |\mathbf{r}^{2}|). \end{align}\] ◻
In all of the outlined methods, a core computation is the symmetric sandwich product \(\mathbf{D}_{i} = \mathbf{F}_{i} \mathbf{C}_{i} \mathbf{F}_{i}^{\top}\), across all blocks. Since the matrix block \(\mathbf{C}_{i}\) is symmetric (of size \(p \times p\) for dictionary methods, and \(r_{i} \times r_{i}\) for RKHS methods), which guarantees that the resulting
block \(\mathbf{D}_{i}\) (of size \(r_{i} \times r_{i}\)) is also symmetric. Theoretically, this known symmetry could be exploited by specialized routines to reduce floating-point operations
(FLOPs). However, modern numerical libraries in Julia
, R
, and Python
, delegate matrix operations to highly optimized Basic Linear Algebra Subprograms (BLAS
), the de facto standard low-level API
for such tasks. As outlined in 9, the standard approach uses two general matrix-matrix multiplications (via the GEMM
routine). While one might expect to accelerate this algorithm by signaling the symmetry to
invoke a specialized routine SYMM
, we empricially witnessed that, for the block dimensions typical in covariance smoothing, this provides no discernible speedup and can often be slower. Arguably, this discrepancy stems from the fact that, the
usual GEMM
routine is the most heavily optimized computational kernel, engineered to maximize memory throughput and exploit instruction-level parallelism (SIMD
) on modern CPUs [93]. These hardware-specific optimizations often outweigh the theoretical FLOP reduction offered by SYMM
.
Therefore, we should decouple the meaning of fast and cheap in existing implementation of covariance smoothers when solving the matrix-vector form of ?? : \[\begin{align} (\mathscr{F}^{*} \mathscr{F} + \lambda \mathscr{D}^{*} \mathscr{D}) \hat{\mathbf{A}}(\lambda) = \mathscr{F}^{*} (\mathbf{y} \odot \mathbf{y}). \end{align}\] In their direct methods, exploiting symmetry via half-vectorization in the preprocessing step, where they exxplicitly construct the forward matrix, is primarily a memory-saving strategy (cheap), which reduces the storage by a factor of \(\approx 2^{2}\). This smaller memory footprint then makes the subsequent solver step faster by a factor of \(\approx 2^{3}\). However, it is important to note that underlying tensor structure is lost in the solving step.
On the contrary, TReK
method takes the operator-based approach, hence significantly cheaper by an order of magnitude to above methods. Instead of storing an operator \(\mathscr{F}\) (of
size \(|\mathbf{r}^{2}| \times p^{2}\) for dictionary methods, or \(|\mathbf{r}^{2}| \times |\mathbf{r}^{2}|\) for RKHS methods), it only requires storing its components and a small,
reusable buffer for computing operator-matrix products (9) on-the-fly. This approach preserves the underlying tensor structure, making the action of \(\mathscr{F}\) highly efficient, as
listed in 2. As detailed in Table 3, the memory savings are substantial
compared to direct methods, requiring only minimal reusable workspace for the operator action and the Krylov solver [42], [44], [48].
Method | Dictionary-based | RKHS-based |
---|---|---|
Storing Data Matrix \(\mathbf{y} \odot \mathbf{y}\) | \(|\mathbf{r}^{2}|\) | \(|\mathbf{r}^{2}|\) |
Workspace for Operator \(\mathscr{F}\) | \(\max(r_{i}) \times p\) | \(\max(r_{i}) \times \max(r_{i})\) |
Workspace for Krylov Solver | \(4|\mathbf{r}^{2}| + 2 p^{2}\) (LSQR) | \(6|\mathbf{r}^{2}|\) (MINRES) |
Proof of 2. Note that \[\begin{align} \Gamma(t_{ij_{1}}, t_{ij_{2}}) &= \sum_{l_{1}, l_{2} = 1}^{p} a_{l_{1} l_{2}} \phi_{l_{1}}(t_{ij_{1}}) \phi_{l_{2}}(t_{ij_{2}}) \\ &= \sum_{l_{1}, l_{2} = 1}^{p} \mathbf{\Phi}_{i}[j_{1}, l_{1}] \mathbf{A}[l_{1}, l_{2}] \mathbf{\Phi}_{i}[j_{2}, l_{2}] = (\mathbf{\Phi}_{i} \mathbf{A} \mathbf{\Phi}_{i}^{\top})[j_{1}, j_{2}], \end{align}\] which results in \[\begin{align} \mathcal{L}_{\otimes}(\mathbf{A}) &= \sum_{i=1}^{n} \sum_{(j_{1}, j_{2}) \in \mathcal{O}_{i}} \left(y_{ij_{1}} y_{ij_{2}} - \Gamma(t_{ij_{1}}, t_{ij_{2}}) \right)^{2} \\ &= \sum_{i=1}^{n} \left\lVert \mathscr{O}_{i}[\mathbf{y}_{i} \mathbf{y}_{i}^{\top} - (\mathbf{\Phi}_{i} \otimes \mathbf{\Phi}_{i}) \mathbf{A}] \right\rVert^{2} = \left\lVert \mathscr{O} (\mathbf{y} \odot \mathbf{y}) - \mathscr{O} (\mathbf{\Phi} \odot \mathbf{\Phi}) \mathbf{A}] \right\rVert^{2}. \end{align}\] Since \(\mathscr{O}\) is an orthogonal projection, this can be decomposed into \[\begin{align} \mathcal{L}_{\otimes}(\mathbf{A}) &= \left\lVert (\mathbf{y} \odot \mathbf{y}) - \mathscr{O} (\mathbf{\Phi} \odot \mathbf{\Phi}) \mathbf{A}] \right\rVert^{2} - \left\lVert (\mathscr{I} - \mathscr{O}) (\mathbf{y} \odot \mathbf{y}) \right\rVert^{2} \\ &= \left\lVert (\mathbf{y} \odot \mathbf{y}) - \mathscr{F} \mathbf{A}] \right\rVert^{2} - \left\lVert (\mathscr{I} - \mathscr{O}) (\mathbf{y} \odot \mathbf{y}) \right\rVert^{2}, \end{align}\] and the Gâteaux derivative [54] becomes \(D \mathcal{L}_{\otimes}(\mathbf{A}) = 2 \mathscr{F}^{*} [\mathscr{F} \mathbf{A}- (\mathbf{y} \odot \mathbf{y})]\). Also, the Gâteaux derivative of the penalty term is \(2 \mathscr{P} \mathbf{A}\), leading to the normal equation in ?? . ◻
Proof of 3. Note that \[\begin{align} \Gamma(t_{ij_{1}}, t_{ij_{2}}) &= {\left\langle {\Gamma},{\mathrm{k}_{ij_{1}} \otimes \mathrm{k}_{ij_{2}}}\right\rangle} = \sum_{i'=1}^{n} \sum_{j'_{1}, j'_{2} =1}^{r_{i'}} a_{i'j'_{1}j'_{2}} {\left\langle {\mathrm{k}_{i'j'_{1}} \otimes \mathrm{k}_{i'j'_{2}}},{\mathrm{k}_{ij_{1}} \otimes \mathrm{k}_{ij_{2}}}\right\rangle} \\ &= \sum_{i'=1}^{n} \sum_{j'_{1}, j'_{2} =1}^{r_{i'}} a_{i'j'_{1}j'_{2}} {\left\langle {\mathrm{k}_{i'j'_{1}}},{\mathrm{k}_{ij_{1}}}\right\rangle} {\left\langle {\mathrm{k}_{i'j'_{2}}},{\mathrm{k}_{ij_{2}}}\right\rangle} \\ &= \sum_{i'=1}^{n} \sum_{j'_{1}, j'_{2} =1}^{r_{i'}} \mathbf{K}_{ii'}[j_{1}, j'_{1}] \mathbf{A}_{i'}[j'_{1}, j'_{2}] \mathbf{K}_{i'i}[j'_{2}, j_{2}] \\ &= \sum_{i'=1}^{n} (\mathbf{K}_{ii'} \mathbf{A}_{i'} \mathbf{K}_{i'i}) [j'_{2}, j_{2}] =[(\mathbf{K}_{i\cdot} \otimes \mathbf{K}_{i\cdot}) \mathbf{A}] [j'_{2}, j_{2}]. \end{align}\] This leads to \[\begin{align} \mathcal{L}_{\otimes}(\mathbf{A}) &= \sum_{i=1}^{n} \sum_{(j_{1}, j_{2}) \in \mathcal{O}_{i}} \left(y_{ij_{1}} y_{ij_{2}} - \Gamma(t_{ij_{1}}, t_{ij_{2}}) \right)^{2} \\ &= \sum_{i=1}^{n} \left\lVert \mathscr{O}_{i}[\mathbf{y}_{i} \mathbf{y}_{i}^{\top} - (\mathbf{K}_{i\cdot} \otimes \mathbf{K}_{i\cdot}) \mathbf{A}] \right\rVert^{2} = \left\lVert \mathscr{O} (\mathbf{y} \odot \mathbf{y}) - \mathscr{O} (\mathbf{K} \odot \mathbf{K}) \mathbf{A}] \right\rVert^{2}. \end{align}\] Since \(\mathbf{A} \in \mathcal{R}(\mathscr{O})\) and \(\mathscr{O}\) is an orthogonal projection, this can be formed into \[\begin{align} \mathcal{L}_{\otimes}(\mathbf{A}) &= \left\lVert (\mathbf{y} \odot \mathbf{y}) - \mathscr{O} (\mathbf{K} \odot \mathbf{K}) \mathscr{O} \mathbf{A}] \right\rVert^{2} - \left\lVert (\mathscr{I} - \mathscr{O}) (\mathbf{y} \odot \mathbf{y}) \right\rVert^{2} \\ &= \left\lVert (\mathbf{y} \odot \mathbf{y}) - \mathscr{F} \mathbf{A}] \right\rVert^{2} - \left\lVert (\mathscr{I} - \mathscr{O}) (\mathbf{y} \odot \mathbf{y}) \right\rVert^{2}, \end{align}\] and the rest of the proof is equivalent to 2. ◻
We omit the proof of 4 as it is same to that of below:
Proof of 5. First, we show the \(\mathbf{K}\)-orthogonality: \[\begin{align} \delta_{k_{1}k_{2}} = {\left\langle {f_{k_{1}}},{f_{k_{2}}}\right\rangle}_{\mathbb{H}} &= {\left\langle {\sum_{i_{1}=1}^{n} \sum_{j_{1}=1}^{r_{i_{1}}} v_{i_{1}j_{1}, k_{1}} \mathrm{k}_{i_{1}j_{1}}},{\sum_{i_{2}=1}^{n} \sum_{j_{2}=1}^{r_{i_{2}}} v_{i_{2}j_{2}, k_{2}} \mathrm{k}_{i_{2}j_{2}}}\right\rangle}_{\mathbb{H}} \\ &= \sum_{i_{1}j_{1}} \sum_{i_{2}j_{2}} v_{i_{1}j_{1}, k_{1}} {\left\langle {\mathrm{k}_{i_{1}j_{1}}},{\mathrm{k}_{i_{2}j_{2}}}\right\rangle}_{\mathbb{H}} v_{i_{2}j_{2}, k_{2}} \\ &= \sum_{i_{1}j_{1}} \sum_{i_{2}j_{2}} \mathbf{V}^{*}[k_{1}, i_{1}j_{1}] \mathbf{K}[i_{1}j_{1}, i_{2}j_{2}] \mathbf{V}[i_{2}j_{2}, k_{2}] = [\mathbf{V}^{*} \mathbf{K} \mathbf{V}] [k_{1}, k_{2}]. \end{align}\] Next, observe that \[\begin{align} \lambda_{k} f_{k} &= \Sigma f_{k} = \left( \sum_{i_{1}j_{1}} \sum_{i_{2}j_{2}} \tilde{a}_{i_{1} j_{1}, i_{2} j_{2}} \mathrm{k}_{i_{1} j_{1}} \otimes_{\mathbb{H}} \mathrm{k}_{i_{2} j_{2}} \right) \sum_{i_{3}j_{3}} v_{i_{3}j_{3}, k} \mathrm{k}_{i_{3}j_{3}} \\ &= \sum_{i_{1}j_{1}} \sum_{i_{2}j_{2}} \sum_{i_{3}j_{3}} \tilde{a}_{i_{1} j_{1}, i_{2} j_{2}} {\left\langle {\mathrm{k}_{i_{2}j_{2}}},{\mathrm{k}_{i_{3}j_{3}}}\right\rangle}_{\mathbb{H}} v_{i_{3}j_{3}, k} \mathrm{k}_{i_{1} j_{1}} \\ &= \sum_{i_{1}j_{1}} \left( \sum_{i_{2}j_{2}} \sum_{i_{3}j_{3}} \tilde{\mathbf{A}}[i_{1} j_{1}, i_{2} j_{2}] \mathbf{K}[i_{2}j_{2}, i_{3}j_{3}] \mathbf{V}[i_{3}j_{3}, k] \right) \mathrm{k}_{i_{1} j_{1}} = \sum_{i_{1}j_{1}} [\tilde{\mathbf{A}} \mathbf{K} \mathbf{V}][i_{1}j_{1}, k] \mathrm{k}_{i_{1} j_{1}}, \end{align}\] which leads to \[\begin{align} \label{eq:sq:root:eigen} &0 = \Sigma f_{k} - \lambda_{k} f_{k} = \sum_{ij} [\tilde{\mathbf{A}} \mathbf{K} \mathbf{V} - \mathbf{V} \mathbf{\Lambda}][ij, k] \mathrm{k}_{i_{1} j_{1}} = 0, \quad 1 \le k \le |\mathbf{r}| \nonumber \\ &\Longleftrightarrow \quad [\tilde{\mathbf{A}} \mathbf{K} \mathbf{V} - \mathbf{V} \mathbf{\Lambda}]^{\top} \mathbf{K} [\tilde{\mathbf{A}} \mathbf{K} \mathbf{V} - \mathbf{V} \mathbf{\Lambda}] = \mathbf{0} \nonumber \\ &\Longleftrightarrow \quad [\mathbf{K}^{1/2} \tilde{\mathbf{A}} \mathbf{K}^{1/2}] (\mathbf{K}^{1/2} \mathbf{v}_{k}) = \lambda_{k} (\mathbf{K}^{1/2} \mathbf{v}_{k}), \quad 1 \le k \le |\mathbf{r}|. \end{align}\tag{17}\] Note that whenever \(\mathbf{K}^{1/2} \mathbf{v}_{k} = \mathbf{K}^{1/2} \tilde{\mathbf{v}}_{k}\), we have \[\left\lVert \sum_{ij} (v_{ij, k} - \tilde{v}_{ij, k}) \mathrm{k}_{ij} \right\rVert_{\mathbb{H}}^{2} = (\mathbf{v}_{k} - \tilde{\mathbf{v}}_{k})^{\top} \mathbf{K} (\mathbf{v}_{k} - \tilde{\mathbf{v}}_{k}) = 0,\] which demonstrates that the FPCA does not depend on the specific choice of \(\mathbf{v}_{k}\) satisfying 17 . In this regard, for each \(1 \le k \le |\mathbf{r}|\), and choose any \(\mathbf{v}_{k}\) that satisfies 17 and define \(\tilde{\mathbf{v}}_{k} := \mathbf{v}_{k} - \mathbf{n}_{k}\) where \[\lambda_{k} \mathbf{n}_{k} = (\tilde{\mathbf{A}} \mathbf{K} \mathbf{v}_{k} - \lambda_{k} \mathbf{v}_{k}).\] Then \(\tilde{\mathbf{v}}_{k}\) satisfies \[\tilde{\mathbf{A}} \mathbf{K} \tilde{\mathbf{v}}_{k} - \lambda_{k} \tilde{\mathbf{v}}_{k} = (\tilde{\mathbf{A}} \mathbf{K} \mathbf{v}_{k} - \lambda_{k} \mathbf{v}_{k}) - \lambda_{k} \mathbf{n}_{k} = 0,\] and \(\tilde{\mathbf{V}} = [\mathbf{v}_{1} \vert \dots \vert \mathbf{v}_{|\mathbf{r}|}]\) satisfies \(\tilde{\mathbf{A}} \mathbf{K} \tilde{\mathbf{V}} = \tilde{\mathbf{V}} \mathbf{\Lambda}\), which completes the proof. ◻
The results presented in this paper can be reproduced using our Julia
software package, which is publicly available on Github
. A tutorial in the repository explains
the package’s functionality:
(Data Generation): Enables simulation from Gaussian processes using both built-in and custom covariance functions.
(Multiple Dispatch): Supports multiple data precisions (e.g., Float32
or Float64
) for multi-threading and scalability to larger datasets.
(Estimation Methods): Supports B-splines of arbitrary order and knot configurations, as well as any built-in or user-defined reproducing kernels. We use LSQR and MINRES as their respective default Krylov solvers, but any other
Krylov methods implemented in Krylov.jl
is compatible with our package.
All results in this paper were generated using double-float precision (Float64
).