October 02, 2025
This paper studies the mixed-precision iterative refinement (IR) algorithm for low-rank continuous-time Lyapunov equation, which has the form \[\label{eq:lyap} AX + XA^T + W =0, \quad A, W\in\mathbb{R}^{n\times n},\tag{1}\] where \(A\) is Hurwitz (asymptotically stable), and \(W\) is positive semidefinite with rank \(m\ll n\), admitting a factorization \[\label{eq:lyap-W-fac} W=LL^T \quad or \quad W=LSL^T, \quad L\in\mathbb{R}^{n\times m}.\tag{2}\] The coefficient matrix \(A\) being Hurwitz implies that the solution \(X\) is symmetric positive semidefinite and hence the factorization \(X=ZZ^T\) (or \(X=ZYZ^T\)) exists. The class of equations given in 1 plays a crucial role in numerous applications in control theory [1], system balancing [2], [3], and model reduction [4], [5]. The larger dimension \(n\) of the equation arising from practical settings can be of order \(10^5\) or beyond [6]. For such large-scale Lyapunov equations, iterative solvers are typically preferred for finding an approximated solution, since factorization-based methods become unduly expensive in terms of both computational overhead and memory storage.
The critical importance of the low-rank Lyapunov equation in many applications has spurred extensive amount of literature devoted to the computation of its solution. Taking advantage of the numerically low-rank property of the solution [7], [8], iterative schemes for solving 1 typically iterate on tall-and-skinny (or short-and-fat) low-rank factors of an approximated solution; these mainly include the method based on rational iterations for the matrix sign function [9], the low-rank alternating direction implicit (LR-ADI) method [10], [11], and projection-type methods based on Krylov subspaces [12], [13], [14]; see [15] for a survey. The idea of seeking factored solutions to matrix equations can be traced back to Hammarling [16], who exploited it for solving stable and non-negative definite Lyapunov equations.
Modern hardware increasingly supports native precisions lower than the traditional IEEE binary64 (fp64) and binary32 (fp32) formats [17], and this has
fostered the development of mixed-precision algorithms. Utilizing the low precisions appropriately within numerical algorithms can accelerate computation, reduce data storage and communication, and improve energy efficiency on computational units, without
sacrificing their accuracy and stability. We refer the reader to [18] for a survey on recent developments of mixed-precision algorithms in numerical linear
algebra. Theoretically, half precisions, including the IEEE binary16 format (fp16) and the bfloat16 format (bf16) by Google Brain,1 offer a \(2
\times\) or \(4\times\) speedup over the performance of fp32 or fp64, respectively. The advent of tensor cores accelerators on modern GPUs has however pushed the limit of the theoretical acceleration of fp16 to \(8 \times\) or \(16\times\) faster than fp32 or fp64, respectively [19], and practical performance
evaluation has revealed that the use of tensor cores can boost the GEMM
(general matrix–matrix multiply) performance by up to \(6\times\) when multiplying large matrices and \(12\times\) when multiplying small-size matrices in parallel [20]. In particular, using fp16 tensor cores within a fp16–fp64 IR
scheme can provide up to \(4\times\) speedup against calling the standard LU-based LAPACK routine dgesv
for solving system of linear equations, while delivering a high fp64 accuracy [21]. Table [table:fp-precs] reports the key parameters of the four floating-point arithmetics
considered in this work.
[t]
\caption{Parameters for bf16, fp16, fp32, and fp64 arithmetic: number of binary digits in the significand (including the implicit bit) $t$ and in the exponent $e$, unit roundoff $u$, smallest positive normalized floating-point number $x_{\min}$, and largest floating-point number $x_{\max}$, all to three significant figures.}
\label{table:fp-precs}
\centering \setlength\tabcolsep{8pt}
\begin{tabular}{c r r l l l}
\toprule
& $t$ & $e$ & \qquad$u$ &
\;\quad$x_{\min}$ & \;\quad$x_{\max}$ \\ \midrule
bf16 & $8$ & $8$ & $3.91\times 10^{-3}$
& $1.18\times 10^{-38}$ & $3.39\times 10^{38}$
\\
fp16 & $11$ & $5$ & $4.88\times 10^{-4}$
& $6.10\times 10^{-5}$ & $6.55\times 10^{4}$
\\
fp32 & $24$ & $8$ & $5.96\times 10^{-8}$
& $1.18\times 10^{-38}$ & $3.40\times 10^{38}$
\\
fp64 & $53$ & $11$ & $1.11\times 10^{-16}$
& $2.22\times 10^{-308}$ & $1.80\times 10^{308}$
\\
\bottomrule
\end{tabular}
Despite the wide use of mixed precision in the numerical linear algebra community, its potential has remained largely unexploited in approximating the solution of matrix equations. Benner et al. [6] developed an algorithm for computing factored solution of the low-rank Lyapunov equation 1 , where the idea of IR in fp32 and fp64 was exploited, albeit not fully spelled out; their focus was more on the implementation with hybrid CPU–GPU platforms rather than algorithmic development. More recently, based on a fixed-precision IR scheme for the quasi-triangular Sylvester equation, a mixed-precision Schur-based method was devised for computing the full solution of the Sylvester matrix equation [22].
In this paper, we develop a mixed-precision IR framework for solving low-rank Lyapunov matrix equations. We provide a rounding error analysis of the algorithms to guide the choice of the precisions and algorithmic parameters. We examine the IR framework by using the sign function Newton iteration as the solver. We begin with the mixed-precision IR framework in Section 2, followed by rounding error analyses of both the Cholesky-type and \(LDL^T\)-type formulations. In Section 3 we discuss the use of the sign function Newton iteration within the IR framework. Numerical experiments are presented in Section 4 to verify our analysis and the quality of the solutions computed by the new mixed-precision algorithms. Conclusions are drawn in Section 5.
We use the phrase “precision \(u\)” (perhaps with subscripts) to indicate a floating-point arithmetic with unit roundoff \(u\). The hats denote quantities computed in floating-point arithmetic, and \(\mathop{\mathrm{\operatorname{f\kern.2ptl}}}_r(\cdot)\) is used to denote the computed quantity of an arithmetic process performed in precision \(u_r\). Given an integer \(n\), we define \(\gamma_n=nu/(1-nu)\) and \(\tilde{\gamma}_n= cnu / (1-cnu)\), where \(c\) is a small integer constant whose exact value is unimportant. When \(\gamma_n\) or \(\tilde{\gamma}_n\) carries a superscript, that superscript denotes the index of the corresponding \(u\) appearing as a subscript; for example \(\gamma_n^s=nu_s/(1-nu_s)\). We use MATLAB-style colon notation to denote index ranges; for example, \(A(:, j)\) selects the entire \(j\)-th column, and \(A(i, :)\) selects the entire \(i\)-th row. The spectral radius of a square matrix \(A\) is denoted by \(\rho(A):=\max\bigl\{|\lambda| : \lambda\;is an eigenvalue of\;A\bigr\}\). The \(\mathop{\mathrm{diag}}(\cdot)\) operator creates a diagonal matrix from its input scalar elements, while \(\mathop{\mathrm{blkdiag}}(\cdot)\) returns a block-diagonal matrix from its input matrices; and \(\norm{\cdot}\) denotes any consistent operator norm.
The low-rank Lyapunov equation 1 can be recast as the \(n^2 \times n^2\) Kronecker linear system \[\label{eq:lyap-ls} M \mathop{\mathrm{vec}}(X) = w,\qquad M := I_n \otimes A + A \otimes I_n,\quad w := \mathop{\mathrm{vec}}(-W),\tag{3}\] where \(I_n\) denotes the identity matrix of order \(n\), and \(\mathop{\mathrm{vec}}\) stacks the columns of an \(m\times n\) matrix into a vector of length \(mn\). In theory, one can apply any linear system solver to the equivalent system 3 for the solution of the Lyapunov equation 1 . This approach should nonetheless be avoided in practice, not only due to its prohibitively expensive storage requirements, but also because it is unclear how the low-rank structure can be exploited.
The authors of [6] design an IR scheme for the factored solution of the low-rank Lyapunov equation 1 , where the solver step is carried out in fp32 and the other steps are performed in the usual fp64 environment. The major difference between this IR scheme and that for the linear system lies in the residual computation and solution update steps (see Line [alg46line46ir3-lyap46resfac46chol] and Line [alg46line46ir3-lyap46solupt46chol] of Algorithm [alg46ir3-lyap-chol] below): the former might involve more complex computational kernels, such as the QR factorization and spectral decomposition.
\(F_i = \begin{bmatrix} Z_i & A Z_i & L \end{bmatrix}\) Compute a thin QR decomposition \(F_i = U_i T_i\). Form \(H_i=T_i P_i T_i^T\) and compute a spectral decomposition \(H_i = Q_i\Lambda_i Q_i^T\). \(\Lambda_{i,t}^{+}= \mathop{\mathrm{diag}}(\lambda_j)\), \(j\in J^+:= \{j\;\vert\; \lambda_j \ge \eta_r\}\), \(Q_{i,t}^+= Q_i(\colon, J^+)\) \(\Lambda_{i,t}^-= \mathop{\mathrm{diag}}(\lambda_j)\), \(j\in J^-:= \{j\;\vert\; \lambda_j \le -\eta_r\}\), \(Q_{i,t}^-= Q_i(\colon, J^-)\) \(L_i^+ = U_i Q_{i,t}^+ (\Lambda_{i,t}^+)^{1/2}\), \(L_i^- = U_i Q_{i,t}^- (-\Lambda_{i,t}^-)^{1/2}\)
The residual of an approximated Cholesky-type factor \(Z_i\) of the solution to 1 has the form \[\label{eq:lyap-res-chol} \mathcal{R}(Z_i) := AZ_iZ_i^T + Z_iZ_i^TA^T + LL^T =: \mathcal{L}(Z_iZ_i^T) + LL^T,\tag{4}\] where \[\mathcal{L}\colon \mathbb{R}^{n\times n}\to \mathbb{R}^{n\times n},\quad \mathcal{L}(X) = AX+XA^T,\] is the Lyapunov operator, whose condition number is defined as \(\kappa_{F}(\mathcal{L}) = \norm{\mathcal{L}}_F\norm{\mathcal{L}^{-1}}_F\). In practice, the residual \(\mathcal{R}(Z_i)\) exhibits indefiniteness, which is not intrinsic but due to the inexactness of the solver as well as the rounding errors. The Cholesky-type solver of [6] requires that the constant matrix be positive semidefinite, so that the (symmetric) positive semidefinite (PSD) and negative semidefinite (NSD) parts of the residual can be extracted as \[\label{eq:res-fac} \mathcal{R}(Z_i)= L^+_i ({L^+_i})^T - L^-_i (L^-_i)^T =: \mathcal{R}^+(Z_i) - \mathcal{R}^-(Z_i),\tag{5}\] and the solution updates can be obtained from the two correction equations \[\tag{6} \begin{align} AX_i^+ + X_i^+A^T + \mathcal{R}^+(Z_i) &= 0, \quad X_i^+ = Z_i^+ (Z_i^+)^T, \tag{7}\\ AX_i^- + X_i^-A^T + \mathcal{R}^-(Z_i) &= 0, \quad X_i^- = Z_i^- (Z_i^-)^T. \tag{8} \end{align}\] Since the indefiniteness of \(\mathcal{R}(Z_i)\) originates from approximation and rounding errors, in general \(\norm{\mathcal{R}^-(Z_i)}\) is expected to be much smaller than \(\norm{\mathcal{R}^+(Z_i)}\), especially as the refinement proceeds. The residual 4 can be rewritten as the product \[\mathcal{R}(Z_i) = F_i P_i F_i^T, \quad F_i:= \begin{bmatrix} Z_i & AZ_i & L \end{bmatrix}, \quad P_i := \begin{bmatrix} 0 & I_{c_i} & 0 \\ I_{c_i} & 0 & 0 \\ 0 & 0 & I_{m} \end{bmatrix},\] where \(c_i\) is the smaller dimension of \(Z_i\), and so \(P_i\) is of size \((2c_i+m) \times (2c_i+m)\). Then the residual factorization 5 can be performed via a QR factorization \(F_i=U_iT_i\) without explicitly forming the residual, followed by a spectral decomposition of the small kernel matrix \(H_i:=T_iP_iT_i^T\) such that \(H_i= Q_i\Lambda_i Q_i^T\). Then \(L_i^+ = U_i Q_i^+ (\Lambda_i^+)^{1/2}\) and \(L_i^- = U_iQ_i^- (-\Lambda_i^-)^{1/2}\), where \(\Lambda_i^+=\mathop{\mathrm{diag}}(\lambda_j)\), \(j\in J^+:=\{j\;\vert\; \lambda_j>0\}\) and \(\Lambda_i^-=\mathop{\mathrm{diag}}(\lambda_j)\), \(j\in J^-:=\{j\;\vert\; \lambda_j\le0\}\); and \(Q_i^+\) and \(Q_i^-\) contain the corresponding eigenvectors.
In practice, it is necessary to impose a rank truncation in the residual factorization, so eigenvalues of magnitude smaller than a certain threshold \(\eta_r>0\) are dropped off. This enhances the robustness of the algorithm in the presence of rounding errors accumulated in the iterations and factorizations, and it also reduces the algorithmic cost by potentially removing the redundant dimensions in the iterates \(Z_i^+\) and \(Z_i^-\). For example, the authors of [6] have observed \(\eta_r=10^{-4}\) in general works well for their algorithm. The overall residual factorization scheme is presented as Fragment [alg46res-fac-chol], where we use double subscript to denote the truncated eigenvector and eigenvalue matrices.
\(G_i = \begin{bmatrix} Z_i & Z_i^+ & Z_i^- \end{bmatrix}\) Compute a thin QR decomposition \(G_i = V_i\Gamma_i\). Form \(K_i=\Gamma_i J_i \Gamma_i^T\) and compute a spectral decomposition \(K_i = \Theta_i \Sigma_i \Theta_i^T\). \(\Sigma_{i,t}^+= \mathop{\mathrm{diag}}(\sigma_j)\), \(j\in J^+:= \{j\;\vert\; \sigma_j\ge \eta_s\}\), \(\Theta_{i,t}^+= \Theta_i(\colon, J^+)\) \(Z_{i+1} = V_i \Theta_{i,t}^+ (\Sigma_{i,t}^+)^{1/2}\)
Mathematically, the full solution update takes the form \(X^{bp}_{i+1}: = X_{i} + X_i^+ - X_i^-\) after the correction equations 6 are solved, where both \(X_i^+\) and \(X_i^-\) are symmetric positive semidefinite. But, in our case where the sought-after solution \(X\) is positive semidefinite, the updated approximant \(X_{i+1}\) then has to be taken as the projection of \(X^{bp}_{i+1}\) onto the convex cone of PSD matrices, in order to guarantee the convergence of the sequence \(\{X_i\}\) of PSD matrices towards \(X\) in the presence of approximation errors from the solver and rounding errors in the floating-point computations. This projection can be done in a similar manner to the implicit residual splitting into PSD and NSD factors—there is no need of explicitly forming the \(n\times n\) matrices in updating the iterating factor \(Z_i\). Define \(X_{i+1}: = X_{i} + X_i^+ - X_i^- - \Delta X_{i+1}\), where \(\Delta X_{i+1} = X^{bp}_{i+1} -X_{i+1}\) represents the negative semidefinite perturbation made in the projection. The condition \(\norm{ \Delta X_{i+1}} \ll \norm{X_{i+1}^{bp}}\) holds, provided the solver for the correction equations 6 is relatively stable and the used floating-point arithmetic is accurate enough.
Suppose the smaller dimensions of the solution factor increments \(Z_i^+\) and \(Z_i^-\) are \(c_i^+\) and \(c_i^-\), respectively. Writing \[G_i := \begin{bmatrix} Z_i & Z_i^+ & Z_i^- \end{bmatrix}, \quad J_i := \mathop{\mathrm{blkdiag}}(I_{c_i}, I_{c_i^+}, -I_{c_i^-}),\] the projected solution update is performed by a QR factorization \(G_i=V_i\Gamma_i\), followed by a spectral decomposition of \(K_i :=\Gamma_iJ_i\Gamma_i^T\) such that \(K_i = \Theta_i\Sigma_i \Theta_i^T\). The factor of the updated approximant is taken to be \(Z_{i+1} = V_i \Theta_i^+ (\Sigma_i^+)^{1/2}\), where \(\Sigma_i^+=\mathop{\mathrm{diag}}(\sigma_j)\), \(j\in J^+:=\{j\;\vert\; \sigma_j > 0\}\) and \(\Theta_i^+\) collects the corresponding eigenvectors.
As in the case of residual factorization, applying rank truncation is also beneficial when updating the solution factor with projection—eigenvalues of small magnitude are truncated such that \(\Sigma_{i,t}^+=\mathop{\mathrm{diag}}(\sigma_j)\), \(j\in\{j\;\vert\; \sigma_j \ge \eta_s\}\) for some tolerance \(\eta_s>0\), and the corresponding eigenvectors are kept. The solution update scheme is presented in Fragment [alg46sol-update-chol].
Solve \(AX+XA^T + LL^T=0\) at precision \(u_s\) and store solution factor \(Z_1\) at precision \(u\). \(Z = Z_{i+1}\)
The mixed-precision IR framework, presented as Algorithm [alg46ir3-lyap-chol], is essentially a projected fixed-point iteration for solving the low-rank Lyapunov equation 1 . The stopping criterion depends on the size of the residual, which we gauge by \(\norm{\mathcal{R}(Z_{i})}_F\); note that this is readily available from the eigenvalues calculated in Fragment [alg46res-fac-chol]. The precisions are parameterized by the working precision \(u\), the solver precision \(u_s\ge u\), the residual factorization precision \(u_r\le u\), and solution update and projection precision \(u_c\le u\). This basic precision setting largely aligns with the choice of precisions in the IR framework [23] for the solution of linear system \(Ax=b\).
We carry out a first-order rounding error analysis of Algorithm [alg46ir3-lyap-chol] in this section. The accuracy is measured with respect to the full solution approximants, which are not formed by the algorithm. Therefore, we assume approximants to the full solution or its increments are computed exactly from the approximated factors, namely, \[\widehat{X}_i = \widehat{Z}_i \widehat{Z}_i^T, \quad \widehat{X}_i^+ = \widehat{Z}_i^+ (\widehat{Z}_i^+)^T, \quad \widehat{X}_i^- = \widehat{Z}_i^- (\widehat{Z}_i^-)^T.\] Similarly, since the residual 4 or its splitting 5 are never explicitly computed, we assume \[\widehat{\mathcal{R}}^+(\widehat{Z}_i) = \widehat{L}^+_i (\widehat{L}^+_i)^T, \quad \widehat{\mathcal{R}}^-(\widehat{Z}_i) = \widehat{L}^-_i (\widehat{L}^-_i)^T.\] Finally, we assume that the solver used on Line [alg46line46ir3-lyap46solver46chol] of Algorithm [alg46ir3-lyap-chol] produces computed solution factors \(\widehat{Z}_i^+\) and \(\widehat{Z}_i^-\) satisfying \(AX_i^+ + X_i^+A^T + \widehat{L}_{i}^+(\widehat{L}_{i}^+)^T=0\) and \(AX_i^- + X_i^- A^T + \widehat{L}_{i}^-(\widehat{L}_{i}^-)^T=0\), respectively, such that \[\label{eq:assump-chol} \norm{ \mathcal{L}(\widehat{X}_i^+ - \widehat{X}_i^-) + \widehat{\mathcal{R}}^+(\widehat{Z}_i) - \widehat{\mathcal{R}}^-(\widehat{Z}_i) }_F \le u_s \big( d_1 \norm{\mathcal{L}}_F \norm{\widehat{X}_i^+ - \widehat{X}_i^-}_F + d_2 \norm{\widehat{\mathcal{R}}^+(\widehat{Z}_i) - \widehat{\mathcal{R}}^-(\widehat{Z}_i)}_F \big),\tag{9}\] where the two constants \(d_1, d_2>0\) depend on \(A\), \(\widehat{\mathcal{R}}^+(\widehat{Z}_i)\), \(\widehat{\mathcal{R}}^-(\widehat{Z}_i)\), the dimension \(n\), as well as the solver precision \(u_s\). The assumption implies that the normwise relative residual of the computed solution to \(\mathcal{L}(X_i^+ -X_i^-) + \widehat{\mathcal{R}}^+(\widehat{Z}_i) - \widehat{\mathcal{R}}^-(\widehat{Z}_i)=0\) is of order at most \(\max(d_1,d_2)u_s\). It is reasonable to consider the rounding errors in solving the two correction equations simultaneously, as they are best solved together because they share the same coefficient matrix \(A\) (see Section 3.3 below).
We start with analysing the rounding errors of Fragment [alg46res-fac-chol]. The computed tall-and-skinny factor \(\widehat{F}_i\) satisfies \[\label{eq:Fihat-def} \widehat{F}_i = \begin{bmatrix} \widehat{Z}_i & A\widehat{Z}_i + \Delta F_i^{(2)} & L \end{bmatrix}, \quad |\Delta F_i^{(2)}|\le \gamma^r_n |A||\widehat{Z}_i|.\tag{10}\] Let \(m_i:= 3\max\{m, c_i\}\), which satisfies \(m_i\ll n\). If the subsequent thin QR factorization of \(\widehat{F}_i\) is computed by the Householder QR algorithm, we have [24] \[\label{eq:Fihat-qr-bound} \widehat{U}_i \widehat{T}_i = \widehat{F}_i + \Delta \widetilde{F}_i, \quad \norm{\Delta \widetilde{F}_i}_F \le \sqrt{m_i}\tilde{\gamma}^r_{m_in} \norm{\widehat{F}_i}_F,\tag{11}\] where \[\label{eq:Uhat-def} \widehat{U}_i = U_i +\Delta U_i, \quad \norm{\Delta U_i}_F \le\sqrt{m_i}\tilde{\gamma}^r_{m_in} \le m_i^{3/2}\tilde{\gamma}^r_{n},\tag{12}\] and we customarily assume \(\sqrt{m_i}\tilde{\gamma}^r_{m_in}<1\). Suppose \[\label{eq:assump-c1} \norm{|A||\widehat{Z}_i|}_F = b_1\norm{F_i}_F,\tag{13}\] where the constant \(b_1\equiv b_1(n,i,A,L)\) depends on the coefficient matrices in the Lyapunov equation as well as the dimension \(n\) and iteration index \(i\). The constant essentially characterizes the stability of the matrix multiplication \(A\widehat{Z}_i\) with respect to potential numerical cancellation; for example, \(b_1=O(1)\) if \(\norm{|A||\widehat{Z}_i|}_F \approx \norm{A\widehat{Z}_i}_F\).
Writing \(\widehat{F}_i + \Delta \widetilde{F}_i=: F_i + \Delta F_i\) and combining 10 –11 and 13 , we obtain the first-order rounding error bound \[\widehat{U}_i \widehat{T}_i = F_i + \Delta F_i, \quad \norm{\Delta F_i}_F \le \big(b_1 + m_i^{3/2}\big) \tilde{\gamma}^r_{n} \norm{F_i}_F.\] Pre-multiplying both sides of the equality by \(U_i^T\) and using the column orthogonality of \(U_i\) gives \[\label{eq:That} (I + U_i^T\Delta U_i) \widehat{T}_i = T_i + U_i^T\Delta F_i.\tag{14}\] Since \(U_iU_i^T\) is an orthogonal projector, we have \[\begin{align} \norm{U_i^T\Delta U_i}_F^2 = & \mathop{\mathrm{tr}}\left((U_i^T\Delta U_i)^T(U_i^T\Delta U_i)\right) = \mathop{\mathrm{tr}}(\Delta U_i\Delta U_i^TU_i U_i^T) \\ \le & \mathop{\mathrm{tr}}(\Delta U_i\Delta U_i^T) = \norm{\Delta U_i}_F^2<1. \end{align}\] Define \(\Delta T_i:= \widehat{T}_i - T_i\) and assume \(\rho(U_i^T\Delta U_i)<1\). Pre-multiplying both sides of 14 by \((I + U_i^T\Delta U_i)^{-1}\) and using the Neumann series expansion \[(I + U_i^T\Delta U_i)^{-1} = I - U_i^T\Delta U_i + (U_i^T\Delta U_i)^2 - (U_i^T\Delta U_i)^3 + \cdots,\] we get \(\Delta T_i = U_i^T (\Delta F_i - \Delta U_iT_i) + O(u_r^2)\) and hence \[\begin{align} \label{eq:errbnd-deltaT} \norm{\Delta T_i}_F \lesssim & \norm{\Delta F_i - \Delta U_iT_i}_F\le \norm{\Delta F_i}_F + \norm{\Delta U_i}_F\norm{T_i}_F \nonumber \\ \le & \big(b_1 + 2m_i^{3/2}\big) \tilde{\gamma}^r_{n}\norm{T_i}_F. \end{align}\tag{15}\]
For Line [alg46res-fac-chol-line46spec46dec] of Fragment [alg46res-fac-chol], define \(\widehat{H}_i:= \mathop{\mathrm{\operatorname{f\kern.2ptl}}}_r((\widehat{T}_iP_i)\widehat{T}_i^T)\), where the product \(\widehat{T}_iP_i\) is computed exactly as \(P_i\) is a permutation matrix. Therefore, we have \[\widehat{H}_i = \widehat{T}_iP_i\widehat{T}_i^T + \Delta \widetilde{H}_i, \quad |\Delta \widetilde{H}_i|\le \tilde{\gamma}^r_{m_i} |\widehat{T}_iP_i||\widehat{T}_i|^T,\] and then \[\label{eq:deltaH-def} \Delta H_i := \widehat{H}_i - H_i = T_iP_i\Delta T_i^T + \Delta T_iP_iT_i^T + \Delta T_iP_i\Delta T_i^T + \Delta \widetilde{H}_i.\tag{16}\] Using 15 , one can get the first-order rounding error bound \[\norm{\Delta H_i}_F \le \big( (2b_1 + 4m_i^{3/2})\tilde{\gamma}^r_{n} + \tilde{\gamma}^r_{m_i} \big)\norm{T_i}_F^2.\] Suppose \[\label{eq:assump-c2} \norm{T_i T_i^T}_F = b_2\norm{T_iP_iT_i^T}_F,\tag{17}\] where the constant \(b_2\equiv b_2(m,i,A,L)\) is large if there is significant cancellation in forming the product \(H_i = T_iP_iT_i^T\); note the usual bound is \(\norm{T_iP_iT_i^T}_F \le \norm{T_i}_F^2\). Let \(\lambda_j\) denote an eigenvalue of \(H_i\), and let \(\widetilde{\lambda}_j\) denote an exact eigenvalue of \(\widehat{H}_i\). Then using Weyl’s inequality, we have \[\label{eq:weyl-ineq-conseq} |\widetilde{\lambda}_j - {\lambda}_j|\le \norm{\Delta H_i}_F \le \big( b_2(2b_1 + 4m_i^{3/2})\tilde{\gamma}^r_{n} + b_2\tilde{\gamma}^r_{m_i} \big)\norm{H_i}_F.\tag{18}\]
The spectral decomposition on Line [alg46res-fac-chol-line46spec46dec] of Fragment [alg46res-fac-chol] is usually computed by the symmetric QR algorithm or the divide-and-conquer algorithm after a tridiagonalization [25], [26], [27], and the computed eigensystem of \(\widehat{H}_i\) satisfies \[\label{eq:Hhat-eig-bnd} \widehat{Q}_i^T (\widehat{H}_i + \Delta \widehat{H}_i)\widehat{Q}_i = \widehat{\Lambda}_i, \quad \|\Delta \widehat{H}_i\|_{2} \le \tilde{\gamma}^r_{m_i}\|\widehat{H}_i\|_{2},\tag{19}\] where \(\widehat{\Lambda}_i=\mathop{\mathrm{diag}}(\widehat{\lambda}_j)\) contains the computed eigenvalues of \(\widehat{H}_i\) and \(\widehat{Q}_i\) is (numerically) orthogonal in precision \(u_r\). The expression 19 means that the eigensystem of \(\widehat{H}_i\) is computed backward stably. It follows from 18 and [26] that, for any computed eigenvalue \(\widehat{\lambda}_j\) of \(\widehat{H}_i\), \[\begin{gather} \label{eq:H-respec-eig-bnd} |\widehat{\lambda}_j - \lambda_j| \le |\widehat{\lambda}_j - \widetilde{\lambda}_j| + |\widetilde{\lambda}_j - \lambda_j| \lesssim u \|H_i+\Delta H_i\|_{2} + \norm{\Delta H_i}_F \\ \lesssim \big( b_2(2b_1 + 4m_i^{3/2})\tilde{\gamma}^r_{n} + b_2\tilde{\gamma}^r_{m_i} \big)\norm{H_i}_F = \big( b_2(2b_1 + 4m_i^{3/2})\tilde{\gamma}^r_{n} + b_2\tilde{\gamma}^r_{m_i} \big)\textstyle \sum_j |\lambda_j|. \end{gather}\tag{20}\] The bound 20 implies that the eigenvalues of large magnitude of the residual \(\mathcal{R}(\widehat{Z}_i)\) will be computed to high relative accuracy, if the constants \(b_1\) and \(b_2\) are of moderate size. But, in any case, there is no guarantee for the relative accuracy of computed eigenvalues of small magnitude.
Recall that the exact residual of \(\widehat{Z}_i\) satisfies \[\label{eq:res-Zi-def} \mathcal{R}(\widehat{Z}_i) = L^+_i ({L^+_i})^T - L^-_i (L^-_i)^T = U_iH_iU_i^T,\tag{21}\] where the residual and the products \(L^+_i ({L^+_i})^T\) and \(L^-_i (L^-_i)^T\) are not explicitly formed in Algorithm [alg46ir3-lyap-chol]. Therefore, to quantify the rounding errors in the computed residual \(\widehat{\mathcal{R}}(\widehat{Z}_i)\), we only need to consider the inexact computation of the factors \(L^+_i\) and \(L^-_i\) and its effect on the residual, and so we can write \[\begin{align} \label{eq:res-comput-Zi-def} \widehat{\mathcal{R}}(\widehat{Z}_i) & = \widehat{L}^+_i ({\widehat{L}^+_i})^T - \widehat{L}^-_i (\widehat{L}^-_i)^T + \Delta R_i^s, \nonumber \\ & = \widehat{\mathcal{R}}^+(\widehat{Z}_i) - \widehat{\mathcal{R}}^-(\widehat{Z}_i) + \Delta R_i^s, \quad \norm{\Delta R_i^s}_F \le 2u_s \norm{\mathcal{\widehat{R}}(\widehat{Z}_i)}_F, \end{align}\tag{22}\] where the last term \(\Delta R_i^s\) accounts for the error caused by rounding \(\widehat{L}^+_i\) and \(\widehat{L}^-_i\) down to precision \(u_s\). The precision conversion incurs relative componentwise perturbations to these factors bounded above by \(u_s\), so the bound holds from the relation \(\norm{\widehat{\mathcal{R}}^-(\widehat{Z}_i)} \ll \norm{\widehat{\mathcal{R}}^+(\widehat{Z}_i)}\) to the first order.
On completion of Fragment [alg46res-fac-chol], we have \[\widehat{L}^+_i = \mathop{\mathrm{\operatorname{f\kern.2ptl}}}_r \big( \widehat{U}_i \widehat{Q}_{i,t}^+ (\widehat{\Lambda}_{i,t}^+)^{1/2} \big), \quad \widehat{L}^-_i = \mathop{\mathrm{\operatorname{f\kern.2ptl}}}_r \big( \widehat{U}_i \widehat{Q}_{i,t}^- (-\widehat{\Lambda}_{i,t}^-)^{1/2} \big),\] where \(\widehat{Q}_{i,t}^\pm\) and \(\widehat{\Lambda}_{i,t}^\pm\) contain the eigenvectors and eigenvalues, respectively, after the rank truncation to the computed full eigensystem \(\widehat{Q}_i \widehat{\Lambda}_i \widehat{Q}_i^T\), where \[\widehat{Q}_i := \begin{bmatrix} \widehat{Q}_{i,t}^+ & \widehat{Q}_{i,t}^- & \widehat{Q}_{i,0}^+ & \widehat{Q}_{i,0}^- \end{bmatrix},\quad \widehat{\Lambda}_i := \mathop{\mathrm{blkdiag}}(\widehat{\Lambda}_{i,t}^+, \widehat{\Lambda}_{i,t}^-, \widehat{\Lambda}_{i,0}^+, \widehat{\Lambda}_{i,0}^-).\] Using the standard model of floating-point arithmetic [24], it is straightforward to show \[\widehat{L}^+_i =: \widehat{U}_i \widehat{Q}_{i,t}^+ (\widehat{\Lambda}_{i,t}^+)^{1/2} + \Delta \widehat{L}^+_i, \quad |\Delta \widehat{L}^+_i| \le (u_r+2\gamma^r_{m_i}) |\widehat{U}_i| |\widehat{Q}_{i,t}^+| |(\widehat{\Lambda}_{i,t}^+)^{1/2}|\] and then \[\widehat{L}^+_i(\widehat{L}^+_i)^T =: \widehat{U}_i \widehat{Q}_{i,t}^+ \widehat{\Lambda}_{i,t}^+ (\widehat{Q}_{i,t}^+)^T \widehat{U}_i^T + \Delta R^+_i,\] with \(|\Delta R^+_i| \le (2u_r+4\gamma^r_{m_i}) |\widehat{U}_i| |\widehat{Q}_{i,t}^+| |\widehat{\Lambda}_{i,t}^+| |(\widehat{Q}_{i,t}^+)^T| |\widehat{U}_i^T|\), where the bound holds to the first order of the unit roundoff. Similarly, one can show \[\widehat{L}^-_i =: \widehat{U}_i \widehat{Q}_{i,t}^- (-\widehat{\Lambda}_{i,t}^-)^{1/2} + \Delta \widehat{L}^-_i, \quad |\Delta \widehat{L}^-_i| \le (u_r+2\gamma^r_{m_i}) |\widehat{U}_i| |\widehat{Q}_{i,t}^-| |(-\widehat{\Lambda}_{i,t}^-)^{1/2}|\] and \[\widehat{L}^-_i(\widehat{L}^-_i)^T =: -\widehat{U}_i \widehat{Q}_{i,t}^- \widehat{\Lambda}_{i,t}^- (\widehat{Q}_{i,t}^-)^T \widehat{U}_i^T + \Delta R^-_i,\] with \(|\Delta R^-_i| \le (2u_r+4\gamma^r_{m_i}) |\widehat{U}_{i}| |\widehat{Q}_{i,t}^-| |\widehat{\Lambda}_{i,t}^-| |(\widehat{Q}_{i,t}^-)^T| |\widehat{U}_i^T|\). Hence, by 19 the computed residual 22 can be rewritten as, \[\begin{gather} \widehat{\mathcal{R}}(\widehat{Z}_i) = \widehat{U}_i (\widehat{H}_i + \Delta \widehat{H}_i)\widehat{U}_i^T - \widehat{U}_i \big(\widehat{Q}_{i,0}^+ \widehat{\Lambda}_{i,0}^+ (\widehat{Q}_{i,0}^+)^T + \widehat{Q}_{i,0}^- \widehat{\Lambda}_{i,0}^- (\widehat{Q}_{i,0}^-)^T \big) \widehat{U}_i^T + \Delta R_i + \Delta R_i^s, \\ \max(\|\widehat{\Lambda}_{i,0}^-\|_{2}, \|\widehat{\Lambda}_{i,0}^-\|_{2}) \le \eta_r \|\widehat{\Lambda}_{i}\|_{2}, \quad |\Delta R_i| \le (2u_r+4\gamma^r_{m_i}) |\widehat{U}_i| |\widehat{Q}_i| |\widehat{\Lambda}_i| |\widehat{Q}_i^T| |\widehat{U}_i^T|. \end{gather}\] Define \(\widehat{H}_{i}^0: = \widehat{Q}_{i,0}^+ \widehat{\Lambda}_{i,0}^+ (\widehat{Q}_{i,0}^+)^T + \widehat{Q}_{i,0}^- \widehat{\Lambda}_{i,0}^- (\widehat{Q}_{i,0}^-)^T\), as both \(\widehat{Q}_{i,0}^+\) and \(\widehat{Q}_{i,0}^-\) have numerically orthonormal columns, we have \(\|\widehat{H}_{i}^0\|_{2}\le 2\eta_r \|\widehat{\Lambda}_i\|_{2} = 2\eta_r \|\widehat{H}_i + \Delta \widehat{H}_i\|_{2}\). Hence, using 12 , 16 , and 21 , we obtain the first-order equality \[\begin{gather} \label{eq:res-delta-comput-Zi-def} \Delta \mathcal{R}(\widehat{Z}_i) := \widehat{\mathcal{R}}(\widehat{Z}_i) - \mathcal{R}(\widehat{Z}_i) \\ = U_i(H_i-\widehat{H}_{i}^0) \Delta U_i^T + U_i(\Delta H_i + \Delta \widehat{H}_i -\widehat{H}_{i}^0)U_i^T + \Delta U_i (H_i-\widehat{H}_{i}^0)U_i^T +\Delta R_i + \Delta R_i^s. \end{gather}\tag{23}\] Considering the rank truncation tolerance \(\eta_r\ll 1\), a first-order normwise bound follows immediately by using 12 and 18 –19 , and we have \[\label{eq:res-comput-Zi-bnd} \norm{\Delta \mathcal{R}(\widehat{Z}_i)}_F \lesssim \left( 2\sqrt{m_i}\eta_r + 2u_s + \big( 2b_1b_2 + (4b_2 +2)m_i^{3/2} \big) \tilde{\gamma}^r_n \right)\norm{\mathcal{R}(\widehat{Z}_i)}_F.\tag{24}\] The bound 24 clearly shows the inaccuracy in the computed residual \(\widehat{\mathcal{R}}(\widehat{Z}_i)\) will be dominated by the rank truncation error, if a tolerance \(\eta_r\gg \max(nu_r, u_s)\) is chosen and the constants \(b_1\) and \(b_2\) in 13 and 17 are of moderate size.
The analysis of Fragment [alg46sol-update-chol] is similar to the discussion above. The thin QR factorization of \(\widehat{G}_i := \begin{bmatrix} \widehat{Z}_i & \widehat{Z}_i^+ & \widehat{Z}_i^- \end{bmatrix}\) satisfies [24] \[\label{eq:qr-Gi-hat} \widehat{V}_i \widehat{\Gamma}_i = \widehat{G}_i + \Delta G_i, \quad \norm{\Delta G_i}_F \le \sqrt{p_i}\tilde{\gamma}^c_{p_in} \norm{G_i}_F, \quad \sqrt{p_i}\tilde{\gamma}^c_{p_in}<1,\tag{25}\] where \(p_i:= 3\max\{c_i, c_i^+, c_i^-\} \ll n\) and \[\label{eq:Vhat-def} \widehat{V}_i = V_i +\Delta V_i, \quad \norm{\Delta V_i}_F \le\sqrt{p_i}\tilde{\gamma}^c_{p_in} \le p_i^{3/2}\tilde{\gamma}^c_{n}.\tag{26}\] Define \(\widehat{K}_i:= \mathop{\mathrm{\operatorname{f\kern.2ptl}}}_c(\widehat{\Gamma}_i J_i \widehat{\Gamma}_i^T)\). One can then show, similarly to the derivation of 14 –16 , \[\begin{align} \label{eq:deltaK-def} \Delta K_i := \widehat{K}_i - K_i, \quad \norm{\Delta K_i}_F \le b_3( 4p_i^{3/2}\tilde{\gamma}^c_{n} + \tilde{\gamma}^c_{p_i} )\norm{K_i}_F, \end{align}\tag{27}\] where the constant \(b_3\equiv b_3(m,i,A,L)\) is such that \[\label{eq:assump-c3} \norm{\Gamma_i \Gamma_i^T}_F = b_3\norm{\Gamma_iJ_i\Gamma_i^T}_F.\tag{28}\] For the spectral decomposition \(\widehat{K}_i=\Theta_i\Sigma_i \Theta_i^T\) at Line [alg46sol-fac-chol-line46spec46dec] of Fragment [alg46sol-update-chol], the computed eigensystem of \(\widehat{K}_i\) satisfies [25], [26], [27] \[\label{eq:Khat-eig-bnd} \widehat{\Theta}_i^T (\widehat{K}_i + \Delta \widehat{K}_i)\widehat{\Theta}_i = \widehat{\Sigma}_i, \quad \|\Delta \widehat{K}_i\|_{2} \le \tilde{\gamma}^c_{p_i}\|\widehat{K}_i\|_{2},\tag{29}\] where \(\widehat{\Sigma}_i\) contains the computed eigenvalues of \(\widehat{K}_i\) and \(\widehat{\Theta}_i\) is (numerically) orthogonal in precision \(u_r\). Overall, we have, to the first order, \[\widehat{Z}_{i+1} =: \widehat{V}_i \widehat{\Theta}_{i,t}^+ (\widehat{\Sigma}_{i,t}^+)^{1/2} + \Delta \widehat{Z}_{i+1}, \quad |\Delta \widehat{Z}_{i+1}| \le (u_c+2\gamma^c_{p_i}) |\widehat{V}_i| |\widehat{\Theta}_{i,t}^+| |(\widehat{\Sigma}_{i,t}^+)^{1/2}|,\] and hence \[\label{eq:Xhat-i431} \widehat{X}_{i+1} = \widehat{V}_i \widehat{\Theta}_{i,t}^+ \widehat{\Sigma}_{i,t}^+ (\widehat{\Theta}_{i,t}^+)^T \widehat{V}_i ^T + \Delta \widehat{X}_{i+1}, \quad \norm{\Delta \widehat{X}_{i+1}}_F \le (2u_c+ 4\gamma^c_{p_i}) \norm{\widehat{X}_{i+1}}_F,\tag{30}\] where \(\widehat{\Theta}_{i,t}^+\) and \(\widehat{\Sigma}_{i,t}^+\) collects the eigenvectors and eigenvalues, respectively, that are retained after the rank truncation of the computed full eigensystem \[\label{eq:comput-full-eigsystem} \widehat{\Theta}_i \widehat{\Sigma}_i \widehat{\Theta}_i^T =: \begin{bmatrix} \widehat{\Theta}_{i,t}^+ & \widehat{\Theta}_{i,0}^+ & \widehat{\Theta}_{i}^- \end{bmatrix} \cdot \mathop{\mathrm{blkdiag}}(\widehat{\Sigma}_{i,t}^+, \widehat{\Sigma}_{i,0}^+, \widehat{\Sigma}_{i}^-) \cdot \begin{bmatrix} \widehat{\Theta}_{i,t}^+ & \widehat{\Theta}_{i,0}^+ & \widehat{\Theta}_{i}^- \end{bmatrix}^T.\tag{31}\] Define \[\label{eq:32Ki-0-minus} K_{i}^0: = \widehat{\Theta}_{i,0}^+ \widehat{\Sigma}_{i,0}^+ (\widehat{\Theta}_{i,0}^+)^T, \quad K_{i}^-: = \widehat{\Theta}_{i}^- \widehat{\Sigma}_{i}^- (\widehat{\Theta}_{i}^-)^T.\tag{32}\] Since \(\widehat{\Theta}_{i,0}^+\) has numerically orthonormal columns, we get \(\|K_{i}^0\|_{2}\le 2\eta_s \|\widehat{\Sigma}_i\|_{2} = 2\eta_s \|\widehat{K}_i + \Delta \widehat{K}_i\|_{2}\) for some \(\eta_s \ll 1\). Here, \(\Theta^-_i \Sigma^-_i (\Theta_i^-)^T = \Delta X_{i+1}\) corresponds to the negative semidefinite part of \(X^{bp}_{i+1}\), so \(\norm{ \Sigma^-_i}_F \ll \norm{X^{bp}_{i+1}}_F = \norm{ K_i }_F\). Consequently, by the absolute perturbation bound 20 and the numerical orthonormality of the columns of \(\widehat{\Theta}_{i}^-\), the computed negative eigenvalues satisfy \(\norm{K_{i}^-}_F = \eta_n \norm{ K_i }_F\) for some \(\eta_n \ll 1\).
Note that \(\widehat{X}_i + \widehat{X}_i^+ - \widehat{X}_i^- = V_i K_i V_i^T\), where the summation can be considered exact as it is performed implicitly via Fragment [alg46sol-update-chol]. We have, from 29 –32 , \[\widehat{X}_{i+1} = \widehat{V}_i (\widehat{K}_i + \Delta \widehat{K}_i)\widehat{V}_i^T - \widehat{V}_i (K_i^0 + K_i^-)\widehat{V}_i^T + \Delta \widehat{X}_{i+1},\] and then the identity of first-order perturbation, \[\begin{align} \label{eq:sol-upt-delta-comput-def} \Delta \Xi_i &:= \widehat{X}_{i+1} - (\widehat{X}_i + \widehat{X}_i^+ - \widehat{X}_i^-) \\ \nonumber & = V_iK_i \Delta V_i^T + V_i(\Delta K_i + \Delta \widehat{K}_i -K_i^0-K_i^-)V_i^T + \Delta V_i K_iV_i^T + \Delta \widehat{X}_{i+1}. \end{align}\tag{33}\] By using 26 –27 and 29 –30 , a first-order normwise bound follows immediately: \[\begin{align} \label{eq:sol-upt-delta-compute-normbnd} \norm{\Delta \Xi_i }_F & \le 2 p_i^{3/2}\tilde{\gamma}^c_{n} \norm{K_i}_F + \norm{\Delta K_i + \Delta \widehat{K}_i -K_i^0-K_i^-}_F + (2u_c+4\gamma^c_{p_i}) \norm{K_i}_F \nonumber \\ & \lesssim \big( 2\sqrt{p_i}\eta_s + \eta_n + (4b_3 +2)p_i^{3/2} \tilde{\gamma}^c_n \big) \norm{\widehat{X}_{i+1}}_F. \end{align}\tag{34}\] From 33 and two invocations of 4 we obtain \[\label{eq:res-Zi431} \mathcal{R}(\widehat{Z}_{i+1}) = \mathcal{L}(\widehat{X}_{i+1}) - \mathcal{L}(X) = \mathcal{R}(\widehat{Z}_{i}) + \mathcal{L}(\widehat{X}_i^+-\widehat{X}_i^-) + \mathcal{L}(\Delta \Xi_i).\tag{35}\] Defining \(W_i := \mathcal{L}(\widehat{X}_i^+ - \widehat{X}_i^-) + \widehat{\mathcal{R}}^+(\widehat{Z}_i) - \widehat{\mathcal{R}}^-(\widehat{Z}_i)\), by assumption 9 we have \[\begin{align} \norm{W_i}_F & \le u_s \big( d_1 \norm{\mathcal{L}}_F \norm{\widehat{X}_i^+ - \widehat{X}_i^-}_F + d_2 \norm{\widehat{\mathcal{R}}^+(\widehat{Z}_i) - \widehat{\mathcal{R}}^-(\widehat{Z}_i)}_F \big) \\ & \le u_s \big( d_1 \kappa_F(\mathcal{L}) (\norm{\widehat{\mathcal{R}}^+(\widehat{Z}_i) - \widehat{\mathcal{R}}^-(\widehat{Z}_i)}_F + \norm{W_i}_F) + d_2 \norm{\widehat{\mathcal{R}}^+(\widehat{Z}_i) - \widehat{\mathcal{R}}^-(\widehat{Z}_i)}_F \big). \end{align}\] Rearranging and using 22 –23 gives the first-order bound \[\label{eq:Wi-bnd} \norm{W_i}_F \le u_s \frac{d_1 \kappa_F(\mathcal{L}) +d_2}{1 - d_1 \kappa_F(\mathcal{L}) u_s} \norm{\mathcal{R}_i(\widehat{Z}_i)}_F,\tag{36}\] where \(d_1 \kappa_F(\mathcal{L}) u_s < 1\) is assumed to hold. Substituting \(W_i\) back into 35 and using 22 –23 gives \[\mathcal{R}(\widehat{Z}_{i+1}) = W_i + \Delta R_i^s - \Delta \mathcal{R}(\widehat{Z}_{i}) + \mathcal{L}(\Delta \Xi_i).\] Taking the norm on both sides and applying the bounds 22 , 24 , 34 , and 36 , \[\begin{gather} \norm{\mathcal{R}(\widehat{Z}_{i+1})}_F \le \norm{W_i}_F + \norm{\Delta R_i^s}_F + \norm{\Delta \mathcal{R}(\widehat{Z}_{i})}_F + \norm{\mathcal{L}}_F\norm{\Delta \Xi_i}_F \\ \le \left( u_s \Big( 4 + \frac{d_1 \kappa_F(\mathcal{L}) +d_2}{1 - d_1 \kappa_F(\mathcal{L}) u_s} \Big) + 2\sqrt{m_i}\eta_r + \big( 2b_1b_2 + (4b_2 +2)m_i^{3/2} \big) \tilde{\gamma}^r_n \right)\norm{\mathcal{R}(\widehat{Z}_i)}_F \\ + \big( 2\sqrt{p_i}\eta_s + \eta_n + (4b_3 +2)p_i^{3/2} \tilde{\gamma}^c_n \big) \norm{\mathcal{L}}_F \norm{\widehat{X}_{i+1}}_F. \end{gather}\] The next theorem summarizes our analysis of the behavior of the residual of Algorithm [alg46ir3-lyap-chol].
Theorem 1. Let Algorithm [alg46ir3-lyap-chol] be applied to a Lyapunov equation \(\mathcal{L}(X) + LL^T =0\) with a nonsingular Lyapunov operator \(\mathcal{L}(X) = AX+XA^T\) on \(\mathbb{R}^{n\times n}\) satisfying \(d_1 \kappa_F(\mathcal{L}) u_s < 1\), and assume that the solver used on Line [alg46line46ir3-lyap46solver46chol] satisfies 9 . If \(\psi := (d_1 \kappa_F(\mathcal{L}) +d_2 ) u_s\) is sufficiently less than \(1\), then the normwise residual is reduced on the \(i\)th iteration by a factor approximately \(\phi := \psi + 2\sqrt{m_i}\eta_r + \big(2b_1b_2 + (4b_2 +2)m_i^{3/2} \big) \tilde{\gamma}^r_n\) until an iterate \(\widehat{Z}\) is produced for which \[\norm{ \mathcal{R}(\widehat{Z}) }_F \lesssim \big( 2\sqrt{p_i}\eta_s + \eta_n + (4b_3 +2)p_i^{3/2} \tilde{\gamma}^c_n \big) \norm{\mathcal{L}}_F \norm{\widehat{Z}\widehat{Z}^T}_F,\] where \(b_1\), \(b_2\), and \(b_3\) are constants defined in 13 , 17 , and 28 , respectively.
In iterative solvers for the low-rank Lyapunov equation, one can alternatively use the \(LDL^T\)-type formulation for the solution, which has been extensively used in ADI-based solvers [28]. This type of formulation writes the solution as \(X=ZYZ^T\), where \(Z\) is a tall-and-skinny and \(Y\) is a block diagonal.
We keep the notation consistent between the Cholesky-type and \(LDL^T\)-type IR schemes, to show the correspondence of the quantities and reduce repetition in our error analysis later on. The residual of an approximated solution \(X_i=Z_iY_iZ_i^T\) to 1 (with \(W=LSL^T\)) takes the form \[\label{eq:lyap-res-ldlt} \mathcal{R}(Z_i, Y_i) := AZ_iY_iZ_i^T + Z_iY_iZ_i^TA^T + LSL^T.\tag{37}\] This indefinite residual can be handled by the \(LDL^T\)-type solver, which carries any indefiniteness of the matrix into the block-diagonal matrix \(Y\). As such, we have \[\label{eq:res-Zi-ldlt} \mathcal{R}(Z_i, Y_i) = F_i N_i F_i^T,\quad F_i := \begin{bmatrix} Z_i & AZ_i & L \end{bmatrix}, \quad N_i := \begin{bmatrix} 0 & Y_i & 0 \\ Y_i & 0 & 0 \\ 0 & 0 & S \end{bmatrix}.\tag{38}\] A rank truncation for the reshaped residual of the \(LDL^T\)-type solver is also necessary for the convergence in floating-point arithmetic; an efficient strategy is similar to that used for the Cholesky-type residual factorization. The overall scheme is stated as Fragment [alg46res-fac-ldlt], which returns the \(LDL^T\)-type factors \(L_i^{\Delta}\) and \(S_i^{\Delta}\) of the truncated residual \(L_i^{\Delta} S_i^{\Delta} (L_i^{\Delta})^T\).
\(F_i = \begin{bmatrix} Z_i & A Z_i & L \end{bmatrix}\) \(N_i = \begin{bmatrix} 0 & Y_i & 0 \\ Y_i & 0 & 0 \\ 0 & 0 & S \end{bmatrix}\) Compute a thin QR decomposition \(F_i = U_i T_i\). Form \(H_i=T_i N_i T_i^T\) and compute a spectral decomposition \(H_i = Q_i\Lambda_i Q_i^T\). \(S_i^\Delta = \mathop{\mathrm{diag}}(\lambda_j)\), \(j\in J^t:= \{j\;\vert\; |\lambda_j| \ge \eta_r\}\), \(Q_{i,t}= Q_i(\colon, J^t)\) \(L_i^\Delta = U_i Q_{i,t}\)
Upon solving the correction equation with the truncated residual, \[\label{eq:lyap-cor-ldlt} AX_i^{\Delta} + X_i^{\Delta} A^T + L_i^{\Delta} S_i^{\Delta} (L_i^{\Delta})^T = 0, \quad X_i^{\Delta} = Z_i^{\Delta} Y_i^{\Delta} (Z_i^{\Delta})^T,\tag{39}\] we obtain the factors \(Z_i^{\Delta}\in\mathbb{R}^{n\times c_i^{\Delta}}\) and \(Y_i^{\Delta}\in\mathbb{R}^{c_i^{\Delta} \times c_i^{\Delta}}\) of the solution increment \(X_i^{\Delta}\). The last step is to update the solution with projection onto the nearest PSD matrix, which mathematically takes the form \(X_{i+1}: = X_{i} + X_i^{\Delta} - \Delta X_{i+1} =: X_{i+1}^{bp} - \Delta X_{i+1}\), where \(\Delta X_{i+1}\) is the negative semidefinite perturbation made by the projection. Again, it is reasonable to assume \(\norm{ \Delta X_{i+1}} \ll \norm{X_{i+1}^{bp}}\) when the solver for 39 is relatively stable and the used floating-point arithmetic is accurate enough.
\(G_i = \begin{bmatrix} Z_i & Z_i^{\Delta} \end{bmatrix}\) \(\Upsilon_i = \mathop{\mathrm{blkdiag}}(Y_i,\;Y_i^{\Delta})\) Compute a thin QR decomposition \(G_i = V_i\Gamma_i\). Form \(K_i=\Gamma_i \Upsilon_i \Gamma_i^T\) and compute a spectral decomposition \(K_i = \Theta_i \Sigma_i \Theta_i^T\). \(Y_{i+1} = \mathop{\mathrm{diag}}(\sigma_j)\), \(j\in J^+:= \{j\;\vert\; \sigma_j\ge \eta_s\}\), \(\Theta_{i,t}^+= \Theta_i(\colon, J^+)\) \(Z_{i+1} = V_i \Theta_{i,t}^+\)
The algorithmic procedure of the solution update resembles that of the Cholesky-type solver described earlier, including the truncation of small eigenvalues. We therefore omit the repeated textual description for brevity; see Fragment [alg46sol-update-ldlt].
Solve \(AX+XA^T + LSL^T=0\) at precision \(u_s\) and store solution factor \(Z_1\) and \(Y_1\) at precision \(u\). \(Z = Z_{i+1}\), \(Y= Y_{i+1}\)
The mixed-precision \(LDL^T\)-type IR framework is presented in Algorithm [alg46ir3-lyap-ldlt].
As done at the beginning of Section 2.2, we can make the reasonable assumptions that \[\widehat{X}_i = \widehat{Z}_i \widehat{Y}_i \widehat{Z}_i^T, \quad \widehat{X}_i^\Delta = \widehat{Z}_i^\Delta \widehat{Y}_i^\Delta (\widehat{Z}_i^\Delta)^T, \quad \widehat{\mathcal{R}}(\widehat{Z}_i,\widehat{Y}_i) = \widehat{L}^\Delta_i \widehat{S}^\Delta_i (\widehat{L}^\Delta_i)^T.\] Also, we assume that the solver used on Line [alg46line46ir3-lyap46solver46ldlt] of Algorithm [alg46ir3-lyap-ldlt] produces computed solution factors \(\widehat{Z}_i^\Delta\) and \(\widehat{Y}_i^\Delta\) to \(AX_i^\Delta + X_i^\Delta A^T + \widehat{L}_{i}^\Delta \widehat{S}_i^\Delta (\widehat{L}_{i}^\Delta)^T=0\) such that \[\label{eq:assump-ldlt} \norm{ \mathcal{L}(\widehat{X}_i^\Delta) + \widehat{\mathcal{R}}(\widehat{Z}_i, \widehat{Y}_i)}_F \le u_s \big( d_1 \norm{\mathcal{L}}_F \norm{\widehat{X}_i^\Delta}_F + d_2 \norm{ \widehat{\mathcal{R}}(\widehat{Z}_i, \widehat{Y}_i)}_F \big),\tag{40}\] where the two constants \(d_1, d_2>0\) depend on \(A\), \(\widehat{\mathcal{R}}(\widehat{Z}_i, \widehat{Y}_i)\), the dimension \(n\), as well as the solver precision \(u_s\). The assumption means that the normwise relative residual of the computed solution to \(\mathcal{L}(X_i^\Delta) + \widehat{\mathcal{R}}(\widehat{Z}_i, \widehat{Y}_i) =0\) is of order at most \(\max(d_1,d_2)u_s\).
The matrix \(\widehat{F}_i\) computed in Fragment [alg46res-fac-ldlt] and its thin QR factorization satisfy 10 –12 , and under the same assumption 13 , the first-order bound 15 holds. Define \(\widehat{H}_i:= \mathop{\mathrm{\operatorname{f\kern.2ptl}}}_r((\widehat{T}_i\widehat{N}_i)\widehat{T}_i^T)\), we have \[\widehat{H}_i = \widehat{T}_i\widehat{N}_i\widehat{T}_i^T + \Delta \widetilde{H}_i, \quad |\Delta \widetilde{H}_i|\le 2\tilde{\gamma}^r_{m_i} |\widehat{T}_i||\widehat{N}_i||\widehat{T}_i|^T,\] and, to the first order, \[\label{eq:deltaH-def-ldlt} \Delta H_i := \widehat{H}_i - H_i, \quad \norm{\Delta H_i}_F \le \big( (2b_1 + 4m_i^{3/2})\tilde{\gamma}^r_{n} + 2\tilde{\gamma}^r_{m_i} \big)\norm{N_i}_F\norm{T_i T_i^T}_F.\tag{41}\] Under the assumption \[\label{eq:assump-c2-ldlt} \norm{N_i}_F\norm{T_i T_i^T}_F = b_2\norm{T_iN_iT_i^T}_F, \quad b_2\equiv b_2(m,i,A,L),\tag{42}\] the perturbation bound in 41 becomes \[\label{eq:deltaH-bnd} \norm{\Delta H_i}_F \le \big( b_2(2b_1 + 4m_i^{3/2})\tilde{\gamma}^r_{n} + 2b_2\tilde{\gamma}^r_{m_i} \big)\norm{H_i}_F.\tag{43}\] For the spectral decomposition on Line [alg46res-fac-ldlt46line-spec-dec], the condition 19 still holds for the computed eigensystem of \(\widehat{H}_i\). Similar to 20 , we can show that, for any eigenvalue of \(H_i\), the corresponding computed eigenvalue \(\widehat{\lambda}_j\) of \(\widehat{H}_i\) satisfies \[\label{eq:H-respec-eig-bnd-ldlt} |\widehat{\lambda}_j - \lambda_j| \lesssim \big( b_2(2b_1 + 4m_i^{3/2})\tilde{\gamma}^r_{n} + 2b_2\tilde{\gamma}^r_{m_i} \big)\textstyle \sum_j |\lambda_j|.\tag{44}\] The residual 38 is not explicitly formed, so from Lines [alg46res-fac-ldlt46line-Si]–[alg46res-fac-ldlt46line-Li] of Fragment [alg46res-fac-ldlt] we can write \[\label{eq:res-comput-Zi-def-ldlt} \widehat{\mathcal{R}}(\widehat{Z}_i, \widehat{Y}_i) = \widehat{L}^\Delta_i \widehat{S}^\Delta_i (\widehat{L}^\Delta_i )^T + \Delta R_i^s, \quad \norm{\Delta R_i^s}_F \le 3u_s \norm{\mathcal{\widehat{R}}(\widehat{Z}_i, \widehat{Y}_i)}_F,\tag{45}\] where \(\Delta R_i^s\) accounts for the error from rounding the factors \(\widehat{S}^\Delta_i\) and \(\widehat{L}^\Delta_i\) down to precision \(u_s\). Writing the full computed eigensystem as \[\widehat{Q}_i \widehat{\Lambda}_i \widehat{Q}_i^T = \begin{bmatrix} \widehat{Q}_{i,t} & \widehat{Q}_{i,0} \end{bmatrix} \cdot \mathop{\mathrm{blkdiag}}(\widehat{\Lambda}_{i,t}, \widehat{\Lambda}_{i,0}) \cdot \begin{bmatrix} \widehat{Q}_{i,t} & \widehat{Q}_{i,0} \end{bmatrix}^T,\] we have \[\widehat{S}^\Delta_i = \widehat{\Lambda}_{i,t},\quad \widehat{L}^\Delta_i = \widehat{U}_i \widehat{Q}_{i,t} + \Delta \widehat{L}^\Delta_i, \; |\widehat{L}^\Delta_i| \le \gamma^r_{m_i} |\widehat{U}_i| |\widehat{Q}_{i,t}|,\] and hence, \[\widehat{L}^\Delta_i \widehat{S}^\Delta_i (\widehat{L}^\Delta_i )^T =: \widehat{U}_i \widehat{Q}_{i,t} \widehat{\Lambda}_{i,t} \widehat{Q}_{i,t}^T \widehat{U}_i^T + \Delta R_i, \quad |\Delta R_i| \le 2 \gamma^r_{m_i} |\widehat{U}_{i}| |\widehat{Q}_{i,t}| |\widehat{\Lambda}_{i,t}| |\widehat{Q}_{i,t}^T| |\widehat{U}_i^T|.\] By 19 , the computed residual 45 can be written as \[\widehat{\mathcal{R}}(\widehat{Z}_i, \widehat{Y}_i) = \widehat{U}_i (\widehat{H}_i + \Delta \widehat{H}_i)\widehat{U}_i^T - \widehat{U}_i \widehat{Q}_{i,0} \widehat{\Lambda}_{i,0} \widehat{Q}_{i,0}^T \widehat{U}_i^T + \Delta R_i + \Delta R_i^s,\] where \(\|\widehat{\Lambda}_{i,0}\|_{2} \le \eta_r \|\widehat{\Lambda}_{i}\|_{2}\). Define \(\widehat{H}_{i}^0: = \widehat{Q}_{i,0} \widehat{\Lambda}_{i,0} \widehat{Q}_{i,0}^T\), the bound \(\|\widehat{H}_{i}^0\|_{2}\le 2\eta_r \|\widehat{H}_i + \Delta \widehat{H}_i\|_{2}\) follows from the numerical orthonormality of the columns of \(\widehat{Q}_{i,0}\). Hence, using 12 , 38 , and 41 , we obtain the first-order equality \[\begin{gather} \label{eq:res-delta-comput-Zi-def-ldlt} \Delta \mathcal{R}(\widehat{Z}_i, \widehat{Y}_i) := \widehat{\mathcal{R}}(\widehat{Z}_i, \widehat{Y}_i) - \mathcal{R}(\widehat{Z}_i, \widehat{Y}_i) \\ = U_i(H_i-\widehat{H}_{i}^0) \Delta U_i^T + U_i(\Delta H_i + \Delta \widehat{H}_i -\widehat{H}_{i}^0)U_i^T + \Delta U_i (H_i-\widehat{H}_{i}^0)U_i^T +\Delta R_i + \Delta R_i^s. \end{gather}\tag{46}\] With \(\eta_r\ll 1\), a first-order normwise bound follows immediately by using 12 , 19 , and 43 \[\label{eq:res-comput-Zi-bnd-ldlt} \norm{\Delta \mathcal{R}(\widehat{Z}_i, \widehat{Y}_i)}_F \lesssim \big( 2\sqrt{m_i}\eta_r + 3u_s + ( 2b_1b_2 + (4b_2 +2)m_i^{3/2}) \tilde{\gamma}^r_n \big)\norm{\mathcal{R}(\widehat{Z}_i, \widehat{Y}_i)}_F.\tag{47}\]
For Fragment [alg46sol-update-ldlt], the thin QR factorization of \(\widehat{G}_i := \begin{bmatrix} \widehat{Z}_i & \widehat{Z}_i^\Delta \end{bmatrix}\) satisfies 25 and 26 with \(p_i:= 2\max\{c_i, c_i^\Delta \} \ll n\). Defining \(\widehat{K}_i:= \mathop{\mathrm{\operatorname{f\kern.2ptl}}}_c(\widehat{\Gamma}_i \widehat{\Upsilon}_i \widehat{\Gamma}_i^T)\), we can show, similarly to the derivation of 41 –43 , that \[\begin{align} \label{eq:deltaK-def-ldlt} \Delta K_i := \widehat{K}_i - K_i, \quad \norm{\Delta K_i}_F \le b_3( 4p_i^{3/2}\tilde{\gamma}^c_{n} + 2\tilde{\gamma}^c_{p_i} )\norm{K_i}_F, \end{align}\tag{48}\] where the constant \(b_3\equiv b_3(m,i,A,L)\) is such that \[\label{eq:assump-c3-ldlt} \norm{\Upsilon_i}_F \norm{\Gamma_i \Gamma_i^T}_F = b_3\norm{\Gamma_i \Upsilon_i \Gamma_i^T}_F.\tag{49}\] For the spectral decomposition of \(K_i\) on Line [alg46sol-update-ldlt-line-spec46dec], the backward error bound 29 remains valid. Overall we have, to the first order, \[\widehat{Y}_{i+1} = \widehat{\Sigma}_{i,t}^+,\quad \widehat{Z}_{i+1} =: \widehat{V}_i \widehat{\Theta}_{i,t}^+ + \Delta \widehat{Z}_{i+1}, \; |\Delta \widehat{Z}_{i+1}| \le \gamma^c_{p_i} |\widehat{V}_i| |\widehat{\Theta}_{i,t}^+|,\] and \[\label{eq:Xhat-i431-ldlt} \widehat{X}_{i+1} = \widehat{V}_i \widehat{\Theta}_{i,t}^+ \widehat{\Sigma}_{i,t}^+ (\widehat{\Theta}_{i,t}^+)^T \widehat{V}_i ^T + \Delta \widehat{X}_{i+1}, \quad \norm{\Delta \widehat{X}_{i+1}}_F \le 2\gamma^c_{p_i} \norm{\widehat{X}_{i+1}}_F,\tag{50}\] where \(\widehat{\Theta}_{i,t}^+\) and \(\widehat{\Sigma}_{i,t}^+\) collects the truncated eigenvectors and eigenvalues from the computed full eigensystem, which is in the same form as 31 .
Now we have \(\widehat{X}_i + \widehat{X}_i^\Delta = V_i K_i V_i^T\), where the summation can be considered exact as it is performed implicitly. Combining 29 , 31 , 32 , and 50 , \[\widehat{X}_{i+1} = \widehat{V}_i (\widehat{K}_i + \Delta \widehat{K}_i)\widehat{V}_i^T - \widehat{V}_i (K_i^0 + K_i^-)\widehat{V}_i^T + \Delta \widehat{X}_{i+1},\] where \(K_{i}^0\) and \(K_{i}^-\) are defined as in 32 , such that \(\|K_{i}^0\|_{2}\le 2\eta_s \|\widehat{K}_i + \Delta \widehat{K}_i\|_{2}\), for some \(\eta_s \ll 1\), and \(\norm{K_{i}^-}_F = \eta_n \norm{ K_i }_F\) for some \(\eta_n \ll 1\). We then obtain the first-order perturbation \[\begin{align} \label{eq:sol-upt-delta-comput-def-ldlt} \Delta \Xi_i & := \widehat{X}_{i+1} - (\widehat{X}_i + \widehat{X}_i^\Delta ) \\ & = V_iK_i \Delta V_i^T + V_i(\Delta K_i + \Delta \widehat{K}_i -K_i^0-K_i^-)V_i^T + \Delta V_i K_iV_i^T + \Delta \widehat{X}_{i+1}, \nonumber \end{align}\tag{51}\] from which a first-order normwise bound similarly to 34 follows, \[\norm{\Delta \Xi_i }_F \lesssim \big( 2\sqrt{p_i}\eta_s + \eta_n + (4b_3 +2)p_i^{3/2} \tilde{\gamma}^c_n \big) \norm{\widehat{X}_{i+1}}_F.\] Using 51 and two invocations of 37 gives \[\mathcal{R}(\widehat{Z}_{i+1}, \widehat{Y}_{i+1}) = \mathcal{L}(\widehat{X}_{i+1}) - \mathcal{L}(X) = \mathcal{R}(\widehat{Z}_{i}, \widehat{Y}_{i}) + \mathcal{L}(\widehat{X}_i^\Delta) + \mathcal{L}(\Delta \Xi_i).\] Define \(W_i := \mathcal{L}(\widehat{X}_i^\Delta) + \widehat{\mathcal{R}}(\widehat{Z}_{i}, \widehat{Y}_{i})\). Under the assumptions 40 and \(d_1 \kappa_F(\mathcal{L}) u_s < 1\), it is straightforward to show that \[\norm{W_i}_F \le u_s \frac{d_1 \kappa_F(\mathcal{L}) +d_2}{1 - d_1 \kappa_F(\mathcal{L}) u_s} \norm{\mathcal{R}_i(\widehat{Z}_{i}, \widehat{Y}_{i})}_F,\] and \[\begin{gather} \norm{\mathcal{R}(\widehat{Z}_{i+1},\widehat{Y}_{i+1})}_F \le \big( 2\sqrt{p_i}\eta_s + \eta_n + (4b_3 +2)p_i^{3/2} \tilde{\gamma}^c_n \big) \norm{\mathcal{L}}_F \norm{\widehat{X}_{i+1}}_F \\ + \Big( u_s \Big( 6 + \frac{d_1 \kappa_F(\mathcal{L}) +d_2}{1 - d_1 \kappa_F(\mathcal{L}) u_s} \Big) + 2\sqrt{m_i}\eta_r + \big( 2b_1b_2 + (4b_2 +2)m_i^{3/2} \big) \tilde{\gamma}^r_n \Big)\norm{\mathcal{R}(\widehat{Z}_i,\widehat{Y}_{i})}_F. \end{gather}\] We summarize our analysis in the next theorem.
Theorem 2. Let Algorithm [alg46ir3-lyap-ldlt] be applied to a Lyapunov equation \(\mathcal{L}(X) + LSL^T =0\) with a nonsingular Lyapunov operator \(\mathcal{L}(X) = AX+XA^T\) on \(\mathbb{R}^{n\times n}\) satisfying \(d_1 \kappa_F(\mathcal{L}) u_s < 1\), and assume the solver used on Line [alg46line46ir3-lyap46solver46ldlt] satisfies 40 . If \(\psi := (d_1 \kappa_F(\mathcal{L}) +d_2 ) u_s\) is sufficiently less than \(1\), then the normwise residual is reduced on the \(i\)th iteration by a factor approximately \(\phi := \psi + 2\sqrt{m_i}\eta_r + \big(2b_1b_2 + (4b_2 +2)m_i^{3/2} \big) \tilde{\gamma}^r_n\) until an iterate pair \((\widehat{Z}, \widehat{Y})\) is produced for which \[\norm{ \mathcal{R}(\widehat{Z}, \widehat{Y}) }_F \lesssim \big( 2\sqrt{p_i}\eta_s + \eta_n + (4b_3 +2)p_i^{3/2} \tilde{\gamma}^c_n \big) \norm{\mathcal{L}}_F \norm{\widehat{Z}\widehat{Y}\widehat{Z}^T}_F,\] where \(b_1\), \(b_2\), and \(b_3\) are constants defined in 13 , 42 , and 49 , respectively.
\caption{Different combinations of floating-point arithmetics and corresponding bounds on $\kappa_F(\L)$ for attaining limiting relative residuals of the order of magnitude to the unit roundoff, provided $\eta_n\lesssim nu$. }
\label{table:ir3}
\centering \setlength\tabcolsep{8pt}
\begin{tabular}{c c c c}
\toprule
$u_s$ & $u_r=u_c =u$ & Bound on $\kappa_F(\L)$ & Limiting residual \\
\midrule[1pt]
bf16 & \multirow{3}{*}{fp32} & $10^3$ & \multirow{3}{*}{fp32} \\
fp16 & & $10^4$ & \\
fp32 & & $10^8$ & \\
\midrule[0.1pt]
bf16 & \multirow{4}{*}{fp64} & $10^3$ & \multirow{4}{*}{fp64} \\
fp16 & & $10^4$ & \\
fp32 & & $10^8$ & \\
fp64 & & $\;10^{16}$ & \\
\bottomrule
\end{tabular}
For Algorithm [alg46ir3-lyap-chol] and Algorithm [alg46ir3-lyap-ldlt], Theorem 1 and Theorem 2 imply that the limiting relative residual (see 63 ) is bounded above by \(\phi := 2\sqrt{p_i}\eta_s + \eta_n + (4b_3 +2)p_i^{3/2} \tilde{\gamma}^c_n\). To attain a relative residual of order \(nu\), one can set \(\eta_s =O(nu)\), and choose \(u_c =u\) if the solution update is stable and the respective \(b_3\) is of moderate size. Otherwise, higher precision should be used for the solution update, with \(u_c =u/b_3\).
The residual reduction rate \(\phi\) depends not only on \(\psi\), which essentially concerns \(\kappa_F(\mathcal{L})u_s\), but also on the residual truncation parameter \(\eta_r\) and the precision \(u_r\) at which the residual factorization is performed. Given that \(u_r<u_s< \psi\), possibly by a large margin, the potential instability in the residual factorization at precision \(u_r\), as indicated by the factors \(b_1\) and \(b_2\), is unlikely to affect the residual reduction rate of the algorithm. Also, the optimal value of \(\eta_r\) should satisfy \(\eta_r = O(\kappa_F(\mathcal{L})u_s)\) to achieve best efficiency and balanced terms in the residual reduction rate.
We conclude this section by presenting Table [table:ir3], which lists different combinations of floating-point arithmetics that are applicable to Algorithm [alg46ir3-lyap-chol] and Algorithm [alg46ir3-lyap-ldlt], as well as the limiting relative residuals, subject to the corresponding conditioning bound and the assumptions 9 or 40 , respectively.
The matrix sign function method for solving the Lyapunov equation was introduced by Roberts in 1971 [29], and it has since been one of the most widely used methods, owing to its easy yet robust implementation, excellent parallelism, and richness in level-3 BLAS operations [30].
We focus on the use of the sign function Newton iteration as solver, but the precision settings of Table [table:ir3] remains valid if different solvers, such as the low-rank ADI-based [10] or the Krylov based [13], [14], are used.
The (scaled) sign function Newton iteration for the solution \(X\) of Lyapunov equations is \[\tag{52} \begin{align} A_{k} &= \frac{1}{2} \bigl( \mu_{k-1}A_{k-1} + \mu_{k-1}^{-1}A_{k-1}^{-1}\bigr), & A_0 &= A, \tag{53} \\ W_{k} &= \frac{1}{2} \bigl( \mu_{k-1}W_{k-1} +\mu_{k-1}^{-1} A_{k-1}^{-1}W_{k-1}A_{k-1}^{-T}\bigr), & W_0 &= W, \tag{54} \end{align}\] where the scaling parameter \(\mu_{k-1}>0\) can be used to accelerate the convergence of the method in its initial steps. The choice \(\mu_{k}\equiv 1\) yields the unscaled Newton iteration, for which \(A_k\) and \(W_k\) converge quadratically to \(-I_n\) and \(2X\), respectively [31]. Common scaling techniques include the determinantal scaling \(\mu_k = |\det(A_k)|^{-1/n}\), the spectral scaling \(\mu_k = \rho(A_k^{-1})^{1/2} / \rho(A_k)^{1/2}\), and the \(2\)-norm scaling \(\mu_k = \|A_k^{-1}\|_{2}^{1/2} / \|A_k\|_{2}^{1/2}\). The spectral norm is often approximated by the Frobenius norm when it is expensive to calculate [32], so there is also the Frobenius-norm scaling \(\mu_k = \norm{A_k^{-1}}_F^{1/2} / \norm{A_k}_F^{1/2}\), since the computation of the Frobenius norm parallelizes quite well—each matrix entry contributes independently to the final result. In general, there is no single scaling strategy that is superior to the rest [31], [33].
In the case of \(W=LL^T\), Larin and Aliev [34] propose to rewrite the iteration for \(W_k\) in 54 as \[W_{k} = \frac{\mu_{k-1}}{2} \begin{bmatrix} Z_{k-1} & \mu_{k-1}^{-1}A_{k-1}^{-1}Z_{k-1} \end{bmatrix} \begin{bmatrix} Z_{k-1}^T \\ \mu_{k-1}^{-1}Z_{k-1}^TA_{k-1}^{-T} \end{bmatrix}, \quad W_0 = Z_0Z_0^T \equiv LL^T,\] and thus obtain the iterations for the Cholesky-type factored solution \[\tag{55} \begin{align} A_{k} &= \frac{1}{2} \bigl( \mu_{k-1}A_{k-1} + \mu_{k-1}^{-1}A_{k-1}^{-1}\bigr),& A_0 &= A, \tag{56}\\ Z_{k} &= \sqrt{\frac{\mu_{k-1}}{2}} \begin{bmatrix} Z_{k-1} & \mu_{k-1}^{-1}A_{k-1}^{-1}Z_{k-1} \end{bmatrix},& Z_0 &= L, \tag{57} \end{align}\] where \(Z_k/\sqrt{2}\) converges to the full-rank factor \(Z\) of the solution \(X=ZZ^T\).
Writing \(W=LSL^T\) for some symmetric positive semidefinite matrix \(S\), we can obtain the iterations for the solution factors \[\tag{58} \begin{align} A_{k} &= \frac{1}{2} \bigl( \mu_{k-1}A_{k-1} + \mu_{k-1}^{-1}A_{k-1}^{-1}\bigr),& A_0 &= A, \tag{59}\\ Y_{k} &= \begin{bmatrix} \frac{\mu_{k-1}}{2} Y_{k-1} & \\ & \frac{1}{2\mu_{k-1}} Y_{k-1} \end{bmatrix},& Y_0 &= S, \tag{60}\\ Z_{k} &= \begin{bmatrix} Z_{k-1} & A_{k-1}^{-1}Z_{k-1} \end{bmatrix},& Z_0 &= L, \tag{61} \end{align}\] where \(Z_k\) converges to \(Z\) and \(Y_k/2\) converges to \(Y\) such that \(X=ZYZ^T\). Note that this \(LDL^T\)-type formulation avoids scaling on the tall-and-skinny solution factor \(Z\), which could be expensive when the columns of \(Z\) accumulate as the number of iterations grows. The iteration on the inner factor \(Y_k\) can be reduced to one involving only the scaling parameters \(\mu_{k}\), and \(Y_k\) can be formed at the end as \[Y_{k} = \begin{bmatrix} \frac{\mu_{k-1}}{2} & \\ & \frac{1}{2\mu_{k-1}} \end{bmatrix}\otimes \cdots \otimes \begin{bmatrix} \frac{\mu_1}{2} & \\ & \frac{1}{2\mu_1} \end{bmatrix}\otimes \begin{bmatrix} \frac{\mu_0}{2} & \\ & \frac{1}{2\mu_0} \end{bmatrix}\otimes S.\] The \(LDL^T\)-type formulation has essentially the same storage requirement as the Cholesky-type formulation.
The factorized iterations 55 –58 can substantially reduce the computational cost, as it avoids the two full \(n\)-by-\(n\) matrix multiplications required in 52 . Nevertheless, a potential concern is the growth of the size of the iterate \(Z_k\), whose column dimension \(c_k\) doubles at each iteration, which implies that the required storage space grows exponentially. Therefore, low-rank truncations are often conditionally performed within the iteration to limit the increase of the smaller dimension of the low-rank factors [6], [9], [34], [35].
\(Z_0= L\), \(A_0= A\) \(Z = Z_{k} / \sqrt{2}\)
height 0.7ex depth -0.7ex
\(Z_0= L\), \(Y_0= S\), \(A_0= A\) \(Y = Y_k / 2\)
height 0.7ex depth -0.7ex
The overall algorithm for both types of solvers are presented as Algorithm [alg46newton-lyapu-chol] and Algorithm [alg46newton-lyapu-ldlt]. The stopping criteria for both iterations depend only on the coefficient matrix \(A\), which is set as \[\label{eq:newton-stop-tau} \norm{A_k + I_n}_1 \le \tau_N,\quad \tau_N = 10\sqrt{nu},\tag{62}\] with two additional iterations being performed after the tolerance has been reached. This is to avoid potential stagnation, which occurs when the stopping criterion is too stringent, yet still letting the algorithm try to reach the attainable accuracy; see [9] for more details on the setting of the convergence test.
Now we turn to discuss the computational cost of the mixed-precision IR frameworks, Algorithm [alg46ir3-lyap-chol] and Algorithm [alg46ir3-lyap-ldlt], when they use as the solver the sign function Newton iterations, Algorithm [alg46newton-lyapu-chol] and Algorithm [alg46newton-lyapu-ldlt], respectively.
Recall that the coefficient matrix \(L\) has dimensions \(m\ll n\) and the sought solution factor \(Z\) has small numerical rank with respect to \(n\), so one can safely choose a \(\rho\ll 1\) for deciding when to perform a rank truncation. Ideally, the rank truncation threshold should be chosen such that \(\rho n\) is greater than \(m\) and the numerical rank of \(Z\), the latter of which is, however, not known in a priori.
As a baseline, we can assume that the total number of rank truncations \(t\) is in general much smaller than the total number of iterations \(k\) for convergence, i.e., \(t\ll k\). Since the possible rank truncation imposes a constraint on the size of the iterates, in any iteration, the smaller matrix in the matrix product in Line [algline46newton-lyapu-matmul] of Algorithm [alg46newton-lyapu-chol] or Line [algline46newton-lyapu-matmul-ldlt] of Algorithm [alg46newton-lyapu-ldlt] is at most of size \(n\times \rho n\) and the rank truncation in Line [algline46newton-lyapu-ranktrunc] of Algorithm [alg46newton-lyapu-chol] or Line [algline46newton-lyapu-ranktrunc-ldlt] of Algorithm [alg46newton-lyapu-ldlt] is performed on a matrix no larger than \(n\times 2\rho n\). It follows that, in each iteration, the matrix product requires \(2\rho n^3\) flops (floating-point operations) and the rank-revealing QR costs \(O(\rho^2n^3)\) flops; see [26] for the flops count for Householder QR with column pivoting, for example. Overall, Algorithm [alg46newton-lyapu-chol] and Algorithm [alg46newton-lyapu-ldlt] require \(2k n^3 + O(\rho k n^3)\) flops for \(k\) Newton iterations performed.
The Cholesky-type formulation in each refinement step solves two correction equations 6 , which share the common coefficient matrix \(A\). Therefore, one can easily adapt Lines [algline46newton-lyapu-matmul]–[algline46newton-lyapu-ranktrunc] of Algorithm [alg46newton-lyapu-chol] to solve the two equations simultaneously, such that the adapted algorithm produces the two solution factors \(Z_i^+\) and \(Z_i^-\) together. This saves a full \(n\times n\) matrix inversion per iteration but preserves the ability to perform the rank truncations of the factors independently. Therefore, the asymptotic cost of invoking Algorithm [alg46newton-lyapu-chol] for solving the correction equations 6 remains \(2 k n^3 + O(\rho k n^3)\) flops in total.
For Fragment [alg46res-fac-chol]–Fragment [alg46sol-update-chol] and Fragment [alg46res-fac-ldlt]–Fragment [alg46sol-update-ldlt], the thin QR decompositions are performed on a tall-and-skinny matrix (with the smaller dimension much lower than \(n\)). As a result, the cost of a single call to one of these subroutines is only \(O(n^2)\) flops, which is negligible compared with an invocation of the solver, under the practical assumption that a flop at precision \(u_c\) or \(u_r\) is not much more expensive than \(n\) flops at precision \(u_s\).
To sum up, when the sign function Newton iterations, Algorithm [alg46newton-lyapu-chol] and Algorithm [alg46newton-lyapu-ldlt], are used as the solver, the asymptotic cost of the IR scheme with \(i\) refinement steps performed is \(2(k_0+k_1+\cdots+k_i)n^3\) flops at precision \(u_s\), where \(k_\ell\), \(\ell=0\colon i\), denotes the number of inner Newton iteration carried out at the \(i\)th outer refinement step. The dominant cost of the algorithm is therefore dependent on both the precision of the solver and the number of total Newton iterations performed throughout. When a lower precision for \(u_s\) is used, the convergence of the Newton iteration is expected to be slower, leading to a higher number of iterations. This is reflected by the varying condition bounds for different combinations of \(u_s\), \(u_c\), and \(u_r\) in Table [table:ir3].
In principle, there is a balance for choosing \(u_s\), provided that the condition bound is satisfied and thus the Newton iteration is convergent. For a given problem, suppose the total number of Newton iterations to reach convergence is \(k_h^{\sigma}=:\sum_{\ell=0}^{i}k_\ell^{(h)}\) for \(u_s=\) fp16 (or bf16), \(k_s^{\sigma}=:\sum_{\ell=0}^{i}k_\ell^{(s)}\) for \(u_s=\) fp32, and \(k_d^{\sigma}=:\sum_{\ell=0}^{i}k_\ell^{(d)}\), for \(u_s=\) fp64, respectively. Since the theoretical speed-up of fp16 over fp32 or fp64 is \(8 \times\) or \(16\times\), respectively, on modern GPUs [19] (see the discussion in the introduction), we know that the computational costs at different precisions are largely comparable if \(4 \le k_h^{\sigma}/k_d^{\sigma} \le 16\) and \(2 \le k_s^{\sigma}/k_d^{\sigma} \le 8\).
According to our discussion in the previous section, the dominant cost of the IR scheme comes from the \(n\times n\) matrix inversions in the Newton iteration solver. It is a crucial observation that the sequence of required matrix inverses \(A_1^{-1}, A_2^{-1}, \dots\) is the same for different calls of the solver across all refinement steps. Therefore, one can store the sequence of computed matrix inverses \(\widehat{A}_1^{-1}, \widehat{A}_2^{-1}, \dots\), and only compute \(A_k^{-1}\) with a larger \(k\) when it is needed in a Newton iteration. In this case, the asymptotic cost of the IR scheme with \(i\) refinement steps performed is \(\max_{0\le \ell\le i} 2k_\ell n^3\) flops at precision \(u_s\), where \(k_\ell\) denotes the number of Newton iteration performed at the \(i\)th refinement step. Suppose this maximal number of Newton iterations in a call of Algorithm [alg46newton-lyapu-chol] (or Algorithm [alg46newton-lyapu-ldlt]) across all refinement steps of Algorithm [alg46ir3-lyap-chol] (or Algorithm [alg46ir3-lyap-ldlt]) is \(k_h^{\max}=: \max_{0\le \ell\le i} k_\ell^{(h)}\) for \(u_s=\) fp16 (or bf16), \(k_s^{\max}=: \max_{0\le \ell\le i} k_\ell^{(s)}\) for \(u_s=\) fp32, and \(k_d^{\max}=: \max_{0\le \ell\le i} k_\ell^{(d)}\) for \(u_s=\) fp64, respectively. Then, to compare the computational costs in different solver precisions we would need to gauge the ratios \(k_h^{\max}/k_d^{\max}\) and \(k_s^{\max}/k_d^{\max}\).
The precomputed matrix sequence \(\widehat{A}_1^{-1}, \widehat{A}_2^{-1}, \dots\) can be stored in the cache, or in the RAM if it does not fit into the former. But in either case, the price to pay for reducing the computational cost is the increased data movement cost associated with accessing the sequence. This approach is therefore more effective when the precomputed sequence fits into cache, so the additional communication cost becomes negligible. Nevertheless, the matrix sequence computed by the solver running in lower precisions requires less storage space, thus mitigating the additional communication cost.
In this section we evaluate the performance of the mixed-precision IR frameworks using as the solver the sign function Newton iterations, Algorithm [alg46newton-lyapu-chol] or Algorithm [alg46newton-lyapu-ldlt]. We gauge the quality of computed solution
factors by the relative residual of the equation 1 in the Frobenius norm, given by \[\label{eq:res-fac-sol} {\text{res}}(\widehat{Z}) =
\frac{\norm{A\widehat{Z}\widehat{Z}^T+\widehat{Z}\widehat{Z}^T A^T+W}_F}{\norm{W}_F + 2\norm{\widehat{Z}\widehat{Z}^T}_F\norm{A}_F}, \;or \; {\text{res}}(\widehat{Z}, \widehat{Y}) =
\frac{\norm{A\widehat{Z}\widehat{Y}\widehat{Z}^T+\widehat{Z}\widehat{Y}\widehat{Z}^T A^T+W}_F}{\norm{W}_F + 2\norm{\widehat{Z}\widehat{Y}\widehat{Z}^T}_F\norm{A}_F},\tag{63}\] depending on the types of the solver used in the algorithm. The
residuals are computed in fp64 throughout. The experiments were run using the 64-bit GNU/Linux version of MATLAB 24.2 (R2024b Update 3) on a desktop computer equipped with an Intel i5-12600K processor running at 3.70 GHz and with 32GiB of RAM. The code
that produces the results in this section is available on GitHub.2 We use bf16 as the half precision format, which was simulated by using the chop
3 function [36].
We tried the different scaling schemes mentioned in Section 3 as well as the combined scaling (with the Frobenius norm and the determinant) recommended in [33] for the Newton iteration, but we found no technique bringing dominating benefits than the others on our test sets. In particular, scalings that require the computation of the determinant of a large matrix are more prone to suffer from underflow and overflow issues in low precision. In our implementation we use the Frobenius-norm scaling, where we also monitor the relative change of the iterates, \[\delta_k := \norm{A_k - A_{k-1}}_F / \norm{A_k}_F,\] for the use of scaling and premature termination of the iterations. We adopt the strategy of Higham [31] and stop scaling of the iterations once \(\delta_k< 10^{-2}\), which is to avoid the interference of nonoptimal scaling parameters on the convergence when the region of convergence is reached. Higham also found that \(\delta_k>\delta_{k-1}/2\) (\(\delta_k\) has not decreased by at least a factor 2) is a good indicator of the dominance of roundoff errors. Therefore, together with the stopping criteria 62 , we use this condition for deciding whether to terminate the Newton iteration after two more steps.
In our implementation of Algorithm [alg46ir3-lyap-chol] and Algorithm [alg46ir3-lyap-ldlt], we monitor the ratio of two successive relative residuals 63 , defined by \[\label{eq:it-early-terminate} \theta_i := {\text{res}}_i / {\text{res}}_{i-1}.\tag{64}\] A ratio \(\theta_i\) close to \(1\) means little improvement has been made in the previous refinement step. We therefore terminate the refinement process if \(\theta_i>0.9\) (the residual is reduced by less than \(10\%\)) for two consecutive iterations. We found this scheme can effectively signify stagnation of the refinement process, especially for ill-conditioned problems.
The global convergence tolerance of the IR framework is set to \(\tau_I=nu\), with a maximum of \(i_{\max} = 50\) refinement steps. The maximal iteration for Algorithm [alg46newton-lyapu-chol] and Algorithm [alg46newton-lyapu-ldlt] is set to \(k_{\max}=50\). Also, the \(\rho\ll 1\) controls the timing of the rank truncation in the solver; but an optimal, or even suitable, value of \(\rho\) depends on the numerical rank of the sought solution factor, as discussed in Section 3.3. Since the coefficient matrix \(L\) in 1 , in our experiments, has smaller dimension \(m< 0.1 n\), we set \(\rho=0.1\) correspondingly.
According to our analyses in Section 2.2 and Section 2.4, the spectral splitting tolerance \(\eta_s>0\) in Fragment [alg46sol-update-chol] and Fragment [alg46sol-update-ldlt] is set to \(10u\), and we choose the other spectral splitting threshold \(\eta_r=10^{-4}\) in Fragment [alg46res-fac-chol] and Fragment [alg46res-fac-ldlt].
The experiments of this section are on low-rank Lyapunov equations constructed using pseudorandom matrices with specified order of condition number. The coefficient matrices in 1 were generated with the MATLAB code
L = randn(n, m);
W = L * L.';
V = gallery('orthog', n);
A = - V .* (logspace(0, q, n)) * V.';
setting \(m=3\). Note that this code generates a symmetric coefficient matrix \(A\), but the symmetry is not exploited in the algorithms; the construction of \(V\) is for controlling the condition of the Lyapunov equation, such that \(\kappa_F(\mathcal{L})\approx 10^q\), \(q\ge 0\). For the \(LDL^T\)-type solver, the inner factor matrix \(S\) of \(W=LSL^T\) is initialized to be the identity matrix of order \(m\).
We examine the quality of the computed solutions by 63 under the precision settings listed in Table [table:ir3]. The sizes of the coefficient matrix \(A\) are set to \(n=100\) and \(n=1000\). In total, with varying condition numbers and sizes, \(26\) different low-rank Lyapunov equations 1 are tested.
Cholesky-type | \(u=\) fp32 | \(u=\) fp64 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\(u_s=\) bf16 | \(u_s=\) fp32 | \(u_s=\) bf16 | \(u_s=\) fp32 | \(u_s=\) fp64 | ||||||||||||
\(n\) | \(\mathop{\mathrm{cond}}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) |
\(100\) | 1.3e0 | 1.4e-6 | 4(2) | 6 | 6.4e-7 | 2(2) | 6 | 2.7e-16 | 16(2) | 15 | 6.2e-15 | 6(2) | 16 | 6.9e-17 | 4(4) | 15 |
3.2e0 | 1.8e-7 | 8(2) | 9 | 3.5e-7 | 3(3) | 9 | 2.8e-15 | 18(2) | 24 | 4.6e-15 | 9(3) | 25 | 7.1e-17 | 5(5) | 24 | |
1.0e1 | 7.3e-7 | 8(2) | 12 | 4.3e-8 | 4(4) | 12 | 6.4e-16 | 22(2) | 33 | 7.1e-17 | 12(4) | 33 | 1.8e-15 | 5(5) | 33 | |
3.2e1 | 2.0e-6 | 12(2) | 13 | 3.0e-7 | 4(4) | 13 | 7.0e-15 | 36(2) | 43 | 4.2e-15 | 12(4) | 45 | 9.5e-17 | 6(6) | 42 | |
1.0e2 | 3.1e-6 | 22(2) | 15 | 4.9e-8 | 5(5) | 15 | 7.7e-15 | 76(2) | 49 | 6.1e-16 | 15(5) | 49 | 1.0e-16 | 6(6) | 49 | |
3.2e2 | 4.7e-6 | 27(3) | 20 | 6.8e-8 | 5(5) | 16 | 3.7e-15 | 75(3) | 57 | 8.3e-16 | 15(5) | 57 | 7.4e-17 | 7(7) | 56 | |
1.0e3 | 3.8e-4 | 9(3) | – | 3.2e-8 | 5(5) | 16 | 2.9e-4 | 21(3) | – | 1.8e-16 | 15(5) | 62 | 6.3e-17 | 7(7) | 62 | |
3.2e3 | 5.7e-4 | 30(5) | – | 6.0e-8 | 5(5) | 16 | 8.4e-4 | 15(5) | – | 7.9e-16 | 15(5) | 69 | 8.3e-17 | 7(7) | 68 | |
\(1000\) | 1.3e0 | 2.9e-7 | 4(2) | 6 | 2.4e-7 | 2(2) | 6 | 1.1e-13 | 16(2) | 43 | 1.9e-15 | 6(2) | 12 | 1.5e-17 | 4(4) | 12 |
3.2e0 | 1.1e-6 | 4(2) | 6 | 1.3e-7 | 3(3) | 6 | 6.5e-14 | 20(2) | 72 | 1.9e-15 | 9(3) | 22 | 1.4e-17 | 5(5) | 22 | |
1.0e1 | 1.1e-5 | 4(2) | 9 | 7.3e-6 | 3(3) | 9 | 1.6e-14 | 28(2) | 31 | 9.6e-15 | 12(3) | 31 | 8.6e-16 | 5(5) | 31 | |
3.2e1 | 2.6e-5 | 6(2) | 10 | 1.2e-7 | 4(4) | 10 | 2.6e-5 | 10(2) | – | 1.9e-15 | 12(4) | 39 | 2.8e-16 | 6(6) | 39 | |
1.0e2 | 7.4e-5 | 10(2) | – | 5.7e-7 | 4(4) | 11 | 7.4e-5 | 10(2) | – | 1.6e-16 | 16(4) | 48 | 2.1e-16 | 6(6) | 48 | |
3.2e2 | 2.0e-4 | 10(2) | – | 5.8e-9 | 5(5) | 12 | 2.0e-4 | 10(2) | – | 2.2e-16 | 15(5) | 55 | 3.4e-16 | 7(7) | 55 | |
1.0e3 | 1.1e-4 | 9(3) | – | 7.0e-9 | 5(5) | 12 | 1.1e-4 | 9(3) | – | 1.8e-15 | 15(5) | 70 | 2.7e-16 | 7(7) | 61 | |
3.2e3 | 5.8e-5 | 12(3) | 16 | 1.0e-8 | 5(5) | 10 | 9.6e-6 | 48(3) | – | 3.2e-14 | 15(5) | 100 | 2.9e-16 | 7(7) | 69 | |
\(LDL^T\)-type | \(u=\) fp32 | \(u=\) fp64 | ||||||||||||||
\(u_s=\) bf16 | \(u_s=\) fp32 | \(u_s=\) bf16 | \(u_s=\) fp32 | \(u_s=\) fp64 | ||||||||||||
\(n\) | \(\mathop{\mathrm{cond}}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) |
\(100\) | 1.3e0 | 1.6e-6 | 4(2) | 6 | 6.4e-7 | 2(2) | 6 | 1.0e-15 | 12(2) | 15 | 6.8e-15 | 6(2) | 18 | 8.4e-17 | 4(4) | 15 |
3.2e0 | 3.9e-6 | 4(2) | 10 | 3.3e-7 | 3(3) | 9 | 1.3e-16 | 14(2) | 24 | 4.7e-15 | 9(3) | 25 | 9.1e-17 | 5(5) | 24 | |
1.0e1 | 2.8e-6 | 6(2) | 12 | 2.6e-8 | 4(4) | 12 | 2.5e-15 | 20(2) | 33 | 1.2e-16 | 12(4) | 33 | 1.8e-15 | 5(5) | 33 | |
3.2e1 | 2.9e-6 | 10(2) | 13 | 3.0e-7 | 4(4) | 13 | 7.8e-15 | 36(2) | 42 | 4.4e-15 | 12(4) | 44 | 6.7e-17 | 6(6) | 42 | |
1.0e2 | 4.1e-6 | 18(2) | 16 | 3.3e-8 | 5(5) | 15 | 6.8e-15 | 74(2) | 49 | 2.9e-16 | 15(5) | 49 | 6.2e-17 | 6(6) | 49 | |
3.2e2 | 3.3e-6 | 18(3) | 16 | 2.6e-8 | 5(5) | 16 | 8.9e-15 | 66(3) | 62 | 2.6e-16 | 15(5) | 56 | 7.7e-17 | 7(7) | 56 | |
1.0e3 | 1.1e-4 | 33(3) | – | 4.4e-8 | 5(5) | 16 | 1.3e-4 | 30(3) | – | 3.2e-16 | 15(5) | 62 | 8.8e-17 | 7(7) | 62 | |
3.2e3 | 1.0e-3 | 15(5) | – | 3.3e-8 | 5(5) | 16 | 7.1e-4 | 15(5) | – | 1.3e-15 | 15(5) | 68 | 9.7e-17 | 7(7) | 68 | |
\(1000\) | 1.3e0 | 2.5e-7 | 4(2) | 6 | 2.4e-7 | 2(2) | 6 | 5.2e-14 | 10(2) | 22 | 1.4e-15 | 6(2) | 12 | 1.5e-17 | 4(4) | 12 |
3.2e0 | 1.3e-6 | 4(2) | 6 | 1.3e-7 | 3(3) | 6 | 2.7e-14 | 12(2) | 33 | 1.8e-15 | 9(3) | 22 | 1.4e-17 | 5(5) | 22 | |
1.0e1 | 1.2e-5 | 4(2) | 9 | 7.3e-6 | 3(3) | 9 | 8.0e-15 | 18(2) | 31 | 9.6e-15 | 12(3) | 31 | 8.6e-16 | 5(5) | 31 | |
3.2e1 | 2.5e-5 | 6(2) | 11 | 1.2e-7 | 4(4) | 10 | 7.8e-14 | 32(2) | 90 | 1.2e-15 | 12(4) | 39 | 2.1e-17 | 6(6) | 39 | |
1.0e2 | 3.6e-5 | 8(2) | 13 | 5.7e-7 | 4(4) | 11 | 6.6e-14 | 72(2) | 48 | 1.3e-16 | 16(4) | 48 | 2.7e-17 | 6(6) | 48 | |
3.2e2 | 2.0e-4 | 102(2) | – | 5.8e-9 | 5(5) | 12 | 2.0e-4 | 102(2) | – | 8.5e-17 | 15(5) | 55 | 2.6e-17 | 7(7) | 55 | |
1.0e3 | 1.1e-4 | 21(3) | – | 6.9e-9 | 5(5) | 12 | 1.1e-4 | 21(3) | – | 1.2e-16 | 15(5) | 61 | 2.6e-17 | 7(7) | 61 | |
3.2e3 | 5.3e-5 | 12(3) | 16 | 1.0e-8 | 5(5) | 10 | 1.5e-7 | 84(3) | – | 1.7e-15 | 15(5) | 77 | 2.6e-17 | 7(7) | 69 |
The results are presented in Table 1. Clearly, the required number of Newton iterations and the numerical rank of the computed solutions are increasing as the problem becomes more ill conditioned. Both Cholesky-type- and \(LDL^T\)-type IR generally have similar behaviour, though the \(LDL^T\)-type IR appears to converge slightly faster than the other, especially when the solver precision \(u_s\) = bf16. We found that both algorithms converge in both single and double working precisions for the Lyapunov equation of condition number up to about \(10^7\) when \(u_s\) = fp32 (not presented). In contrast, decreasing the solver precision \(u_s\) to bf16 limits the range of problems over which the IR scheme converges. For \(n=100\), it is convergent for problems of condition number up to about \(10^{2.5}\); for \(n=1000\), this threshold bound reduces to approximately \(10^{2}\), though the algorithm appears to converge on a problem with condition number approximately \(10^{3.5}\).
The results are largely in agreement with the condition number bounds in Table [table:ir3], but they also display the instability of the sign function Newton iteration in floating-point arithmetic, which occurs when the matrix \(A\) has eigenvalues close to the imaginary axis [37] and may be indicated by a large \(\kappa_{\mathop{\mathrm{sign}}}(B)\) [31], where \(B = \begin{bsmallmatrix} A & W \\ 0 & -A^T \end{bsmallmatrix}\). In particular, we found that the solution update of Fragment [alg46sol-update-chol] and Fragment [alg46sol-update-ldlt] has been performed accurately with \(u_c=u\), where the \(b_3\) of 28 or 49 remains moderate and approaches \(1\) as the refinement proceeds.
For each working precision \(u\), we see that \(k^{\sigma}_h/k^{\sigma}_s\), the ratio between the total number of Newton iterations with \(u_s=\) bf16 and that with \(u_s=\) fp32, is approximately \(2\) for well-conditioned problems, say, those with condition number no larger than \(10^{1.5}\). But for problems with condition number close to \(10^3\), this ratio can be much larger. A similar trend is observed for the ratio \(k^{\sigma}_s/k^{\sigma}_d\), the ratio between the total number of Newton iterations with \(u_s=\) fp32 and that with \(u_s=\) fp64. This implies that a speedup by a factor of up to four can be achieved when solving well-conditioned problems (with respect to the solver precision \(u_s\)), by reducing \(u_s\) from fp64 to fp32 or from fp32 to bf16.
If we turn to look at the maximal number of Newton iterations in a single call of the solver, we see that the ratios \(k_h^{\max}/k_s^{\max}\) and \(k_s^{\max}/k_d^{\max}\) are never larger than \(1\), which is due to the higher stopping tolerance in the reduced precision. This reveals the huge potential of exploiting reduced precisions in the IR framework to reduce computational costs and hence accelerate the solver in cache-fit scenario; see the discussion in Section 3.4. Consider the case where \(n=1000\) and cond \(=10^2\), for example. In the working precision \(u=\) fp64, the \(LDL^T\)-type solver only needs to compute two \(n\times n\) matrix inversions in \(u_s=\) bf16, whereas six such matrix inversions are required if \(u_s=\) fp64. This means \(12\times\) to \(48\times\) theoretical speed-up by switching to the low-precision solver, if the communication cost and the other non-dominant computational cost are negligible. The caveat is the limited range of problems on which the lower-precision solver is convergent.
Dataset | \(n\) | \(m\) | Nonzeros | \(\kappa_F(A)\) | \(\kappa_{\mathop{\mathrm{sign}}}(B)\) |
---|---|---|---|---|---|
beam | 348 | 1 | 60,726 | \(1.2\mathrm{e}{7}\) | \(1.4\mathrm{e}{9}\) |
build | 48 | 1 | 1,176 | \(7.5\mathrm{e}{4}\) | \(2.0\mathrm{e}{6}\) |
CDplayer | 120 | 2 | 240 | \(1.4\mathrm{e}{5}\) | \(4.1\mathrm{e}{10}\) |
eady | 598 | 1 | 357,406 | \(3.6\mathrm{e}{3}\) | \(3.8\mathrm{e}{6}\) |
fom | 1,006 | 1 | 1,012 | \(2.3\mathrm{e}{4}\) | \(5.2\mathrm{e}{5}\) |
heat-cont | 200 | 1 | 598 | \(1.5\mathrm{e}{5}\) | \(1.0\mathrm{e}{4}\) |
iss | 270 | 3 | 405 | \(2.5\mathrm{e}{5}\) | \(1.4\mathrm{e}{11}\) |
pde | 84 | 1 | 382 | \(1.2\mathrm{e}{2}\) | \(1.6\mathrm{e}{2}\) |
random | 200 | 1 | 2132 | \(3.2\mathrm{e}{3}\) | \(2.3\mathrm{e}{10}\) |
Cholesky-type | \(u=\) fp32 | \(u=\) fp64 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\(u_s=\) bf16 | \(u_s=\) fp32 | \(u_s=\) bf16 | \(u_s=\) fp32 | \(u_s=\) fp64 | |||||||||||
\({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | |
beam | 1.5e-3 | 165(15) | – | 1.6e-7 | 14(14) | 54 | 2.8e-3 | 48(16) | – | 1.6e-7 | 42(14) | – | 3.1e-16 | 16(16) | 134 |
build | 2.0e-4 | 36(12) | – | 7.8e-8 | 14(14) | 35 | 2.1e-4 | 36(12) | – | 3.7e-16 | 84(14) | 48 | 3.9e-17 | 15(15) | 48 |
CDplayer | 4.3e-6 | 8(8) | 2 | 1.4e-9 | 16(16) | 10 | 5.9e-8 | 112(8) | – | 1.4e-16 | 48(16) | 116 | 1.9e-16 | 18(18) | 116 |
eady | 2.1e-3 | 36(12) | – | 1.3e-7 | 15(15) | 12 | 2.3e-3 | 47(12) | – | 2.8e-14 | 45(15) | 98 | 2.6e-16 | 16(16) | 88 |
fom | 2.4e-3 | 12(3) | – | 6.6e-9 | 13(13) | 12 | 2.4e-3 | 12(3) | – | 8.2e-16 | 39(13) | 29 | 7.8e-17 | 15(15) | 27 |
heat-cont | 4.2e-4 | 12(4) | – | 2.0e-8 | 7(7) | 10 | 4.1e-4 | 12(4) | – | 2.2e-15 | 28(7) | 27 | 3.7e-17 | 9(9) | 26 |
iss | 1.2e-5 | 14(14) | 11 | 2.0e-8 | 21(21) | 46 | 6.5e-6 | 60(15) | – | 2.0e-8 | 63(21) | – | 4.9e-17 | 23(23) | 223 |
pde | 7.4e-7 | 8(2) | 5 | 2.9e-8 | 4(4) | 5 | 9.2e-16 | 22(2) | 11 | 5.7e-17 | 12(4) | 11 | 1.0e-16 | 6(6) | 11 |
random | 6.7e-4 | 21(7) | – | 6.5e-8 | 15(15) | 2 | 8.1e-4 | 19(7) | – | 2.2e-8 | 75(15) | – | 1.3e-16 | 16(16) | 24 |
\(LDL^T\)-type | \(u=\) fp32 | \(u=\) fp64 | |||||||||||||
\(u_s=\) bf16 | \(u_s=\) fp32 | \(u_s=\) bf16 | \(u_s=\) fp32 | \(u_s=\) fp64 | |||||||||||
\({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | \({\text{res}}\) | \(\text{iter}\) | \(\text{rank}\) | |
beam | 3.1e-3 | 45(15) | – | 2.2e-7 | 14(14) | 54 | 9.2e-4 | 80(16) | – | 2.0e-14 | 70(14) | 143 | 3.2e-16 | 16(16) | 134 |
build | 1.5e-4 | 36(12) | – | 4.5e-8 | 14(14) | 35 | 1.5e-4 | 60(12) | – | 7.6e-16 | 70(14) | 48 | 6.0e-17 | 15(15) | 48 |
CDplayer | 4.1e-6 | 8(8) | 2 | 7.4e-8 | 16(16) | 10 | 3.3e-8 | 128(8) | – | 1.8e-17 | 64(16) | 116 | 6.7e-17 | 18(18) | 116 |
eady | 2.5e-5 | 71(12) | 21 | 9.2e-8 | 15(15) | 12 | 2.2e-7 | 263(12) | – | 7.3e-12 | 60(15) | – | 2.3e-16 | 16(16) | 88 |
fom | 2.4e-3 | 12(3) | – | 2.6e-7 | 13(13) | 12 | 2.4e-3 | 12(3) | – | 3.6e-15 | 39(13) | 29 | 5.7e-16 | 15(15) | 27 |
heat-cont | 4.1e-4 | 12(4) | – | 2.4e-8 | 7(7) | 10 | 3.9e-4 | 12(4) | – | 1.0e-16 | 28(7) | 26 | 5.3e-17 | 9(9) | 26 |
iss | 1.2e-5 | 14(14) | 6 | 1.8e-8 | 21(21) | 46 | 3.1e-5 | 60(15) | – | 1.8e-8 | 63(21) | – | 2.4e-17 | 23(23) | 223 |
pde | 3.6e-7 | 8(2) | 5 | 2.8e-8 | 4(4) | 5 | 3.4e-15 | 18(2) | 11 | 1.4e-16 | 12(4) | 11 | 5.4e-17 | 6(6) | 11 |
random | 6.4e-5 | 91(7) | – | 4.8e-8 | 15(15) | 2 | 9.9e-5 | 60(7) | – | 4.6e-16 | 60(15) | 24 | 9.8e-17 | 16(16) | 24 |
Finally, we evaluate the performance of the IR algorithms on Lyapunov equations from the SLICOT library4 of benchmark examples of model reduction problems [38]. Key characteristics of the test problems are listed in Table 2.
For each dataset, we estimate \(\kappa_F(A)\) as well as \(\kappa_{\mathop{\mathrm{sign}}}(B)\) in the Frobenius norm. The latter was done in double precision by using the
funm_condest_fro
function from the Matrix Function Toolbox [31]. Since the sign function Newton iteration 52
for solving the Lyapunov equation 1 is essentially computing \(\mathop{\mathrm{sign}}(B)\), the value of \(\kappa_{\mathop{\mathrm{sign}}}(B)\) is useful for
predicting the accuracy of the Newton solver in floating-point arithmetic. Indeed, the sign function Newton iterations can be numerically rather unstable even for mildly ill-conditioned small-and-dense problems [31].
The numerical results are presented in Table 3. Perhaps not surprisingly, the IR framework with both types of solvers only reached convergence on few problems that are relatively well conditioned when \(u_s=\) bf16. For the other problems, the limiting residual presented in Table [table:ir3] is clearly irrelevant, as the iterates failed to approach sufficiently near the solution. With \(u_s=\) fp32, the algorithm converges in most cases, except on three problems of which \(\kappa_{\mathop{\mathrm{sign}}}(B)\) has a magnitude of \(10^{10}\); this ill-conditioning appears to have prevented the algorithm from reaching a relative residual of the order of the unit roundoff when \(u=\) fp64. Since most problems within the dataset are mild- to ill-conditioned, the total number of Newton iterations across all refinement steps is typically more than doubled when the solver precision \(u_s\) decreases from fp64 to fp32. However, the ratios \(k_h^{\max}/k_s^{\max}\) and \(k_s^{\max}/k_d^{\max}\) are below \(1\) in all cases where both solvers are convergent, which once again demonstrates the potential for acceleration on well-conditioned problems by using low precision in the solver.
We have developed a mixed-precision IR framework for the factored solution of low-rank Lyapunov equations, in the formulation of either Cholesky-type or \(LDL^T\)-type. Guided by rounding error analysis, we analyzed how to utilize mixed precision and choose the algorithmic parameters within the IR framework. We then focused on the case where the solver is the sign function Newton iteration, and we developed a \(LDL^T\)-type sign function Newton iteration, enabling the refinement of a computed solution from an indefinite residual. Our numerical experiments indicate that reduced precision can be employed as the solver precision to accelerate the solution of Lyapunov equations without compromising accuracy.
This work is the first step towards exploiting the emerging new reduced formats, such as the half precision, in solving the low-rank Lyapunov equation. Future lines of research include implementation of the IR algorithms on hardware that natively supports the low-precision formats. Investigating the use of reduced precision with other popular Lyapunov equation solvers—such as ADI-based and Krylov-based methods—within the mixed-precision IR framework is also an interesting problem. It might be possible to accelerate the IR algorithm by replacing the sign function Newton iteration solver with some inexact Kleinman–Newton solver [39] that adaptively tightens the convergence tolerance as the refinement proceeds.
The authors thank Massimiliano Fasi and Jonas Schulze for their helpful comments on a draft manuscript.