April 01, 2024

*This paper develops a numerical procedure to accelerate the convergence of the Favre-averaged Non-Linear Harmonic (FNLH) method. The scheme provides a unified mathematical framework for solving the sparse linear systems formed by the mean flow and
the time-linearized harmonic flows of FNLH in an explicit or implicit fashion. The approach explores the similarity of the sparse linear systems of FNLH and leads to a memory efficient procedure, so that its memory consumption does not depend on the number
of harmonics to compute. The proposed method has been implemented in the industrial CFD solver HYDRA. Two test cases are used to conduct a comparative study of explicit and implicit schemes in terms of convergence, computational efficiency, and memory
consumption. Comparisons show that the implicit scheme yields better convergence than the explicit scheme and is also roughly 7 to 10 times more computationally efficient than the explicit scheme with 4 levels of multigrid. Furthermore, the implicit scheme
consumes only approximately \(50\%\) of the explicit scheme with four levels of multigrid. Compared with the full annulus unsteady Reynolds averaged Navier-Stokes (URANS) simulations, the implicit scheme produces comparable
results to URANS with computational time and memory consumption that are two orders of magnitude smaller.*

Computational Fluid Dynamics (CFD) has been routinely used in the aerodynamic design of fluid machinery. A fluid machinery (i.e. turbomachinery) can involve a series of rotational and stationary rows of blades to exchange energy between the machine and the working fluid. Even though the exchange of energy between the working fluid and the machine is dependent on the unsteadiness of the flow [@comp_aero_book], steady Reynolds Averaged Navier-Stokes (RANS) has been the workhorse in the industry for designing turbomachinery. However, unsteady interactions among turbomachinery components can be significant and should be considered at the design stage. Unsteady RANS (URANS) simulation is an approach to model these unsteady interactions, but they can increase the computational cost by several orders of magnitude compared to the steady RANS approach. The industry has had a significant interest in developing computationally efficient design methods to model these unsteady interactions. This work is a step in this direction.

If the effect of stochastic flow perturbations on the mean flow can be approximated by a suitable turbulence model, the unsteady flow in a turbomachine can be effectively decomposed into a time-mean flow and an unsteady periodic perturbation. Periodic fluctuations are deterministic and can be computed efficiently by harmonic methods, which transform unsteady time domain problems into steady problems in the frequency domain. This removes the constraint of global time stepping in the nonlinear time marching approach (i.e. URANS). For multistage simulations, phase-lag boundaries can be applied to the periodic boundaries, and this effectively avoids the need to model multiple passages or the whole wheel in the simulations. Nevertheless, it is noted that this approach ignores the interactions between the stochastic and the periodic perturbations; besides, the frequencies of the periodic perturbations normally need to be known a priori. Addressing these limitations is out of the scope of this work and will be investigated in future work. This work concerns the family of harmonic methods that couples a mean flow solver and a time-linearized NS flow solver in the frequency domain. The coupling between the mean flow and the perturbations can be evaluated using the flow perturbations of the linearized NS solver and the mean flow variables. He and Ning [@He1998] proposed the first type of such approach, which is called the Non-Linear Harmonic (NLH) method, but the authors did not elucidate the conversion among conservative and primitive variables in the mean flow. Wang and di Mare [@fnlh] proposed a more elaborate and rigorous approach to include the non-linear coupling between the mean flow and the flow perturbations for compressible flows, and the approach is termed the Favre-averaged NLH (FNLH) method. FNLH can be used to simulate unsteady rotor-stator interactions [@fnlh], multi-row [@fnlh_multirow] interactions and more recently multi-shaft [@Wang2023] unsteady interactions. Furthermore, FNLH can also be used to efficiently simulate non-uniform flow fields in an assembly of bladerows with non-uniform geometries [@sfnlh_aiaa; @fan_sfnlh_jot]. The other family of harmonic methods is the harmonic balance method [@Hall_hb_2002]. This approach does not require a linearized NS solver, but it solves a finite set of steady-state solutions at fixed time levels in the time domain simultaneously. All of these time-level solutions are coupled through periodic boundaries and a spectral source term. Readers can refer to Hall et al. [@Hall2013] for more details on this approach.

Harmonic methods (i.e. FNLH) use pseudo-time stepping to solve the unsteady problem in the Fourier space [@fnlh]. For each pseudo-time step, the mean flow and the time linearized flow need to be computed. Convergence acceleration techniques are crucial to the computational performance of harmonic methods. NLH used an explicit scheme with geometric multigrid to accelerate its convergence. To the best knowledge of the authors, there is no previous research in the public domain on developing an implicit scheme for NLH or FNLH. However, it is worth mentioning that there has been previous work on developing implicit schemes to accelerate the convergence of the Harmonic Balance (HB) method, such as the work of Ref [@Su_2009; @Woodgate_2009; @Sicot2008; @Frey2014], as harmonic balance method does not need to solve a dedicated time-linearized solver, and hence these approaches cannot be used directly to accelerate the convergence of FNLH or NLH.

There has been extensive research on the acceleration of convergence of a steady NS flow solver [@JAMESON1991; @Yoon1988; @saad_book] and also a time-linearized flow solver [@Campobasso2003] using explicit or implicit approaches. This work will bridge the gap between convergence acceleration techniques in these two aspects and develop an implicit scheme to accelerate the convergence of FNLH. This is the first contribution of this paper. The second contribution of this paper is to formulate a unified formulation to solve the sparse linear systems of FNLH in an explicit or implicit fashion. The similarity of the sparse pattern for the linear systems of the mean flow and the time-linearized flow is explored to develop a memory-efficient scheme. The third contribution is to conduct a comparative study of explicit and implicit schemes regarding convergence, computational efficiency, and memory usage. The proposed method is implemented in the industrial CFD solver HYDRA [@moinier_thesis], which a major aero-engine manufacturer routinely uses for aerothermal design.

The paper is organized as follows: the numerical algorithms of HYDRA are briefly introduced in the first place, and the FNLH method is then described. The generalized RK scheme for FNLH is then detailed. The performance of this generalized explicit/implicit RK scheme is demonstrated. This is followed by conclusions and future work.

The numerics of HYDRA are briefly described here, and readers can refer to Ref [@hydra_mg; @Campobasso2003; @moinier_thesis] for more details. HYDRA is a finite-volume CFD solver and uses the edge-based data structure, and this can be illustrated by Fig. 1. The solver is second-order accurate in space. Inviscid flux is calculated with the Jameson-Schmidt-Turkel (JST) scheme using matrix dissipation [@JAMESON1981]. The viscous flux is computed using the central difference. To calculate the flow gradient, the gradient on each edge \((i,j)\) is first calculated using a mixture of the central difference and the average flow gradient at nodes \(i\) and \(j\). Gradient estimates for each node are given by the Green-Gauss gradients [@moinier_thesis]. A set of turbulence models is implemented in HYDRA. In this work, the Spalart-Allmaras (SA) [@spalart1992one] turbulent model with wall function is used. With respect to time integration, the hybrid five-stage Runge-Kutta (RK) scheme is implemented, where the non-linear viscous residuals are evaluated only at odd stages. This scheme has been widely used in the literature and demonstrated good performance [@Swanson2007; @moinier_thesis; @Turkel1993] . To accelerate the convergence of the non-linear solver, geometric multigrid [@Moinier2002] and incomplete matrix factorization [@misev_thesis] are implemented. The former leads to an explicit RK scheme, and the latter leads to an implicit RK scheme.

HYDRA also has an existing time-linearized flow solver and TAPENADE [@Hascoet2013] is used to perform the source code transformation to automatically differentiate the non-linear solver. Only the subroutines that contribute to the right-hand side of the nonlinear solver are differentiated, but the time-stepping for the time-linearized solver is still coded manually. Although a time-linearized flow solver solves the real and imaginary parts of the Fourier components of the flow perturbations, the subroutines of the differentiated code from TAPENDA are for flow variables in real numbers and they will be used sequentially to evaluate the real and imaginary parts of the residuals of the time-linearized solver . The explicit RK scheme with multigrid is used to accelerate the convergence of the time-linearized solver, but no implicit solver is implemented so far. FNLH requires a mean flow solver and a time-linearized flow solver, both of which are available in HYDRA, therefore, implementing FNLH in HYDRA does not require a significant modification of the code.

The Favre-averaged Non-Linear Harmonic (FNLH) method is a computational framework which efficiently couples the mean flow and its finite-amplitude periodic perturbations for compressible flows in the frequency domain. Here we provide a brieft description of FNLH and the details of FNLH can be found in Wang and di Mare [@fnlh; @fnlh_multirow; @sfnlh_aiaa].

For a flow that is subject to periodic disturbance, the primitive variables of the \(i^{\text{th}}\) control volume \(\mathbf{W}_i=(\rho, u, v, w, p)^T\) can be decomposed as: \[\mathbf{W}_i = \hat{\mathbf{W}}_i + {\mathbf{W}}''_i \label{eqn:flow95decomp}\tag{1}\] in which \(\hat{\mathbf{W}}_i\) is the Favre-averaged primitive variable and \({\mathbf{W}}''_i\) is the flow perturbation relative to the Favre-averaged value. It is noted that, for Favre-averaging, density \(\rho\) and pressure \(p\) are still time-averaged. Their perturbations are denoted as \(\rho'\) and \(p'\). The conversion between Favre-averaged and time-averaged conservative variables \(\bar{\mathbf{Q}}_i=(\bar{\rho}, \overline{\rho u}, \overline{\rho v}, \overline{\rho w}, \overline{\rho E})^T\) is trivial [@cebeci_book], for example \(\overline{\rho u} = \bar{\rho} \tilde{u}\).

In the context of periodic unsteady flows in turbomachinery, considering the schematic in Fig. 2, which features a stator-rotor-stator setting. The flow field of B3 can be decomposed to the combination of a passage-averaged time-mean value, an unsteady disturbance and a stationary passage-to-passage disturbance. If we consider the stationary disturbance as a "slowly moving" unsteady disturbance and its frequency is approaching zero. The flow field in B3 can be decomposed as a passage-averaged time-mean value \(\tilde{\phi}\) and an unsteady flow disturbance \(\phi''\), whose frequency can be zero.

If the perturbation of a dummy variable \({\phi}''\) of control volume \(i\) can be represented as the harmonic form: \[\label{eqn::phi95fft} \phi''_i \approx \sum^{l=N_h}_{l=-N_h} \hat{\phi}_{i,l} e^{I \omega_{i,l} t}\tag{2}\] where \(l\) is the harmonic index, \(N_h\) is the number of harmonics, \(I\) is the imaginary unit \(\sqrt{-1}\), \(\omega_{i,l}\) is the angular frequency of the \(l^{\text{th}}\) harmonic and \(t\) is time. Using the flow decomposition and the representation of the flow perturbation in the harmonic form, the coupled system of mean flow and harmonic flow perturbations can be obtained, and this can be represented symbolically in Equation 3 : \[\begin{align} \label{eqn:fnlh95system} \mathbf{R}_i + \text{DF}_i &= 0 \nonumber \\ I \omega_{i,l} \mathbf{M} V_i \mathbf{\hat{W}}_{i,l} + \frac{\partial\mathbf{R}_i}{\partial \mathbf{W}_i} \mathbf{\hat{W}}_{i,l} &= 0 \end{align}\tag{3}\] in which \(l\) is the harmonic index, \(V_i\) is the volume of the control volume \(i\) and \(\mathbf{M}\) is the matrix that converts primitive variable \(\mathbf{W}\) to conservative variables \(\mathbf{Q}\). \(\mathbf{R}_i\) is the non-linear residual of the mean flow and \(\frac{\partial\mathbf{R}_i}{\partial \mathbf{W}_i} \mathbf{W}''_i\) the linearized fluxes. DF is the deterministic fluxes [@df_jot]. It represents the coupling between the mean flow and harmonic flow perturbations and can be determined using the mean flow variable and harmonic flow perturbations [@fnlh]. It should be noted that HYDRA works with primitive variables, and this is why the coupled system in Equation 3 is written in the form of primitive variables, and the formulation of FNLH is favoured in HYDRA.

In a multistage setting of FNLH, bladerow interfaces for the mean flow and the time-linearized harmonic flow are treated separately. For the mean flow, the original mixing plane formulation [@Denton1992] is corrected by DF to take into account the unsteady effect [@fnlh]. For the time-linearized flow, a unified formulation has been developed to transmit disturbances in a multi-row environment [@fnlh_multirow]. Quasi-3D non-reflective treatment [@giles_nrbc_report] are applied to the mean flow and time-linearized flow.

Regarding the general process of FNLH, for the \(n^{th}\) non-linear iteration (or outer iteration), the harmonic flow perturbations are evaluated in the first place, DF is then computed using the available flow perturbations in the harmonic form, and the mean flow is then updated with the inclusion of DF. It is noted that the cross-coupling of the harmonics is ignored in the current work. This will be included in our future work.

This work devises a unified formulation to accelerate the convergence of FNLH. The scheme allows both explicit and implicit schemes to be implemented. Although the mean flow and the time-linearized harmonic flow are solved separately, it will be shown that they can be connected with each other due to the identical sparse patterns of their linear systems. This feature will also be explored to develop a memory-efficient scheme.

HYDRA uses the generalized Runge-Kutta (RK) scheme for time integration, both explicit [@Moinier2002] and implicit [@Misev2018] time integration are available. However, this procedure is only for steady flows, and in this paper it is extended to solve the mean flow and the time-linearized harmonic flows of FNLH. The generalized RK scheme uses the hybrid five-stage scheme RK (5,3), where the non-linear viscous residuals are evaluated at odd stages. It can be written as:

\[\begin{align} \label{eqn:rk} \mathbf{Q}^{n,(0)}_i & = \mathbf{Q}^n_i \nonumber \\ \mathbf{Q}^{n,(k)}_i & = \mathbf{Q}^{n,(0)}_i - \Gamma \alpha^{(k)} \mathbf{K}^{-1} (\mathbf{R}^{(k)}_i - \mathbf{R}^*_i), k = [1:5] \nonumber \\ \mathbf{Q}^{n+1}_i & = \mathbf{Q}^{n,(5)}_i \end{align}\tag{4}\]

In Equation 4 , \(n\) is the current pseudo-time step index, \(k\) is the RK stage index, \(\mathbf{R}^{(k)}_i\) is the non-linear residual of the \(k^{\text{th}}\) RK stage, \(\mathbf{R}^*_i\) is the additional term related to the multigrid operation. Its formulation can be found in Monier [@moinier_thesis] and for the implicit scheme, this term is set to zero. \(\Gamma\) can be considered as a relaxation factor and \(\mathbf{K}\) is the matrix preconditioner. To improve the convergence rate and stability, the non-linear residual of the \(k^{\text{th}}\) stage \(\mathbf{R}^{(k)}_i\) is decomposed into two components, namely the inviscid contribution \(\mathbf{R}^{(k),inv}_i\) and the dissipative or viscous contribution \(\mathbf{R}^{(k),vis}_i\): \[\begin{align} \mathbf{R}^{(k)}_i &= \mathbf{R}^{(k),inv}_i + \mathbf{R}^{(k),vis}_i \end{align}\] in which, the inviscid and viscous contributions can be evaluated as: \[\begin{align} \mathbf{R}^{(k),inv}_i &= \mathbf{R}^{inv}_i (\mathbf{Q}^{n,k-1)}) \\ \mathbf{R}^{(k),vis}_i &= \beta_k \mathbf{R}^{vis}_i(\mathbf{Q}^{n,k-1}) + (1-\beta_k) \mathbf{R}^{(k-1),vis}_i \end{align}\]

\(\mathbf{R}^{inv}_i (\mathbf{Q}^{n,k-1})\) is the inviscid contribution using the flow variable \(\mathbf{Q}^{n,k-1}\), \(\mathbf{R}^{(k),vis}_i\) is the viscous contribution and is a mixture of the viscous flux using the flow variable \(\mathbf{Q}^{n,k-1}_i\) and the viscous residual \(\mathbf{R}^{(k-1),vis}_i\) from the previous RK stage. The coefficients \(\alpha^{(k)}\) and \(\beta^{(k)}\) are summarized in Table 1. These coefficients are optimized to damp high-frequency errors and lead to better convergence when used with multigrid. Furthermore, these coefficients allow the CFL number of the explicit scheme to be increased to 2.

k | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|

\(\alpha^{(k)}\) | \(\frac{1}{4}\) | \(\frac{1}{6}\) | \(\frac{3}{8}\) | \(\frac{1}{2}\) | 1 |

\(\beta^{(k)}\) | 1 | 0 | \(\frac{14}{25}\) | 0 | \(\frac{11}{25}\) |

The above RK scheme will be applied both to the mean flow and the time-linearized harmonic flow. For the mean flow, \(\Gamma\) and \(\mathbf{K}\) can be written as [@misev_thesis]: \[\Gamma = \left\{ \begin{array}{ll} \sigma_{\text{CFL}} & \text{Explicit} \\ \frac{1}{\varepsilon} & \text{Implicit} \end{array} \right.\]

\[\label{eqn:precon} \mathbf{K} = \left\{ \begin{array}{ll} \mathbf{P} & \text{Explicit} \\ \frac{\mathbf{P}}{\varepsilon \sigma_{\text{CFL}}} + \mathbf{J} & \text{Implicit} \end{array} \right.\tag{5}\] where \(\mathbf{J}\) is the flux Jacobian matrix \(\frac{\partial{\mathbf{R}}}{\partial \mathbf{Q}}\). The flux Jacobian is assembled using the information of the immediate neighbors with first-order spatial accuracy, and hence it is an approximation of the true Jacobian. \(\mathbf{P}\) is the diagonal block of the Jacobian matrix \(\mathbf{J}\). For the explicit scheme, \(\Gamma\) is the the Courant–Friedrichs–Levy (CFL) number. For the implicit scheme, \(\Gamma\) is related to the relaxation factor \(\varepsilon\) and in this work a value of \(0.6\) [@Misev2018] is used.

With respect to the time-linearized flow, the contribution of the spectral source term must be taken into account, and consequently a diagonal matrix \(I \omega_l \mathbf{I} \mathbf{V}\) is added to \(\mathbf{K}\), where \(\omega_l\) is the angular frequency. This leads to the preconditioner matrix for each harmonic \(\mathbf{\hat{K}}_l\), and this is shown in Equation 6 .

\[\label{eqn:precon95linear} \mathbf{\hat{K}}_l = \left\{ \begin{array}{ll} \mathbf{P} + I \omega_l \mathbf{I} \mathbf{V} & \text{Explicit} \\ \frac{\mathbf{P}}{\varepsilon \sigma_{\text{CFL}}} + \mathbf{J} + I \omega_l \mathbf{I} \mathbf{V} & \text{Implicit} \end{array} \right.\tag{6}\] in which \(I\) is the imaginary unit and \(\mathbf{I}\) is the unit matrix. Since \(\omega_l\) is generally different for each harmonic, \(\mathbf{\hat{K}}_l\) also differs from harmonic to harmonic. It is not practical to store the LHS for each harmonic separately, as this will make the scheme not memory efficient. Since the sparse pattern of the matrix remains the same for each harmonic, this feature can be explored to develop a memory-efficient approach to solving FNLH.

For each RK stage (either for the mean flow or the time-linearized harmonic flow), an update of the solution \(\Delta \mathbf{Q}^{(k)}\) to the vector of the intermediary solution \(\mathbf{Q}^{n,(k)}\) must be calculated, and this reads as follows:

\[\label{eqn:rk95update} \Delta \mathbf{Q}^{(k)} = \Gamma \alpha^{(k)} \mathbf{K}^{-1} (\mathbf{R}^{(k)}_i - \mathbf{R}^*)\tag{7}\]

To use more traditional symbols for the discussion of solving linear systems, Equation 7 can be written in a more familiar form. For example, the one for the mean flow can be written as \(\mathbf{A} \mathbf{x} = \mathbf{B}\), where \(\mathbf{A} = (\Gamma \alpha^{(k)}\mathbf{K}^{-1})^{-1}\), \(\mathbf{B} = (\mathbf{R}^{(k-1)}_i - \mathbf{R}^*_i)\) and \(\mathbf{x}\) is \(\Delta \mathbf{Q}^{(k)}_i\). For the explicit scheme, \(\mathbf{A}\) is a block diagonal matrix, and its inversion can be performed point by point. This is essentially a block Jacobi method, and inverting the block diagonal matrix is a simple and fast process. For the implicit scheme, the LHS of the mean flow or the time-linearized harmonic flows is a large sparse matrix. Applying an exact inversion of this matrix is seldom economical, because the improvement of the convergence rate is not good enough to justify the high computational cost of inverting this matrix exactly. In this work, the procedure similar to that proposed by Swanson et al. [@Swanson2007] is followed. This matrix is inexactly inverted and the linear system is solved iteratively. The incomplete lower-upper decomposition with zero fill (ILU(0)) is used to approximate the matrix inversion.

With respect to the time-linearized harmonic flow, after inspecting Equation 6 and Equation 5 , it can be seen that the LHS of the linearized flow has the same sparse pattern as the one for the nonlinear flow. The difference comes mainly from the diagonal block. Furthermore, if there are periodic boundaries in the simulation, phase-lagged boundary conditions are applied on the periodic boundaries. Since the interblade phase angle will be different for each harmonic, this will also lead to a difference of the LHS for each harmonic, but it has no impact on the sparse pattern of the LHS. Based on the above observation, the coupled system of FNLH for each nonlinear iteration can be symbolically represented as: \[\begin{align} \label{eqn:fnlh95couple95sys} \mathbf{A} \mathbf{x} &= \mathbf{B} & \text{mean flow} \nonumber \\ (\mathbf{A} + I \omega_l \mathbf{I}\mathbf{V}) \hat{\mathbf{x}}_l &= \hat{\mathbf{B}}_l & \text{linearized flow} \end{align}\tag{8}\] in which \(\mathbf{A}\) represents the LHS of the non-linear flow. \(\mathbf{B}\) and \(\mathbf{\hat{B}}_l\) denote the RHS of the mean flow and the RHS of the \(l^{\text{th}}\) harmonic. Regarding the coherence of the sparse pattern of the preconditioner matrix from the nonlinear flow and the linearized flow, the flux Jacobian matrix \(\mathbf{J}\) only needs to be assembled once by the nonlinear flow for each nonlinear iteration, and is then reused by the linearized flow. From an implementation point of view, the mean flow and the first harmonic of the linearized flow will allocate memory for its LHS, respectively, but the LHS of all the harmonics of the linearized flow reuses the memory space of that of the first harmonic. This treatment is used for both the explicit and implicit schemes and allows the memory usage of FNLH to not depend on the number of harmonics to compute; therefore, the resulting procedure is memory efficient.

The sparse linear system from the time-linearized flow consists of complex numbers. We can either cast this complex linear system into its equivalent system with real numbers, which is shown in Equation 9 , or directly solve this complex linear system. \[\label{eqn:matrix95z95real} \begin{bmatrix} \mathbf{A} & -{\omega}_l \mathbf{I}\mathbf{V} \\ {\omega}_l \mathbf{I}\mathbf{V} & \mathbf{A} \end{bmatrix} \begin{bmatrix} \text{Re}\{\Delta \hat{\mathbf{x}}_l\} \\ \text{Im}\{\Delta \hat{\mathbf{x}}_l\} \end{bmatrix} = \begin{bmatrix} \text{Re} \{ \hat{\mathbf{B}}_l\} \\ \text{Im} \{ \hat{\mathbf{B}}_l\} \end{bmatrix}\tag{9}\]

The benefit of using real numbers is that existing implementation of incomplete matrix factorization can be reused. However, as shown in Equation 9 , the sparse pattern of the new LHS differs from that of the mean flow solver and this means that a new lookup table is required for the matrix factorization (i.e. ILU(0)). For the approach using complex numbers, the sparse pattern of the LHS is the same as that of the mean flow solver, and this also facilitates parallel implementation (i.e. parallel domain decomposition) of incomplete matrix factorization. In the authors’ opinion, these benefits outweigh the effort to implement an incomplete matrix factorization with complex numbers. Therefore, in this work, the approach with complex numbers is chosen to solve the sparse linear system of the time-linearized flow.

For parallel implementation of the implicit scheme, a domain decomposition strategy is required to partition the LHS of the mean flow or the linearized flow for different processors. As the sparse patterns of the linearized flow and the mean flow are identical, the same domain decomposition strategy is used. In this work, the partition is done in such a way that the entries related to two nodes that are in different partitions are ignored. This can be demonstrated in Fig. 3, which shows the sparse pattern and the partition of the LHS on four processors. Each of the 4 boxes encloses a sub-linear system for each processor. These sub-linear systems only couple nodes that are partitioned into the same processor. This effectively leads to a parallelisable set of subproblems for each processor. Furthermore, as was discussed in the previous section, the linear system with complex numbers is solved for the linearized flow. If an equivalent linear system with real numbers is solved, the off-diagonal part, which contains the contributions of the spectral source term, could be ignored for parallel computations. This could lead to potentially numerical problems.

In addition, as the coupling of the nodes that are in different processors is ignored, the convergence rate of the implicit solver will degrade as more processors are used. For the extreme case, where the number of processors is equal to the number of nodes, the parallel implicit solver will degrade to the explicit scheme. It is found that serious convergence degradation can occur if the total number of nodes to the number of processors is less than \(O(10^3)\) [@Misev2018]. On the other hand, in reality, to obtain good scalability for parallel computation, this ratio normally is at least \(O(10^4)\). Therefore, in practice, the convergence rate will only have a weak dependence on the number of processors, and this effect will be demonstrated in the test cases.

We demonstrate the computational performance of the explicit and implicit versions of the generalized RK scheme for FNLH and perform a comparative study of both approaches in terms of convergence, wall clock time and memory consumption. Two test cases are used to serve this purpose. The first case is to model the rotor-rotor interactions in a compressor, and the second case is to trace hot streak migrations in turbine stages. Regarding verification and validation, the performance of FNLH for these two cases has been demonstrated in the previous work [@fnlh_multirow] for multirow coupling. Nevertheless, comparisons with URANS are still provided to demonstrate that the explicit/implicit scheme converges to the correct results. Since FNLH can be considered as a reduced model of URANS, comparisons with URANS suffice for validation. Besides, the capability of the bladerow interface in HYDRA to correctly transmit disturbances in a multi-row environment is also verified by comparing with analytical solutions.

The key parameters of the generalized RK scheme are summarized in Table 2. For the implicit solver, the incomplete LU factorization with zero fill-in level is used and is denoted as ILU(0). A fixed threshold of 0.01 is used for the linear residual to iteratively solve the sparse linear system for both the mean flow and the time-linearized harmonic flow. The maximum number of iterations to solve the sparse linear systems is 10. For the explicit scheme, a V cycle is used for the multigrid and the order of spatial accuracy for the coarse grid levels is set to 1.

Scheme | Component | Value |
---|---|---|

Implicit | ILU fill levels | 0 |

Linear residual convergence tolerance | 1e-2 | |

Linear iteration limit | 10 | |

Explicit | MG Cycle | V-cycle |

Spatial order on coarse grid | 1 |

The effectiveness of the bladerow interface in transmitting disturbances in a multiple bladerow setting is demonstrated here. Figure 4 a) shows the configuration of three ducts that are connected to each other in the streamwise direction. "duct2" is rotating in the direction indicated in the figure. "duct1" and "duct 3" are stationary. The pitch of the ducts are \(\frac{1}{36} \pi\), \(\frac{1}{24} \pi\), and \(\frac{1}{36} \pi\), respectively. The flow is assumed to be inviscid, and a total temperature distortion is specified at the inlet of "duct1". The magnitude of the distortion is \(5\%\) of the inlet total temperature. As there is no work input or heat exchange, theoretically the total temperature distortion should be transported downstream by the background flow without attenuation.

Structured meshes are used for all the three ducts and two sets of grid resolution are employed. The first mesh has a resolution of \(60\times150\times1\) and the second has \(100\times200\times1\). They represent resolutions in the streamwise, circumferential, and spanwise directions, respectively. Two background flows are used. For the first, the flow is axial, and for the second, the whirl angle is \(45^{\circ}\). This is shown in Fig. 4 b) and c). The flow condition (e.g. reduced frequency) is representative of typical turbomachinery flow conditions. The total temperature distribution in any downstream position can be obtained analytically by applying the phase shift of the inlet total temperature distribution if the background flow is non-axial. Figure 5 shows the comparisons between the FNLH solution and the analytical solution. There is good agreement between FNLH and analytical solutions. As the mesh is refined, the agreement is improved due to reduced numerical dissipation.

This test case is the mid-span stream tube of the front 1.5 stages of an 8-stage high-speed machine [@simod_jot], and the stream tube accounts for roughly \(4.8\%\) of the span height. The first- and second-stage rotors of this 8-stage compressor are used [@fnlh_multirow]. Blade counts are scaled to the ratio of 3(R1):5(S1):4(R2) to reduce the computational cost of URANS. The in-house meshing tool [@mesh_jpp] is used to create multiblock structured meshes to discretize the computational domain in the blade-to-blade section and will be used to mesh the computational domain for all test cases. In the spanwise direction, one layer of hexahedral mesh is used and the total hexahedral element is 33090. The average \(y^{+}\) on the blade surface is around 5. The Sparlart-Allmaras turbulence model [@SPALART1992] is used by default in all test cases. Non-reflective treatment for mean flow [@Saxer1993] and linearized flow [@giles_nrbc_report] is applied at the bladerow interfaces. For the URANS simulation, single-passage meshes are replicated in the circumferential direction to form a \(30^{\circ}\) sector. 500 time steps are used for this \(30^{\circ}\) sector simulation, which corresponds to 100 time steps for the stator to sweep one passage of the rotor that has the smaller pitch.

Table 3 shows the FNLH setup for this compressor case, and \(\Omega\) is the shaft speed. 15 harmonics are used in total. In the current FNLH implementation in Hydra, all bladerows compute the same total number of harmonics. For R1, 15 harmonics are used to calculate the potential fields of S1. For S1, 12 harmonics are used to model the R1 wake and 3 harmonics are used to compute the potential field of R2. For R2, 8 harmonics are used for the wake of S1 and 5 harmonics are used to model the wake of R1. For the harmonic representation of the R1 wake in R2, its wave number is a linear combination of the blade counts of R1 and S1, and its temporal frequency is related to the shaft speed \(\Omega\). Such an FNLH setup was used in our previous work, and for more details, the readers can refer to the previous work [@fnlh_multirow].

Bladerow | Temporal mode | Spatial mode | Note |
---|---|---|---|

R1 | \(i\text{N}_{S1}\Omega, i\in[1:15]\) | \(i\text{N}_{S1}, i\in[1:15]\) | exit, S1 potential field |

S1 | \(i\text{N}_{R1}\Omega, i\in[1:12]\) | \(i\text{N}_{R1}, i\in[1:12]\) | inlet, R1 wake |

\(-i\text{N}_{R2}\Omega, i\in[1:3]\) | \(i\text{N}_{R2}, i\in[1:3]\) | exit, R2 potential field | |

R2 | \(i\text{N}_{S1}\Omega, i\in[1:8]\) | \(i\text{N}_{S2}, i\in[1:8]\) | inlet, S1 wake |

\(-n\text{N}_{S1}\Omega, n \in \{0,1,2\}\) | \(m\text{N}_{R1} + n \text{N}_{S1}, m \in \{1\}\) | inlet, R1 wake-1 | |

\(-n\text{N}_{S1}\Omega, n \in \{0,1\}\) | \(m\text{N}_{R1} + n \text{N}_{S1}, m \in \{ 2 \}\) | inlet, R1 wake-2 | |

\(-n\text{N}_{S1}\Omega, n \in \{0,1\}\) | \(m\text{N}_{R1} + n \text{N}_{S1}, m \in \{ 3 \}\) | inlet, R1 wake-3 |

Figure 6 shows the unsteady entropy fields for this 1.5 stage transonic compressor computed by FNLH and URANS. Entropy is evaluated as \(s = \frac{R}{\gamma - 1}(\ln(p) - \gamma \ln(\rho))\). FNLH shows good agreement with URANS, in particular, the wake of R1 is reconstructed satisfactorily in R2. Figure 7 shows the circumferential distribution of the entropy at two streamwise locations, which are located at the midway betweeen S1 inlet and LE, R2 inlet and LE, respectively. Good agreement with URANS data is observed. The results are computed by the implicit RK scheme, hence we are confident that the generalized RK scheme still produces the correct results.

Figure 8 shows the RMS of the residuals for the mean flow and the harmonic flows from the implicit scheme. For example, the residual for "Harmonic-0" is the residual of the 1st harmonic computed by all the blade rows. According to the FNLH setup in Table 3, Harmonic [0-7] represents adjacent row interactions for all the blade rows. Harmonic [8:14] still computes adjacent row interaction for R1 and S1, but for R2 they model R1 wake in the computational domain of R2. For adjacent row interactions, harmonics with larger indices have shorter wave lengths and higher frequencies; Fig. 8 shows that these harmonics tend to have faster convergence. This can be explained by the fact that, because of the shorter wave length, the wave length to grid size ratio is smaller, and this means that these harmonics are more dissipated by the numerical scheme. Furthermore, as the spectral source term only contributes to the diagonal part of the LHS of the linearized flow (Equation 6 ), this will enhance the diagonal dominance and consequently improve the convergence rate.

Since a dozen harmonics could be computed in a FNLH run, it is not convenient to inspect the residuals for each harmonic. Therefore, the RMS of the residuals for all the harmonics is monitored. This is defined as: \[R_{z} = \sqrt{\frac{1}{N_{h}} (R^2_{z,1} + R^2_{z,2} + R^2_{z,3} + \cdots R^2_{z,N_h})}\] where \(R_{z,i}, i \in [1:N_h],\) is the RMS of the non-linear residual for the \(i^{\text{th}}\) harmonic. In the following text, the residual of the mean flow and the RMS of the residuals for all harmonics, namely \(R_z\), is used to compare the performance of the explicit and implicit RK scheme.

For the performance of the explicit scheme with multigrid, Fig. 9 shows the convergence history of FNLH for the case shown in Fig. 6 and the explicit RK scheme with V-cycle multigrid is used. It can be seen from Fig. 9 that only one multigrid level yields poor convergence. As the number of MG levels increases, the convergence rate is significantly improved, but the benefit of adding more MG levels is also diminishing. The non-linear mean flow seems to benefit more from multigrid compared to the time-linearized harmonic flow. The oscillations of residuals between 250 - 500 iterations are likely caused by the transonic flows in the rotors. As the residuals are further reduced, and the flow becomes more established; such an oscillation is no longer present.

It should be noted that as the number of MG levels increases, the computational cost of each nonlinear iteration also increases. Figure 10 shows the wall-clock time against the nonlinear residual. The wall clock is expressed in TauBench Work Units (WU). One WU is the wall clock time of the following command:

mpirun -np 1 ./TauBench -n 250000 -s 10

The computational cost of the CFD simulation in terms of WU is then expressed as \(\frac{N_{\text{proc}}T_{\text{FNLH}}}{T_1}\). \(N_{\text{proc}}\) is the number of CPU cores that are used and \(T_{\text{FNLH}}\) is the wall clock time of the FNLH simulation. \(T_1\) is the wall clock time of the aforementioned TauBench command. For the explicit scheme, although each nonlinear iteration is more expensive, 4 level of MG is still the most computationally efficient one, and using only one level of MG is not computationally efficient and also shows poor convergence. In the following text, the explicit scheme with 4 levels of MG will be used as the default explicit scheme and compared with the implicit RK schemes.

Figure 11 shows the comparison of the convergence histories of 4 levels of MG and implicit RK schemes with different CFL numbers. It can be seen that in terms of non-linear iterations, implicit RK schemes show significant improvement over the explicit scheme with 4 levels of MG. In addition, the oscillations of the residuals observed between 200 and 500 iterations are not present in the implicit scheme. This is because the implicit scheme uses a full Jacobian matrix with first-order spatial accuracy for the preconditioner matrix shown in Equation 5 and Equation 6 while the explicit scheme only uses the diagonal block of this matrix. Therefore, the solution update could be in a more accurate direction for the implicit scheme to reduce the residual. When there are strong non-linearity in the flow, this is beneficial to improve the convergence and also the robustness of the solution.

An approximate Jacobian matrix is used in the current implicit scheme for the mean flow (Equation 5 ) and the linearized flow (Equation 6 ). The CFL number acts as a relaxation factor. A smaller CFL number leads to better diagonal dominance, but this will also lead to slower convergence and vice versa. Therefore, a judicious choice of the CFL number will strike a good balance in terms of convergence speed, stability, and computational efficiency. Figure 12 shows that as the CFL number is gradually increased from 10 to 50, the convergence rate improves, but the benefit of increasing the CFL also decreases. A CFL number of 40 or 50 seems to be a good choice, and this is consistent with the previous experience of developing the implicit scheme for the non-linear mean flow [@misev_thesis].

As shown in Fig. 3, in the parallel implementation of the implicit scheme, the entries in the Jacobian matrix that correspond to the coupling of two nodes will be ignored if they are in different partitions. Therefore, the convergence rate of the implicit scheme will depend on the number of processors. Figure 13 shows the convergence history with 5, 10, 20 and 30 CPU cores and the CFL number is set to 50. For the case with 30 CPU cores, the number of elements per core is around 1100.

In terms of nonlinear iterations, when the nonlinear residual is greater than \(10^{-13}\), the number of cores only has a marginal impact on the convergence rate. When a deeper convergence is sought, adding more CPU cores leads to a slightly lower convergence rate, especially for the linearized harmonic flow. This is because the quality of the Jacobian matrix reduces as more CPUs are used. For a fixed number of linear iterations, solution updates that use more CPU cores are less accurate than those that use fewer CPU cores.

With respect to computational efficiency, which can be measured in terms of the work unit, Fig. 13 shows that adding more CPUs leads to a less efficient implicit scheme, but it has marginal effect on the number of non-linear iterations. This is because the accuracy of the Jacobian matrix decreases as more CPU cores are used, and this will lead to a greater number of linear iterations to reach the convergence threshold of the linear residual. Hence, each nonlinear iteration becomes more expensive, but the resulting total number of nonlinear iterations can still be similar due to the fixed linear residual threshold.

The second case is based on the MT1 turbine stage [@mt1_hotstreak] and was modified to a 1.5 stage turbine in our previous work [@fnlh_multirow] to validate FNLH for hot streak migrations. The original blade count of the MT1 stage is 32 (NGV):60 (R1) and the blade counts of the redesigned 1.5-stage turbine are scaled to 30 (NGV): 60 (R1): 30 (S2). A stream tube of the geometry is simulated, and the stream tube accounts for roughly \(5\%\) of the blade span. The NGV is modeled as a pair to prescribe a simplified hot streak at the NGV inlet. One layer of hexahedral elements is used to discretize the computational domain, and there are 72328 hexahedral elements in total. The average \(y^{+}\) on the blade surface is around 5. The hot streak, which spans two NGV passages, has a sinusoidal shape with an amplitude of \(\Delta T_{hs}\). The resulting inlet total temperature field \(T_{\text{in}}\) is represented as the following: \[T_{\text{in}} = T_{0} + \Delta T_{hs} \sin(N_{\text{hs}} \theta) \label{eqn:hotstreak95def}\tag{10}\] in which \(T_{0}\) is the mean total temperature at the inlet and its value is 444 K, and \(N_{\text{hs}} = 15\), which means the hot streak will span two NGV passages. The measured peak-to-mean and minimum-to-mean temperature ratios are approximately 1.08 and 0.93 in the MT1 stage [@mt1_hotstreak], respectively. In this study \(\Delta T_{hs} = 50\)K is used. The ideal-gas model is used in this study, although in the real scenario, the working fluid in the turbine should be a mixture of air and the productions of the combustion process. The ideal gas model can still be adequate for the preliminary design and for validating computational methods. For the URANS simulation, single-passage meshes are replicated in the circumferential direction to form a \(12^{\circ}\) sector. 200 time steps are used for this \(12^{\circ}\) sector simulation, which corresponds to 100 time steps for the stator to sweep one passage of the rotor.

The setup for the harmonics in FNLH is summarized in Table 4 and this setup is based on our previous study in this case [@fnlh_multirow]. 16 harmonics are computed in total. To model the hot streak in S2, 7 harmonics are used. 2 fundamental modes of the hot streak and their related scattered modes are computed. Their temporal frequency is related to the shaft speed \(\Omega\) and its wave number is related to the linear combination of the wave number of the hot streak and the blade count of R1.

Bladerow | Temporal mode | Spatial mode | Note |
---|---|---|---|

NGV | \(-i\text{N}_{R1}\Omega, i\in[1:16]\) | \(i\text{N}_{R1}, i\in[1:16]\) | exit, R1 potential field |

R1 | \(i\text{N}_{NGV}\Omega, i\in[1:14]\) | \(i\text{N}_{NGV}, i\in[1:14]\) | inlet, NGV wake + hot streak |

\(i\text{N}_{S2}\Omega, i\in[1:2]\) | \(i\text{N}_{S2}, i\in[1:2]\) | exit, S2 potential field | |

S2 | \(i\text{N}_{R1}\Omega, i\in[1:9]\) | \(i\text{N}_{R1}, i\in[1:9]\) | inlet, R1 wake |

\(-n\text{N}_{R1}\Omega, n \in \{-1,0,1,2\}\) | \(m\text{N}_{hs} + n \text{N}_{R1}, m \in \{1\}\) | inlet,Hotstreak-1 | |

\(-n\text{N}_{R1}\Omega, n \in \{0,1,2\}\) | \(m\text{N}_{hs} + n \text{N}_{R1}, m \in \{2\}\) | inlet,Hotstreak-2 |

The explicit scheme with 4 level of MG and the implicit scheme with CFL\(=50\) are compared. According to the previous test case, these two are the most efficient explicit and implicit schemes shown, respectively. The residuals of the explicit and implicit schemes are shown in Figure 14 and 30 CPU cores are used. Similarly to the previous case, the implicit scheme shows a significant speed-up of roughly 10x compared to the explicit scheme in terms of computational time. For the explicit scheme with MG, the convergence rate is degraded as the residual decreases. This indicates that the explicit scheme with MG is not effective in damping the short-wavelength errors in the computation. However, the implicit scheme shows is less affected by this issue.

In terms of parallel performance of the implicit scheme, Fig. 15 shows the convergence history with 5, 10, 20, and 30 CPU cores. For the case with 30 CPU cores, the number of elements per CPU core is roughly 2410. It can be seen that in terms of non-linear iterations, the number of CPU cores has a negligible effect. With respect to WU, it can be seen that the implicit solver becomes slightly less efficient as the number of CPUs increases. This observation is consistent with the previous compressor case.

Figure 16 compares the predicted instantaneous temperature field between FNLH and URANS. For the FNLH, the implicit scheme is used. There is good agreement between URANS and FNLH on the predicted temperature field, and the hot streak is correctly transported downstream the turbine bladerows. For a quantitative comparison, Fig. 17 shows the circumferential distribution of instantaneous temperature at two streamwise locations. The first is the midway between the R1 inlet and the R1 LE, and the second is the midway between the S2 inlet and the S2 LE. It can be seen that there is good agreement between the implicit FNLH and URANS. More validations of FNLH in this case can be found in previous work [@fnlh_multirow].

Table 5 compares the memory consumption of the explicit and implicit RK schemes for the test cases. It can be seen that the memory consumption of the implicit solver is roughly half of the explicit solver using 4 MG levels. Therefore, the implicit RK scheme not only delivers faster and faster convergence compared to the explicit scheme but also has significantly less memory consumption. The memory consumption of the corresponding whole annulus URANS simulations is also summarized in Table 5. The memory consumption of FNLH is roughly two orders of magnitude smaller compared to whole annulus URANS.

Explicit (4-MG) | Implicit | URANS (\(360^{\circ}\)) | |
---|---|---|---|

Compressor case | 3.9 | 2.1 | 309.6 |

Turbine case | 7.4 | 3.2 | 378.1 |

Figure 18 shows the convergence history of the implicit scheme (CFL = 50) for the test cases, but the x-axis of the plots is the normalized time, which is defined as the wall clock time of the FNLH simulation divided by the wall clock time of execution of the full annulus URANS simulation for one revolution. It is typical to run 4-5 revolutions for URANS before collecting data from URANS. Figure 18 shows that the implicit FNLH is at least two orders of magnitude more efficient than the corresponding full-annulus URANS simulations in both test cases.

A unified formulation has been developed and implemented in the industrial CFD HYDRA to accelerate the convergence of the FNLH method. A comparative study of the resulting explicit and implicit schemes has been undertaken. The proposed method is memory efficient as memory consumption does not depend on the number of harmonics to compute. For the implicit scheme, complex algebra is used for the matrix factorization for the purpose of re-using the look-up table for matrix factorization of the mean flow. This significantly reduces the effort of implementing the implicit FNLH scheme in parallel.

The implicit scheme is found to consistently show a speed-up of 7-10x compared to the explicit schemes with multigrid. Furthermore, the memory consumption of the implicit scheme consumes approximately \(50\%\) of that of the explicit scheme with 4 levels of multigrid. The parallel performance of the implicit scheme is also studied, and it shows that the implicit solver is slightly less efficient as more CPU cores are used. However, in a typical industrial simulation, the number of elements per CPU core is at least \(O(10^4)\), therefore, the parallel performance of the implicit scheme should not be a major concern.

The authors are grateful to Rolls-Royce plc and Aerospace Technology Institute for funding this work under the CORDITE project and granting permission for its publication. The authors would like to thank Dr. David Liliedahl from Rolls-Royce Corporation (USA) for discussions on the implicit solver of Hydra. The authors would like to thank Dr. Paolo Adami for his support of this work.

Corresponding author, email: feng.wang@eng.ox.ac.uk↩︎