FourPhonon_GPU: A GPU-accelerated framework for calculating phonon scattering rates and thermal conductivity


Abstract

Accurately predicting phonon scattering is crucial for understanding thermal transport properties. However, the computational cost of such calculations, especially for four-phonon scattering, can often be more prohibitive when large number of phonon branches and scattering processes are involved. In this work, we present FourPhonon_GPU, a GPU-accelerated framework for three-phonon and four-phonon scattering rate calculations based on the FourPhonon package. By leveraging OpenACC and adopting a heterogeneous CPU–GPU computing strategy, we efficiently offload massive, parallelizable tasks to the GPU while using the CPU for process enumeration and control-heavy operations. Our approach achieves over 25\(\times\) acceleration for the scattering rate computation step and over 10\(\times\) total runtime speedup without sacrificing accuracy. Benchmarking on various GPU architectures confirms the method’s scalability and highlights the importance of aligning parallelization strategies with hardware capabilities. This work provides an efficient and accurate computational tool for phonon transport modeling and opens pathways for accelerated materials discovery.

1 Introduction↩︎

Thermal conductivity is a fundamental material property that plays a critical role in a wide range of applications, including thermal management in electronic devices [1], [2], thermoelectric energy conversion [3], [4], etc. Understanding and accurately predicting thermal conductivity is essential for optimizing material performance. The primary heat carriers in dielectrics and semiconductors are phonons. Phonon scattering rates, which limit the phonon mean free paths, are the central mechanism for determining thermal conductivity [5]. Accurately predicting phonon scattering rates is therefore key to predicting thermal transport behavior in materials.

Recent advances in first-principles calculations of phonon scattering coupled with Boltzmann transport equation (BTE) have enabled accurate, parameter-free predictions of lattice thermal conductivity. The theoretical foundation for phonon BTE was first laid by Peierls [6]. Later, Maradudin et al.[7] developed a comprehensive framework for three-phonon (3ph) scattering. Broido et al. [8] further combined ab initio force constants with these approaches, leading to robust first-principles predictions of thermal conductivity. It was believed for decades that 3ph scattering is adequate for describing thermal transport except at very high temperatures. Moreover, the general theory and computational method for four-phonon (4ph) scattering were lacking, preventing quantitative evaluation of its role. Feng and Ruan [9], [10] developed the general theory and computational method for 4ph scattering, demonstrating that 4ph processes can significantly impact the thermal conductivity of many materials, at elevated temperatures or even room temperature. Their theoretical predictions for boron arsenide (BAs) were later confirmed experimentally [11][13], proving that four-phonon scattering is a critical factor in thermal transport.

While first-principles calculations have greatly improved the fundamental understanding of thermal transport, they are extremely computationally expensive when including 4ph scattering. The high cost arises from the need to enumerate and compute a large number of phonon scattering processes, which scale as \(N^3\) and \(N^4\) for 3ph and 4ph scattering, respectively (\(N\) is the number of q-points in the Brillouin zone). This leads to a dramatic increase in computational complexity. For example, for a silicon calculation using a 16\(\times\)​16\(\times\)​16 q-mesh (discretized grid in the reciprocal space), there are approximately 9.0\(\times 10^{5}\) and 7.6\(\times 10^{9}\) processes for 3ph and 4ph scattering, respectively, which could take over 7000 CPU hours to calculate.

To address the high or even prohibitive computational cost, several approaches have been explored, including machine learning surrogates [14], [15] and statistical sampling methods [16]. These techniques have been successfully applied to estimate thermal and radiative properties of solids [17][24]. Although these methods dramatically reduce computational cost, they achieve this by introducing approximations that sacrifice some degree of accuracy. As a result, they are particularly useful for applications where a certain level of error is acceptable. However, in scenarios that require rigorous and fully resolved calculations of every scattering process, such approximate methods are inadequate.

An alternative path for improving computational efficiency is leveraging hardware acceleration through Graphics Processing Units (GPUs). Originally developed for rendering computer graphics, GPUs have evolved as a powerful tool for scientific computing [25], machine learning [26][28], and high-performance simulations [29] due to their massively parallel architecture. With thousands of cores capable of executing millions of lightweight threads simultaneously, GPUs are ideally suited for workloads that involve a large number of independent operations [30]. The advantage of GPUs has already been demonstrated across various fields of atomistic simulations. Popular first-principles simulation packages such as Abinit [31] and VASP [32], [33] have incorporated GPU support, achieving orders-of-magnitude acceleration over traditional CPU implementations. In the context of phonon scattering calculations, Wei et al. [34] first identified the performance bottlenecks in ShengBTE [35] and offloaded the scattering rate calculations onto GPUs. Building on this foundation, Zhang et al. [36] developed the GPU_PBTE package, which employed a two-kernel strategy to further accelerate phonon scattering calculations. It is shown that under the relaxation time approximation (RTA), each phonon scattering process is fully decoupled and can be independently evaluated, making the problem highly suitable for GPU parallelization.

In this work, we develop a GPU-accelerated framework for both 3ph and 4ph phonon scattering calculations by combining CPU-based preprocessing with GPU-based large-scale parallel computing. Starting from the original FourPhonon [37] package, we adopt OpenACC to enable GPU acceleration with minimal code restructuring. However, due to the challenges of divergent branching, direct GPU offloading alone is insufficient to achieve optimal performance. To overcome these limitations, we propose a heterogeneous CPU–GPU computing scheme, in which the CPU enumerates momentum- and energy-conserving scattering processes, and the GPU efficiently evaluates the corresponding scattering rates in parallel (Fig. 1). In addition, we implement several optimization techniques to enhance parallelism and further improve computational efficiency. Through systematic comparisons, we demonstrate that our method preserves the full accuracy of the original calculations while achieving over 25\(\times\) acceleration for the scattering rate computation step and over 10\(\times\) total speedup. Our work establishes an efficient and scalable pathway for rigorous phonon scattering calculations on modern high-performance computing architectures.

Figure 1: Workflow of CPU-GPU heterogeneous computing.

2 Methodology↩︎

The original ShengBTE [35] and FourPhonon [37] packages are written in Fortran and parallelized using MPI, optimized for CPU-based high-performance computing environments. To enable GPU acceleration with minimal code restructuring and maximize cross-platform portability, we adopt OpenACC, a directive-based programming model designed to simplify the development of heterogeneous applications targeting both CPUs and GPUs. Since the phonon scattering processes are independent of each other, there are no loop-carried dependencies and the task is highly suitable for GPU calculation. This computational pattern aligns well with the Single Instruction, Multiple Threads (SIMT) execution model of GPUs, where thousands of lightweight threads execute the same instruction on different data. As a result, the calculation of scattering rates across different scattering channels can be efficiently mapped onto parallel GPU threads.

To illustrate our GPU acceleration strategy, we use the 3ph absorption process as an example, with detailed pseudocode provided in the Appendix. In the original CPU implementation (Algorithm [algo-CPU95only] in the Appendix), the code loops over all phonon modes in a driver function, where each iteration calls a subroutine to evaluate the scattering rate for an individual mode. This subroutine performs nested loops over all possible combinations of phonon wavevectors and branches. Energy conservation checks are conducted to filter out forbidden scattering processes before computing the weighted phase space (WP) and scattering rate (\(\Gamma\)). Due to the vast number of phonon combinations, this calculation is computationally expensive. Our first strategy is directly offloading the entire computation to the GPU (see Algorithm [algo-GPU95only] in the Appendix). All necessary data is preloaded into GPU memory to avoid host-device data transfer overhead during runtime. After GPU computation, the results are transferred back to CPU memory, and GPU memory is released. We parallelize over all possible scattering processes and, beyond that, across multiple phonon modes simultaneously to enhance concurrency. This all-modes parallelization approach outperforms the mode-by-mode parallelization used in prior GPU implementations [34], [36] (see Algorithm [algo-GPU95hybrid-mode-wise]), offering higher levels of parallelism and improved speedup while having higher GPU memory cost. Both of these methods are implemented, and a comparison is provided in the Results section.

However, directly applying OpenACC directives to existing MPI-based Fortran code is insufficient due to architectural differences between CPUs and GPUs. Several optimizations have been implemented to make the code GPU-compatible and efficient. First, we apply loop flattening using the collapse clause to combine nested loops and expose more parallelism within each phonon mode. Second, the loop order is rearranged to achieve memory coalescing, allowing consecutive threads to access contiguous memory locations, which improves memory access efficiency. Third, we inline the computations for the broadening factor (\(\sigma\)) and the matrix element (\(V_p\)) to avoid the overhead of function and subroutine calls on the GPU. Finally, the accumulation of the scattering rate and weighted phase space is handled using the reduction clause, which efficiently manages parallel updates and avoids race conditions, outperforming the use of atomic operations.

While this direct GPU-offloading approach preserves the original CPU workflow and achieves acceleration, it shows performance degradation due to divergent branching. Specifically, conditional statements used to check energy conservation introduce warp divergence, which undermines the efficiency of the SIMT execution. In our scenario, threads associated with forbidden scattering processes stall while others continue, leading to serialization and reduced throughput. Given the sparsity of allowed scattering processes, this divergence leads to significant underutilization of GPU resources (Fig. 2 upper figure). This problem has also been reported in prior work [36]. Moreover, for 4ph scattering calculations, symmetry considerations restrict the iteration domain to a triangular region (see Fig. 3). However, OpenACC’s collapse directive requires rectangular iteration domains, preventing the use of symmetry conditions on GPUs and further impacting performance.

Figure 2: GPU-only vs. CPU+GPU heterogeneous computing. The GPU-only approach suffers from warp divergence due to conditional branching when filtering forbidden phonon scattering processes, leading to reduced computational efficiency. In contrast, the CPU+GPU heterogeneous strategy first filters and prepares valid scattering processes on the CPU, allowing the GPU to execute the computation with higher parallel efficiency.
Figure 3: Triangular iteration region due to symmetry.

To address these limitations, we further propose a hybrid CPU–GPU computing framework that partitions the workload between CPU and GPU (Fig. 2 lower figure). In this scheme, the CPU is responsible for enumerating all symmetry-allowed and energy-conserving scattering processes, as described in Algorithm [algo-CPU95hybrid]. Once the allowable scattering process indices are generated, they are transferred to the GPU. The GPU then performs the expensive scattering rate calculations in parallel, following Algorithm [algo-GPU95hybrid]. By transferring only the relevant scattering events, we eliminate divergent branching on the GPU and ensure maximum occupancy of GPU resources during computation. Although this hybrid approach introduces additional CPU overhead for process enumeration, it significantly reduces the computational cost associated with scattering rate evaluations and improves overall efficiency. In the Results section, we provide a detailed comparison between the CPU-only, GPU-only, and hybrid schemes, demonstrating the performance advantages of the proposed heterogeneous workflow. In our open source code, we adopted the hybrid approach for phonon scattering calculation.

3 Results↩︎

The performance of both the original and GPU-accelerated codes is evaluated using silicon as a benchmark material. All calculations are performed at 300 K under RTA for both 3ph and 3ph+4ph scattering processes. The broadening factor is chosen as 0.1 for 3ph+4ph calculation following our previous work [37]. Simulations are carried out on the Gilbreth cluster at Purdue University’s Rosen Center for Advanced Computing (RCAC). The software and hardware configurations are summarized in Table 1.

Table 1: Experimental hardware and software configurations
Item Description
CPU AMD EPYC 7543 32-Core Processor
GPU Primary: NVIDIA A100 (80GB) GPU. Tests: NVIDIA A10, A30 GPUs
Compiler (GPU) NVIDIA HPC Compiler (nvc 23.5-0)
CUDA Version CUDA 12.6.0
Compiler (CPU) Intel OneAPI Compilers 2024.2.1
MPI Version Intel MPI 2021.13

For CPU-only calculations, parallel computing is performed with 32 CPU cores since the wall time for serial computing is impractical. The reported time is CPU time, which represents the summed computational time over all cores. For GPU-accelerated calculations, the simulations are performed using one CPU core and one GPU, and the reported runtime is the sum of the CPU and GPU computation times. The relative difference in the calculated thermal conductivity between CPU-only and GPU-accelerated runs is less than 0.1%, which is primarily attributed to minor numerical variations between different compilers. These results confirm that GPU acceleration does not compromise the accuracy of the original code.

We first analyze the acceleration achieved by our CPU-GPU heterogeneous computing implementation. Figure 4 shows a comparison of the total computational cost between the original CPU-based method and the CPU-GPU hybrid version across different q-mesh densities. For all tested q-meshes, we observe a consistent acceleration of over 10\(\times\) for both 3ph and 3ph+4ph scattering calculations. Note that this computational time includes not only the phonon scattering step but also other computational overheads that are inherently performed on the CPU, such as the calculation of harmonic properties, phonon phase space, and the post-processing steps. If we only consider the computational cost of the CPU+GPU phonon scattering calculation step, we observe even more substantial speedups of over 18\(\times\) and over 25\(\times\) for 3ph and 4ph scattering rate calculation, which is shown in the insets of Fig 4. We are expecting to see an even higher acceleration rate if we set the broadening factor to unity for the 3ph+4ph calculation. These results clearly demonstrate the effectiveness of our GPU acceleration strategy.

Figure 4: Comparison of total computational cost between CPU-only and CPU-GPU hybrid implementations across different q-mesh sizes. (a) 3ph scattering, (b) 3ph+4ph scattering. The insets show the isolated computational cost of the 3ph and 4ph scattering step alone.

We then compare two GPU parallelization strategies for phonon scattering calculations: mode-by-mode parallelization (where each phonon mode is processed independently) and all-modes parallelization (where all scattering processes are computed collectively). Figure 5 (a) shows the comparison for representative cases. Note that for 4ph scattering, we choose a smaller q-mesh size to illustradue to the large computational cost, but the trend would be similar for large mesh size. We observe that all-modes parallelization leads to additional acceleration of 21% and 9% for the 3ph scattering case with a 32\(\times\)​32\(\times\)​32 q-mesh and 3ph+4ph with 10\(\times\)​10\(\times\)​10 q-mesh, respectively. This performance gain is due to two factors: (1) enhanced parallelism enabled by concurrent execution of all scattering processes, and (2) reduced overhead in transferring data between CPU and GPU memory by avoiding frequent host–device memory traffic. However, this gain comes with a trade-off. As shown in Fig. 5 (b), preloading all scattering processes into GPU memory significantly increases the GPU memory cost. For dense q-meshes, this demand may exceed available GPU memory. This is the case for the 3ph+4ph calculation at a 16\(\times\)​16\(\times\)​16 mesh, which could not be completed using the all-modes strategy due to excessive memory usage (>80 GB). To address this, we fall back to the mode-by-mode parallelization approach for the 16\(\times\)​16\(\times\)​16 case. While the computational efficiency is reduced, it still enables a notable acceleration of approximately 7\(\times\) compared to the original CPU-based implementation (Fig. 5 (c)). This trade-off between memory usage and speed highlights the importance of selecting an appropriate parallelization strategy based on problem size and hardware constraints.

Figure 5: Comparison between all-modes and mode-by-mode parallelization strategies. (a) Computational cost and (b) GPU memory usage for 3ph and 3ph+4ph scattering calculations using a q-mesh of 32\times​32\times​32 and 10\times​10\times​10, respectively. (c) Computational cost for 3ph+4ph scattering with a 16\times​16\times​16 q-mesh, comparing CPU-only and CPU-GPU with mode-by-mode parallelization implementations.

We further compared our CPU–GPU hybrid method with a GPU-only implementation, as illustrated in Fig. 6. For the 3ph scattering calculation with a 32\(\times\)​32\(\times\)​32 q-mesh, while both approaches are faster than the CPU-only baseline, the CPU–GPU hybrid approach achieves a 5\(\times\) speedup over the GPU-only method. Moreover, for the 3ph+4ph calculation using a q-mesh of 10\(\times\)​10\(\times\)​10, the GPU-only implementation is about twice as slow as the CPU-only baseline. This degradation is primarily due to the sparsity of the 4ph scattering matrix, which results in severe warp divergence and significantly reduces parallel efficiency. These results further demonstrate the effectiveness of our heterogeneous computing strategy by using the GPU for highly parallel workloads while using CPU for irregular, branching-heavy operations.

Figure 6: Comparison between CPU–GPU hybrid and GPU-only implementations. (a) 3ph scattering with q-mesh of 32\times​32\times​32, (b) 3ph+4ph scattering with q-mesh of 10\times​10\times​10.

Finally, we evaluated the performance of our method on different GPU architectures. Since FourPhonon relies on double-precision arithmetic (FP64), the performance is strongly influenced by the available FP64 floating-point operations per second (FLOPS) on the GPU. Specifically, we tested three NVIDIA GPUs: A100, A30, and A10. The key specifications of these GPUs, including memory capacity and peak FLOPS for both single-precision (FP32) and FP64 operations, are summarized in Table 2. To isolate GPU performance, we measured only the execution time of the GPU kernel, excluding CPU-side overheads. We observe that A100 > A30 > A10 in terms of computational speed for our double-precision workloads (Fig. 7), which is consistent with the FP64 FLOPS ranking. Since the A10 is optimized for FP32 workloads and has significantly lower FP64 performance, it is less suitable for scientific computing applications and shows a substantial performance drop in our task. Additionally, the A100’s larger memory capacity provides an advantage in dense q-mesh scenarios. For example, in the case of a 14\(\times\)​14\(\times\)​14 q-mesh with all-modes parallelization strategy, the total GPU memory requirement will be approximately 63 GB. This exceeds the memory capacity of the A10 and A30, which therefore will have to fall back to mode-by-mode parallelization, sacrificing performance to stay within hardware limitations. These results highlight the importance of selecting GPU hardware that matches the computational precision and memory demands of the targeted simulation workload.

Figure 7: Comparison of GPU kernel time on different NVIDIA GPUS (A10, A30, A100). (a) 3ph scattering, (b) 4ph scattering.
Table 2: Specifications of selected NVIDIA GPUs used in this work.
GPU Model FP32 Performance (TFLOPS) FP64 Performance (TFLOPS) Memory (GB)
A100 19.5 9.7 80
A30 10.3 5.2 24
A10 31.2 0.39 24

There are several directions for further improving the performance. Currently, our method is limited to the RTA due to the high memory cost and the overhead associated with data transfer between CPU and GPU memory. Future work could involve extending the framework to support iterative solvers. Besides, combining the GPU with the sampling-estimation-based approaches [16] would further reduce both time and memory cost. In addition, the enumeration step on the CPU and the parallel computing step on the GPU are executed sequentially. Introducing asynchronous operations to overlap the CPU and GPU calculations could further reduce total runtime. Lastly, while our implementation focuses on double-precision arithmetic for accuracy, it is worth noting that many modern GPUs, particularly those optimized for machine learning workloads, have significantly higher performance in single-precision computations. Exploring the use of mixed-precision or single-precision approaches, where appropriate, could yield additional performance gains without compromising accuracy for certain tasks.

In conclusion, we developed FourPhonon_GPU, a GPU-accelerated framework for phonon scattering calculations using a CPU–GPU heterogeneous computing strategy. By offloading highly parallel tasks to the GPU and retaining enumeration and control-heavy operations on the CPU, our approach achieves substantial speedups, with over 10\(\times\) improvement in total runtime and over 25\(\times\) acceleration in the scattering rate calculation step. With comprehensive benchmarking tests, we demonstrated the effectiveness of this framework and highlighted the importance of aligning algorithm design with hardware capabilities. This work provides an efficient computational tool for evaluating materials’ thermal properties and paves the way for accelerating materials discovery.

4 Appendix↩︎

Initialize global arrays: rate_scatt_plus(\(N_{bands},N_{list}\)), WP3_plus_array(\(N_{bands},N_{list}\))

// Subroutine: compute scattering for mode mm

Get \(i, ll\) from \(mm\) Get \(q\), \(\omega\), \(v\) from \(i, ll\) Initialize \(\text{WP3}_{\text{plus}}\) \(\gets 0\), \(\Gamma_{\text{plus}}\) \(\gets 0\) Get \(q'\), \(\omega'\), \(v'\) from \(j, ii\) \(f' \gets \text{BE}(\omega')\)

\(q'' \gets \text{modulo}(q + q', N_{\text{grid}})\)

Get \(ss\) from \(q''\), get \(\omega''\), \(v''\) from \(ss, k\)

\(f'' \gets \text{BE}(\omega'')\) \(\sigma \gets \text{compute\_sigma}(v' - v'')\)

\(\text{WP} \gets \frac{(f' - f'') \cdot \exp\left[-\frac{(\omega + \omega' - \omega'')^2}{\sigma^2}\right]}{\sigma \sqrt{\pi} \cdot \omega \omega' \omega''}\)

\(\text{WP3}_{\text{plus}} \gets \text{WP3}_{\text{plus}} + \text{WP}\)

\(V_p \gets \text{compute\_Vp}(\ldots)\) \(\Gamma_{\text{plus}} \gets \Gamma_{\text{plus}} + \text{WP} \cdot |V_p|^2\)

\(\text{rate\_scatt\_plus}(i,ll) \gets \Gamma_{\text{plus}}\) \(\text{WP3\_plus\_array}(i,ll) \gets \text{WP3}_{\text{plus}}\) // End subroutine

\(\text{WP3\_plus\_array} \gets \text{WP3\_plus\_array} / nptk\) \(\text{rate\_scatt\_plus} \gets \text{rate\_scatt\_plus} \cdot \text{const} / nptk\)

Initialize global arrays: rate_scatt_plus(\(N_{bands},N_{list}\)), WP3_plus_array(\(N_{bands},N_{list}\))

// GPU: Launch parallel loop

Get \(i, ll\) from \(mm\) Get \(q\), \(\omega\), \(v\) from \(i, ll\) Initialize \(\text{WP3}_{\text{plus}}\) \(\gets 0\), \(\Gamma_{\text{plus}}\) \(\gets 0\)

// GPU: Launch parallel loop, collapse(3), reduction over\(\Gamma_{\text{plus}}\),\(\text{WP3}_{\text{plus}}\)

Get \(q'\), \(\omega'\), \(v'\) from \(j, ii\)

\(q'' \gets \text{modulo}(q + q', N_{grid})\) Get \(ss\) from \(q''\), get \(\omega''\), \(v''\) from \(ss, k\) \(f' \gets \text{BE}(\omega')\), \(f'' \gets \text{BE}(\omega'')\)

// GPU: parallel loop, reduction over\(\sigma\)

\(\sigma \gets \text{inline\_compute\_sigma}(v' - v'')\)

\(\text{WP} \gets \frac{(f' - f'') \cdot \exp\left[-\frac{(\omega + \omega' - \omega'')^2}{\sigma^2}\right]}{\sigma \sqrt{\pi} \cdot \omega \omega' \omega''}\) \(\text{WP3}_{\text{plus}} \gets \text{WP3}_{\text{plus}} + \text{WP}\)

\(V_p \gets \text{inline\_compute\_Vp}(\ldots)\) \(\Gamma_{\text{plus}} \gets \Gamma_{\text{plus}} + \text{WP} \cdot |V_p|^2\)

\(\text{rate\_scatt\_plus}(i,ll) \gets \Gamma_{\text{plus}}\), \(\text{WP3\_plus\_array}(i,ll) \gets \text{WP3}_{\text{plus}}\)

\(\text{WP3\_plus\_array} \gets \text{WP3\_plus\_array} / nptk\) \(\text{rate\_scatt\_plus} \gets \text{rate\_scatt\_plus} \cdot \text{const} / nptk\)

// CPU: Enumerate of all allowed scattering processes Compute cumulative offset array \(N_{\text{accum\_plus}}\) from \(N_{\text{plus}}\)

Initialize global arrays: \(\text{Ind2}_{\text{all}}(\text{sum}(N_{\text{plus}}))\), \(\text{Ind3}_{\text{all}}(\text{sum}(N_{\text{plus}}))\)

Get \(N_{plus}\) from \(mm\)

Get \(i, ll\) from \(mm\) Get \(q\), \(\omega\), \(v\) from \(i, ll\) Initialize \(N_{\text{plus\_count}} \gets 0\) Initialize arrays \(\text{Ind2}(N_{\text{plus}})\), \(\text{Ind3}(N_{\text{plus}})\)

Get \(q'\), \(\omega'\), \(v'\) from \(j, ii\) \(q'' \gets \text{modulo}(q + q', N_{\text{grid}})\)

Get \(ss\) from \(q''\), get \(\omega''\), \(v''\) from \(ss, k\) \(\sigma \gets \text{compute\_sigma}(v' - v'')\)

\(N_{\text{plus\_count}} \gets N_{\text{plus\_count}} + 1\) \(\text{Ind2}(N_{\text{plus\_count}}) \gets (ii - 1) \cdot N_{\text{bands}} + j\) \(\text{Ind3}(N_{\text{plus\_count}}) \gets (ss - 1) \cdot N_{\text{bands}} + k\) Copy \(\text{Ind2}\) to \(\text{Ind2}_{\text{all}}\) at \(N_{\text{accum\_plus}}(mm) + 1\) Copy \(\text{Ind3}\) to \(\text{Ind3}_{\text{all}}\) at \(N_{\text{accum\_plus}}(mm) + 1\)

Compute cumulative offset array \(N_{\text{accum\_plus}}\) from \(N_{\text{plus}}\)

// GPU: Launch parallel loop

Get \(i, ll\) from \(mm\) Get \(q\), \(\omega\) from \(i, ll\) Initialize \(\Gamma_{\text{plus}} \gets 0\), \(\text{WP3}_{\text{plus}} \gets 0\)

// GPU: Launch parallel loop, reduction over\(\Gamma_{\text{plus}}\),\(\text{WP3}_{\text{plus}}\)

Get phonon indices: \((j,ii), (k,ss)\) from \(\rm Ind2\), \(\rm Ind3\) Get \(q'\), \(\omega'\) from \(j, ii\) \(q'' \gets \text{modulo}(q + q', N_{\text{grid}})\) Get \(\omega''\) from \(k, ss\)

// GPU: parallel loop, reduction over\(\sigma\)

Compute \(\sigma \gets \text{inline\_compute\_sigma}(v' - v'')\) Compute \(f' \gets \text{BE}(\omega')\), \(f'' \gets \text{BE}(\omega'')\)

\(\text{WP} \gets \frac{(f' - f'') \cdot \exp\left[-\frac{(\omega + \omega' - \omega'')^2}{\sigma^2}\right]}{\sigma \sqrt{\pi} \cdot \omega \omega' \omega''}\) \(\text{WP3}_{\text{plus}} \gets \text{WP3}_{\text{plus}} + \text{WP}\)

\(V_p \gets \text{inline\_compute\_Vp}(\ldots)\) \(\Gamma_{\text{plus}} \gets \Gamma_{\text{plus}} + \text{WP} \cdot |V_p|^2\)

\(\text{rate\_scatt\_plus}(i,ll) \gets \Gamma_{\text{plus}}\) \(\text{WP3\_plus\_array}(i,ll) \gets \text{WP3}_{\text{plus}}\)

\(\text{WP3\_plus\_array} \gets \text{WP3\_plus\_array} / nptk\) \(\text{rate\_scatt\_plus} \gets \text{rate\_scatt\_plus} \cdot \text{const} / nptk\)

Compute cumulative offset array \(N_{\text{accum\_plus}}\) from \(N_{\text{plus}}\)

Get \(i, ll\) from \(mm\) Get \(q\), \(\omega\) from \(i, ll\) Initialize \(\Gamma_{\text{plus}} \gets 0\), \(\text{WP3}_{\text{plus}} \gets 0\)

// GPU: Launch parallel loop, reduction over\(\Gamma_{\text{plus}}\),\(\text{WP3}_{\text{plus}}\)

Get phonon indices: \((j,ii), (k,ss)\) from \(Ind2\), \(Ind3\) Get \(q'\), \(\omega'\) from \(j, ii\) \(q'' \gets \text{modulo}(q + q', N_{\text{grid}})\) Get \(\omega''\) from \(k, ss\)

// GPU: parallel loop, reduction over\(\sigma\)

Compute \(\sigma \gets \text{inline\_compute\_sigma}(v' - v'')\) Compute \(f' \gets \text{BE}(\omega')\), \(f'' \gets \text{BE}(\omega'')\)

\(\text{WP} \gets \frac{(f' - f'') \cdot \exp\left[-\frac{(\omega + \omega' - \omega'')^2}{\sigma^2}\right]}{\sigma \sqrt{\pi} \cdot \omega \omega' \omega''}\) \(\text{WP3}_{\text{plus}} \gets \text{WP3}_{\text{plus}} + \text{WP}\)

\(V_p \gets \text{inline\_compute\_Vp}(\ldots)\) \(\Gamma_{\text{plus}} \gets \Gamma_{\text{plus}} + \text{WP} \cdot |V_p|^2\)

\(\text{rate\_scatt\_plus}(i,ll) \gets \Gamma_{\text{plus}}\) \(\text{WP3\_plus\_array}(i,ll) \gets \text{WP3}_{\text{plus}}\)

\(\text{WP3\_plus\_array} \gets \text{WP3\_plus\_array} / nptk\) \(\text{rate\_scatt\_plus} \gets \text{rate\_scatt\_plus} \cdot \text{const} / nptk\)

Data availability↩︎

The original results of the study are available from the corresponding authors upon reasonable request.

Code availability↩︎

The work is incorporated as a new feature of the FourPhonon package and is available at https://github.com/FourPhonon/FourPhonon.

Competing interests↩︎

The authors declare no competing interests.

Author contributions↩︎

G.L., X.R. and Z.G. conceived the study. Z.G. designed and implemented the software, did the simulations, analyzed the results, and wrote the manuscript. G.L. and X.R. supervised the project. All authors contributed to discussions and revisions of the manuscript.

Z.G. and X.R. acknowledge partial support from NSF Awards 2311848 and 2321301. Z.G is partly supported by the System Fellows 2024 Doctoral Fellowship provided by the Purdue Systems Collaboratory at Purdue University. Simulations were performed at the Rosen Center for Advanced Computing (RCAC) of Purdue University. G.L. acknowledges the National Science Foundation under grants DMS-2053746, DMS-2134209, ECCS-2328241, CBET-2347401, and OAC-2311848. The U.S. Department of Energy also supports this work through the Office of Science Advanced Scientific Computing Research program (DE-SC0023161) and the Office of Fusion Energy Sciences (DE-SC0024583).

References↩︎

[1]
A. L. Moore and L. Shi, “Emerging challenges and materials for thermal management of electronics,” Materials today 17, 163–174 (2014).
[2]
S. He, Z. Ma, W. Deng, Z. Zhang, Z. Guo, W. Liu, and Z. Liu, “Experimental investigation on the start-up performance of a novel flat loop heat pipe with dual evaporators,” Energy Reports 8, 7500–7507 (2022).
[3]
M. S. Dresselhaus, G. Chen, M. Y. Tang, R. Yang, H. Lee, D. Wang, Z. Ren, J.-P. Fleurial, and P. Gogna, “New directions for low-dimensional thermoelectric materials,” Advanced materials 19, 1043–1053 (2007).
[4]
G. Chen, M. Dresselhaus, G. Dresselhaus, J.-P. Fleurial, and T. Caillat, “Recent developments in thermoelectric materials,” International materials reviews 48, 45–66 (2003).
[5]
J. Ziman, http://dx.doi.org/10.1093/acprof:oso/9780198507796.001.0001(Oxford University Press, Northamptonshire, 2001).
[6]
R. Peierls, “Zur kinetischen theorie der wärmeleitung in kristallen,” Annalen der Physik 395, 1055–1101 (1929).
[7]
A. Maradudin and A. Fein, “Scattering of neutrons by an anharmonic crystal,” Phys. Rev. 128, 2589 (1962).
[8]
D. A. Broido, M. Malorny, G. Birner, N. Mingo, and D. Stewart, “Intrinsic lattice thermal conductivity of semiconductors from first principles,” Appl. Phys. Lett. 91, 231922 (2007).
[9]
T. Feng and X. Ruan, “Quantum mechanical prediction of four-phonon scattering rates and reduced thermal conductivity of solids,” Phys. Rev. B 93, 045202 (2016).
[10]
T. Feng, L. Lindsay, and X. Ruan, “Four-phonon scattering significantly reduces intrinsic thermal conductivity of solids,” Phys. Rev. B 96, 161201 (2017).
[11]
J. S. Kang, M. Li, H. Wu, H. Nguyen, and Y. Hu, “Experimental observation of high thermal conductivity in boron arsenide,” Science 361, 575–578 (2018).
[12]
F. Tian, B. Song, X. Chen, N. K. Ravichandran, Y. Lv, K. Chen, S. Sullivan, J. Kim, Y. Zhou, T.-H. Liu, et al., “Unusual high thermal conductivity in boron arsenide bulk crystals,” Science 361, 582–585 (2018).
[13]
S. Li, Q. Zheng, Y. Lv, X. Liu, X. Wang, P. Y. Huang, D. G. Cahill, and B. Lv, “High thermal conductivity in cubic boron arsenide crystals,” Science 361, 579–581 (2018).
[14]
Z. Guo, P. Roy Chowdhury, Z. Han, Y. Sun, D. Feng, G. Lin, and X. Ruan, “Fast and accurate machine learning prediction of phonon scattering rates and lattice thermal conductivity,” Npj Comput. Mater. 9, 95 (2023).
[15]
Y. Srivastava and A. Jain, “Accelerating thermal conductivity prediction through machine-learning: Two orders of magnitude reduction in phonon-phonon scattering rates calculation,” Materials Today Physics 41, 101345 (2024).
[16]
Z. Guo, Z. Han, D. Feng, G. Lin, and X. Ruan, “Sampling-accelerated prediction of phonon scattering rates for converged thermal conductivity and radiative properties,” Npj Comput. Mater. 10, 31 (2024).
[17]
Z. Guo, Z. Han, A. Alkandari, K. Khot, and X. Ruan, “First-principles prediction of thermal conductivity of bulk hexagonal boron nitride,” Applied Physics Letters 124 (2024).
[18]
X. Zhang, C. Shao, and H. Bao, “Cryogenic thermal transport properties from accelerated first-principles calculations: Role of boundary and isotope scattering,” Physical Review B 110, 224301 (2024).
[19]
Z. Guo, P. Sokalski, Z. Han, Y. Cheng, L. Shi, T. Taniguchi, K. Watanabe, and X. Ruan, “First-principles prediction of zone-center optical phonon linewidths and ir spectra of hexagonal boron nitride,” Applied Physics Letters 125 (2024).
[20]
Z. Tang, X. Wang, C. He, J. Li, M. Chen, C. Tang, and T. Ouyang, “Effects of thermal expansion and four-phonon interactions on the lattice thermal conductivity of the negative thermal expansion material scf 3,” Physical Review B 110, 134320 (2024).
[21]
A. Alkandari, Z. Han, Z. Guo, T. E. Beechem, and X. Ruan, “Anisotropic anharmonicity dictates the thermal conductivity of \(\beta\)-ga 2 o 3,” Physical Review B 111, 094308 (2025).
[22]
J. Wei, Z. Xia, Y. Xia, and J. He, “Hierarchy of exchange-correlation functionals in computing lattice thermal conductivities of rocksalt and zinc-blende semiconductors,” Physical Review B 110, 035205 (2024).
[23]
Z. Guo, I. Katsamba, D. Carne, D. Feng, K. Moss, E. Barber, Z. Fang, A. Felicelli, and X. Ruan, “Electronic and phononic characteristics of high-performance radiative cooling pigments h-bn: A comparative study to baso4,” http://dx.doi.org/https://doi.org/10.1016/j.mtphys.2025.101721.
[24]
K. Khot, B. Xiao, Z. Han, Z. Guo, Z. Xiong, and X. Ruan, “Phonon local non-equilibrium at al/si interface from machine learning molecular dynamics,” http://dx.doi.org/10.1063/5.0243641, http://arxiv.org/abs/https://pubs.aip.org/aip/jap/article-pdf/doi/10.1063/5.0243641/20444055/115301_1_5.0243641.pdf.
[25]
J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell, “A survey of general-purpose computation on graphics hardware,” in Computer graphics forum, Vol. 26(Wiley Online Library, 2007) pp. 80–113.
[26]
R. Raina, A. Madhavan, A. Y. Ng, et al., “Large-scale deep unsupervised learning using graphics processors.” in Icml, Vol. 9(2009) pp. 873–880.
[27]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems 25 (2012).
[28]
D. Carne, Z. Guo, and X. Ruan, https://arxiv.org/abs/2509.22890(2025), http://arxiv.org/abs/2509.22890.
[29]
J. E. Stone, J. C. Phillips, L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten, “Accelerating molecular modeling applications with graphics processors,” Journal of computational chemistry 28, 2618–2640 (2007).
[30]
D. B. Kirk and W. H. Wen-Mei, Programming massively parallel processors: a hands-on approach(Morgan kaufmann, 2016).
[31]
X. Gonze, F. Jollet, F. A. Araujo, D. Adams, B. Amadon, T. Applencourt, C. Audouze, J.-M. Beuken, J. Bieder, A. Bokhanchuk, et al., “Recent developments in the abinit software package,” Computer physics communications 205, 106–131 (2016).
[32]
M. Hacene, A. Anciaux-Sedrakian, X. Rozanska, D. Klahr, T. Guignon, and P. Fleurat-Lessard, “Accelerating vasp electronic structure calculations using graphic processing units,” Journal of computational chemistry 33, 2581–2589 (2012).
[33]
M. Hutchinson and M. Widom, “Vasp on a gpu: Application to exact-exchange calculations of the stability of elemental boron,” Computer Physics Communications 183, 1422–1426 (2012).
[34]
Y. Wei, X. You, H. Yang, Z. Luan, and D. Qian, “Towards gpu acceleration of phonon computation with shengbte,” in Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia2020(Association for Computing Machinery, New York, NY, USA, 2020) p. 32–42.
[35]
W. Li, J. Carrete, N. A. Katcho, and N. Mingo, “Shengbte: A solver of the boltzmann transport equation for phonons,” Computer Physics Communications 185, 1747–1758 (2014).
[36]
B. Zhang, Z. Fan, C. Zhao, and X. Gu, “Gpu_pbte: an efficient solver for three and four phonon scattering rates on graphics processing units,” J. Phys. Condens. Matter 33, 495901 (2021).
[37]
Z. Han, X. Yang, W. Li, T. Feng, and X. Ruan, “Fourphonon: An extension module to shengbte for computing four-phonon scattering rates and thermal conductivity,” Computer Physics Communications 270, 108179 (2022).