October 15, 2025
We present forecasts for constraints on the matter density (\(\Omega_m\)) and the amplitude of matter density fluctuations at 8h\(^{-1}\)Mpc (\(\sigma_8\)) from CMB lensing convergence (\(\kappa_\text{CMB}\)) maps and galaxy weak lensing convergence (\(\kappa_\text{WL}\)) maps. For \(\kappa_\text{CMB}\) auto statistics, we compare the angular power spectra (\(C_\ell\)’s) to the wavelet scattering transform (WST) coefficients. For \(\kappa_\text{CMB}\times\kappa_\text{WL}\) statistics, we compare the cross angular power spectra to wavelet phase harmonics (WPH). This work also serves as the first application of WST and WPH to these probes. For \(\kappa_\text{CMB}\), we find that WST and \(C_\ell\)’s yield similar constraints in forecasts for the Simons Observatory and the South Pole Telescope. However, WST gives a tighter
constraint on \(\sigma_8\) by a factor of \(1.7\) for Planck data. When \(\kappa_\text{CMB}\) is crossed with \(\kappa_\text{WL}\) projected from Euclid Data Release 2 (DR2), we find that WPH outperforms cross-\(C_\ell\)’s by factors between \(2.4\) and \(3.8\) for individual parameter constraints. To compare these different summary statistics we develop a novel learned binning approach. This method compresses summary statistics while maintaining interpretability. We find this
leads to improved constraints compared to more naive binning schemes for \(C_\ell\)’s, WST, and most significantly WPH. By learning the binning and measuring constraints on distinct data sets, our method is robust to
overfitting by construction.
Gravitational lensing of photons occurs when light is deflected by a mass distribution in between the source and the observer [1]. By observing distortions in the shapes of numerous galaxies from weak lensing (hereafter WL), statistical methods can be used to obtain integrated line-of-sight mass maps [2]. Current and upcoming surveys, such as the Euclid mission [3], the Dark Energy Survey [4], the Kilo-Degree Survey [5], the Hyper Suprime-Cam [6], and the Vera C. Rubin Observatory’s Legacy Survey of Space and Time [7], will measure galaxy shapes, enabling measurements of these integrated mass maps. Additionally, by observing the lensing of the cosmic microwave background (hereafter CMBL), one can obtain mass maps integrated all the way back to the surface of last scattering [8]. Examples of surveys that perform this measurement include Planck [9], the Atacama Cosmology Telescope (ACT) [10], the South Pole Telescope (SPT) [11], and the Simons Observatory (SO) [12]. Lensing-convergence maps from these surveys are used to constrain cosmology by comparing measured summary statistics with predictions from analytic theories or forward-modeled simulations, within either a likelihood-based or simulation-based inference framework.
If the convergence maps were Gaussian random fields, all of their information would be captured by two-point statistics. However, due to non-linear gravitational evolution, the large-scale structure of the universe has non-Gaussianities that become more pronounced at low redshift. Since two-point statistics fail to capture non-Gaussian information, other summary statistics are desirable. Numerous summary statistics have been used to try to extract non-Gaussian information, including the wavelet scattering transform [13]–[18] and wavelet phase harmonics [17]. Other approaches include higher-order N-point statistics [19]–[23], density-split clustering [24], topological descriptors such as Minkowski functionals [25], [26], and machine-learning-based summaries like convolutional neural networks [27].
In this work, we compare the information content of selected summary statistics applied to the lensing of the cosmic microwave background and weak lensing fields using Fisher forecasts. Each summary statistic compresses the information in a field and typically depends on a handful of hyper-parameters (such as the range of scales in a power spectrum or the number of different opening angles in three-point statistics). Hyperparameter choices change the effective scales and the dimensionality of the summary statistics; this can impact the stability of the covariance inversion in Gaussian-likelihood pipelines and, for simulation-based inference pipelines, inflate model complexity and training variance. Constraints become then highly dependent on hyper-parameters. We address this by proposing and developing a novel learned binning approach, which produces compact and informative summaries.
In our approach, each summary statistic has access to all scales present in our maps. The statistics are then compressed via binning to the same size while maximizing the information content (the determinant of the Fisher matrix). By learning our binning and providing parameter constraints on distinct data sets, our method is robust to overfitting and yields a compressed statistic without sacrificing interpretability.
The rest of the paper is organized as follows: in Section 2 we introduce the simulations used in our analysis and explain how to use them to construct convergence maps. In Section 3, we introduce the summary statistics used. In Section 4, we present an overview of our learned binning approach, which is then applied to simulated data in Section 5. Finally, we conclude in Section 6. All parameter constraints are presented in Appendix 8, and their convergence is tested in Appendix 9. Additional details on the learned binning approach are given in Appendix 10. We verify the Gaussianity of our summary statistics required for a reliable Fisher forecast in Appendix 11.
We use weak gravitational lensing convergence as our simulated observable. Weak gravitational lensing distorts distant images and can be summarized by the lensing potential \(\psi\), which is an integral of the gravitational potential \(\Phi\) along the line of sight:
\[\psi(\vec{\theta},D_S) = \frac{2}{c^2D_S}\int_0^{D_S}dD_L \frac{D_{LS}}{D_L}\Phi(D_L\vec{\theta},D_L),\] where \(D_L\), \(D_S\), and \(D_{LS}\) are the angular distances from the observer to the lens, the observer to the source, and the lens to the source respectively, \(c\) is the speed of light, and \(\vec{\theta}\) is the angular position on the sky. From the lensing potential, we can then construct the convergence map \(\kappa\) as:
\[\kappa(\vec{\theta},D_S) \equiv \frac{1}{2}\nabla^2_\theta \psi(\vec{\theta},D_S).\]
In this work we exclusively use convergence maps integrated out to the cosmic microwave background (CMB) (\(\kappa_\text{CMB}(\vec{\theta})\)) and convergence maps integrated out to redshift ranges of galaxy surveys (\(\kappa_\text{WL}(\vec{\theta})\)). These are defined in the following way:
\[\begin{gather} \kappa_\text{CMB}(\vec{\theta})\equiv\kappa(\vec{\theta}, D_{S,\text{CMB}}), \\ \kappa_\text{WL}(\vec{\theta})\equiv\kappa(\vec{\theta}, D_{S,\text{WL}}), \end{gather}\] where \(D_{S,\text{CMB}}\) and \(D_{S,\text{WL}}\) are the angular distances to the CMB and redshift bin of galaxies, respectively. We now turn to the simulations used to generate convergence maps.
ULAGAM Suite↩︎The analysis in this work uses the ULAGAM suite of N-body simulations [28]. The suite consists of 2000 simulations at a fiducial cosmology,
plus sets of 100 simulations where one parameter is varied at a time in a stepwise fashion for Fisher forecast purposes.
The simulations were run using PKDGRAV3 [29], and are performed in periodic cubic boxes of size \(L =
1\,h^{-1}\mathrm{Gpc} \approx 1.5\,\mathrm{Gpc}\), initialized at redshift \(z = 127\) and evolved with \(512^3\) dark matter particles. The initial conditions for all simulations
were sourced from the Quijote suite [30]. The fiducial cosmological model is \(\Lambda\)
Cold Dark Matter (\(\Lambda\)CDM) and assumes the following parameters: matter density \(\Omega_m = 0.317\), baryon density \(\Omega_b = 0.049\), reduced
Hubble constant \(h = 0.671\), spectral tilt \(n_s = 0.962\), and amplitude of matter density fluctuations at 8h\(^{-1}\) Mpc \(\sigma_8 = 0.834\).
Our analysis focuses on forecasting constraints for \(\Omega_m\) and \(\sigma_8\), which are the \(\Lambda\)CDM parameters determined by weak lensing. To
evaluate the derivatives necessary for a Fisher forecast, we use simulations where \(\Omega_m\) is shifted by \(\pm0.01\) and where \(\sigma_8\) is shifted
by \(\pm0.015\). We use 1000 realizations for the fiducial cosmology, and 100 realizations each for variations in \(\sigma_8\) and \(\Omega_m\).
PKDGRAV3 automatically generates particle lightcones in the form of HEALPix1 pixel maps [31] at varying redshifts. These maps are provided at \(\texttt{NSIDE}=1024\), corresponding to a pixel size of roughly 3.44 arcminutes. The simulation box is tiled and
repeated as needed to construct sufficiently large volumes for building full-sky lightcones up to a given redshift. We convert the particle maps into overdensity maps, and use them to generate weak lensing convergence maps, which we turn to in the next
subsection.
| Survey Footprint Areas (deg\(^2\)) | ||
|---|---|---|
| Survey | Survey Area | Overlap with Euclid DR2 |
| ACT | 15472 | 2903 |
| Planck | 24915 | 7290 |
| SPT | 4001 | 2415 |
| SO | 27229 | 3075 |
We follow the procedure described in Section \(3.1\) of [32] to create noiseless, full-sky \(\kappa_\text{CMB}\) maps for each of our simulations. In particular, we use the Born approximation to integrate the overdensity maps up to redshift \(z = 3.5\). The CMB lensing power spectrum is then computed for the range \(3.5 < z < 1089\), and a Gaussian realization based on this spectrum is generated and added to obtain the full \(\kappa_\text{CMB}\) map.
We generate mock survey \(\kappa_\text{CMB}\) maps by adding Gaussian noise and footprint masks to the full-sky, noiseless convergence maps, simulating various experiments. Specifically, we produce maps for Planck, ACT Data Release 6 (DR6), SPT main and summer fields, and SO. The noise is modeled as purely Gaussian and generated from the corresponding noise power spectra, taken from [12] and [33]. For ACT, our noise power spectra only go to \(\ell_\text{max}=2090\), so we set all signal beyond this to zero. Signal and noise power spectra for \(\kappa_\text{CMB}\) are shown for all surveys used in Figure 1.
For Planck, we further apply a Gaussian smoothing beam with a full width half maximum of 10 arcminutes after adding noise. This mimics the instrument’s beam. We neglect the beam of the other experiments since they are smaller than the pixel size of our maps. Survey masks are applied to each simulated map to match the respective sky coverage. For SPT, we consider the union of the main and summer fields, each region with its corresponding noise power spectrum applied. The coverage area of all surveys can be found in Table 1. An example \(\kappa_\text{CMB}\) for Planck levels of noise is shown in Figure 2.
To create mock \(\kappa_\text{WL}\) maps, we use a simplified version of the pipeline described in [17], tailored to resemble Euclid DR2 data. First, we generate noiseless convergence maps as a function of redshift from the density field using the Born approximation.
We assume that the Euclid DR2 weak-lensing source sample is divided into six tomographic bins of roughly equal number density. For simplicity, we only use two tomographic bins: the bin with the lowest redshift and the bin with the highest redshift. The first (last) tomographic bin has a mean redshift of 0.33 (1.37) and a standard deviation of 0.09 (0.16). For each bin, we produce a convergence map by summing the redshift shells weighted by the corresponding redshift distribution. We add Gaussian shape noise, assuming an ellipticity dispersion of \(\sigma_e = 0.26\) and a galaxy number density of 30 galaxies per arcmin\(^2\), uniformly distributed across the tomographic bins [3]. Finally, we mask the mock \(\kappa_\text{WL}\) maps by applying the Euclid DR2 footprint mask. For all \(\kappa_\text{CMB}\times\kappa_\text{WL}\) statistics, we apply both the Euclid and CMB-survey masks. The overlap coverage area for all surveys can be found in Table 1.
For \(\kappa_\text{CMB}\) maps, we quantify information using the angular power spectrum (\(C_\ell\)) and the wavelet scattering transform (WST). For \(\kappa_\text{CMB}\times\kappa_\text{WL}\), we generalize these statistics to the cross angular power spectrum and wavelet phase harmonics (WPH).
For a function \(A(\Omega)\) on the sphere, \(a^A_{\ell m}\) values are defined as:
\[a^A_{\ell m} = \int A(\Omega)Y_{\ell,m}^*(\Omega)\,\,d\Omega ,\] with \(Y_{\ell,m}\) denoting the spherical harmonics and \(\Omega\) the solid angle. From the \(a_{\ell m}\)’s, we define the cross angular power spectra as:
\[C^{AB}_\ell = \frac{1}{2\ell+1}\sum_{m=-\ell}^\ell a^A_{\ell,m} a^{B*}_{\ell,m}.\]
The auto angular power spectra are defined with \(A=B\). From here on, we will omit superscripts as the context will make it clear which power spectrum we are referring to.
The \(C_\ell\)’s are able to capture all Gaussian information about a field, but miss any information contained in non-Gaussianities. An example of a summary statistic for a single map that can capture non-Gaussian information is the WST, which we turn to next.
The WST was initially proposed in [34] and then used in cosmology for extracting information from weak lensing maps [13] and galaxy clustering [14]–[16]. WST consists of convolving the initial map with a directional, multi-scaled wavelet and taking the modulus of the result. Multiple successive convolutions can be performed to probe non-Gaussianities. An average over all pixels is used to convert any map into a scattering coefficient, the building block of this summary statistic.
More formally, starting with an input field, \(A_0\), one can construct:
\[A_{n+1} = |A_n \star \psi|.\]
In the above, \(\psi\) is the localized wavelet. In general, \(\psi\) will have a scale and a directional dependence, represented here by the integer-valued parameters \(j\) and \(l\) respectively. In this case, one can obtain the first three orders of scattering coefficients as:
\[\begin{gather} S_0=\langle A_0\rangle,\\ S_1(j_1, l_1) = \langle |A_0 \star \psi_{j_1, l_1}|\rangle, \\ S_2(j_1, l_1, j_2, l_2) = \langle ||A_0 \star \psi_{j_1, l_1}|\star\psi_{j_2, l_2}|\rangle, \end{gather}\] where \(\langle . . .\rangle\) is a spatial average. As in other works [17], [18], we only use \(S_1\) and \(S_2\) coefficients. If the overlap of wavelets \(\psi_{j,l}\) and \(\psi_{j,l +1}\) is small and we have statistical isotropy, we can average over orientations without losing information: \(S_1(j_1, l_1)\rightarrow S_1(j_1)\) and \(S_2(j_1, l_1, j_2, l_2) \rightarrow S_2(j_1, j_2, |l_1 - l_2|)\). In this work we assume isotropy for all relevant summary statistics and average over opening angles. Additionally, not all scale combinations contribute to \(S_2\). As shown in [35], only \(S_2\) coefficients with \(j_2>j_1\) are significant.
In this work, we use the Morlet wavelets, given by:
\[\begin{gather} \psi_{j,l}(\vec{x}) = \frac{1}{\sigma}\exp\left(-\frac{x^2}{2\sigma^2}\right)[\exp(i\vec{k}_0\cdot\vec{x})-\beta],\label{eq:Morlet}\\ \tilde{\psi}_{j,l}(\vec{k}) = \exp\left(-\frac{(\vec{k}-\vec{k}_0)^2\sigma^2}{2}\right)- \beta\exp\left(-\frac{k^2\sigma^2}{2}\right), \end{gather}\tag{1}\] where \(\sigma=0.8\times 2^j\), \(k_0 = 0.75\pi\times 2^{-j}\), \(\text{arg}(\vec{k}_0) = \pi\theta/L\), with \(\theta = (\frac{L}{2}-1-l)\pi/L\). The counterterm \(\beta=\exp\left(-\frac{k_0^2\sigma^2}{2}\right)\) is introduced so that \(\tilde{\psi}(0)=0\). The \(j\) index will take integer values between \(0\) and \(J-1\), where \(J\) controls the number of scales and is at most \(\log_2(\text{Image Size})-1\). This keeps the wavelet peaks in Fourier space roughly between the fundamental frequency and the Nyquist frequency of the image. The \(l\) values range between \(0\) and \(L-1\), where \(L\) specifies the number of different angles to probe. While Morlet wavelets traditionally have the above dyadic spacing, we also experiment with finer spacing. Details on this will be given explicitly before any results are shown.
To apply the wavelet scattering transform, the input image must be two dimensional and flat. To achieve this with spherical data we split the data into 768 disjoint patches, which are the HEALPix pixels at \(\texttt{NSIDE} = 8\). The center of each of these patches is rotated to the north pole of the sphere. From here, gnomonic projection is used to project the high resolution patch onto a \(256\times
256\) grid [17], which sets our maximum \(J\) value at \(J=7\). Due to spherical
distortions, not all HEALPix pixels are mapped one-to-one onto a pixel in the final grid. We do not attempt to correct for this effect in this work. After getting scattering coefficients for each patch, we average across patches to get
scattering coefficients for the entire map.
Since our data is masked, we use the convention to only use a patch if at least 20% of its HEALPix pixels are unmasked. The unmasked fraction for a patch is used to weight all of its scattering coefficients prior to averaging across
patches. We use the wavelet scattering repository2 for calculating scattering coefficients after modifying it to allow for non-dyadic spacing.
Wavelet phase harmonics (WPH) is a generalization of some of the concepts of WST to two maps, which we will denote as \(A\) and \(B\). Since WPH uses two maps instead of the one used for WST, it allows for the study of cross-correlations. Unlike WST, each map is only ever convolved with one wavelet, which we denote as:
\[A_{j,l} = A\star \psi_{j,l}.\] As in WST, \(\psi_{j,l}\) is a localized wavelet characterized by a scale parameter \(j\) and a directional parameter \(l\). In general, \(A_{j,l}\) will be complex valued. We apply a phase harmonic to this, defined as:
\[\text{PH}(r\exp(i\theta), q) = r\exp(iq\theta),\] with \(r\), \(\theta\), and \(q\) real-valued numbers.
From the equation above, we can define WPH coefficients in general. As in WST, because we are assuming isotropy, the only angular dependence will be in \(\Delta l\equiv |l_2-l_1|\), and we can write our WPH coefficients as:
\[\begin{align} \text{WPH}_{q_1q_2}(A,B,j_1,j_2,\Delta l) &= \nonumber\\ \langle \text{PH}(A_{j_1,l},q_1) &\text{PH}(B_{j_2,l+\Delta l},q_2)\rangle. \end{align}\]
The WPH coefficients that we use are:
\[\begin{gather} S_{00}(A,B,j) = \langle |A_{j,l}||B_{j,l}|\rangle,\\ S_{01}(A,B,j) = \langle |A_{j,l}|B_{j,l}\rangle,\\ S_{10}(A,B,j) = \langle A_{j,l}|B_{j,l}|\rangle,\\ S_{11}(A,B,j) = \langle A_{j,l} B_{j,l}\rangle,\\ C_{01}(A,B,j_1,j_2,\Delta l) = \langle |A_{j_1,l}|B_{j_2,l+\Delta l}\rangle,\\ C_{10}(A,B,j_1,j_2,\Delta l) = \langle A_{j_1,l}|B_{j_2,l+\Delta l}|\rangle, \end{gather}\] with \(j_1<j_2\) for \(C_{01}\) and \(j_2<j_1\) for \(C_{1,0}\).
These statistics are chosen following [17], with the only change being that we allow for more opening angles in \(C_{01}\) and \(C_{10}\) statistics. For simplicity, we only consider the real component of the final statistics.
In this work we use the package PyWPH [36] to apply WPH and use the bump-steerable wavelets, which are given by the following mother wavelet:
\[\psi(\vec{k}) = 0.693\exp\left(\frac{-(k-\xi_0)^2}{\xi_0^2-(k-\xi_0)^2}\right)\cos^3(\text{arg}(\vec{k})).\] The equation above is for Fourier wavenumbers in the range \(0<k<2\xi_0\) and \(|\text{arg}(k)|<\pi/2\); for all other values \(\psi =0\). The normalization and the central frequency \(\xi_0=0.85\pi\) are set as in [36]. All other wavelets are generated by dilating or rotating the mother wavelet. In real space:
\[\psi_{j,l}(\vec{x}) = 2^{-j} \psi\left(2^{-j}\text{Rot}_{-\pi l/L}(\vec{x})\right).\] In the above, “Rot" means to rotate the vector by the subscripted angle. We probe \(L=4\) different angles.
To apply WPH, the input images must be two-dimensional and flat, as for WST. We follow the same procedure as we used for WST to project patches of the sphere onto \(256\times 256\) grids. As in the case of WST, we only
use a patch if at least 20% of its HEALPix pixels are unmasked. The unmasked fraction for a patch is used to weight the WPH coefficients from each patch before averaging across patches to obtain the WPH coefficients for the entire map.
We use the Fisher information approach to compare the constraining power of our different summary statistics. In doing so, we are assuming that our summary statistics are Gaussian distributed, which we test and confirm in Appendix 11. To fairly compare different summary statistics, we compress them with a novel learned binning approach. This approach compresses the statistics to a set number of bins while maximizing the determinant of the Fisher matrix.
To generate our covariance matrix \(C\), we use 1000 fiducial simulations. Derivatives for the Fisher forecast are estimated numerically using \(100\) realizations per parameter shift. We concatenate our derivative vectors into a matrix \(D\), so we can write:
\[\label{eq:Fish} F = D^TC^{-1}D.\tag{2}\]
To unbias our inverse covariance, we apply the Hartlap correction [37]; however, this correction alone does not result in an unbiased Fisher matrix. Noise in the estimates of the derivatives causes the Fisher matrix to systematically overestimate the constraining power of a summary statistic [38].
To address this, [38] introduce a compressed Fisher matrix, \(F^\text{comp}\). This Fisher matrix is constructed as in Eq.@eq:eq:Fish but using summary statistics that have been compressed with the MOPED [39] algorithm. Perfect compression of the summary statistics would lead to no information loss, but the compression relies on the matrices \(D\) and \(C\). [38] show that \(F^\text{comp}\) will systematically underestimate the constraining power if \(D\) and \(C\) are estimated from simulations via Monte Carlo methods before compressing the statistics. Taking the geometric mean of \(F\) and \(F^\text{comp}\) gives the combined Fisher matrix, \(F^\text{Combined}\), in which the opposing biases of the two inputs cancel, yielding unbiased constraints. It is this combined Fisher matrix and the resulting parameter covariance matrix, \((F^\text{Combined})^{-1}\), which we use for all shown constraints.
The construction of \(F^\text{comp}\) requires a random splitting of the derivatives, leading to stochasticity in \(F^\text{Combined}\). To reduce shot noise, we generate \(100,000\) realizations of the parameter covariance matrix with different random splits of the derivatives. Our final parameter covariance matrix is the average of the middle \(90,000\) realizations ordered by determinant. We cut off the most extreme \(5,000\) realizations on either end to reduce the impact of extreme outliers, which can occur in this method. This averaging reduces shot noise while still leaving us sensitive to systematic convergence issues, which we test for.
Before constructing the combined Fisher matrix, we need to compress our summary statistics in order to get an invertible data covariance matrix. Compression is necessary since the finite number of realizations used to evaluate the covariance matrix places an upper bound on its rank. Since all of our summary statistics have scale dependence, in this work we implement a novel learned binning (over scale) approach to compress our statistics, while retaining interpretability. We explain our method below.
In this section, the initial summary statistic will be \(M\) dimensional and it will be compressed to \(N\) dimensions, with \(N<M\). Any binning scheme can be viewed as a particular linear transformation within \(\mathbb{R}^{N\times M}\). We will refer to the general binning matrix as \(B\), and to compress a data vector \(\vec{d}\) we perform \(B\vec{d}\).
Initially, with \(D\in \mathbb{R}^{M\times 2}\) (we have \(2\) cosmological parameters) and \(C\in \mathbb{R}^{M\times M}\), one would be unable to calculate the Fisher matrix as in Eq.@eq:eq:Fish if \(M\) exceeds the number of simulations. With a binning matrix, one would instead perform:
\[F = (BD)^T(BCB^T)^{-1}BD.\] In this way, we are now inverting an \(N\times N\) matrix instead of an \(M\times M\) matrix. Therefore, we want the value of \(N\) to be much smaller than the number of fiducial simulations. Everything so far simply applies to general \(N\times M\) linear transformations beyond just binning matrices. Compared to general linear transformations, binning is more intuitive and interpretable while having fewer free parameters, thus reducing the chance of overfitting.
To enable low-dimensional parameterizations of binning matrices, we use a function \(f(x):[1,M]\rightarrow [1,N]\) to specify the positions of nonzero matrix entries. Specifically, the \(m^\text{th}\) column of the binning matrix has a single nonzero entry equal to \(1\) at the integer index closest to \(f(m)\). In this work, \(f(x)\) is a linear interpolation between \((1,1)\), learned pivots, and \((M,N)\). Further details on the conditions that \(f(x)\) must satisfy can be found in Appendix 10. An example of such a construction is shown in Figure 3. We will often refer to the summary statistic index and bin number of the pivot as the “\(x\)" or”\(y\)" components of the pivot.
To compress summary statistics, we first pick a number of pivots for the binning matrix. Some summary statistics (cross-\(C_\ell\)’s, WST, and WPH, in this work) can be thought of as consisting of multiple subclasses of statistics that depend exclusively on scale. Taking WST as an example, \(S_1\) coefficients and \(S_2\) coefficients cannot be naturally compared based on scale. Additionally, within the \(S_2\) coefficients, statistics with different opening angles cannot be naturally ordered based on scale. Therefore, for WST we split the coefficients into \(5\) subclasses that have only scale dependence (\(S_1\) and the four different opening angle components of \(S_2\)). These statistics are individually ordered based on scale and then appended together to form our full ordered summary statistic. We fix the \(x\) coordinates of certain pivots to the transitions between subclasses. This effectively introduces parameters which, in combination with the total number of bins, exclusively control how many bins are allocated to each subclass.
Pivots are therefore defined by \(2\) free parameters if their \(x\) coordinate is not specified and \(1\) free parameter otherwise. If there are \(p\) pivots in total and \(p_x\) of them have specified \(x\) coordinates, the pivots can collectively be parameterized by a point in a space with dimension \(2(p-p_x)+p_x\). We call this the “pivot space".
Each point in pivot space corresponds to a binning of the summary statistics which can be used to create a unique Fisher matrix. To retain as much information as possible, we attempt to maximize \(\det(F)\). For the sake of efficiency, this optimization uses the Fisher matrix from Eq.@eq:eq:Fish instead of the combined Fisher matrix. The relatively low number of free parameters (\(\mathcal{O}(\sim10)\) in this work) reduces the probability of overfitting compared to a general linear compression with \(N\times M\) free parameters.
When only using one pivot, the pivot space is two dimensional, and \(\det(F)\) can be plotted as a heatmap. An example of this for Planck \(\kappa_\text{CMB}\) \(C_\ell\)’s is shown in Figure 4. This plot shows that \(\det(F)\) has a genuine signal as a function over pivot space, but that there is also numerical noise that comes from having a finite number of simulations. Since we maximize \(\det(F)\), we could end up fitting to a local maximum in the noise, which would lead to us overestimating the strength of constraints.
To avoid overfitting, we randomly split our fiducial and derivative simulations in half, learn pivot positions on one half of the data, and then generate contours based on the second half of the data. While we use the Fisher matrix from Eq.@eq:eq:Fish to learn how to bin our summary statistics, once compressed, we use the combined Fisher matrix to create contour plots. We repeat this process for \(1000\) different splits and get a covariance matrix for our cosmological parameters for each split. We take our final covariance matrix to be the average of these matrices.
We use \(15\) bins since we find that this choice yields binned summary statistics which follow a Gaussian distribution (see Appendix 11). With a total of \(15\) bins, it would be feasible to attempt to find the size of every bin individually. However, this approach would struggle to generalize to future works if more bins were used. Determining the optimal placements of bin boundaries for \(N\) bins consists of maximizing \(\det(F)\) over an \(N-1\) dimensional space. When using our pivots approach, the dimension of the pivot space \(\det(F)\) is maximized over is completely decoupled from \(N\). This allows one to control the number of free parameters.
In general, we note that the compression of summary statistics is a non-trivial problem, and a variety of methods have been proposed in the literature [40]. Many of these techniques, however, require training on simulations at varying cosmologies, which is not feasible in our setting where only a fiducial covariance matrix and its derivatives are available. Other possible alternatives in this regime are Principal Component Analysis (PCA), finding ideal subsets of observables [41], and the more widely used MOPED algorithm [39]. In principle, MOPED would be the most principled choice, as it is designed to maximize the Fisher information. In practice, however, its performance is limited in our case: it relies on stable estimates of derivatives and on the inversion of a high-dimensional covariance matrix, both of which can be noisy when only a modest number of simulations are available. This issue is exacerbated by the need to split the simulations into independent subsets for training and validation to avoid overfitting, which further reduces the effective number available. Although our learned-binning scheme also maximizes Fisher information, the compression matrix is constrained to a simple, nearly diagonal form that depends on only a handful of pivot parameters. This structure makes it substantially less sensitive to noise than the dense compression vectors produced by MOPED, while preserving interpretability. The additional use of cross-validation further mitigates overfitting to fluctuations in the Fisher matrix, yielding a robust and practical method.
We now turn to forecasted constraints on \(\Omega_m\) and \(\sigma_8\) after binning our summary statistics. We first compare auto \(C_\ell\)’s and WST applied to \(\kappa_\text{CMB}\) maps in Section 5.1. Next, we look at cross-\(C_\ell\)’s and WPH applied to \(\kappa_\text{CMB}\times\kappa_\text{WL}\) maps in Section 5.2. We note that results for WST and WPH in the next section are not combined with those from the power spectrum.
| \(\kappa_\text{CMB}\) constraints: \(\sigma(\Omega_m)[\times100]\,\,/\,\,\sigma(\sigma_8)[\times 100]\) | ||
|---|---|---|
| Survey | \(C_\ell\)’s | WST |
| ACT | 2.0 / 0.8 | 1.7 / 0.6 |
| Planck | 1.9 / 0.8 | 1.4 / 0.5 |
| SPT | 1.5 / 0.3 | 1.5 / 0.4 |
| SO | 0.8 / 0.2 | 0.9 / 0.2 |
For CMB lensing convergence maps, we show results for the angular power spectrum and also for WST, as our summary statistics.
For WST, we do not use dyadic sequencing; instead we have \(j\) from Eq.@eq:eq:Morlet take \(50\) different values from \(0\) to \(6\). This allows for more flexibility in the learned binning by increasing the size of the initial data vector. For angular spacing, we pick \(L=4\). With these choices, there are 5150 WST coefficients. For \(C_\ell\)’s, our resolution limits the maximum \(\ell\) to 3071, so after discarding the dipole we have 3070 \(C_\ell\) values.
For the learned binning, ordering the original summary statistic is important. Entries that are near each other will likely end up being binned together, so we want these entries to be highly correlated to ensure minimal loss of information. For \(C_\ell\)’s this is done by ordering based on \(\ell\). \(S_1\) coefficients are likewise ordered based on \(j\). For \(S_2\) coefficients, we have two scales and an opening angle, making an ordering scheme more arbitrary. In this work, we divide the \(S_2\) coefficients into \(4\) different subclasses based on the opening angle. In each subclass, we order based on \(j_2\). When multiple coefficients have the same opening angle and \(j_2\) value, we order based on \(j_1\).
For the \(C_\ell\)’s, we find that allowing for one pivot can slightly improve constraints, but going to two gives little additional gain. Therefore, for \(C_\ell\)’s we search for one pivot. For WST, we lock four pivots to the transitions from \(S_1\) to \(S_2\) and between the different opening angles of \(S_2\). Since one dimension of the pivot is locked, this has four degrees of freedom. We find improvements by adding one more arbitrary pivot that can be fit anywhere, leading to a \(6\) dimensional search space for the pivots, with no significant improvements going beyond this.
After finding the pivots, we use the combined Fisher forecast to get constraints on the cosmological parameters \(\Omega_m\) and \(\sigma_8\). For SPT and SO, we find that WST and \(C_\ell\)’s give the same constraints on both \(\Omega_m\) and \(\sigma_8\) to within \(10\%\), showing that there is not much non-Gaussian information to extract from \(\kappa_\text{CMB}\). For ACT, WST gives tighter constraints by factors of \(1.2\) for \(\Omega_m\) and \(1.3\) for \(\sigma_8\). For Planck, WST gives tighter constraints by \(1.3\) for \(\Omega_m\) and \(1.7\) for \(\sigma_8\). Planck constraints are shown in Figure 5, and constraints for all surveys are shown in Appendix 8. Additionally, all 1\(\sigma\) constraints can be found in Table 2.
The relative performance across surveys reflects the interplay between noise levels and sky coverage. Planck has the highest noise but covers nearly the full sky, providing a vast number of modes; in this regime, the power spectrum alone is less constraining, allowing WST to recover additional information from non-Gaussian features. ACT lies in an intermediate regime, with lower noise but smaller area, where WST still improves constraints by capturing complementary statistics. SO and SPT deliver the lowest-noise reconstructions; in this limit, most of the cosmological information is already encoded in the power spectrum, leaving limited additional gain from WST.
| \(\kappa_\text{CMB}\times\kappa_\text{WL}\) constraints: \(\sigma(\Omega_m)[\times100]\,\,/\,\,\sigma(\sigma_8)[\times 100]\) | ||
|---|---|---|
| Survey | Cross-\(C_\ell\)’s | WPH |
| ACT \(\times\) Euclid DR2 | 3.4 / 3.9 | 1.3 / 1.6 |
| Planck \(\times\) Euclid DR2 | 2.7 / 3.1 | 1.0 / 1.3 |
| SPT \(\times\) Euclid DR2 | 2.3 / 2.8 | 0.7 / 0.9 |
| SO \(\times\) Euclid DR2 | 2.6 / 3.1 | 0.7 / 0.9 |
|l||c|c|
Survey & Cross-\(C_\ell\)’s & WPH
ACT \(\times\) Euclid DR2 & 4.0 / 4.5 & 3.5 / 4.0
Planck \(\times\) Euclid DR2 & 3.3 / 3.7 & 2.7 / 3.0
SPT \(\times\) Euclid DR2 & 2.4 / 2.9 & 1.5 / 1.6
SO \(\times\) Euclid DR2 & 2.7 / 3.2 & 1.4 / 1.6
For the cross-correlation of the convergence of WL and CMBL fields (\(\kappa_\text{CMB}\times\kappa_\text{WL}\)), we use cross-\(C_\ell\)’s and WPH as our summary statistics. For WPH we use \(J=7\) and \(L=4\), which leads to 392 coefficients. We use dyadic sequencing for WPH (unlike WST) to reduce runtime. Generating WPH coefficients for one map and one survey takes about \(10\) minutes in our implementation.
For cross-\(C_\ell\)’s, we have two spectra coming from our two tomographic bins for WL, leading to 6140 \(C_\ell\)’s. We order each power spectrum based on \(\ell\) and then append them. We lock a pivot at the transition between tomographic bins. We find improvements by adding two more arbitrary pivots, leading to a \(5\) dimensional search space for the pivots, but no significant improvements going beyond this.
For WPH, for each tomographic bin we have \(S_{00}\), \(S_{01}\), \(S_{10}\), \(S_{11}\), \(C_{01}\), and \(C_{10}\) coefficients. For \(C_{01}\) and \(C_{10}\), we also have four different opening angles that we split into different subclasses of statistics. From this, for each of the two tomographic bins we have 12 subclasses within the WPH summary statistic, leading to a total of 24 subclasses. Within each subclass, we order based on \(j\). For \(C_{01}\) (\(C_{10}\)), we order based on \(j_2\) (\(j_1\)). When multiple coefficients have the same opening angle and \(j_2\) (\(j_1)\) value, we order based on \(j_1\) (\(j_2\)). We lock pivots at the transitions between each subclass of the WPH statistics. Since this configuration already has a relatively large number of degrees of freedom for binning, we do not add any additional pivots.
As an example of the power of the learned binning, we can compare constraints with the learned binning compared to an initial binning. As the simplest possible initial binning to compare to, we look at linear binning with no pivots. Constraints are shown in Figure 6 for Planck WPH. Constraints improve by factors of \(2.7\) for \(\Omega_m\) and \(2.3\) for \(\sigma_8\) when using a learned binning. We verify the convergence of this method in Appendix 9.
We find that constraints on \(\Omega_m\) and \(\sigma_8\) are both significantly tighter when using WPH compared to cross-\(C_\ell\)’s. Constraints on \(\Omega_m\) improve by factors between \(2.7\) for Planck \(\times\) Euclid DR2 and \(3.8\) for SO \(\times\) Euclid DR2. Improvements for \(\sigma_8\) are between factors of \(2.4\) for Planck \(\times\) Euclid DR2 and \(3.6\) for SO \(\times\) Euclid DR2. In Figure 7 we show this result for Planck \(\times\) Euclid DR2. Additionally, all 1\(\sigma\) constraints can be found in Table 3. Full contour results for other surveys are provided in Appendix 8. All 1\(\sigma\) constraints when no pivots are learned are also presented in Table [tab:orig95err95table].
All of these results show that WPH is able to extract information from non-Gaussianities in the lensing. These gains are substantially larger than those found for CMB lensing alone, which is expected: weak-lensing – CMB-lensing cross-correlations peak at significantly lower redshift (\(z\sim1\)) compared to CMB lensing auto-correlations (\(z\sim2\)–6). At these lower redshifts the matter distribution is more evolved and inherently more non-Gaussian, providing WPH with additional information to extract beyond that available to power spectra alone. These findings are consistent with those of previous WL applications in the literature [13], [17], [18], [26], [27].
To test the convergence of our results, we repeat this entire process (including finding pivots), while lowering the total number of derivatives used. We check how the constraints vary as a function of the number of derivative realizations used and find that the results have sufficiently converged. All results are shown in Appendix 9.
In this work, we study CMB lensing convergence and galaxy weak lensing convergence data with a variety of summary statistics. To reduce the dimensionality of our summary statistics, we introduce a novel, learned-binning approach. Like other data reduction methods such as MOPED [39], our algorithm attempts to maximize Fisher information in the compressed statistic. Since our method depends on fewer parameters, it is less susceptible to noise arising from a limited number of simulations. Our method also never requires the inversion of the original data covariance matrix, something that becomes impossible if the number of simulations is less than the length of the summary statistic. By construction, our learned binning approach is robust to overfitting and can improve constraints compared to linear binning by factors up to \(2.7\) for the surveys and observables considered in this work.
For CMB lensing convergence data, we compare the higher-order statistic of the wavelet scattering transform to the standard angular power spectrum. We find that these summary statistics lead to similar constraints for SPT and SO, while WST outperforms for Planck and ACT. When we cross our CMB lensing simulated data with Euclid DR2 galaxy weak lensing convergence, we find that the higher-order statistic of the wavelet phase harmonics significantly outperforms the cross power spectrum. Constraints are improved by factors of up to \(3.8\). Compared to constraints from CMB lensing convergence data alone, the cross-survey constraints exhibit stronger degeneracies. All contour plots are presented in Appendix 8.
With WPH showing the most promise of the two higher-order statistics, further generalizations of WPH could prove useful. One potential next step would be investigating the use of WPH statistics besides \(S_{00}\), \(S_{01}\), \(S_{10}\), \(S_{11}\), \(C_{01}\), and \(C_{10}\). WPH and WST also require us to break the sky into patches, which limits the scales that we are able to probe. Generalizing our pipeline to spherical wavelets [42] could circumvent this issue. Spherical wavelets have been previously used for CMB analysis [43], [44].
This work demonstrates that, even though higher-order statistics are largely unnecessary for CMB lensing applications, they can substantially improve cosmological parameter constraints from galaxy weak-lensing–CMB-lensing cross-correlation data.
We also provide a novel method for the comparison of multiple summary statistics without sacrificing interpretability, which will become increasingly more important as new summary statistics are constantly being introduced.
A physical interpretation of these results follows from the discussion in the previous section. WL-CMBL cross-correlations peak at lower redshifts (\(z\sim1\)) than CMBL convergence auto-correlations (\(z\sim2\)–6), where the matter distribution is more evolved and therefore more non-Gaussian. This makes non-Gaussian statistics particularly powerful in the cross-correlation case. While applications of non-Gaussian statistics to weak lensing alone are already well established [13], [17], [18], [22], [23], [26], [27], their use in WL–CMBL convergence cross-correlation data has not been explored before. One reason is that it requires forward modeling of the CMB lensing reconstruction process, including all relevant observational systematics, which is highly non-trivial. Nonetheless, the results presented here suggest that such an effort would be worthwhile.
The authors would like to thank Anmol Raina for his contributions during the initial stages of this project and for insightful discussions. GV acknowledges the support of the Eric and Wendy Schmidt AI in Science Fellowship at the University of Chicago, a program of Schmidt Sciences. MG is supported by the Kavli Institute for Cosmological Physics at the University of Chicago through an endowment from the Kavli Foundation.
In this Appendix, we show all the parameter constraints studied in this work. All the constraints shown here are obtained using our learned-binning approach.
We test the convergence of our methodology by reducing the number of derivatives available and checking how the contours change. The number of realizations for generating both \(\Omega_m\) and \(\sigma_8\) derivatives are changed simultaneously. For brevity, we only show the convergence contours for \(\kappa_\text{CMB}\times\kappa_\text{WL}\) in Figures 11 and 12. Constraints for auto \(\kappa_\text{CMB}\) all converge to similar levels. Additionally, we show how \(1\sigma\) constraints on \(\Omega_m\) and \(\sigma_8\) converge individually for cross-\(C_\ell\)’s and WPH, in Figures 13 and 14, respectively.
In order to properly construct a binning matrix, the function \(f(x):[1,M]\rightarrow [1,N]\) defined in Section 4 must satisfy certain conditions. Since the first entry of the summary statistic is mapped to the first bin, we must have \(f(1)=1\). Similarly, requiring that the last entry of the summary statistic is mapped to the last bin gives the requirement \(f(M)=N\). Bins should consist of entries in the summary statistic which are all adjacent to one another (no gaps in the bins), which demands that \(f(x)\) is a non-decreasing function. Finally, we want to ensure that we have no empty bins. There are numerous constraints one could impose on \(f(x)\) to guarantee this property, and in this work we choose to require \(f'(x)\leq1\).
To construct a binning matrix, we would generally have the \(m^\text{th}\) column of the binning matrix be nonzero (and equal to \(1\)) only at the integer index closest to \(f(m)\). In this work, we slightly generalize binning to allow for one summary statistic entry to be mapped into two adjacent bins. For example, if \(f(2)=3.4\), the \(2^\text{nd}\) summary statistic entry would be divided \(60\%\) into bin \(3\) and \(40\%\) into bin \(4\). This splitting ensures that the determinant of the Fisher matrix is a continuous function with respect to pivot positions.
The Fisher forecast methods we use rely on our summary statistics following a Gaussian distribution. With enough binning, the central limit theorem will ensure that our statistics are indeed Gaussian. We test whether this is the case by producing a \(\chi^2\) distribution of the statistics as in [45]. Specifically, if a summary statistic on any given realization is given by \(\text{\boldsymbol{O}}_i\) with an average value of \(\bar{\text{\boldsymbol{O}}}\) and covariance matrix \(C\), the \(\chi^2\) values can be given by:
\[\chi^2 = (\text{\boldsymbol{O}}_i - \bar{\text{\boldsymbol{O}}})^T C^{-1}(\text{\boldsymbol{O}}_i - \bar{\text{\boldsymbol{O}}}).\]
We perform this test with our binned data. Specifically, we use \(500\) fiducial realizations in learning the binning, and then apply that binning to the other \(500\) realizations which we use to get \(\chi^2\) values. Since we use \(15\) bins, if our statistics are Gaussian distributed, the calculated \(\chi^2\) values should follow the theoretical \(\chi^2\) distribution with \(15\) degrees of freedom. We verify this in Figure 15 where we find no significant differences between the \(\chi^2\) distribution of our data, the theoretical \(\chi^2\) distribution, and a Gaussian distribution from samples randomly generated with the same mean and covariance as our data.