New articles on Statistics


[1] 2404.11675

Decomposition of Longitudinal Disparities: an Application to the Fetal Growth-Singletons Study

Addressing health disparities among different demographic groups is a key challenge in public health. Despite many efforts, there is still a gap in understanding how these disparities unfold over time. Our paper focuses on this overlooked longitudinal aspect, which is crucial in both clinical and public health settings. In this paper, we introduce a longitudinal disparity decomposition method that decomposes disparities into three components: the explained disparity linked to differences in the exploratory variables' conditional distribution when the modifier distribution is identical between majority and minority groups, the explained disparity that emerges specifically from the unequal distribution of the modifier and its interaction with covariates, and the unexplained disparity. The proposed method offers a dynamic alternative to the traditional Peters-Belson decomposition approach, tackling both the potential reduction in disparity if the covariate distributions of minority groups matched those of the majority group and the evolving nature of disparity over time. We apply the proposed approach to a fetal growth study to gain insights into disparities between different race/ethnicity groups in fetal developmental progress throughout the course of pregnancy.


[2] 2404.11678

Corrected Correlation Estimates for Meta-Analysis

Meta-analysis allows rigorous aggregation of estimates and uncertainty across multiple studies. When a given study reports multiple estimates, such as log odds ratios (ORs) or log relative risks (RRs) across exposure groups, accounting for within-study correlations improves accuracy and efficiency of meta-analytic results. Canonical approaches of Greenland-Longnecker and Hamling estimate pseudo cases and non-cases for exposure groups to obtain within-study correlations. However, currently available implementations for both methods fail on simple examples. We review both GL and Hamling methods through the lens of optimization. For ORs, we provide modifications of each approach that ensure convergence for any feasible inputs. For GL, this is achieved through a new connection to entropic minimization. For Hamling, a modification leads to a provably solvable equivalent set of equations given a specific initialization. For each, we provide implementations a guaranteed to work for any feasible input. For RRs, we show the new GL approach is always guaranteed to succeed, but any Hamling approach may fail: we give counter-examples where no solutions exist. We derive a sufficient condition on reported RRs that guarantees success when reported variances are all equal.


[3] 2404.11713

Propensity Score Analysis with Guaranteed Subgroup Balance

Estimating the causal treatment effects by subgroups is important in observational studies when the treatment effect heterogeneity may be present. Existing propensity score methods rely on a correctly specified propensity score model. Model misspecification results in biased treatment effect estimation and covariate imbalance. We proposed a new algorithm, the propensity score analysis with guaranteed subgroup balance (G-SBPS), to achieve covariate mean balance in all subgroups. We further incorporated nonparametric kernel regression for the propensity scores and developed a kernelized G-SBPS (kG-SBPS) to improve the subgroup mean balance of covariate transformations in a rich functional class. This extension is more robust to propensity score model misspecification. Extensive numerical studies showed that G-SBPS and kG-SBPS improve both subgroup covariate balance and subgroup treatment effect estimation, compared to existing approaches. We applied G-SBPS and kG-SBPS to a dataset on right heart catheterization to estimate the subgroup average treatment effects on the hospital length of stay and a dataset on diabetes self-management training to estimate the subgroup average treatment effects for the treated on the hospitalization rate.


[4] 2404.11747

Spatio-temporal patterns of diurnal temperature: a random matrix approach I-case of India

We consider the spatio-temporal gridded daily diurnal temperature range (DTR) data across India during the 72-year period 1951--2022. We augment this data with information on the El Nino-Southern Oscillation (ENSO) and on the climatic regions (Stamp's and Koeppen's classification) and four seasons of India. We use various matrix theory approaches to trim out strong but routine signals, random matrix theory to remove noise, and novel empirical generalised singular-value distributions to establish retention of essential signals in the trimmed data. We make use of the spatial Bergsma statistics to measure spatial association and identify temporal change points in the spatial-association. In particular, our investigation captures a yet unknown change-point over the 72 years under study with drastic changes in spatial-association of DTR in India. It also brings out changes in spatial association with regard to ENSO. We conclude that while studying/modelling Indian DTR data, due consideration should be granted to the strong spatial association that is being persistently exhibited over decades, and provision should be kept for potential change points in the temporal behaviour, which in turn can bring moderate to dramatic changes in the spatial association pattern. Some of our analysis also reaffirms the conclusions made by other authors, regarding spatial and temporal behavior of DTR, adding our own insights. We consider the data from the yearly, seasonal and climatic zones points of view, and discover several new and interesting statistical structures which should be of interest, especially to climatologists and statisticians. Our methods are not country specific and could be used profitably for DTR data from other geographical areas.


[5] 2404.11781

Asymmetric canonical correlation analysis of Riemannian and high-dimensional data

In this paper, we introduce a novel statistical model for the integrative analysis of Riemannian-valued functional data and high-dimensional data. We apply this model to explore the dependence structure between each subject's dynamic functional connectivity -- represented by a temporally indexed collection of positive definite covariance matrices -- and high-dimensional data representing lifestyle, demographic, and psychometric measures. Specifically, we employ a reformulation of canonical correlation analysis that enables efficient control of the complexity of the functional canonical directions using tangent space sieve approximations. Additionally, we enforce an interpretable group structure on the high-dimensional canonical directions via a sparsity-promoting penalty. The proposed method shows improved empirical performance over alternative approaches and comes with theoretical guarantees. Its application to data from the Human Connectome Project reveals a dominant mode of covariation between dynamic functional connectivity and lifestyle, demographic, and psychometric measures. This mode aligns with results from static connectivity studies but reveals a unique temporal non-stationary pattern that such studies fail to capture.


[6] 2404.11802

Associations between pain-management treatments and opioid use disorder risk among Medicaid patients

Introduction: Chronic pain patients are at increased risk of opioid-misuse. Less is known about the unique risk conferred by each pain-management treatment, as treatments are typically implemented together, confounding their independent effects. We estimated the extent to which pain-management strategies were associated with risk of incident opioid use disorder (OUD) for those with chronic pain, controlling for baseline demographic and clinical confounding variables and holding other pain-management treatments at their observed levels. Methods: We used data from two chronic pain subgroups within a cohort of non-pregnant Medicaid patients aged 35-64 years, 2016-2019, from 25 states: 1) those with a chronic pain condition co-morbid with physical disability (N=6,133) or 2) those with chronic pain without disability (N=67,438). We considered 9 pain-management treatments: prescription opioid i) dose and ii) duration; iii) number of opioid prescribers; opioid co-prescription with iv) benzodiazepines, v) muscle relaxants, and vi) gabapentinoids; vii) non-opioid pain prescription, viii) physical therapy, and ix) other pain treatment modality. Our outcome was incident OUD. Results: Having an opioid and gabapentin co-prescription or an opioid and benzodiazepine co-prescription was statistically significantly associated with a 16-46% increased risk of OUD. Opioid dose and duration also were significantly associated with increased risk of OUD. Physical therapy was significantly associated with an 11% decreased risk of OUD in the subgroup with chronic pain but no disability. Conclusions: Co-prescription of opioids with either gabapentin or benzodiazepines may substantially increase risk of OUD. More positively, physical therapy may be a relatively accessible and safe pain-management strategy.


[7] 2404.11813

Detection of a structural break in intraday volatility pattern

We develop theory leading to testing procedures for the presence of a change point in the intraday volatility pattern. The new theory is developed in the framework of Functional Data Analysis. It is based on a model akin to the stochastic volatility model for scalar point-to-point returns. In our context, we study intraday curves, one curve per trading day. After postulating a suitable model for such functional data, we present three tests focusing, respectively, on changes in the shape, the magnitude and arbitrary changes in the sequences of the curves of interest. We justify the respective procedures by showing that they have asymptotically correct size and by deriving consistency rates for all tests. These rates involve the sample size (the number of trading days) and the grid size (the number of observations per day). We also derive the corresponding change point estimators and their consistency rates. All procedures are additionally validated by a simulation study and an application to US stocks.


[8] 2404.11965

Multi-fidelity Gaussian process surrogate modeling for regression problems in physics

One of the main challenges in surrogate modeling is the limited availability of data due to resource constraints associated with computationally expensive simulations. Multi-fidelity methods provide a solution by chaining models in a hierarchy with increasing fidelity, associated with lower error, but increasing cost. In this paper, we compare different multi-fidelity methods employed in constructing Gaussian process surrogates for regression. Non-linear autoregressive methods in the existing literature are primarily confined to two-fidelity models, and we extend these methods to handle more than two levels of fidelity. Additionally, we propose enhancements for an existing method incorporating delay terms by introducing a structured kernel. We demonstrate the performance of these methods across various academic and real-world scenarios. Our findings reveal that multi-fidelity methods generally have a smaller prediction error for the same computational cost as compared to the single-fidelity method, although their effectiveness varies across different scenarios.


[9] 2404.12137

Estimation of subcritical Galton Watson processes with correlated immigration

We consider an observed subcritical Galton Watson process $\{Y_n,\ n\in \mathbb{Z} \}$ with correlated stationary immigration process $\{\epsilon_n,\ n\in \mathbb{Z} \}$. Two situations are presented. The first one is when $\mbox{Cov}(\epsilon_0,\epsilon_k)=0$ for $k$ larger than some $k_0$: a consistent estimator for the reproduction and mean immigration rates is given, and a central limit theorem is proved. The second one is when $\{\epsilon_n,\ n\in \mathbb{Z} \}$ has general correlation structure: under mixing assumptions, we exhibit an estimator for the the logarithm of the reproduction rate and we prove that it converges in quadratic mean with explicit speed. In addition, when the mixing coefficients decrease fast enough, we provide and prove a two terms expansion for the estimator. Numerical illustrations are provided.


[10] 2404.12181

Estimation of the invariant measure of a multidimensional diffusion from noisy observations

We introduce a new approach for estimating the invariant density of a multidimensional diffusion when dealing with high-frequency observations blurred by independent noises. We consider the intermediate regime, where observations occur at discrete time instances $k\Delta_n$ for $k=0,\dots,n$, under the conditions $\Delta_n\to 0$ and $n\Delta_n\to\infty$. Our methodology involves the construction of a kernel density estimator that uses a pre-averaging technique to proficiently remove noise from the data while preserving the analytical characteristics of the underlying signal and its asymptotic properties. The rate of convergence of our estimator depends on both the anisotropic regularity of the density and the intensity of the noise. We establish conditions on the intensity of the noise that ensure the recovery of convergence rates similar to those achievable without any noise. Furthermore, we prove a Bernstein concentration inequality for our estimator, from which we derive an adaptive procedure for the kernel bandwidth selection.


[11] 2404.12290

Debiased Distribution Compression

Modern compression methods can summarize a target distribution $\mathbb{P}$ more succinctly than i.i.d. sampling but require access to a low-bias input sequence like a Markov chain converging quickly to $\mathbb{P}$. We introduce a new suite of compression methods suitable for compression with biased input sequences. Given $n$ points targeting the wrong distribution and quadratic time, Stein Kernel Thinning (SKT) returns $\sqrt{n}$ equal-weighted points with $\widetilde{O}(n^{-1/2})$ maximum mean discrepancy (MMD) to $\mathbb {P}$. For larger-scale compression tasks, Low-rank SKT achieves the same feat in sub-quadratic time using an adaptive low-rank debiasing procedure that may be of independent interest. For downstream tasks that support simplex or constant-preserving weights, Stein Recombination and Stein Cholesky achieve even greater parsimony, matching the guarantees of SKT with as few as $\operatorname{poly-log}(n)$ weighted points. Underlying these advances are new guarantees for the quality of simplex-weighted coresets, the spectral decay of kernel matrices, and the covering numbers of Stein kernel Hilbert spaces. In our experiments, our techniques provide succinct and accurate posterior summaries while overcoming biases due to burn-in, approximate Markov chain Monte Carlo, and tempering.


[12] 2404.12294

floZ: Evidence estimation from posterior samples with normalizing flows

We propose a novel method (floZ), based on normalizing flows, for estimating the Bayesian evidence (and its numerical uncertainty) from a set of samples drawn from the unnormalized posterior distribution. We validate it on distributions whose evidence is known analytically, up to 15 parameter space dimensions, and compare with two state-of-the-art techniques for estimating the evidence: nested sampling (which computes the evidence as its main target) and a k-nearest-neighbors technique that produces evidence estimates from posterior samples. Provided representative samples from the target posterior are available, our method is more robust to posterior distributions with sharp features, especially in higher dimensions. It has wide applicability, e.g., to estimate the evidence from variational inference, Markov-chain Monte Carlo samples, or any other method that delivers samples from the unnormalized posterior density.


[13] 2404.12319

Marginal Analysis of Count Time Series in the Presence of Missing Observations

Time series in real-world applications often have missing observations, making typical analytical methods unsuitable. One method for dealing with missing data is the concept of amplitude modulation. While this principle works with any data, here, missing data for unbounded and bounded count time series are investigated, where tailor-made dispersion and skewness statistics are used for model diagnostics. General closed-form asymptotic formulas are derived for such statistics with only weak assumptions on the underlying process. Moreover, closed-form formulas are derived for the popular special cases of Poisson and binomial autoregressive processes, always under the assumption that missingness occurs. The finite-sample performances of the considered asymptotic approximations are analyzed with simulations. The practical application of the corresponding dispersion and skewness tests under missing data is demonstrated with three real-data examples.


[14] 2404.12356

Improving the interpretability of GNN predictions through conformal-based graph sparsification

Graph Neural Networks (GNNs) have achieved state-of-the-art performance in solving graph classification tasks. However, most GNN architectures aggregate information from all nodes and edges in a graph, regardless of their relevance to the task at hand, thus hindering the interpretability of their predictions. In contrast to prior work, in this paper we propose a GNN \emph{training} approach that jointly i) finds the most predictive subgraph by removing edges and/or nodes -- -\emph{without making assumptions about the subgraph structure} -- while ii) optimizing the performance of the graph classification task. To that end, we rely on reinforcement learning to solve the resulting bi-level optimization with a reward function based on conformal predictions to account for the current in-training uncertainty of the classifier. Our empirical results on nine different graph classification datasets show that our method competes in performance with baselines while relying on significantly sparser subgraphs, leading to more interpretable GNN-based predictions.


[15] 2404.11667

Deep Dependency Networks and Advanced Inference Schemes for Multi-Label Classification

We present a unified framework called deep dependency networks (DDNs) that combines dependency networks and deep learning architectures for multi-label classification, with a particular emphasis on image and video data. The primary advantage of dependency networks is their ease of training, in contrast to other probabilistic graphical models like Markov networks. In particular, when combined with deep learning architectures, they provide an intuitive, easy-to-use loss function for multi-label classification. A drawback of DDNs compared to Markov networks is their lack of advanced inference schemes, necessitating the use of Gibbs sampling. To address this challenge, we propose novel inference schemes based on local search and integer linear programming for computing the most likely assignment to the labels given observations. We evaluate our novel methods on three video datasets (Charades, TACoS, Wetlab) and three image datasets (MS-COCO, PASCAL VOC, NUS-WIDE), comparing their performance with (a) basic neural architectures and (b) neural architectures combined with Markov networks equipped with advanced inference and learning techniques. Our results demonstrate the superiority of our new DDN methods over the two competing approaches.


[16] 2404.11739

Testing Mechanisms

Economists are often interested in the mechanisms by which a particular treatment affects an outcome. This paper develops tests for the ``sharp null of full mediation'' that the treatment $D$ operates on the outcome $Y$ only through a particular conjectured mechanism (or set of mechanisms) $M$. A key observation is that if $D$ is randomly assigned and has a monotone effect on $M$, then $D$ is a valid instrumental variable for the local average treatment effect (LATE) of $M$ on $Y$. Existing tools for testing the validity of the LATE assumptions can thus be used to test the sharp null of full mediation when $M$ and $D$ are binary. We develop a more general framework that allows one to test whether the effect of $D$ on $Y$ is fully explained by a potentially multi-valued and multi-dimensional set of mechanisms $M$, allowing for relaxations of the monotonicity assumption. We further provide methods for lower-bounding the size of the alternative mechanisms when the sharp null is rejected. An advantage of our approach relative to existing tools for mediation analysis is that it does not require stringent assumptions about how $M$ is assigned; on the other hand, our approach helps to answer different questions than traditional mediation analysis by focusing on the sharp null rather than estimating average direct and indirect effects. We illustrate the usefulness of the testable implications in two empirical applications.


[17] 2404.11839

(Empirical) Bayes Approaches to Parallel Trends

We consider Bayes and Empirical Bayes (EB) approaches for dealing with violations of parallel trends. In the Bayes approach, the researcher specifies a prior over both the pre-treatment violations of parallel trends $\delta_{pre}$ and the post-treatment violations $\delta_{post}$. The researcher then updates their posterior about the post-treatment bias $\delta_{post}$ given an estimate of the pre-trends $\delta_{pre}$. This allows them to form posterior means and credible sets for the treatment effect of interest, $\tau_{post}$. In the EB approach, the prior on the violations of parallel trends is learned from the pre-treatment observations. We illustrate these approaches in two empirical applications.


[18] 2404.11917

Expected Coordinate Improvement for High-Dimensional Bayesian Optimization

Bayesian optimization (BO) algorithm is very popular for solving low-dimensional expensive optimization problems. Extending Bayesian optimization to high dimension is a meaningful but challenging task. One of the major challenges is that it is difficult to find good infill solutions as the acquisition functions are also high-dimensional. In this work, we propose the expected coordinate improvement (ECI) criterion for high-dimensional Bayesian optimization. The proposed ECI criterion measures the potential improvement we can get by moving the current best solution along one coordinate. The proposed approach selects the coordinate with the highest ECI value to refine in each iteration and covers all the coordinates gradually by iterating over the coordinates. The greatest advantage of the proposed ECI-BO (expected coordinate improvement based Bayesian optimization) algorithm over the standard BO algorithm is that the infill selection problem of the proposed algorithm is always a one-dimensional problem thus can be easily solved. Numerical experiments show that the proposed algorithm can achieve significantly better results than the standard BO algorithm and competitive results when compared with five state-of-the-art high-dimensional BOs. This work provides a simple but efficient approach for high-dimensional Bayesian optimization.


[19] 2404.11922

Redefining the Shortest Path Problem Formulation of the Linear Non-Gaussian Acyclic Model: Pairwise Likelihood Ratios, Prior Knowledge, and Path Enumeration

Effective causal discovery is essential for learning the causal graph from observational data. The linear non-Gaussian acyclic model (LiNGAM) operates under the assumption of a linear data generating process with non-Gaussian noise in determining the causal graph. Its assumption of unmeasured confounders being absent, however, poses practical limitations. In response, empirical research has shown that the reformulation of LiNGAM as a shortest path problem (LiNGAM-SPP) addresses this limitation. Within LiNGAM-SPP, mutual information is chosen to serve as the measure of independence. A challenge is introduced - parameter tuning is now needed due to its reliance on kNN mutual information estimators. The paper proposes a threefold enhancement to the LiNGAM-SPP framework. First, the need for parameter tuning is eliminated by using the pairwise likelihood ratio in lieu of kNN-based mutual information. This substitution is validated on a general data generating process and benchmark real-world data sets, outperforming existing methods especially when given a larger set of features. The incorporation of prior knowledge is then enabled by a node-skipping strategy implemented on the graph representation of all causal orderings to eliminate violations based on the provided input of relative orderings. Flexibility relative to existing approaches is achieved. Last among the three enhancements is the utilization of the distribution of paths in the graph representation of all causal orderings. From this, crucial properties of the true causal graph such as the presence of unmeasured confounders and sparsity may be inferred. To some extent, the expected performance of the causal discovery algorithm may be predicted. The refinements above advance the practicality and performance of LiNGAM-SPP, showcasing the potential of graph-search-based methodologies in advancing causal discovery.


[20] 2404.12070

Towards an Approximation Theory of Observable Operator Models

Observable operator models (OOMs) offer a powerful framework for modelling stochastic processes, surpassing the traditional hidden Markov models (HMMs) in generality and efficiency. However, using OOMs to model infinite-dimensional processes poses significant theoretical challenges. This article explores a rigorous approach to developing an approximation theory for OOMs of infinite-dimensional processes. Building upon foundational work outlined in an unpublished tutorial [Jae98], an inner product structure on the space of future distributions is rigorously established and the continuity of observable operators with respect to the associated 2-norm is proven. The original theorem proven in this thesis describes a fundamental obstacle in making an infinite-dimensional space of future distributions into a Hilbert space. The presented findings lay the groundwork for future research in approximating observable operators of infinite-dimensional processes, while a remedy to the encountered obstacle is suggested.


[21] 2404.12215

Quantifying Aleatoric and Epistemic Uncertainty with Proper Scoring Rules

Uncertainty representation and quantification are paramount in machine learning and constitute an important prerequisite for safety-critical applications. In this paper, we propose novel measures for the quantification of aleatoric and epistemic uncertainty based on proper scoring rules, which are loss functions with the meaningful property that they incentivize the learner to predict ground-truth (conditional) probabilities. We assume two common representations of (epistemic) uncertainty, namely, in terms of a credal set, i.e. a set of probability distributions, or a second-order distribution, i.e., a distribution over probability distributions. Our framework establishes a natural bridge between these representations. We provide a formal justification of our approach and introduce new measures of epistemic and aleatoric uncertainty as concrete instantiations.


[22] 2404.12219

A Quadrature Approach for General-Purpose Batch Bayesian Optimization via Probabilistic Lifting

Parallelisation in Bayesian optimisation is a common strategy but faces several challenges: the need for flexibility in acquisition functions and kernel choices, flexibility dealing with discrete and continuous variables simultaneously, model misspecification, and lastly fast massive parallelisation. To address these challenges, we introduce a versatile and modular framework for batch Bayesian optimisation via probabilistic lifting with kernel quadrature, called SOBER, which we present as a Python library based on GPyTorch/BoTorch. Our framework offers the following unique benefits: (1) Versatility in downstream tasks under a unified approach. (2) A gradient-free sampler, which does not require the gradient of acquisition functions, offering domain-agnostic sampling (e.g., discrete and mixed variables, non-Euclidean space). (3) Flexibility in domain prior distribution. (4) Adaptive batch size (autonomous determination of the optimal batch size). (5) Robustness against a misspecified reproducing kernel Hilbert space. (6) Natural stopping criterion.


[23] 2404.12238

Neural Networks with Causal Graph Constraints: A New Approach for Treatment Effects Estimation

In recent years, there has been a growing interest in using machine learning techniques for the estimation of treatment effects. Most of the best-performing methods rely on representation learning strategies that encourage shared behavior among potential outcomes to increase the precision of treatment effect estimates. In this paper we discuss and classify these models in terms of their algorithmic inductive biases and present a new model, NN-CGC, that considers additional information from the causal graph. NN-CGC tackles bias resulting from spurious variable interactions by implementing novel constraints on models, and it can be integrated with other representation learning methods. We test the effectiveness of our method using three different base models on common benchmarks. Our results indicate that our model constraints lead to significant improvements, achieving new state-of-the-art results in treatment effects estimation. We also show that our method is robust to imperfect causal graphs and that using partial causal information is preferable to ignoring it.


[24] 2404.12267

Physics-integrated generative modeling using attentive planar normalizing flow based variational autoencoder

Physics-integrated generative modeling is a class of hybrid or grey-box modeling in which we augment the the data-driven model with the physics knowledge governing the data distribution. The use of physics knowledge allows the generative model to produce output in a controlled way, so that the output, by construction, complies with the physical laws. It imparts improved generalization ability to extrapolate beyond the training distribution as well as improved interpretability because the model is partly grounded in firm domain knowledge. In this work, we aim to improve the fidelity of reconstruction and robustness to noise in the physics integrated generative model. To this end, we use variational-autoencoder as a generative model. To improve the reconstruction results of the decoder, we propose to learn the latent posterior distribution of both the physics as well as the trainable data-driven components using planar normalizng flow. Normalizng flow based posterior distribution harnesses the inherent dynamical structure of the data distribution, hence the learned model gets closer to the true underlying data distribution. To improve the robustness of generative model against noise injected in the model, we propose a modification in the encoder part of the normalizing flow based VAE. We designed the encoder to incorporate scaled dot product attention based contextual information in the noisy latent vector which will mitigate the adverse effect of noise in the latent vector and make the model more robust. We empirically evaluated our models on human locomotion dataset [33] and the results validate the efficacy of our proposed models in terms of improvement in reconstruction quality as well as robustness against noise injected in the model.


[25] 2404.12312

A Mean-Field Analysis of Neural Gradient Descent-Ascent: Applications to Functional Conditional Moment Equations

We study minimax optimization problems defined over infinite-dimensional function classes. In particular, we restrict the functions to the class of overparameterized two-layer neural networks and study (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural network. As an initial step, we consider the minimax optimization problem stemming from estimating a functional equation defined by conditional expectations via adversarial estimation, where the objective function is quadratic in the functional space. For this problem, we establish convergence under the mean-field regime by considering the continuous-time and infinite-width limit of the optimization dynamics. Under this regime, gradient descent-ascent corresponds to a Wasserstein gradient flow over the space of probability measures defined over the space of neural network parameters. We prove that the Wasserstein gradient flow converges globally to a stationary point of the minimax objective at a $\mathcal{O}(T^{-1} + \alpha^{-1} ) $ sublinear rate, and additionally finds the solution to the functional equation when the regularizer of the minimax objective is strongly convex. Here $T$ denotes the time and $\alpha$ is a scaling parameter of the neural network. In terms of representation learning, our results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $\mathcal{O}(\alpha^{-1})$, measured in terms of the Wasserstein distance. Finally, we apply our general results to concrete examples including policy evaluation, nonparametric instrumental variable regression, and asset pricing.


[26] 2404.12376

Matching the Statistical Query Lower Bound for k-sparse Parity Problems with Stochastic Gradient Descent

The $k$-parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-parity problem with stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that SGD can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hypercube ($k\le O(\sqrt{d})$) with a sample complexity of $\tilde{O}(d^{k-1})$ using $2^{\Theta(k)}$ neurons, thus matching the established $\Omega(d^{k})$ lower bounds of Statistical Query (SQ) models. Our theoretical analysis begins by constructing a good neural network capable of correctly solving the $k$-parity problem. We then demonstrate how a trained neural network with SGD can effectively approximate this good network, solving the $k$-parity problem with small statistical errors. Our theoretical results and findings are supported by empirical evidence, showcasing the efficiency and efficacy of our approach.