Randomized response has long been used in statistical surveys to estimate the proportion of sensitive groups in a population while protecting the privacy of respondents. More recently, this technique has been adopted by organizations that generate synthetic data from real personal binary data, enabling data storage and sharing for research or commercial purposes without compromising individual privacy. While the main aim in statistical surveys is the accurate estimation of sensitive group proportions, synthetic data generation prioritizes privacy preservation. To achieve precise estimation, statisticians typically determine the required sample size based on a pre-specified power of hypothesis testing. However, we find that designing randomized response studies to achieve high statistical power can come at the expense of increased privacy risk. In this work, we analyze how various established randomized response designs perform with respect to both statistical power and differential privacy (DP), the latter quantifying the risk of privacy leakage. We demonstrate that commonly used design strategies may result in either insufficient power or excessive privacy loss. To address this issue, we propose optimal choices of design parameters across different randomized response models to simultaneously achieve the desired statistical power and maintain differential privacy within acceptable bounds. The practical advantages of these optimal designs are illustrated through comprehensive simulation studies and a real data application. Additionally, we present a user-friendly Shiny-App to assist researchers in designing randomized response studies and performing the associated data analyses.
We propose a novel Bayesian nonparametric classification model that combines a Gaussian process prior for the latent function with a Dirichlet process prior for the link function, extending the interpretative framework of de Finetti representation theorem and the construction of random distribution functions made by Ferguson (1973). This approach allows for flexible uncertainty modeling in both the latent score and the mapping to probabilities. We demonstrate the method performance using simulated data where it outperforms standard logistic regression.
Mediation analysis examines the pathways through which mediators transmit the effect of an exposure to an outcome. In high-dimensional settings, the joint significance test is commonly applied using variable screening followed by statistical inference. However, when mediators are highly correlated, existing methods may experience reduced statistical power due to inaccurate screening and residual bias in asymptotic inference. To address these issues, we propose CHIMA (Correlation-aware High-dimensional Mediation Analysis), an extension of a recently developed high-dimensional mediation analysis framework that enhances performance under correlation by integrating two advances: (i) high-dimensional ordinary least squares projection for accurate screening under correlation; and (ii) approximate orthogonalization for bias reduction. Simulation studies demonstrate that CHIMA effectively identifies active mediators even in the presence of strong correlations and outperforms competing methods across various settings. We further apply CHIMA to ribonucleic acid sequencing (RNA-seq) from the Living Brain Project, identifying genes that mediate the effect of Parkinson's disease on brain cell composition, thereby revealing cell-type-specific mechanisms of disease.
We conducted a proof-of-concept evaluation of individualized treatment effect (ITE) estimation using survival data from a randomized trial of 475 men with advanced prostate cancer treated with high- versus low-dose diethylstilbestrol (DES). A Weibull accelerated failure time (AFT) model with interaction terms for treatment-by-age and treatment-by-log tumor size was used to capture subgroup-specific treatment effects. The estimated main effect of high-dose DES indicated a time ratio of 0.582 (95% CI: [0.306, 1.110]), reflecting reduced survival at the reference levels of age and tumor size. However, interaction-adjusted ITEs revealed marked effect modification: younger patients (e.g., age 50 years) had over fourfold expected survival gains (time ratio 4.09), whereas older patients (e.g., age 80 years) experienced reduced benefit (time ratio 0.71). Similarly, patients with larger tumors (log size $\sim$4.25, $\sim$70 $cm^2$) derived a stronger benefit (time ratio 1.89) than those with smaller tumors. To evaluate the reliability of these individualized estimates, both the delta method and bootstrap resampling were applied for uncertainty quantification, producing closely aligned intervals across the risk spectrum. This analysis illustrates how parametric survival models with clinically motivated interactions and robust inference procedures can yield interpretable patient-level treatment effect estimates, even in moderately sized oncology trials.
We consider statistical inference for a finite-dimensional parameter in a regular semiparametric model under a distributed setting with blockwise missingness, where entire blocks of variables are unavailable at certain sites and sharing individual-level data is not allowed. To improve efficiency of the internal study, we propose a class of augmented one-step estimators that incorporate information from external sites through ``transfer functions.'' The proposed approach has several advantages. First, it is communication-efficient, requiring only one-round communication of summary-level statistics. Second, it satisfies a do-no-harm property in the sense that the augmented estimator is no less efficient than the original one based solely on the internal data. Third, it is statistically optimal, achieving the semiparametric efficiency bound when the transfer function is appropriately estimated from data. Finally, it is scalable, remaining asymptotically normal even when the number of external sites and the data dimension grow exponentially with the internal sample size. Simulation studies confirm both the statistical efficiency and computational feasibility of our method in distributed settings.
Accurate modelling and quantification of predictive uncertainty is crucial in deep learning since it allows a model to make safer decisions when the data is ambiguous and facilitates the users' understanding of the model's confidence in its predictions. Along with the tremendously increasing research focus on \emph{graph neural networks} (GNNs) in recent years, there have been numerous techniques which strive to capture the uncertainty in their predictions. However, most of these approaches are specifically designed for node or link-level tasks and cannot be directly applied to graph-level learning problems. In this paper, we propose a novel variational modelling framework for the \emph{posterior predictive distribution}~(PPD) to obtain uncertainty-aware prediction in graph-level learning tasks. Based on a graph-level embedding derived from one of the existing GNNs, our framework can learn the PPD in a data-adaptive fashion. Experimental results on several benchmark datasets exhibit the effectiveness of our approach.
Standard techniques for aligning large language models (LLMs) utilize human-produced data, which could limit the capability of any aligned LLM to human level. Label refinement and weak training have emerged as promising strategies to address this superalignment problem. In this work, we adopt probabilistic assumptions commonly used to study label refinement and analyze whether refinement can be outperformed by alternative approaches, including computationally intractable oracle methods. We show that both weak training and label refinement suffer from irreducible error, leaving a performance gap between label refinement and the oracle. These results motivate future research into developing alternative methods for weak to strong generalization that synthesize the practicality of label refinement or weak training and the optimality of the oracle procedure.
We introduce a two-round adaptive communication strategy that enables rate-optimal estimation in the white noise model without requiring prior knowledge of the underlying smoothness. In the first round, local machines send summary statistics using $(\log_2(n))^2$ bits to enable the central machine to select the tuning parameters of the procedure. In the second round, another set of statistics are transmitted using optimal number of bits, enabling the central machine to aggregate and produce a final estimator that adapts to the true smoothness level. This approach achieves optimal convergence rates across a wider range of regularities, offering a potential improvement in the adaptability and efficiency of distributed estimation compared to existing one-round methods.
Current experimental scientists have been increasingly relying on simulation-based inference (SBI) to invert complex non-linear models with intractable likelihoods. However, posterior approximations obtained with SBI are often miscalibrated, causing credible regions to undercover true parameters. We develop $\texttt{CP4SBI}$, a model-agnostic conformal calibration framework that constructs credible sets with local Bayesian coverage. Our two proposed variants, namely local calibration via regression trees and CDF-based calibration, enable finite-sample local coverage guarantees for any scoring function, including HPD, symmetric, and quantile-based regions. Experiments on widely used SBI benchmarks demonstrate that our approach improves the quality of uncertainty quantification for neural posterior estimators using both normalizing flows and score-diffusion modeling.
Many modern probabilistic models rely on SDEs, but their adoption is hampered by instability, poor inductive bias outside bounded domains, and reliance on restrictive dynamics or training tricks. While recent work constrains SDEs to compact spaces using reflected dynamics, these approaches lack continuous dynamics and efficient high-order solvers, limiting interpretability and applicability. We propose a novel class of neural SDEs on compact polyhedral spaces with continuous dynamics, amenable to higher-order solvers, and with favorable inductive bias.
Causality and causal inference have emerged as core research areas at the interface of modern statistics and domains including biomedical sciences, social sciences, computer science, and beyond. The field's inherently interdisciplinary nature -- particularly the central role of incorporating domain knowledge -- creates a rich and varied set of statistical challenges. Much progress has been made, especially in the last three decades, but there remain many open questions. Our goal in this discussion is to outline research directions and open problems we view as particularly promising for future work. Throughout we emphasize that advancing causal research requires a wide range of contributions, from novel theory and methodological innovations to improved software tools and closer engagement with domain scientists and practitioners.
Differential privacy (DP) has recently emerged as a definition of privacy to release private estimates. DP calibrates noise to be on the order of an individuals contribution. Due to the this calibration a private estimate obscures any individual while preserving the utility of the estimate. Since the original definition, many alternate definitions have been proposed. These alternates have been proposed for various reasons including improvements on composition results, relaxations, and formalizations. Nevertheless, thus far nearly all definitions of privacy have used a divergence of densities as the basis of the definition. In this paper we take an information geometry perspective towards differential privacy. Specifically, rather than define privacy via a divergence, we define privacy via the Rao distance. We show that our proposed definition of privacy shares the interpretation of previous definitions of privacy while improving on sequential composition.
We investigate the problem of estimating the average treatment effect (ATE) under a very general setup where the covariates can be high-dimensional, highly correlated, and can have sparse nonlinear effects on the propensity and outcome models. We present the use of a Double Deep Learning strategy for estimation, which involves combining recently developed factor-augmented deep learning-based estimators, FAST-NN, for both the response functions and propensity scores to achieve our goal. By using FAST-NN, our method can select variables that contribute to propensity and outcome models in a completely nonparametric and algorithmic manner and adaptively learn low-dimensional function structures through neural networks. Our proposed novel estimator, FIDDLE (Factor Informed Double Deep Learning Estimator), estimates ATE based on the framework of augmented inverse propensity weighting AIPW with the FAST-NN-based response and propensity estimates. FIDDLE consistently estimates ATE even under model misspecification and is flexible to also allow for low-dimensional covariates. Our method achieves semiparametric efficiency under a very flexible family of propensity and outcome models. We present extensive numerical studies on synthetic and real datasets to support our theoretical guarantees and establish the advantages of our methods over other traditional choices, especially when the data dimension is large.
This paper re-examines the first normalized incomplete moment, a well-established measure of inequality with wide applications in economic and social sciences. Despite the popularity of the measure itself, existing statistical inference appears to lag behind the needs of modern-age analytics. To fill this gap, we propose an alternative solution that is intuitive, computationally efficient, mathematically equivalent to the existing solutions for "standard" cases, and easily adaptable to "non-standard" ones. The theoretical and practical advantages of the proposed methodology are demonstrated via both simulated and real-life examples. In particular, we discover that a common practice in industry can lead to highly non-trivial challenges for trustworthy statistical inference, or misleading decision making altogether.
In multi-objective learning (MOL), several possibly competing prediction tasks must be solved jointly by a single model. Achieving good trade-offs may require a model class $\mathcal{G}$ with larger capacity than what is necessary for solving the individual tasks. This, in turn, increases the statistical cost, as reflected in known MOL bounds that depend on the complexity of $\mathcal{G}$. We show that this cost is unavoidable for some losses, even in an idealized semi-supervised setting, where the learner has access to the Bayes-optimal solutions for the individual tasks as well as the marginal distributions over the covariates. On the other hand, for objectives defined with Bregman losses, we prove that the complexity of $\mathcal{G}$ may come into play only in terms of unlabeled data. Concretely, we establish sample complexity upper bounds, showing precisely when and how unlabeled data can significantly alleviate the need for labeled data. These rates are achieved by a simple, semi-supervised algorithm via pseudo-labeling.
In many applications, the target parameter depends on a nuisance function defined by a conditional moment restriction, whose estimation often leads to an ill-posed inverse problem. Classical approaches, such as sieve-based GMM, approximate the restriction using a fixed set of test functions and may fail to capture important aspects of the solution. Adversarial estimators address this limitation by framing estimation as a game between an estimator and an adaptive critic. We study the class of Regularized Adversarial Stabilized (RAS) estimators that employ reproducing kernel Hilbert spaces (RKHSs) for both estimation and testing, with regularization via the RKHS norm. Our first contribution is a novel analysis that establishes finite-sample bounds for both the weak error and the root mean squared error (RMSE) of these estimators under interpretable source conditions, in contrast to existing results. Our second contribution is a detailed comparison of the assumptions underlying this RKHS-norm-regularized approach with those required for (i) RAS estimators using $\mathcal{L}^2$ penalties, and (ii) recently proposed, computationally stable Kernel Maximal Moment estimators.
The Wasserstein distance is a metric for assessing distributional differences. The measure originates in optimal transport theory and can be interpreted as the minimal cost of transforming one distribution into another. In this paper, the Wasserstein distance is applied to life table age-at-death distributions. The main finding is that, under certain conditions, the Wasserstein distance between two age-at-death distributions equals the corresponding gap in life expectancy at birth ($e_0$). More specifically, the paper shows mathematically and empirically that this equivalence holds whenever the survivorship functions do not cross. For example, this applies when comparing mortality between women and men from 1990 to 2020 using data from the Human Mortality Database. In such cases, the gap in $e_0$ reflects not only a difference in mean ages at death but can also be interpreted directly as a measure of distributional difference.
The Sustainable Development Goals (SDGs) of the United Nations consist of 17 general objectives, subdivided into 169 targets to be achieved by 2030. Several SDG indices and indicators require continuous analysis and evaluation, and most of these indices are supported in the unit interval (0,1). To incorporate the flexibility of the modified Weibull (MW) distribution in doubly constrained datasets, the first objective of this work is to propose a new unit probability distribution based on the MW distribution. For this, a transformation of the MW distribution is applied, through which the unit modified Weibull (UMW) distribution is obtained. The second objective of this work is to introduce a quantile regression model for random variables with UMW distribution, reparameterized in terms of the quantiles of the distribution. Maximum likelihood estimators (MLEs) are used to estimate the model parameters. Monte Carlo simulations are performed to evaluate the MLE properties of the model parameters in finite sample sizes. The introduced methods are used for modeling some sustainability indicators related to the SDGs, also considering the reading skills of dyslexic children, which are indirectly associated with SDG 4 (Quality Education) and SDG 3 (Health and Well-Being).
Multi-source and multi-modal datasets are increasingly common in scientific research, yet they often exhibit block-wise missingness, where entire data sources or modalities are systematically absent for subsets of subjects. This structured form of missingness presents significant challenges for statistical analysis, particularly for two-sample hypothesis testing. Standard approaches such as imputation or complete-case analysis can introduce bias or result in substantial information loss, especially when the missingness mechanism is not random. To address this methodological gap, we propose the Block-Pattern Enhanced Test (BPET), a general framework for two-sample testing that directly accounts for block-wise missingness without requiring imputation or deletion of observations. As a concrete instantiation, we develop the Block-wise Rank In Similarity graph Edge-count (BRISE) test, which extends rank-based similarity graph methods to settings with block-wise missing data. Under mild conditions, we establish that the null distribution of BRISE converges to a chi-squared distribution. Simulation studies show that BRISE consistently controls the type I error rate and achieves good statistical power under a wide range of alternatives. Applications to two real-world datasets with block-wise missingness further demonstrate the practical utility of our method in identifying meaningful distributional differences.
Topological Data Analysis (TDA) can be used to detect and characterize holes in an image, such as zero-dimensional holes (connected components) or one-dimensional holes (loops). However, there is currently no widely accepted statistical framework for modeling spatiotemporal dependence in the evolution of topological features, such as holes, within a time series of images. We propose a hypothesis testing framework to identify statistically significant topological features of images in space and time, simultaneously. This addition of time may induce higher-dimensional topological features which can be used to establish temporal connections between the lower-dimensional features at each point in time. The temporal evolution of these lower-dimensional features is then represented on a zigzag persistence diagram, as a topological summary statistic focused on time dynamics. We demonstrate that the method effectively captures the emergence and progression of topological features in a study of a series of images of a wounded cell as it repairs. The proposed method outperforms a current approach in a simulation study that includes features of the wound healing process. Since, the wounded cell images exhibit nonlinear, dynamic, spatial, and temporal structures during single-cell repair, they provide a good application for this method.
Langevin algorithms are popular Markov chain Monte Carlo (MCMC) methods for large-scale sampling problems that often arise in data science. We propose Monte Carlo algorithms based on the discretizations of $P$-th order Langevin dynamics for any $P\geq 3$. Our design of $P$-th order Langevin Monte Carlo (LMC) algorithms is by combining splitting and accurate integration methods. We obtain Wasserstein convergence guarantees for sampling from distributions with log-concave and smooth densities. Specifically, the mixing time of the $P$-th order LMC algorithm scales as $O\left(d^{\frac{1}{R}}/\epsilon^{\frac{1}{2R}}\right)$ for $R=4\cdot 1_{\{ P=3\}}+ (2P-1)\cdot 1_{\{ P\geq 4\}}$, which has a better dependence on the dimension $d$ and the accuracy level $\epsilon$ as $P$ grows. Numerical experiments illustrate the efficiency of our proposed algorithms.
We introduce Sequential Probability Ratio Bisection (SPRB), a novel stochastic approximation algorithm that adapts to the local behavior of the (regression) function of interest around its root. We establish theoretical guarantees for SPRB's asymptotic performance, showing that it achieves the optimal convergence rate and minimal asymptotic variance even when the target function's derivative at the root is small (at most half the step size), a regime where the classical Robbins-Monro procedure typically suffers reduced convergence rates. Further, we show that if the regression function is discontinuous at the root, Robbins-Monro converges at a rate of $1/n$ whilst SPRB attains exponential convergence. If the regression function has vanishing first-order derivative, SPRB attains a faster rate of convergence compared to stochastic approximation. As part of our analysis, we derive a nonasymptotic bound on the expected sample size and establish a generalized Central Limit Theorem under random stopping times. Remarkably, SPRB automatically provides nonasymptotic time-uniform confidence sequences that do not explicitly require knowledge of the convergence rate. We demonstrate the practical effectiveness of SPRB through simulation results.
Machine learning models must balance accuracy and fairness, but these goals often conflict, particularly when data come from multiple demographic groups. A useful tool for understanding this trade-off is the fairness-accuracy (FA) frontier, which characterizes the set of models that cannot be simultaneously improved in both fairness and accuracy. Prior analyses of the FA frontier provide a full characterization under the assumption of complete knowledge of population distributions -- an unrealistic ideal. We study the FA frontier in the finite-sample regime, showing how it deviates from its population counterpart and quantifying the worst-case gap between them. In particular, we derive minimax-optimal estimators that depend on the designer's knowledge of the covariate distribution. For each estimator, we characterize how finite-sample effects asymmetrically impact each group's risk, and identify optimal sample allocation strategies. Our results transform the FA frontier from a theoretical construct into a practical tool for policymakers and practitioners who must often design algorithms with limited data.
Publication bias (PB) is one of the most vital threats to the accuracy of meta-analysis. Adjustment or sensitivity analysis based on selection models, which describe the probability of a study being published, provide a more objective evaluation of PB than widely-used simple graphical methods such as the trim-and-fill method. Most existing methods rely on parametric selection models. The Copas-Jackson bound (C-J bound) provides a worst-case bound of an analytical form over a nonparametric class of selection models, which would provide more robust conclusions than parametric sensitivity analysis. The nonparametric class of the selection models in the C-J bound is restrictive and only covers parametric selection models monotonic to the standard errors of outcomes. The novelty of this paper is to develop a method that constructs worst-case bounds over a general class of selection models weakening the assumption in the C-J bound. We propose an efficient numerical method to obtain an approximate worst-case bound via tractable nonlinear programming with linear constraints. We substantiate the effectiveness of the proposed bound with extensive simulation studies and show its applicability with two real-world meta-analyses.
Optimizing wheat variety selection for high performance in different environmental conditions is critical for reliable food production and stable incomes for growers. We employ a statistical machine learning framework utilizing Gaussian Process (GP) models to capture the effects of genetic and environmental factors on wheat yield and protein content. In doing so, selecting suitable covariance kernels to account for the distinct characteristics of the information is essential. The GP approach is closely related to linear mixed-effect models for genotype x environment predictions, where random additive and interaction effects are modeled with covariance structures. However, while commonly used linear mixed effect models in plant breeding rely on Euclidean-based kernels, we also test kernels specifically designed for strings and time series. The resulting GP models are capable of competitively predicting outcomes for (1) new environmental conditions, and (2) new varieties, even in scenarios with little to no previous data for the new conditions or variety. While we focus on a wheat test case using a novel dataset collected in Switzerland, the GP approach presented here can be applied and extended to a wide range of agricultural applications and beyond, paving the way for improved decision-making and data acquisition strategies.
In many real-world applications, researchers aim to deploy models trained in a source domain to a target domain, where obtaining labeled data is often expensive, time-consuming, or even infeasible. While most existing literature assumes that the labeled source data and the unlabeled target data follow the same distribution, distribution shifts are common in practice. This paper focuses on label shift and develops efficient inference procedures for general parameters characterizing the unlabeled target population. A central idea is to model the outcome density ratio between the labeled and unlabeled data. To this end, we propose a progressive estimation strategy that unfolds in three stages: an initial heuristic guess, a consistent estimation, and ultimately, an efficient estimation. This self-evolving process is novel in the statistical literature and of independent interest. We also highlight the connection between our approach and prediction-powered inference (PPI), which uses machine learning models to improve statistical inference in related settings. We rigorously establish the asymptotic properties of the proposed estimators and demonstrate their superior performance compared to existing methods. Through simulation studies and multiple real-world applications, we illustrate both the theoretical contributions and practical benefits of our approach.
This paper investigates a perceptron, a simple neural network model, with ReLU activation and a ridge-regularized mean squared error (RR-MSE). Our approach leverages the fact that the RR-MSE for ReLU perceptron is piecewise polynomial, enabling a systematic analysis using tools from computational algebra. In particular, we develop a Divide-Enumerate-Merge strategy that exhaustively enumerates all local minima of the RR-MSE. By virtue of the algebraic formulation, our approach can identify not only the typical zero-dimensional minima (i.e., isolated points) obtained by numerical optimization, but also higher-dimensional minima (i.e., connected sets such as curves, surfaces, or hypersurfaces). Although computational algebraic methods are computationally very intensive for perceptrons of practical size, as a proof of concept, we apply the proposed approach in practice to minimal perceptrons with a few hidden units.
We consider statistical inference for a class of dynamic mixed-effect models described by stochastic differential equations whose drift and diffusion coefficients simultaneously depend on fixed- and random-effect parameters. Assuming that each process is observed at high frequency and the number of individuals goes to infinity, we propose a stepwise inference procedure and prove its theoretical properties. The methodology is based on suitable quasi-likelihood functions by profiling the random effect in the diffusion coefficient at the first stage, and then taking the marginal distribution in the drift coefficient in the second stage, resulting in a fully explicit and computationally convenient method.
Handling outliers is a fundamental challenge in multivariate data analysis, as outliers may distort structures of correlation or conditional independence. Although robust Bayesian inference has been extensively studied for univariate settings, theoretical results ensuring posterior robustness in multivariate models are scarce. We propose a novel scale mixture of multivariate normals called correlation-intact sandwich mixture, where the scale parameters are real-valued and follow the unfolded log-Pareto distribution. Our theoretical results on posterior robustness in multivariate settings emphasizes that the use of a symmetric, super heavy-tailed distribution for the scale parameters is essential in achieving posterior robustness against element-wise contamination. Posterior inference for the proposed model is feasible by an efficient Gibbs sampling algorithm we developed. The superiority of the proposed method is illustrated further in simulation and empirical studies using graphical models and multivariate regression in the presence of complex outlier structures.
Forecasts for uncertain future events should be probabilistic. Probabilistic forecasts are commonly issued as prediction intervals, which provide a measure of uncertainty in the unknown outcome whilst being easier to understand and communicate than full predictive distributions. The calibration of a $(1 - \alpha)$-level prediction interval can be assessed by checking whether the probability that the outcome falls within the interval is equal to $1 - \alpha$. However, such coverage checks are typically unconditional and therefore relatively weak. Although this is well known, there is a lack of methods to assess the conditional calibration of interval forecasts. In this work, we demonstrate how this can be achieved via decompositions of the well-known interval (or Winkler) score. We study notions of calibration for interval forecasts and then introduce a decomposition of the interval score based on isotonic distributional regression. This decomposition exhibits many desirable properties, both in theory and in practice, which allows users to accurately assess the conditional calibration of interval forecasts. This is illustrated on simulated data and in three applications to benchmark regression datasets.
Randomized trials are viewed as the benchmark for assessing causal effects of treatments on outcomes of interest. Nonetheless, challenges such as measurement error can undermine the standard causal assumptions for randomized trials. In ASPIRE, a cluster-randomized trial, pediatric primary care clinics were assigned to one of two treatments aimed at promoting clinician delivery of a secure firearm program to parents during well-child visits. A key outcome of interest is thus parent receipt of the program at each visit. Clinicians documented program delivery in patients' electronic health records for all visits, but their reporting is a proxy measure for the parent receipt outcome. Parents were also surveyed to report directly on program receipt after their child's visit; however, only a small subset of them completed the survey. Here, we develop a causal inference framework for a binary outcome that is subject to misclassification through silver-standard measures (clinician reports), but gold-standard measures (parent reports) are only available for a non-random internal validation subset. We propose a method for identifying the average treatment effect (ATE) that addresses the risks of misclassification and selection bias, even when the outcome (parent receipt) may directly impact selection propensity (survey responsiveness). We show that ATE estimation relies on specifying the relationship between the gold- and silver-standard outcome measures in the validation subset, which may depend on treatment and covariates. Additionally, the clustered design is reflected in our causal assumptions and in our cluster-robust approach to estimation of the ATE. Simulation studies demonstrate acceptable finite-sample operating characteristics of our ATE estimator, supporting its application to ASPIRE.
Heterogeneous treatment effects, which vary according to individual covariates, are crucial in fields such as personalized medicine and tailored treatment strategies. In many applications, rather than considering the heterogeneity induced by all covariates, practitioners focus on a few key covariates to develop tailored treatment decisions. Based on this, we aim to estimate the group average treatment effects (GATEs), which represent heterogeneous treatment effects across subpopulations defined by certain key covariates. Previous strategies for estimating GATEs, such as weighting-based and regression-based methods, suffer from instability or extrapolation bias, especially when several propensity scores are close to zero or one. To address these limitations, we propose two novel nonparametric estimation methods: a matching-based method and a bias-corrected matching method for estimating GATEs. The matching-based method imputes potential outcomes using a matching technique, followed by a nonparametric regression. This method avoids the instability caused by extreme propensity scores but may introduce non-negligible bias when the dimension of full covariates is high. To mitigate this, the bias-corrected matching estimator incorporates additional outcome regression models, enhancing robustness and reducing bias. We show the consistency, double robustness, and asymptotic normality of the bias-corrected matching estimator. We empirically demonstrate the advantages of the proposed methods with extensive simulation studies and a real-world application. An open-source R package, MatchGATE, is available to implement the proposed methods.
Background. Dengue outbreaks are a major public health issue, with Brazil reporting 71% of global cases in 2024. Purpose. This study aims to describe the profile of severe dengue patients admitted to Brazilian Intensive Care units (ICUs) (2012-2024), assess trends over time, describe new onset complications while in ICU and determine the risk factors at admission to develop complications during ICU stay. Methods. We performed a prospective study of dengue patients from 253 ICUs across 56 hospitals. We used descriptive statistics to describe the dengue ICU population, logistic regression to identify risk factors for complications during the ICU stay, and a machine learning framework to predict the risk of evolving to complications. Visualisations were generated using ISARIC VERTEX. Results. Of 11,047 admissions, 1,117 admissions (10.1%) evolved to complications, including non-invasive (437 admissions) and invasive ventilation (166), vasopressor (364), blood transfusion (353) and renal replacement therapy (103). Age>80 (OR: 3.10, 95% CI: 2.02-4.92), chronic kidney disease (OR: 2.94, 2.22-3.89), liver cirrhosis (OR: 3.65, 1.82-7.04), low platelets (<50,000 cells/mm3; OR: OR: 2.25, 1.89-2.68), and high leukocytes (>7,000 cells/mm3; OR: 2.47, 2.02-3.03) were significant risk factors for complications. A machine learning tool for predicting complications was proposed, showing accurate discrimination and calibration. Conclusion. We described a large cohort of dengue patients admitted to ICUs and identified key risk factors for severe dengue complications, such as advanced age, presence of comorbidities, higher level of leukocytes and lower level of platelets. The proposed prediction tool can be used for early identification and targeted interventions to improve outcomes in dengue-endemic regions.
Deepfakes, AI-generated multimedia content that mimics real media, are becoming increasingly prevalent, posing significant risks to political stability, social trust, and economic well-being, especially in developing societies with limited media literacy and technological infrastructure. This work aims to understand how these technologies are perceived and impact resource-limited communities. We conducted a survey to assess public awareness, perceptions, and experiences with deepfakes, leading to the development of a comprehensive framework for prevention, detection, and mitigation in tech-limited environments. Our findings reveal critical knowledge gaps and a lack of effective detection tools, emphasizing the need for targeted education and accessible verification solutions. This work offers actionable insights to support vulnerable populations and calls for further interdisciplinary efforts to tackle deepfake challenges globally, particularly in the Global South.
Time series foundation models (TSFMs) such as Lag-Llama, TimeGPT, Chronos, MOMENT, UniTS, and TimesFM have shown strong generalization and zero-shot capabilities for time series forecasting, anomaly detection, classification, and imputation. Despite these advantages, their predictions still suffer from variance, domain-specific bias, and limited uncertainty quantification when deployed on real operational data. This paper investigates a suite of statistical and ensemble-based enhancement techniques, including bootstrap-based bagging, regression-based stacking, prediction interval construction, statistical residual modeling, and iterative error feedback, to improve robustness and accuracy. Using the Belgium Electricity Short-Term Load Forecasting dataset as a case study, we demonstrate that the proposed hybrids consistently outperform standalone foundation models across multiple horizons. Regression-based ensembles achieve the lowest mean squared error; bootstrap aggregation markedly reduces long-context errors; residual modeling corrects systematic bias; and the resulting prediction intervals achieve near nominal coverage with widths shrinking as context length increases. The results indicate that integrating statistical reasoning with modern foundation models yields measurable gains in accuracy, reliability, and interpretability for real-world time series applications.
Accurate quantification of uncertainty in neural network predictions remains a central challenge for scientific applications involving high-dimensional, correlated data. While existing methods capture either aleatoric or epistemic uncertainty, few offer closed-form, multidimensional distributions that preserve spatial correlation while remaining computationally tractable. In this work, we present a framework for training neural networks with a multidimensional Gaussian loss, generating closed-form predictive distributions over outputs with non-identically distributed and heteroscedastic structure. Our approach captures aleatoric uncertainty by iteratively estimating the means and covariance matrices, and is demonstrated on a super-resolution example. We leverage a Fourier representation of the covariance matrix to stabilize network training and preserve spatial correlation. We introduce a novel regularization strategy -- referred to as information sharing -- that interpolates between image-specific and global covariance estimates, enabling convergence of the super-resolution downscaling network trained on image-specific distributional loss functions. This framework allows for efficient sampling, explicit correlation modeling, and extensions to more complex distribution families all without disrupting prediction performance. We demonstrate the method on a surface wind speed downscaling task and discuss its broader applicability to uncertainty-aware prediction in scientific models.
We develop a novel optimistic gradient-type algorithmic framework, combining both Nesterov's acceleration and variance-reduction techniques, to solve a class of generalized equations involving possibly nonmonotone operators in data-driven applications. Our framework covers a wide class of stochastic variance-reduced schemes, including mini-batching, and control variate unbiased and biased estimators. We establish that our method achieves $\mathcal{O}(1/k^2)$ convergence rates in expectation on the squared norm of residual under the Lipschitz continuity and a ``co-hypomonotonicity-type'' assumptions, improving upon non-accelerated counterparts by a factor of $1/k$. We also prove faster $o(1/k^2)$ convergence rates, both in expectation and almost surely. In addition, we show that the sequence of iterates of our method almost surely converges to a solution of the underlying problem. We demonstrate the applicability of our method using general error bound criteria, covering mini-batch stochastic estimators as well as three well-known control variate estimators: loopless SVRG, SAGA, and loopless SARAH, for which the last three variants attain significantly better oracle complexity compared to existing methods. We validate our framework and theoretical results through two numerical examples. The preliminary results illustrate promising performance of our accelerated method over its non-accelerated counterparts.
The rise of parallel computing hardware has made it increasingly important to understand which nonlinear state space models can be efficiently parallelized. Recent advances like DEER (arXiv:2309.12252) or DeepPCR (arXiv:2309.16318) have shown that evaluating a state space model can be recast as solving a parallelizable optimization problem, and sometimes this approach can yield dramatic speed-ups in evaluation time. However, the factors that govern the difficulty of these optimization problems remain unclear, limiting the larger adoption of the technique. In this work, we establish a precise relationship between the dynamics of a nonlinear system and the conditioning of its corresponding optimization formulation. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behavior, impacts the number of optimization steps required for evaluation. In predictable systems, the state trajectory can be computed in $O((\log T)^2)$ time, where $T$ is the sequence length, a major improvement over the conventional sequential approach. In contrast, chaotic or unpredictable systems exhibit poor conditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis demonstrates that for predictable systems, the optimization problem is always well-conditioned, whereas for unpredictable systems, the conditioning degrades exponentially as a function of the sequence length. We validate our claims through extensive experiments, providing practical guidance on when nonlinear dynamical systems can be efficiently parallelized, and highlighting predictability as a key design principle for parallelizable models.
Diffusion models have achieved state-of-the-art results in generative modelling but remain computationally intensive at inference time, often requiring thousands of discretization steps. To this end, we propose Sig-DEG (Signature-based Differential Equation Generator), a novel generator for distilling pre-trained diffusion models, which can universally approximate the backward diffusion process at a coarse temporal resolution. Inspired by high-order approximations of stochastic differential equations (SDEs), Sig-DEG leverages partial signatures to efficiently summarize Brownian motion over sub-intervals and adopts a recurrent structure to enable accurate global approximation of the SDE solution. Distillation is formulated as a supervised learning task, where Sig-DEG is trained to match the outputs of a fine-resolution diffusion model on a coarse time grid. During inference, Sig-DEG enables fast generation, as the partial signature terms can be simulated exactly without requiring fine-grained Brownian paths. Experiments demonstrate that Sig-DEG achieves competitive generation quality while reducing the number of inference steps by an order of magnitude. Our results highlight the effectiveness of signature-based approximations for efficient generative modeling.
This paper proposes a frequency-domain system identification method for learning low-order systems. The identification problem is formulated as the minimization of the l2 norm between the identified and measured frequency responses, with the nuclear norm of the Loewner matrix serving as a regularization term. This formulation results in an optimization problem that can be efficiently solved using standard convex optimization techniques. We derive an upper bound on the sampled-frequency complexity of the identification process and subsequently extend this bound to characterize the identification error over all frequencies. A detailed analysis of the sample complexity is provided, along with a thorough interpretation of its terms and dependencies. Finally, the efficacy of the proposed method is demonstrated through an example, along with numerical simulations validating the growth rate of the sample complexity bound.
Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures via exploiting the geometric properties of hyperbolic spaces characterized by negative curvatures. Curvature plays a crucial role in optimizing HNNs. Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance. So far, the theoretical foundation of the effect of curvatures on HNNs has not been developed. In this paper, we derive a PAC-Bayesian generalization bound of HNNs, highlighting the role of curvatures in the generalization of HNNs via their effect on the smoothness of the loss landscape. Driven by the derived bound, we propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs. In our method, we design a scope sharpness measure for curvatures, which is minimized through a bi-level optimization process. Then, we introduce an implicit differentiation algorithm that efficiently solves the bi-level optimization by approximating gradients of curvatures. We present the approximation error and convergence analyses of the proposed method, showing that the approximation error is upper-bounded, and the proposed method can converge by bounding gradients of HNNs. Experiments on four settings: classification, learning from long-tailed data, learning from noisy data, and few-shot learning show that our method can improve the performance of HNNs.
Deep neural networks often contain far more parameters than training examples, yet they still manage to generalize well in practice. Classical complexity measures such as VC-dimension or PAC-Bayes bounds usually become vacuous in this overparameterized regime, offering little explanation for the empirical success of models like Transformers. In this work, I explore an alternative notion of capacity for attention-based models, based on the effective rank of their attention matrices. The intuition is that, although the parameter count is enormous, the functional dimensionality of attention is often much lower. I show that this quantity leads to a generalization bound whose dependence on sample size matches empirical scaling laws observed in large language models, up to logarithmic factors. While the analysis is not a complete theory of overparameterized learning, it provides evidence that spectral properties of attention, rather than raw parameter counts, may be the right lens for understanding why these models generalize.
Recent breakthroughs in autonomous experimentation have demonstrated remarkable physical capabilities, yet their cognitive control remains limited--often relying on static heuristics or classical optimization. A core limitation is the absence of a principled mechanism to detect and adapt to the unexpectedness. While traditional surprise measures--such as Shannon or Bayesian Surprise--offer momentary detection of deviation, they fail to capture whether a system is truly learning and adapting. In this work, we introduce Mutual Information Surprise (MIS), a new framework that redefines surprise not as anomaly detection, but as a signal of epistemic growth. MIS quantifies the impact of new observations on mutual information, enabling autonomous systems to reflect on their learning progression. We develop a statistical test sequence to detect meaningful shifts in estimated mutual information and propose a mutual information surprise reaction policy (MISRP) that dynamically governs system behavior through sampling adjustment and process forking. Empirical evaluations--on both synthetic domains and a dynamic pollution map estimation task--show that MISRP-governed strategies significantly outperform classical surprise-based approaches in stability, responsiveness, and predictive accuracy. By shifting surprise from reactive to reflective, MIS offers a path toward more self-aware and adaptive autonomous systems.
We propose Anti-regularization (AR), which adds a sign-reversed reward term to the loss to intentionally increase model expressivity in the small-sample regime, and then attenuates this intervention with a power-law decay as the sample size grows. We formalize spectral safety and trust-region conditions, and design a lightweight stability safeguard that combines a projection operator with gradient clipping, ensuring stable intervention under stated assumptions. Our analysis spans linear smoothers and the Neural Tangent Kernel (NTK) regime, providing practical guidance on selecting the decay exponent by balancing empirical risk against variance. Empirically, AR reduces underfitting while preserving generalization and improving calibration in both regression and classification. Ablation studies confirm that the decay schedule and the stability safeguard are critical to preventing overfitting and numerical instability. We further examine a degrees-of-freedom targeting schedule that keeps per-sample complexity approximately constant. AR is simple to implement and reproducible, integrating cleanly into standard empirical risk minimization pipelines. It enables robust learning in data- and resource-constrained settings by intervening only when beneficial and fading away when unnecessary.
Error assessment for Approximate Query Processing (AQP) is a challenging problem. Bootstrap sampling can produce error assessment even when the population data distribution is unknown. However, bootstrap sampling needs to produce a large number of resamples with replacement, which is a computationally intensive procedure. In this paper, we introduce a quantum bootstrap sampling (QBS) framework to generate bootstrap samples on a quantum computer and produce an error assessment for AQP query estimations. The quantum circuit design is included in this framework.
We prove that a minimal Transformer architecture with frozen weights is capable of emulating a broad class of algorithms by in-context prompting. In particular, for any algorithm implementable by a fixed-weight attention head (e.g. one-step gradient descent or linear/ridge regression), there exists a prompt that drives a two-layer softmax attention module to reproduce the algorithm's output with arbitrary precision. This guarantee extends even to a single-head attention layer (using longer prompts if necessary), achieving architectural minimality. Our key idea is to construct prompts that encode an algorithm's parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, establishing a form of algorithmic universality in modern Transformer models.
When do diffusion models reproduce their training data, and when are they able to generate samples beyond it? A practically relevant theoretical understanding of this interplay between memorization and generalization may significantly impact real-world deployments of diffusion models with respect to issues such as copyright infringement and data privacy. In this work, to disentangle the different factors that influence memorization and generalization in practical diffusion models, we introduce a scientific and mathematical "laboratory" for investigating these phenomena in diffusion models trained on fully synthetic or natural image-like structured data. Within this setting, we hypothesize that the memorization or generalization behavior of an underparameterized trained model is determined by the difference in training loss between an associated memorizing model and a generalizing model. To probe this hypothesis, we theoretically characterize a crossover point wherein the weighted training loss of a fully generalizing model becomes greater than that of an underparameterized memorizing model at a critical value of model (under)parameterization. We then demonstrate via carefully-designed experiments that the location of this crossover predicts a phase transition in diffusion models trained via gradient descent, validating our hypothesis. Ultimately, our theory enables us to analytically predict the model size at which memorization becomes predominant. Our work provides an analytically tractable and practically meaningful setting for future theoretical and empirical investigations. Code for our experiments is available at this https URL.
In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.
Message passing neural networks (MPNNs) are powerful models for node classification but suffer from performance limitations under heterophily (low same-class connectivity) and structural bottlenecks in the graph. We provide a unifying statistical framework exposing the relationship between heterophily and bottlenecks through the signal-to-noise ratio (SNR) of MPNN representations. The SNR decomposes model performance into feature-dependent parameters and feature-independent sensitivities. We prove that the sensitivity to class-wise signals is bounded by higher-order homophily -- a generalisation of classical homophily to multi-hop neighbourhoods -- and show that low higher-order homophily manifests locally as the interaction between structural bottlenecks and class labels (class-bottlenecks). Through analysis of graph ensembles, we provide a further quantitative decomposition of bottlenecking into underreaching (lack of depth implying signals cannot arrive) and oversquashing (lack of breadth implying signals arriving on fewer paths) with closed-form expressions. We prove that optimal graph structures for maximising higher-order homophily are disjoint unions of single-class and two-class-bipartite clusters. This yields BRIDGE, a graph ensemble-based rewiring algorithm that achieves near-perfect classification accuracy across all homophily regimes on synthetic benchmarks and significant improvements on real-world benchmarks, by eliminating the ``mid-homophily pitfall'' where MPNNs typically struggle, surpassing current standard rewiring techniques from the literature. Our framework, whose code we make available for public use, provides both diagnostic tools for assessing MPNN performance, and simple yet effective methods for enhancing performance through principled graph modification.
A diffusion-based voice conversion (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterVoiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterVoiceGrad achieves competitive VC performance compared to FastVoiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.
In text-to-speech (TTS) and voice conversion (VC), acoustic features, such as mel spectrograms, are typically used as synthesis or conversion targets owing to their compactness and ease of learning. However, because the ultimate goal is to generate high-quality waveforms, employing a vocoder to convert these features into waveforms and applying adversarial training in the time domain is reasonable. Nevertheless, upsampling the waveform introduces significant time and memory overheads. To address this issue, we propose a vocoder-projected feature discriminator (VPFD), which uses vocoder features for adversarial training. Experiments on diffusion-based VC distillation demonstrated that a pretrained and frozen vocoder feature extractor with a single upsampling step is necessary and sufficient to achieve a VC performance comparable to that of waveform discriminators while reducing the training time and memory consumption by 9.6 and 11.4 times, respectively.
In this PhD thesis, we propose a novel framework for uncertainty quantification in machine learning, which is based on proper scores. Uncertainty quantification is an important cornerstone for trustworthy and reliable machine learning applications in practice. Usually, approaches to uncertainty quantification are problem-specific, and solutions and insights cannot be readily transferred from one task to another. Proper scores are loss functions minimized by predicting the target distribution. Due to their very general definition, proper scores apply to regression, classification, or even generative modeling tasks. We contribute several theoretical results, that connect epistemic uncertainty, aleatoric uncertainty, and model calibration with proper scores, resulting in a general and widely applicable framework. We achieve this by introducing a general bias-variance decomposition for strictly proper scores via functional Bregman divergences. Specifically, we use the kernel score, a kernel-based proper score, for evaluating sample-based generative models in various domains, like image, audio, and natural language generation. This includes a novel approach for uncertainty estimation of large language models, which outperforms state-of-the-art baselines. Further, we generalize the calibration-sharpness decomposition beyond classification, which motivates the definition of proper calibration errors. We then introduce a novel estimator for proper calibration errors in classification, and a novel risk-based approach to compare different estimators for squared calibration errors. Last, we offer a decomposition of the kernel spherical score, another kernel-based proper score, allowing a more fine-grained and interpretable evaluation of generative image models.
Leveraging information from public data has become increasingly crucial in enhancing the utility of differentially private (DP) methods. Traditional DP approaches often require adding noise based solely on private data, which can significantly degrade utility. In this paper, we address this limitation in the context of the ordinary least squares estimator (OLSE) of linear regression based on sufficient statistics perturbation (SSP) under the unbounded data assumption. We propose a novel method that involves transforming private data using the public second-moment matrix to compute a transformed SSP-OLSE, whose second-moment matrix yields a better condition number and improves the OLSE accuracy and robustness. We derive theoretical error bounds about our method and the standard SSP-OLSE to the non-DP OLSE, which reveal the improved robustness and accuracy achieved by our approach. Experiments on synthetic and real-world datasets demonstrate the utility and effectiveness of our method.
In this paper, we analyze the asymptotic properties of the Two-Stage (TS) estimator -- a simulation-based parameter estimation method that constructs estimators offline from synthetic data. While TS offers significant computational advantages compared to standard approaches to estimation, its statistical properties have not been previously analyzed in the literature. Under simple assumptions, we establish that the TS estimator is strongly consistent and asymptotically normal, providing the first theoretical guarantees for this class of estimators.
We prove a bound on the finite sample error of sequential Monte Carlo (SMC) on static spaces using the $L_2$ distance between interpolating distributions and the mixing times of Markov kernels. This result is unique in that it is the first finite sample convergence result for SMC that does not require an upper bound on the importance weights. Using this bound we show that careful selection of the interpolating distributions can lead to substantial improvements in the computational complexity of the algorithm. This result also justifies the adaptive selection of SMC distributions using the relative effective sample size commonly used in the literature, and we establish conditions guaranteeing the approximation accuracy of the adaptive SMC approach. We show that the commonly used data tempering approach fails to satisfy these conditions, and introduce a modified data tempering algorithm under which our guarantees do hold. We then demonstrate empirically that this procedure provides nearly-optimal sequences of distributions in an automatic fashion for realistic examples.
This paper proposes novel methods to test for simultaneous diagonalization of possibly asymmetric matrices. Motivated by various applications, a two-sample test as well as a generalization for multiple matrices are proposed. A partial version of the test is also studied to check whether a partial set of eigenvectors is shared across samples. Additionally, a novel algorithm for the considered testing methods is introduced. Simulation studies demonstrate favorable performance for all designs. Finally, the theoretical results are utilized to decouple vector autoregression models into multiple univariate time series, and to test for the same stationary distribution in recurrent Markov chains. These applications are demonstrated using macroeconomic indices of 8 countries and streamflow data, respectively.
In longitudinal studies where units are embedded in space or a social network, interference may arise, meaning that a unit's outcome can depend on treatment histories of others. The presence of interference poses significant challenges for causal inference, particularly when the interference structure -- how a unit's outcome responds to others' influences -- is complex, heterogeneous, and unknown to researchers. This paper develops a general framework for identifying and estimating both direct and spillover effects of treatment histories under minimal assumptions about the interference structure. We introduce a class of causal estimands that capture the effects of treatment histories at any specified proximity level and show that they can be represented by a modified marginal structural model. Under sequential exchangeability, these estimands are identifiable and can be estimated using inverse probability weighting. We derive conditions for consistency and asymptotic normality of the estimators and provide procedures for constructing asymptotically conservative confidence intervals. The method's utility is demonstrated through applications in both social science and biomedical settings.
We consider likelihood-based two-step estimation of latent variable models, in which just the measurement model is estimated in the first step and the measurement parameters are then fixed at their estimated values in the second step where the structural model is estimated. We show how this approach can be implemented for latent trait models (item response theory models) where the latent variables are continuous and their measurement indicators are categorical variables. The properties of two-step estimators are examined using simulation studies and applied examples. They perform well, and have attractive practical and conceptual properties compared to the alternative one-step and three-step approaches. These results are in line with previous findings for other families of latent variable models. This provides strong evidence that two-step estimation is a flexible and useful general method of estimation for different types of latent variable models.
In the past decade, the increased availability of genome-wide association studies summary data has popularized Mendelian Randomization (MR) for conducting causal inference. MR analyses, incorporating genetic variants as instrumental variables, are known for their robustness against reverse causation bias and unmeasured confounders. Nevertheless, classical MR analyses utilizing summary data may still produce biased causal effect estimates due to the winner's curse and pleiotropic issues. To address these two issues and establish valid causal conclusions, we propose a unified robust Mendelian Randomization framework with summary data, which systematically removes the winner's curse and screens out invalid genetic instruments with pleiotropic effects. Different from existing robust MR literature, our framework delivers valid statistical inference on the causal effect neither requiring the genetic pleiotropy effects to follow any parametric distribution nor relying on perfect instrument screening property. Under appropriate conditions, we show that our proposed estimator converges to a normal distribution and its variance can be well estimated. We demonstrate the performance of our proposed estimator through Monte Carlo simulations and two case studies. The codes implementing the procedures are available at this https URL.
We propose a conditional stochastic interpolation (CSI) method for learning conditional distributions. CSI is based on estimating probability flow equations or stochastic differential equations that transport a reference distribution to the target conditional distribution. This is achieved by first learning the conditional drift and score functions based on CSI, which are then used to construct a deterministic process governed by an ordinary differential equation or a diffusion process for conditional sampling. In our proposed approach, we incorporate an adaptive diffusion term to address the instability issues arising in the diffusion process. We derive explicit expressions of the conditional drift and score functions in terms of conditional expectations, which naturally lead to an nonparametric regression approach to estimating these functions. Furthermore, we establish nonasymptotic error bounds for learning the target conditional distribution. We illustrate the application of CSI on image generation using a benchmark image dataset.
A key step in separating signal from noise in time series by means of singular spectrum analysis (SSA) is grouping. We present a multiple testing method for the grouping step in SSA. As separability criterion, we utilize the weighted correlation between the signal and the noise component of the (reconstructed) time series, and we test whether this weighted correlation is equal to zero. This test has to be performed for several possible groupings, resulting in a multiple test problem. The null distributions of the corresponding test statistics are approximated by a wild bootstrap procedure. The performance of our proposed method is assessed in a simulation study, and we illustrate its practical application with an analysis of real world data.
Bayesian Optimization (BO) is a powerful method for optimizing black-box functions by combining prior knowledge with ongoing function evaluations. BO constructs a probabilistic surrogate model of the objective function given the covariates, which is in turn used to inform the selection of future evaluation points through an acquisition function. For smooth continuous search spaces, Gaussian Processes (GPs) are commonly used as the surrogate model as they offer analytical access to posterior predictive distributions, thus facilitating the computation and optimization of acquisition functions. However, in complex scenarios involving optimization over categorical or mixed covariate spaces, GPs may not be ideal. This paper introduces Simulation Based Bayesian Optimization (SBBO) as a novel approach to optimizing acquisition functions that only requires sampling-based access to posterior predictive distributions. SBBO allows the use of surrogate probabilistic models tailored for combinatorial spaces with discrete variables. Any Bayesian model in which posterior inference is carried out through Markov chain Monte Carlo can be selected as the surrogate model in SBBO. We demonstrate empirically the effectiveness of SBBO using various choices of surrogate models in applications involving combinatorial optimization.
Linear Discriminant Analysis (LDA) is an important classification approach. Its simple linear form makes it easy to interpret, and it is well-suited for handling multi-class responses. It is closely related to other classical multivariate statistical techniques, such as Fisher's discriminant analysis, canonical correlation analysis and linear regression. In this paper we strengthen its connection to multivariate response regression by characterizing the explicit relationship between discriminant directions and regression coefficients. This key characterization leads to a new regression-based multi-class classification procedure that is flexible enough to deploy any existing structured, regularized, and even non-parametric, regression methods. In contrast with the existing regression-based approaches, our new formulation is particularly amenable to establish provable guarantees: we establish a general strategy of analyzing the excess misclassification risk of the proposed classifier for all aforementioned regression techniques. As applications, we provide complete theoretical guarantees for using the widely used $\ell_1$-regularization as well as for using the reduced-rank regression, neither of which has yet been fully analyzed in the LDA context. Our theoretical findings are corroborated by extensive simulation studies and real data analysis.
A dataset with two labels is linearly separable if it can be split into its two classes with a hyperplane. This inflicts a curse on some statistical tools (such as logistic regression) but forms a blessing for others (e.g. support vector machines). Recently, the following question has regained interest: What is the probability that the data are linearly separable? We provide a formula for the probability of linear separability for Gaussian features and labels depending only on one marginal of the features (as in generalized linear models). In this setting, we derive an upper bound that complements the recent result by Hayakawa, Lyons, and Oberhauser [2023], and a sharp upper bound for sign-flip noise. To prove our results, we exploit that this probability can be expressed as a sum of the intrinsic volumes of a polyhedral cone of the form $\text{span}\{v\}\oplus[0,\infty)^n$, as shown in Candès and Sur [2020]. After providing the inequality description for this cone, and an algorithm to project onto it, we calculate its intrinsic volumes. In doing so, we encounter Youden's demon problem, for which we provide a formula following Kabluchko and Zaporozhets [2020]. The key insight of this work is the following: The number of correctly labeled observations in the data affects the structure of this polyhedral cone, allowing the translation of insights from geometry into statistics.
Post-clustering inference in single-cell RNA sequencing (scRNA-seq) analysis presents significant challenges in controlling Type I error during differential expression analysis. Data fission, a promising approach that aims to split data into two independent parts, relies on strong parametric assumptions of non-mixture distributions that are inherently violated in clustered data. To address this limitation, we introduce conditional data fission, an extension designed to decompose each mixture component into two independent parts. However, we demonstrate that applying such conditional data fission to mixture distributions requires prior knowledge of the clustering structure to ensure valid post-clustering inference. This arises from the need to accurately estimate component-specific scale parameters, which are critical for performing decomposition while maintaining independence. We theoretically quantify how biases in estimating these parameters lead to inflated Type I error rates due to deviations from independence. Given that mixture components are typically unknown in practice, our results underscore the fundamental difficulty of applying data fission in real-world settings, despite its prior proposal as a solution for post-clustering inference.
Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that the predominant approach for aligning LLMs with human preferences through a reward model -- reinforcement learning from human feedback (RLHF) -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT and Llama-family models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.
Researchers often turn to block randomization to increase the precision of their inference or due to practical considerations, such as in multisite trials. However, if the number of treatments under consideration is large it might not be feasible or practical to assign all treatments within each block. We develop novel inference results under the finite-population design-based framework for natural alternatives to the complete block design that do not require reducing the number of treatment arms, the incomplete block design (IBD) and the balanced incomplete block design. This includes deriving the properties of two design-based estimators, developing a finite-population central limit theorem, and proposing conservative variance estimators. Comparisons of the design-based estimators are made to linear model-based estimators. Simulations and a data illustration further demonstrate performance of IBD estimators. This work highlights IBDs as practical and currently underutilized designs.
We formulate factorial difference-in-differences (FDID) as a research design that extends the canonical difference-in-differences (DID) to settings without clean controls. Such situations often arise when researchers exploit cross-sectional variation in a baseline factor and temporal variation in an event affecting all units. In these applications, the exact estimand is often unspecified and justification for using the DID estimator is unclear. We formalize FDID by characterizing its data structure, target parameters, and identifying assumptions. Framing FDID as a factorial design with two factors -- the baseline factor G and the exposure level Z, we define effect modification and causal moderation as the associative and causal effects of G on the effect of Z. Under standard DID assumptions, including no anticipation and parallel trends, the DID estimator identifies effect modification but not causal moderation. To identify the latter, we propose an additional factorial parallel trends assumption. We also show that the canonical DID is a special case of FDID under an exclusion restriction. We extend the framework to conditionally valid assumptions and clarify regression-based implementations. We then discuss extensions to repeated cross-sectional data and continuous G. We illustrate the approach with an empirical example on the role of social capital in famine relief in China.
The importance of developing efficient image denoising methods is immense especially for modern applications such as image comparisons, image monitoring, medical image diagnostics, and so forth. Available methods in the vast literature on image denoising can address certain issues in image denoising, but no one single method can solve all such issues. For example, jump regression based methods can preserve linear edges well, but cannot preserve many other fine details of an image. On the other hand, local clustering based methods can preserve fine edge structures, but cannot perform well in presence of heavy noise. The proposed method uses various shapes and sizes of local neighborhood based on local information, and integrates this adaptive approach with the local clustering based smoothing. Theoretical justifications and numerical studies show that the proposed method indeed performs better than these two individual methods and outperforms many other state-of-the-art techniques as well. Such performance demonstrates vast potential of the applicability of the proposed method in many modern-day applications.
Endogeneity poses significant challenges in causal inference across various research domains. This paper proposes a novel approach to identify and estimate causal effects in the presence of endogeneity. We consider a structural equation with endogenous exposures and an additive error term. Assuming the light-tailedness of the error term, we show that the causal effect can be identified by contrasting extreme conditional quantiles of the outcome given the exposures. Unlike many existing results, our identification approach does not rely on additional parametric assumptions or auxiliary variables. Building on the identification result, we develop a new method that estimates the causal effect using extreme quantile regression. We establish the consistency of the proposed extreme-based estimator under a general additive structural equation and demonstrate its asymptotic normality in the linear model setting. Simulations and data analysis of an automobile sale dataset show the effectiveness of our method in handling endogeneity.
We examine a special case of the multilevel factor model, with covariance given by multilevel low rank (MLR) matrix~\cite{parshakova2023factor}. We develop a novel, fast implementation of the expectation-maximization algorithm, tailored for multilevel factor models, to maximize the likelihood of the observed data. This method accommodates any hierarchical structure and maintains linear time and storage complexities per iteration. This is achieved through a new efficient technique for computing the inverse of the positive definite MLR matrix. We show that the inverse of positive definite MLR matrix is also an MLR matrix with the same sparsity in factors, and we use the recursive Sherman-Morrison-Woodbury matrix identity to obtain the factors of the inverse. Additionally, we present an algorithm that computes the Cholesky factorization of an expanded matrix with linear time and space complexities, yielding the covariance matrix as its Schur complement. This paper is accompanied by an open-source package that implements the proposed methods.
This paper investigates decision-making in A/B experiments for online platforms and marketplaces. In such settings, due to constraints on inventory, A/B experiments typically lead to biased estimators because of *interference* between treatment and control groups; this phenomenon has been well studied in recent literature. By contrast, there has been relatively little discussion of the impact of interference on decision-making. In this paper, we analyze a benchmark Markovian model of an inventory-constrained platform, where arriving customers book listings that are limited in supply. We focus on the commonly used frequentist hypothesis testing approach for making launch decisions based on data from customer-randomized experiments, and we study the impact of interference on (1) false positive probability and (2) statistical power. We obtain three main findings. First, we show that for *sign-consistent* treatments -- i.e., those where the treatment changes booking probabilities in the same direction relative to control for all states of inventory availability -- the false positive probability of a test statistic using the standard difference-in-means estimator with a corresponding naïve variance estimator is correctly controlled. Second, we demonstrate that for sign-consistent treatments in realistic settings, the statistical power of this naïve approach is higher than that of any similar pipeline using a debiased estimator. Taken together, these two findings suggest that platforms may be better off *not* debiasing when treatments are sign-consistent. Third, using numerics, we investigate false positive probability and statistical power when treatments are sign-inconsistent, and we show that in principle, the performance of the naïve approach can be arbitrarily worse in such cases.
In this paper, we study the estimation of drift and diffusion coefficients in a two dimensional system of N interacting particles modeled by a degenerate stochastic differential equation. We consider both complete and partial observation cases over a fixed time horizon [0, T] and propose novel contrast functions for parameter estimation. In the partial observation scenario, we tackle the challenge posed by unobserved velocities by introducing a surrogate process based on the increments of the observed positions. This requires a modified contrast function to account for the correlation between successive increments. Our analysis demonstrates that, despite the loss of Markovianity due to the velocity approximation in the partial observation case, the estimators converge to a Gaussian distribution (with a correction factor in the partial observation case). The proofs are based on Ito like bounds and an adaptation of the Euler scheme. Additionally, we provide insights into Hörmander's condition, which helps establish hypoellipticity in our model within the framework of stochastic calculus of variations.
This work proposes an adaptive sequential Monte Carlo sampling algorithm to solve Bayesian inverse problems in scenarios where likelihood evaluations are costly but can be approximated using a surrogate model built from previous evaluations of the true likelihood. A rough estimate of the surrogate error is required. The method relies on an adaptive SMC framework that simultaneously adjusts both the likelihood approximations and a standard tempering scheme of the target posterior distribution. This algorithm is well-suited for cases where the posterior is concentrated in a rare and unknown region of the prior. It is also suitable for solving low-temperature and rare event simulation problems. The main contribution is to propose an entropy criterion that relates the accuracy of the current surrogate to a maximum inverse temperature for the likelihood approximation. The latter is instrumental to sample a so-called snapshot, on which is performed an exact likelihood evaluation, used to update the surrogate and its error quantification. Some consistency results are presented in an idealized framework for the proposed algorithm. Our numerical experiments use in particular a reduced basis approach to construct approximate parametric solutions to a partially observed solution of an elliptic partial differential equation. They demonstrate the convergence of the algorithm and show a significant cost reduction (close to a factor of $10$) for comparable accuracy.
The present manuscript is concerned with component-wise estimation of the positive power of ordered restricted standard deviation of two normal populations with certain restrictions on the means. We propose several improved estimators under a general scale invariant bowl-shaped loss function. Also, we proposed a class of improved estimators. It has been shown that the boundary estimator of this class is a generalized Bayes. As an application, the improved estimators are obtained with respect to quadratic loss, entropy loss, and a symmetric loss function. We have conducted extensive Monte Carlo simulations to study and compare the risk performance of the proposed estimators. Finally, a real life data analysis is given to illustrate our findings.
Gaussian process regression is a powerful Bayesian nonlinear regression method. Recent research has enabled the capture of many types of observations using non-Gaussian likelihoods. To deal with various tasks in spatial modeling, we benefit from this development. Difficulties still arise when we can only access summarized data consisting of representative features, summary statistics, and data point counts. Such situations frequently occur primarily due to concerns about confidentiality and management costs associated with spatial data. This study tackles learning and inference using only summarized data within the framework of Gaussian process regression. To address this challenge, we analyze the approximation errors in the marginal likelihood and posterior distribution that arise from utilizing representative features. We also introduce the concept of sample quasi-likelihood, which facilitates learning and inference using only summarized data. Non-Gaussian likelihoods satisfying certain assumptions can be captured by specifying a variance function that characterizes a sample quasi-likelihood function. Theoretical and experimental results demonstrate that the approximation performance is influenced by the granularity of summarized data relative to the length scale of covariance functions. Experiments on a real-world dataset highlight the practicality of our method for spatial modeling.
We propose a novel Bayesian wavelet regression approach using a three-component spike-and-slab prior for wavelet coefficients, combining a point mass at zero, a moment (MOM) prior, and an inverse moment (IMOM) prior. This flexible prior supports small and large coefficients differently, offering advantages for highly dispersed data where wavelet coefficients span multiple scales. The IMOM prior's heavy tails capture large coefficients, while the MOM prior is better suited for smaller non-zero coefficients. Further, our method introduces innovative hyperparameter specifications for mixture probabilities and scale parameters, including generalized logit, hyperbolic secant, and generalized normal decay for probabilities, and double exponential decay for scaling. Hyperparameters are estimated via an empirical Bayes approach, enabling posterior inference tailored to the data. Extensive simulations demonstrate significant performance gains over two-component wavelet methods. Applications to electroencephalography and noisy audio data illustrate the method's utility in capturing complex signal characteristics. We implement our method in an R package, NLPwavelet (>= 1.1).
Two-stage Learning-to-Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts, supporting reliable decision-making in complex, multi-agent environments. However, existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation--causing costly misrouting or expert overload. We present the first comprehensive study of adversarial robustness in two-stage L2D systems. We introduce two novel attack strategie--untargeted and targeted--which respectively disrupt optimal allocations or force queries to specific agents. To defend against such threats, we propose SARD, a convex learning algorithm built on a family of surrogate losses that are provably Bayes-consistent and $(\mathcal{R}, \mathcal{G})$-consistent. These guarantees hold across classification, regression, and multi-task settings. Empirical results demonstrate that SARD significantly improves robustness under adversarial attacks while maintaining strong clean performance, marking a critical step toward secure and trustworthy L2D deployment.
We introduce the Poisson Hierarchical Indian Buffet Process (PHIBP), a new class of species sampling models designed to address the challenges of complex, sparse count data by facilitating information sharing across and within groups. Our theoretical developments enable a tractable Bayesian nonparametric framework with machine learning elements, accommodating a potentially infinite number of species (taxa) whose parameters are learned from data. Focusing on microbiome analysis, we address key gaps by providing a flexible multivariate count model that accounts for overdispersion and robustly handles diverse data types (OTUs, ASVs). We introduce novel parameters reflecting species abundance and diversity. The model borrows strength across groups while explicitly distinguishing between technical and biological zeros to interpret sparse co-occurrence patterns. This results in a framework with tractable posterior inference, exact generative sampling, and a principled solution to the unseen species problem. We describe extensions where domain experts can incorporate knowledge through covariates and structured priors, with potential for strain-level analysis. While motivated by ecology, our work provides a broadly applicable methodology for hierarchical count modeling in genetics, commerce, and text analysis, and has significant implications for the broader theory of species sampling models arising in probability and statistics.
We study the fundamental problem of offline assortment optimization under the Multinomial Logit (MNL) model, where sellers must determine the optimal subset of the products to offer based solely on historical customer choice data. While most existing approaches to learning-based assortment optimization focus on the online learning of the optimal assortment through repeated interactions with customers, such exploration can be costly or even impractical in many real-world settings. In this paper, we consider the offline learning paradigm and investigate the minimal data requirements for efficient offline assortment optimization. To this end, we introduce Pessimistic Rank-Breaking (PRB), an algorithm that combines rank-breaking with pessimistic estimation. We prove that PRB is nearly minimax optimal by establishing the tight suboptimality upper bound and a nearly matching lower bound. This further shows that "optimal item coverage" - where each item in the optimal assortment appears sufficiently often in the historical data - is both sufficient and necessary for efficient offline learning. This significantly relaxes the previous requirement of observing the complete optimal assortment in the data. Our results provide fundamental insights into the data requirements for offline assortment optimization under the MNL model.
Spatio-temporal point processes (STPPs) model discrete events distributed in time and space, with important applications in areas such as criminology, seismology, epidemiology, and social networks. Traditional models often rely on parametric kernels, limiting their ability to capture heterogeneous, nonstationary dynamics. Recent innovations integrate deep neural architectures -- either by modeling the conditional intensity function directly or by learning flexible, data-driven influence kernels, substantially broadening their expressive power. This article reviews the development of the deep influence kernel approach, which enjoys statistical explainability, since the influence kernel remains in the model to capture the spatiotemporal propagation of event influence and its impact on future events, while also possessing strong expressive power, thereby benefiting from both worlds. We explain the main components in developing deep kernel point processes, leveraging tools such as functional basis decomposition and graph neural networks to encode complex spatial or network structures, as well as estimation using both likelihood-based and likelihood-free methods, and address computational scalability for large-scale data. We also discuss the theoretical foundation of kernel identifiability. Simulated and real-data examples highlight applications to crime analysis, earthquake aftershock prediction, and sepsis prediction modeling, and we conclude by discussing promising directions for the field.
Overestimation of turnout has long been an issue in election surveys, with nonresponse bias or voter overrepresentation regarded as one of the major sources of bias. However, the adjustment for nonignorable nonresponse bias is substantially challenging. Based on the ANES Non-Response Follow-Up Study concerning the 2020 U.S. presidential election, we investigate the role of callback data in adjusting for nonresponse bias in the estimation of turnout. Callback data are the records of contact attempts in the survey course, available in many modern large-scale surveys. We propose a stableness of resistance assumption to account for the nonignorable missingness in the outcome, which states that the impact of the missing outcome on the response propensity is stable in the first two call attempts. Under this assumption and by leveraging covariates information from the census data, we establish the identifiability and develop estimation methods for turnout, including a doubly robust estimator. Our methods produce estimates very close to the official turnout and successfully capture the trend of declining willingness to vote as response reluctance increases. This work hints at the importance of adjusting for nonignorable nonresponse bias and exhibits the promise of callback data for political surveys.
We propose a new measure of dependence that quantifies the degree to which a random variable $Y$ depends on a random vector $X$. This measure is zero if and only if $Y$ and $X$ are independent, and equals one if and only if $Y$ is a measurable function of $X$. We introduce a simple and interpretable estimator that is comparable in ease of computation to classical correlation coefficients such as Pearson's, Spearman's, or Chatterjee's. Building on this coefficient, we develop a model-free variable selection algorithm, feature ordering by dependence (FORD), inspired by FOCI. FORD requires no tuning parameters and is provably consistent under suitable sparsity assumptions. We demonstrate its effectiveness and improvements over FOCI through experiments on both synthetic and real datasets.
Classical estimators, the cornerstones of statistical inference, face insurmountable challenges when applied to important emerging classes of Archimedean copulas. These models exhibit pathological properties, including numerically unstable densities, non-monotonic parameter-to-dependence mappings, and vanishingly small likelihood gradients, rendering methods like Maximum Likelihood (MLE) and Method of Moments (MoM) inconsistent or computationally infeasible. We introduce IGNIS, a unified neural estimation framework that sidesteps these barriers by learning a direct, robust mapping from data-driven dependency measures to the underlying copula parameter theta. IGNIS utilizes a multi-input architecture and a theory-guided output layer (softplus(z) + 1) to automatically enforce the domain constraint theta_hat >= 1. Trained and validated on four families (Gumbel, Joe, and the numerically challenging A1/A2), IGNIS delivers accurate and stable estimates for real-world financial and health datasets, demonstrating its necessity for reliable inference in modern, complex dependence models where traditional methods fail.
Fairness is a growing area of machine learning (ML) that focuses on ensuring models do not produce systematically biased outcomes for specific groups, particularly those defined by protected attributes such as race, gender, or age. Evaluating fairness is a critical aspect of ML model development, as biased models can perpetuate structural inequalities. The {fairmetrics} R package offers a user-friendly framework for rigorously evaluating numerous group-based fairness criteria, including metrics based on independence (e.g., statistical parity), separation (e.g., equalized odds), and sufficiency (e.g., predictive parity). Group-based fairness criteria assess whether a model is equally accurate or well-calibrated across a set of predefined groups so that appropriate bias mitigation strategies can be implemented. {fairmetrics} provides both point and interval estimates for multiple metrics through a convenient wrapper function and includes an example dataset derived from the Medical Information Mart for Intensive Care, version II (MIMIC-II) database (Goldberger et al., 2000; Raffa, 2016).
Graphs are a powerful data structure for representing relational data and are widely used to describe complex real-world systems. Probabilistic Graphical Models (PGMs) and Graph Neural Networks (GNNs) can both leverage graph-structured data, but their inherent functioning is different. The question is how do they compare in capturing the information contained in networked datasets? We address this objective by solving a link prediction task and we conduct three main experiments, on both synthetic and real networks: one focuses on how PGMs and GNNs handle input features, while the other two investigate their robustness to noisy features and increasing heterophily of the graph. PGMs do not necessarily require features on nodes, while GNNs cannot exploit the network edges alone, and the choice of input features matters. We find that GNNs are outperformed by PGMs when input features are low-dimensional or noisy, mimicking many real scenarios where node attributes might be scalar or noisy. Then, we find that PGMs are more robust than GNNs when the heterophily of the graph is increased. Finally, to assess performance beyond prediction tasks, we also compare the two frameworks in terms of their computational complexity and interpretability.
Recently, a Wasserstein analogue of the Cramer--Rao inequality has been developed using the Wasserstein information matrix (Otto metric). This inequality provides a lower bound on the Wasserstein variance of an estimator, which quantifies its robustness against additive noise. In this study, we investigate conditions for an estimator to attain the Wasserstein--Cramer--Rao lower bound (asymptotically), which we call the (asymptotic) Wasserstein efficiency. We show a condition under which Wasserstein efficient estimators exist for one-parameter statistical models. This condition corresponds to a recently proposed Wasserstein analogue of one-parameter exponential families (e-geodesics). We also show that the Wasserstein estimator, a Wasserstein analogue of the maximum likelihood estimator based on the Wasserstein score function, is asymptotically Wasserstein efficient in location-scale families.
Robust subspace estimation is fundamental to many machine learning and data analysis tasks. Iteratively Reweighted Least Squares (IRLS) is an elegant and empirically effective approach to this problem, yet its theoretical properties remain poorly understood. This paper establishes that, under deterministic conditions, a variant of IRLS with dynamic smoothing regularization converges linearly to the underlying subspace from any initialization. We extend these guarantees to affine subspace estimation, a setting that lacks prior recovery theory. Additionally, we illustrate the practical benefits of IRLS through an application to low-dimensional neural network training. Our results provide the first global convergence guarantees for IRLS in robust subspace recovery and, more broadly, for nonconvex IRLS on a Riemannian manifold.
Markov-switching models are a powerful tool for modelling time series data that are driven by underlying latent states. As such, they are widely used in behavioural ecology, where discrete states can serve as proxies for behavioural modes and enable inference on latent behaviour driving e.g. observed movement. To understand drivers of behavioural changes, it is common to link model parameters to covariates. Over the last decade, nonparametric approaches have gained traction in this context to avoid unrealistic parametric assumptions. Nonetheless, existing methods are largely limited to univariate smooth functions of covariates, based on penalised splines, while real processes are typically complex requiring consideration of interaction effects. We address this gap by incorporating tensor-product interactions into Markov-switching models, enabling flexible modelling of multidimensional effects in a computationally efficient manner. Based on the extended Fellner-Schall method, we develop an efficient automatic smoothness selection procedure that is robust and scales well with the number of smooth functions in the model. The method builds on a random effects view of the spline coefficients and yields a recursive penalised likelihood procedure. As special cases, this general framework accommodates bivariate smoothing, function-valued random effects, and space-time interactions. We demonstrate its practical utility through three ecological case studies of an African elephant, common fruitflies, and Arctic muskoxen. The methodology is implemented in the LaMa R package, providing applied ecologists with an accessible and flexible tool for semiparametric inference in hidden-state models. The approach has the potential to drastically improve the level of detail in inference, allowing to fit HMMs with hundreds of parameters, 10-20 (potentially bivariate) smooths to thousands of observations.
In the paper, we derive an analytic formula for the ROC curves of the LDA classifiers. We establish elementary properties of these curves (monotonicity and concavity), provide formula for the area under curve (AUC) and compute the Youden J-index. Finally, we illustrate the performance of our results on a real--life dataset of Wisconsin breast cancer patients.
High-dimensional Bayesian Optimization (BO) has attracted significant attention in recent research. However, existing methods have mainly focused on optimizing in continuous domains, while combinatorial (ordinal and categorical) and mixed domains still remain challenging. In this paper, we first propose MOCA-HESP, a novel high-dimensional BO method for combinatorial and mixed variables. The key idea is to leverage the hyper-ellipsoid space partitioning (HESP) technique with different categorical encoders to work with high-dimensional, combinatorial and mixed spaces, while adaptively selecting the optimal encoders for HESP using a multi-armed bandit technique. Our method, MOCA-HESP, is designed as a \textit{meta-algorithm} such that it can incorporate other combinatorial and mixed BO optimizers to further enhance the optimizers' performance. Finally, we develop three practical BO methods by integrating MOCA-HESP with state-of-the-art BO optimizers for combinatorial and mixed variables: standard BO, CASMOPOLITAN, and Bounce. Our experimental results on various synthetic and real-world benchmarks show that our methods outperform existing baselines. Our code implementation can be found at this https URL
We propose a Bayesian forecast combination framework that, for the first time, embeds forward-looking signals, formulated as predictive priors, directly into the time-varying weight-updating process. This approach enables weights to adapt using both historical forecast performance and anticipated future model behavior. We implement the framework with model diversity as the forward-looking signal, yielding the diversity-driven time-varying weights (DTVW) method. Compared with the standard time-varying weights (TVW) approach, DTVW embeds diversity-driven predictive priors that penalize redundancy and encourage informative contributions across constituent models. Simulation experiments, covering both a simple complete model set and a complex misspecified environment, show that DTVW improves forecast accuracy by dynamically focusing on well-performing models. Empirical applications to multi-step-ahead oil price forecasts and bivariate forecasts of U.S. inflation and GDP growth confirm its superiority over benchmarks including Equal weighting, Bayesian Model Averaging, and standard TVW. Beyond accuracy gains, diversity-based predictive priors provide diagnostic insights into model incompleteness and forecast uncertainty, making DTVW both more adaptive and more informative than existing Bayesian combination methods.
Linear birth-and-death processes (LBDPs) are foundational stochastic models in population dynamics, evolutionary biology, and hematopoiesis. Estimating parameters from discretely observed data is computationally demanding due to irregular sampling, noise, and missing values. We propose a novel approximate maximum likelihood estimator (MLE) for LBDPs based on a Gaussian approximation to transition probabilities. The approach transforms estimation into a univariate optimization problem, achieving substantial computational gains without sacrificing accuracy. Through simulations, we show that the approximate MLE outperforms Gaussian and saddlepoint-based estimators in speed and precision under realistic noise and sparsity. Applied to longitudinal clonal hematopoiesis data, the method produces biologically meaningful growth estimates even with noisy, compositional input. Unlike Gaussian and saddlepoint approximations, our estimator is invariant to data scaling, making it ideal for real-world applications such as variant allele frequency analyses.
As a demonstration of concept, quantile gradient boosting is used to forecast diurnal and nocturnal Q(.90) air temperatures for Paris, France during late the spring and summer months of 2020. The data are provided by the Paris-Montsouris weather station. Q(.90) values are estimated because the 90th percentile requires that the temperatures be relatively rare and extreme. Predictors include seven routinely collected indicators of weather conditions, lagged by 14 days; the temperature forecasts are produced two weeks in advance. Conformal prediction regions capture forecasting uncertainty with provably valid properties. For both diurnal and nocturnal temperatures, forecasting accuracy is promising, and sound measures of uncertainty are provided.
Unexpected advertising items in sponsored search may reduce users' reliance on organic search, resulting in hidden cost for the e-commerce platform. To address this problem and promote sustainable growth, we propose a dynamic reserve price design that incorporates the hidden cost into the auction mechanism to determine whether to sell the traffic, thereby ensuring a balanced relationship between revenue and user experience. Our dynamic reserve price design framework optimizes traffic sales by minimizing impacts on user experience while maintaining long-term incentives for advertisers to reveal their valuations truthfully. Furthermore, we introduce a distributed algorithm capable of computing reserve prices with billion-scale data in the production environment. Experiments involving offline evaluations and online A/B testing demonstrate that this method is simple and efficient, making it suitable for use in industrial production. This method has already been fully deployed in the production environment.
In this paper, we present a flexible and probabilistic framework for tracking topological features in time-varying scalar fields using merge trees and partial optimal transport. Merge trees are topological descriptors that record the evolution of connected components in the sublevel sets of scalar fields. We present a new technique for modeling and comparing merge trees using tools from partial optimal transport. In particular, we model a merge tree as a measure network, that is, a network equipped with a probability distribution, and define a notion of distance on the space of merge trees inspired by partial optimal transport. Such a distance offers a new and flexible perspective for encoding intrinsic and extrinsic information in the comparative measures of merge trees. More importantly, it gives rise to a partial matching between topological features in time-varying data, thus enabling flexible topology tracking for scientific simulations. Furthermore, such partial matching may be interpreted as probabilistic coupling between features at adjacent time steps, which gives rise to probabilistic tracking graphs. We derive a stability result for our distance and provide numerous experiments indicating the efficacy of our framework in extracting meaningful feature tracks.
In this paper, we study the problem of sparse mean estimation under adversarial corruptions, where the goal is to estimate the $k$-sparse mean of a heavy-tailed distribution from samples contaminated by adversarial noise. Existing methods face two key limitations: they require prior knowledge of the sparsity level $k$ and scale poorly to high-dimensional settings. We propose a simple and scalable estimator that addresses both challenges. Specifically, it learns the $k$-sparse mean without knowing $k$ in advance and operates in near-linear time and memory with respect to the ambient dimension. Under a moderate signal-to-noise ratio, our method achieves the optimal statistical rate, matching the information-theoretic lower bound. Extensive simulations corroborate our theoretical guarantees. At the heart of our approach is an incremental learning phenomenon: we show that a basic subgradient method applied to a nonconvex two-layer formulation with an $\ell_1$-loss can incrementally learn the $k$ nonzero components of the true mean while suppressing the rest. More broadly, our work is the first to reveal the incremental learning phenomenon of the subgradient method in the presence of heavy-tailed distributions and adversarial corruption.
Motivated by the need for a robust policy in the face of environment shifts between training and deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around robust Markov decision processes (RMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we rigorously construct RMDPs that embrace various modeling attributes for both the decision maker and the adversary. These attributes include the structure of information availability-covering history-dependent, Markov, and Markov time-homogeneous dynamics-as well as constraints on the shifts induced by the adversary, with a focus on SA- and S-rectangularity. Within this RMDP framework, we investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficient DRRL algorithms are reliant on the DPP. To investigate its existence, we systematically analyze various combinations of controller and adversary attributes, presenting streamlined proofs based on a unified methodology. We then construct counterexamples for settings where a fully general DPP fails to hold and establish asymptotically optimal history-dependent policies for key scenarios where the DPP is absent.
A large amount of effort has recently been put into understanding the barren plateau phenomenon. In this perspective article, we face the increasingly loud elephant in the room and ask a question that has been hinted at by many but not explicitly addressed: Can the structure that allows one to avoid barren plateaus also be leveraged to efficiently simulate the loss classically? We collect evidence-on a case-by-case basis-that many commonly used models whose loss landscapes avoid barren plateaus can also admit classical simulation, provided that one can collect some classical data from quantum devices during an initial data acquisition phase. This follows from the observation that barren plateaus result from a curse of dimensionality, and that current approaches for solving them end up encoding the problem into some small, classically simulable, subspaces. Thus, while stressing that quantum computers can be essential for collecting data, our analysis sheds doubt on the information processing capabilities of many parametrized quantum circuits with provably barren plateau-free landscapes. We end by discussing the (many) caveats in our arguments including the limitations of average case arguments, the role of smart initializations, models that fall outside our assumptions, the potential for provably superpolynomial advantages and the possibility that, once larger devices become available, parametrized quantum circuits could heuristically outperform our analytic expectations.
Recent work in deep learning has shown strong empirical and theoretical evidence of an implicit low-rank bias: weight matrices in deep networks tend to be approximately low-rank. Moreover, removing relatively small singular values during training, or from available trained models, may significantly reduce model size while maintaining or even improving model performance. However, the majority of the theoretical investigations around low-rank bias in neural networks deal with oversimplified models, often not taking into account the impact of nonlinearity. In this work, we first of all quantify a link between the phenomenon of deep neural collapse and the emergence of low-rank weight matrices for a general class of feedforward networks with nonlinear activation. In addition, for the general class of nonlinear feedforward and residual networks, we prove the global optimality of deep neural collapsed configurations and the practical absence of a loss barrier between interpolating minima and globally optimal points, offering a possible explanation for its common occurrence. As a byproduct, our theory also allows us to forecast the final global structure of singular values before training. Our theoretical findings are supported by a range of experimental evaluations illustrating the phenomenon.
In the realm of multi-arm bandit problems, the Gittins index policy is known to be optimal in maximizing the expected total discounted reward obtained from pulling the Markovian arms. In most realistic scenarios however, the Markovian state transition probabilities are unknown and therefore the Gittins indices cannot be computed. One can then resort to reinforcement learning (RL) algorithms that explore the state space to learn these indices while exploiting to maximize the reward collected. In this work, we propose tabular (QGI) and Deep RL (DGN) algorithms for learning the Gittins index that are based on the retirement formulation for the multi-arm bandit problem. When compared with existing RL algorithms that learn the Gittins index, our algorithms have a lower run time, require less storage space (small Q-table size in QGI and smaller replay buffer in DGN), and illustrate better empirical convergence to the Gittins index. This makes our algorithm well suited for problems with large state spaces and is a viable alternative to existing methods. As a key application, we demonstrate the use of our algorithms in minimizing the mean flowtime in a job scheduling problem when jobs are available in batches and have an unknown service time distribution.
Latent variable models serve as powerful tools to infer underlying dynamics from observed neural activity. Ideally, the inferred dynamics should align with true ones. However, due to the absence of ground truth data, prediction benchmarks are often employed as proxies. One widely-used method, $\textit{co-smoothing}$, involves jointly estimating latent variables and predicting observations along held-out channels to assess model performance. In this study, we reveal the limitations of the co-smoothing prediction framework and propose a remedy. Using a student-teacher setup, we demonstrate that models with high co-smoothing can have arbitrary extraneous dynamics in their latent representations. To address this, we introduce a secondary metric -- $\textit{few-shot co-smoothing}$, performing regression from the latent variables to held-out neurons in the data using fewer trials. Our results indicate that among models with near-optimal co-smoothing, those with extraneous dynamics underperform in the few-shot co-smoothing compared to `minimal' models that are devoid of such dynamics. We provide analytical insights into the origin of this phenomenon and further validate our findings on four standard neural datasets using a state-of-the-art method: STNDT. In the absence of ground truth, we suggest a novel measure to validate our approach. By cross-decoding the latent variables of all model pairs with high co-smoothing, we identify models with minimal extraneous dynamics. We find a correlation between few-shot co-smoothing performance and this new measure. In summary, we present a novel prediction metric designed to yield latent variables that more accurately reflect the ground truth, offering a significant improvement for latent dynamics inference.
The geotechnical evaluation of seabed sediments is important for engineering projects and naval applications, offering valuable insights into sediment properties, behavior, and strength. Obtaining high-quality seabed samples can be a challenging task, making in situ testing an essential part of site characterization. Free-fall penetrometers (FFPs) are robust tools for rapidly profiling seabed surface sediments, even in energetic nearshore or estuarine conditions and shallow as well as deep depths. Although methods for interpretation of traditional offshore cone penetration testing (CPT) data are well-established, their adaptation to FFP data is still an area of research. This study introduces an innovative approach that utilizes machine learning algorithms to create a sediment behavior classification system based on portable free- fall penetrometer (PFFP) data. The proposed model leverages PFFP measurements obtained from multiple locations, such as Sequim Bay (Washington), the Potomac River, and the York River (Virginia). The results show 91.1% accuracy in the class prediction, with the classes representing cohesionless sediment with little to no plasticity (Class 1), cohesionless sediment with some plasticity (Class 2), cohesive sediment with low plasticity (Class 3), and cohesive sediment with high plasticity (Class 4). The model prediction not only predicts classes but also yields an estimate of inherent uncertainty associated with the prediction, which can provide valuable insight into different sediment behaviors. Lower uncertainties are more common, but they can increase significantly depending on variations in sediment composition, environmental conditions, and operational techniques. By quantifying uncertainty, the model offers a more comprehensive and informed approach to sediment classification
Neural posterior estimation (NPE), a type of amortized variational inference, is a computationally efficient means of constructing probabilistic catalogs of light sources from astronomical images. To date, NPE has not been used to perform inference in models with spatially varying covariates. However, ground-based astronomical images have spatially varying sky backgrounds and point spread functions (PSFs), and accounting for this variation is essential for constructing accurate catalogs of imaged light sources. In this work, we introduce a method of performing NPE with spatially varying backgrounds and PSFs. In this method, we generate synthetic catalogs and semi-synthetic images for these catalogs using randomly sampled PSF and background estimates from existing surveys. Using this data, we train a neural network, which takes an astronomical image and representations of its background and PSF as input, to output a probabilistic catalog. Our experiments with Sloan Digital Sky Survey data demonstrate the effectiveness of NPE in the presence of spatially varying backgrounds and PSFs for light source detection, star/galaxy separation, and flux measurement.
Laplacian-based methods are popular for dimensionality reduction of data lying in $\mathbb{R}^N$. Several theoretical results for these algorithms depend on the fact that the Euclidean distance locally approximates the geodesic distance on the underlying submanifold which the data are assumed to lie on. However, for some applications, other metrics, such as the Wasserstein distance, may provide a more appropriate notion of distance than the Euclidean distance. We provide a framework that generalizes the problem of manifold learning to metric spaces and study when a metric satisfies sufficient conditions for the pointwise convergence of the graph Laplacian.
Tabular data sets with varying missing values are prepared for machine learning using an arbitrary imputation strategy. Synthetic values generated by imputation models often raise concerns about computational complexity, data quality, and data-driven outcomes. To address these concerns, this article proposes a no-imputation incremental attention learning (NIAL) method for tabular data. A pair of attention masks is derived and retrofitted to a transformer to directly streamline tabular data without imputing or initializing missing values. The proposed method incrementally learns partitions of overlapping and fixed-size feature sets to enhance the efficiency and performance of the transformer. The average classification performance rank order across 15 diverse tabular data sets highlights the superiority of NIAL over 11 state-of-the-art learning methods with or without missing value imputations. Further experiments substantiate the robustness of NIAL against varying missing value types and rates compared to methods involving missing value imputation. Our analysis reveals that a feature partition size of half the original feature space is, both computationally and in terms of accuracy, the best choice for the proposed incremental learning. The proposed method is one of the first solutions to enable deep attention learning of tabular data without requiring missing-value imputation.
Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but also continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Methods for nonparametric sequential testing -- especially conformal test martingales (CTMs) and anytime-valid inference -- offer promising tools for this monitoring task. However, existing approaches are restricted to monitoring limited hypothesis classes or ``alarm criteria'' (e.g., detecting data shifts that violate certain exchangeability or IID assumptions), do not allow for online adaptation in response to shifts, and/or cannot diagnose the cause of degradation or alarm. In this paper, we address these limitations by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that adapt online to mild covariate shifts (in the marginal input distribution), quickly detect harmful shifts, and diagnose those harmful shifts as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.
Fairness through Unawareness (FtU) describes the idea that discrimination against demographic groups can be avoided by not considering group membership in the decisions or predictions. This idea has long been criticized in the machine learning literature as not being sufficient to ensure fairness. In addition, the use of additional features is typically thought to increase the accuracy of the predictions for all groups, so that FtU is sometimes thought to be detrimental to all groups. In this paper, we show both theoretically and empirically that FtU can reduce algorithmic discrimination without necessarily reducing accuracy. We connect this insight with the literature on Model Multiplicity, to which we contribute with novel theoretical and empirical results. Furthermore, we illustrate how, in a real-life application, FtU can contribute to the deployment of more equitable policies without losing efficacy. Our findings suggest that FtU is worth considering in practical applications, particularly in high-risk scenarios, and that the use of protected attributes such as gender in predictive models should be accompanied by a clear and well-founded justification.
We introduce and develop a novel particle exchange Monte Carlo method. Whereas existing methods apply to eigenfunction problems where the eigenvalue is known (e.g., integrals with respect to a Gibbs measure, which can be interpreted as corresponding to eigenvalue zero), here the focus is on problems where the eigenvalue is not known a priori. To obtain an appropriate particle exchange rule we must consider a pair of processes, with one evolving forward in time and the other backward. Applications to eigenfunction problems corresponding to quasistationary distributions and ergodic stochastic control are discussed.
We develop a unified Data Processing Inequality PAC-Bayesian framework -- abbreviated DPI-PAC-Bayesian -- for deriving generalization error bounds in the supervised learning setting. By embedding the Data Processing Inequality (DPI) into the change-of-measure technique, we obtain explicit bounds on the binary Kullback-Leibler generalization gap for both Rényi divergence and any $f$-divergence measured between a data-independent prior distribution and an algorithm-dependent posterior distribution. We present three bounds derived under our framework using Rényi, Hellinger \(p\) and Chi-Squared divergences. Additionally, our framework also demonstrates a close connection with other well-known bounds. When the prior distribution is chosen to be uniform, our bounds recover the classical Occam's Razor bound and, crucially, eliminate the extraneous \(\log(2\sqrt{n})/n\) slack present in the PAC-Bayes bound, thereby achieving tighter results. The framework thus bridges data-processing and PAC-Bayesian perspectives, providing a flexible, information-theoretic tool to construct generalization guarantees.
While modern deep networks have demonstrated remarkable versatility, their training dynamics remain poorly understood--often driven more by empirical tweaks than architectural insight. This paper investigates how internal structural choices shape the behavior of learning systems. Building on prior efforts that introduced simple architectural constraints, we explore the broader implications of structure for convergence, generalization, and adaptation. Our approach centers on a family of enriched transformation layers that incorporate constrained pathways and adaptive corrections. We analyze how these structures influence gradient flow, spectral sensitivity, and fixed-point behavior--uncovering mechanisms that contribute to training stability and representational regularity. Theoretical analysis is paired with empirical studies on synthetic and structured tasks, demonstrating improved robustness, smoother optimization, and scalable depth behavior. Rather than prescribing fixed templates, we emphasize principles of tractable design that can steer learning behavior in interpretable ways. Our findings support a growing view that architectural design is not merely a matter of performance tuning, but a critical axis for shaping learning dynamics in scalable and trustworthy neural systems.
In over-identified models, misspecification -- the norm rather than exception -- fundamentally changes what estimators estimate. Different estimators imply different estimands rather than different efficiency for the same target. A review of recent applications of generalized method of moments in the American Economic Review suggests widespread acceptance of this fact: There is little formal specification testing and widespread use of estimators that would be inefficient were the model correct, including the use of "hand-selected" moments and weighting matrices. Motivated by these observations, we review and synthesize recent results on estimation under model misspecification, providing guidelines for transparent and robust empirical research. We also provide a new theoretical result, showing that Hansen's J-statistic measures, asymptotically, the range of estimates achievable at a given standard error. Given the widespread use of inefficient estimators and the resulting researcher degrees of freedom, we thus particularly recommend the broader reporting of J-statistics.