Algorithmic stability is a central concept in statistics and learning theory that measures how sensitive an algorithm's output is to small changes in the training data. Stability plays a crucial role in understanding generalization, robustness, and replicability, and a variety of stability notions have been proposed in different learning settings. However, while stability entails desirable properties, it is typically not sufficient on its own for statistical learning -- and indeed, it may be at odds with accuracy, since an algorithm that always outputs a constant function is perfectly stable but statistically meaningless. Thus, it is essential to understand the potential statistical cost of stability. In this work, we address this question by adopting a statistical decision-theoretic perspective, treating stability as a constraint in estimation. Focusing on two representative notions-worst-case stability and average-case stability-we first establish general lower bounds on the achievable estimation accuracy under each type of stability constraint. We then develop optimal stable estimators for four canonical estimation problems, including several mean estimation and regression settings. Together, these results characterize the optimal trade-offs between stability and accuracy across these tasks. Our findings formalize the intuition that average-case stability imposes a qualitatively weaker restriction than worst-case stability, and they further reveal that the gap between these two can vary substantially across different estimation problems.
We consider the problem of learning the network of mutual excitations (i.e., the dependency graph) in a non-stationary, multivariate Hawkes process. We consider a general setting where baseline rates at each node are time-varying and delay kernels are not shift-invariant. Our main results show that if the dependency graph of an $n$-variate Hawkes process is sparse (i.e., it has a maximum degree that is bounded with respect to $n$), our algorithm accurately reconstructs it from data after observing the Hawkes process for $T = \mathrm{polylog}(n)$ time, with high probability. Our algorithm is computationally efficient, and provably succeeds in learning dependencies even if only a subset of time series are observed and event times are not precisely known.
Explicit modelling of between-study heterogeneity is essential in network meta-analysis (NMA) to ensure valid inference and avoid overstating precision. While the additive random-effects (RE) model is the conventional approach, the multiplicative-effect (ME) model remains underexplored. The ME model inflates within-study variances by a common factor estimated via weighted least squares, yielding identical point estimates to a fixed-effect model while inflating confidence intervals. We empirically compared RE and ME models across NMAs of two-arm studies with significant heterogeneity from the nmadb database, assessing model fit using the Akaike Information Criterion. The ME model often provided comparable or better fit to the RE model. Case studies further revealed that RE models are sensitive to extreme and imprecise observations, whereas ME models assign less weight to such observations and hence exhibit greater robustness to publication bias. Our results suggest that the ME model warrant consideration alongside conventional RE model in NMA practice.
We study nonasymptotic (finite-sample) confidence intervals for treatment effects in randomized experiments. In the existing literature, the effective sample sizes of nonasymptotic confidence intervals tend to be looser than the corresponding central-limit-theorem-based confidence intervals by a factor depending on the square root of the propensity score. We show that this performance gap can be closed, designing nonasymptotic confidence intervals that have the same effective sample size as their asymptotic counterparts. Our approach involves systematic exploitation of negative dependence or variance adaptivity (or both). We also show that the nonasymptotic rates that we achieve are unimprovable in an information-theoretic sense.
Global sensitivity analysis of complex numerical simulators is often limited by the small number of model evaluations that can be afforded. In such settings, surrogate models built from a limited set of simulations can substantially reduce the computational burden, provided that the design of computer experiments is enriched efficiently. In this context, we propose an active learning approach that, for a fixed evaluation budget, targets the most informative regions of the input space to improve sensitivity analysis accuracy. More specifically, our method builds on recent advances in active learning for sensitivity analysis (Sobol' indices and derivative-based global sensitivity measures, DGSM) that exploit derivatives obtained from a Gaussian process (GP) surrogate. By leveraging the joint posterior distribution of the GP gradient, we develop acquisition functions that better account for correlations between partial derivatives and their impact on the response surface, leading to a more comprehensive and robust methodology than existing DGSM-oriented criteria. The proposed approach is first compared to state-of-the-art methods on standard benchmark functions, and is then applied to a real environmental model of pesticide transfers.
Clinical AI systems frequently suffer performance decay post-deployment due to temporal data shifts, such as evolving populations, diagnostic coding updates (e.g., ICD-9 to ICD-10), and systemic shocks like the COVID-19 pandemic. Addressing this ``aging'' effect via frequent retraining is often impractical due to computational costs and privacy constraints. To overcome these hurdles, we introduce Adversarial Drift-Aware Predictive Transfer (ADAPT), a novel framework designed to confer durability against temporal drift with minimal retraining. ADAPT innovatively constructs an uncertainty set of plausible future models by combining historical source models and limited current data. By optimizing worst-case performance over this set, it balances current accuracy with robustness against degradation due to future drifts. Crucially, ADAPT requires only summary-level model estimators from historical periods, preserving data privacy and ensuring operational simplicity. Validated on longitudinal suicide risk prediction using electronic health records from Mass General Brigham (2005--2021) and Duke University Health Systems, ADAPT demonstrated superior stability across coding transitions and pandemic-induced shifts. By minimizing annual performance decay without labeling or retraining future data, ADAPT offers a scalable pathway for sustaining reliable AI in high-stakes healthcare environments.
Extreme weather events are becoming more common, with severe storms, floods, and prolonged precipitation affecting communities worldwide. These shifts in climate patterns pose a direct threat to the insurance industry, which faces growing exposure to weather-related damages. As claims linked to extreme weather rise, insurance companies need reliable tools to assess future risks. This is not only essential for setting premiums and maintaining solvency but also for supporting broader disaster preparedness and resilience efforts. In this study, we propose a two-step method to examine the impact of precipitation on home insurance claims. Our approach combines the predictive power of deep neural networks with the flexibility of copula-based multivariate analysis, enabling a more detailed understanding of how precipitation patterns relate to claim dynamics. We demonstrate this methodology through a case study of the Canadian Prairies, using data from 2002 to 2011.
Semi-implicit variational inference (SIVI) enhances the expressiveness of variational families through hierarchical semi-implicit distributions, but the intractability of their densities makes standard ELBO-based optimization biased. Recent score-matching approaches to SIVI (SIVI-SM) address this issue via a minimax formulation, at the expense of an additional lower-level optimization problem. In this paper, we propose kernel semi-implicit variational inference (KSIVI), a principled and tractable alternative that eliminates the lower-level optimization by leveraging kernel methods. We show that when optimizing over a reproducing kernel Hilbert space, the lower-level problem admits an explicit solution, reducing the objective to the kernel Stein discrepancy (KSD). Exploiting the hierarchical structure of semi-implicit distributions, the resulting KSD objective can be efficiently optimized using stochastic gradient methods. We establish optimization guarantees via variance bounds on Monte Carlo gradient estimators and derive statistical generalization bounds of order $\tilde{\mathcal{O}}(1/\sqrt{n})$. We further introduce a multi-layer hierarchical extension that improves expressiveness while preserving tractability. Empirical results on synthetic and real-world Bayesian inference tasks demonstrate the effectiveness of KSIVI.
The two popular systemic risk measures CoVaR (Conditional Value-at-Risk) and CoES (Conditional Expected Shortfall) have recently been receiving growing attention on applications in economics and finance. In this paper, we study the estimations of extreme CoVaR and CoES when the two random variables are asymptotic independent but positively associated. We propose two types of extrapolative approaches: the first relies on intermediate VaR and extrapolates it to extreme CoVaR/CoES via an adjustment factor; the second directly extrapolates the estimated intermediate CoVaR/CoES to the extreme tails. All estimators, including both intermediate and extreme ones, are shown to be asymptotically normal. Finally, we explore the empirical performances of our methods through conducting a series of Monte Carlo simulations and a real data analysis on S&P500 Index with 12 constituent stock data.
Tail Value-at-Risk (TVaR) is a widely adopted risk measure playing a critically important role in both academic research and industry practice in insurance. In data applications, TVaR is often estimated using the empirical method, owing to its simplicity and nonparametric nature. The empirical TVaR has been explicitly advocated by regulatory authorities as a standard approach for computing TVaR. However, prior literature has pointed out that the empirical TVaR estimator is negatively biased, which can lead to a systemic underestimation of risk in finite-sample applications. This paper aims to deepen the understanding of the bias of the empirical TVaR estimator in two dimensions: its magnitude as well as the key distributional and structural determinants driving the severity of the bias. To this end, we derive a leading-term approximation for the bias based on its asymptotic expansion. The closed-form expression associated with the leading-term approximation enables us to obtain analytical insights into the structural properties governing the bias of the empirical TVaR estimator. To account for the discrepancy between the leading-term approximation and the true bias, we further derive an explicit upper bound for the bias. We validate the proposed bias analysis framework via simulations and demonstrate its practical relevance using real data.
Instrumental variable based estimation of a causal effect has emerged as a standard approach to mitigate confounding bias in the social sciences and epidemiology, where conducting randomized experiments can be too costly or impossible. However, justifying the validity of the instrument often poses a significant challenge. In this work, we highlight a problem generally neglected in arguments for instrumental variable validity: the presence of an ''aggregate treatment variable'', where the treatment (e.g., education, GDP, caloric intake) is composed of finer-grained components that each may have a different effect on the outcome. We show that the causal effect of an aggregate treatment is generally ambiguous, as it depends on how interventions on the aggregate are instantiated at the component level, formalized through the aggregate-constrained component intervention distribution. We then characterize conditions on the interventional distribution and the aggregate setting under which standard instrumental variable estimators identify the aggregate effect. The contrived nature of these conditions implies major limitations on the interpretation of instrumental variable estimates based on aggregate treatments and highlights the need for a broader justificatory base for the exclusion restriction in such settings.
Background: Diagnostic test accuracy (DTA) studies, like etiological studies, are susceptible to various biases including reference standard error bias, partial verification bias, spectrum effect, confounding, and bias from misassumption of conditional independence. While directed acyclic graphs (DAGs) are widely used in etiological research to identify and illustrate bias structures, they have not been systematically applied to DTA studies. Methods: We developed DAGs to illustrate the causal structures underlying common biases in DTA studies. For each bias, we present the corresponding DAG structure and demonstrate the parallel with equivalent biases in etiological studies. We use real-world examples to illustrate each bias mechanism. Results: We demonstrate that five major biases in DTA studies can be represented using DAGs with clear structural parallels to etiological studies: reference standard error bias corresponds to exposure misclassification, misassumption of conditional independence creates spurious correlations similar to unmeasured confounding, spectrum effect parallels effect modification, confounding operates through backdoor paths in both settings, and partial verification bias mirrors selection bias. These DAG representations reveal the causal mechanisms underlying each bias and suggest appropriate correction strategies. Conclusions: DAGs provide a valuable framework for understanding bias structures in DTA studies and should complement existing quality assessment tools like STARD and QUADAS-2. We recommend incorporating DAGs during study design to prospectively identify potential biases and during reporting to enhance transparency. DAG construction requires interdisciplinary collaboration and sensitivity analyses under alternative causal structures.
Data-driven damage detection methods achieve damage identification by analyzing changes in damage-sensitive features (DSFs) derived from structural health monitoring (SHM) data. The core reason for their effectiveness lies in the fact that damage or structural state transition can be manifested as changes in the distribution of DSF data. This enables us to reframe the problem of damage detection as one of identifying these distributional changes. Hence, developing automated tools for detecting such changes is pivotal for automated structural health diagnosis. Control charts are extensively utilized in SHM for DSF change detection, owing to their excellent online detection and early warning capabilities. However, conventional methods are primarily designed to detect mean or variance shifts, making it challenging to identify complex shape changes in distributions. This limitation results in insufficient damage detection sensitivity. Moreover, they typically exhibit poor robustness against data contamination. This paper proposes a novel control chart to address these limitations. It employs the probability density functions (PDFs) of subgrouped DSF data as monitoring objects, with shape deformations characterized by warping functions. Furthermore, a nonparametric control chart is specifically constructed for warping function monitoring in the functional data analysis framework. Key advantages of the new method include the ability to detect both shifts and complex shape deformations in distributions, excellent online detection performance, and robustness against data contamination. Extensive simulation studies demonstrate its superiority over competing approaches. Finally, the method is applied to detecting distributional changes in DSF data for cable condition assessment in a long-span cable-stayed bridge, demonstrating its practical utility in engineering.
While momentum-based acceleration has been studied extensively in deterministic optimization problems, its behavior in nonstationary environments -- where the data distribution and optimal parameters drift over time -- remains underexplored. We analyze the tracking performance of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak heavy-ball and Nesterov) under uniform strong convexity and smoothness in varying stepsize regimes. We derive finite-time bounds in expectation and with high probability for the tracking error, establishing a sharp decomposition into three components: a transient initialization term, a noise-induced variance term, and a drift-induced tracking lag. Crucially, our analysis uncovers a fundamental trade-off: while momentum can suppress gradient noise, it incurs an explicit penalty on the tracking capability. We show that momentum can substantially amplify drift-induced tracking error, with amplification that becomes unbounded as the momentum parameter approaches one, formalizing the intuition that using 'stale' gradients hinders adaptation to rapid regime shifts. Complementing these upper bounds, we establish minimax lower bounds for dynamic regret under gradient-variation constraints. These lower bounds prove that the inertia-induced penalty is not an artifact of analysis but an information-theoretic barrier: in drift-dominated regimes, momentum creates an unavoidable 'inertia window' that fundamentally degrades performance. Collectively, these results provide a definitive theoretical grounding for the empirical instability of momentum in dynamic environments and delineate the precise regime boundaries where SGD provably outperforms its accelerated counterparts.
Surface ozone pollution remains a persistent challenge in many metropolitan regions worldwide, as the nonlinear dependence of ozone formation on nitrogen oxides and volatile organic compounds (VOCs) complicates the design of effective emission control strategies. While chemical transport models provide mechanistic insights, they rely on detailed emission inventories and are computationally expensive. This study develops a machine learning--based surrogate framework inspired by the Empirical Kinetic Modeling Approach (EKMA). Using hourly air quality observations from Los Angeles during 2024--2025, a random forest model is trained to predict surface ozone concentrations based on precursor measurements and spatiotemporal features, including site location and cyclic time encodings. The model achieves strong predictive performance, with permutation importance highlighting the dominant roles of diurnal temporal features and nitrogen dioxide, along with additional contributions from carbon monoxide. Building on the trained surrogate, EKMA-style sensitivity experiments are conducted by perturbing precursor concentrations while holding other covariates fixed. The results indicate that ozone formation in Los Angeles during the study period is predominantly VOC-limited. Overall, the proposed framework offers an efficient and interpretable approach for ozone regime diagnosis in data-rich urban environments.
Interval censored data commonly arise in medical studies when the event time of interest is only known to lie within an interval. In the presence of a cure subgroup, conventional mixture cure models typically assume a logistic model for the uncure probability and a proportional hazards model for the susceptible subjects. However, in practice, the assumptions of parametric form for the uncure probability and the proportional hazards model for the susceptible may not always be satisfied. In this paper, we propose a class of flexible single-index semiparametric transformation cure models for interval-censored data, where a single-index model and a semiparametric transformation model are utilized for the uncured and conditional survival probability, respectively, encompassing both the proportional hazards cure and proportional odds cure models as specific cases. We approximate the single-index function and cumulative baseline hazard functions via the kernel technique and splines, respectively, and develop a computationally feasible expectation-maximisation (EM) algorithm, facilitated by a four-layer gamma-frailty Poisson data augmentation. Simulation studies demonstrate the satisfactory performance of our proposed method, compared to the spline-based approach and the classical logistic-based mixture cure models. The application of the proposed methodology is illustrated using the Alzheimers dataset.
Semi- and non-parametric mixture of regressions are a very useful flexible class of mixture of regressions in which some or all of the parameters are non-parametric functions of the covariates. These models are, however, based on the Gaussian assumption of the component error distributions. Thus, their estimation is sensitive to outliers and heavy-tailed error distributions. In this paper, we propose semi- and non-parametric contaminated Gaussian mixture of regressions to robustly estimate the parametric and/or non-parametric terms of the models in the presence of mild outliers. The virtue of using a contaminated Gaussian error distribution is that we can simultaneously perform model-based clustering of observations and model-based outlier detection. We propose two algorithms, an expectation-maximization (EM)-type algorithm and an expectation-conditional-maximization (ECM)-type algorithm, to perform maximum likelihood and local-likelihood kernel estimation of the parametric and non-parametric of the proposed models, respectively. The robustness of the proposed models is examined using an extensive simulation study. The practical utility of the proposed models is demonstrated using real data.
In contrast to evaluating treatment effects, causal attribution analysis focuses on identifying the key factors responsible for an observed outcome. For two binary exposure variables and a binary outcome variable, researchers need to assess not only the likelihood that an observed outcome was caused by a particular exposure, but also the likelihood that it resulted from the interaction between the two exposures. For example, in the case of a male worker who smoked, was exposed to asbestos, and developed lung cancer, researchers aim to explore whether the cancer resulted from smoking, asbestos exposure, or their interaction. Even in randomized controlled trials, widely regarded as the gold standard for causal inference, identifying and evaluating retrospective causal interactions between two exposures remains challenging. In this paper, we define posterior probabilities to characterize the interactive causes of an observed outcome. We establish the identifiability of posterior probabilities by using a secondary outcome variable that may appear after the primary outcome. We apply the proposed method to the classic case of smoking and asbestos exposure. Our results indicate that for lung cancer patients who smoked and were exposed to asbestos, the disease is primarily attributable to the synergistic effect between smoking and asbestos exposure.
McKean-Vlasov stochastic differential equations (MVSDEs) describe systems whose dynamics depend on both individual states and the population distribution, and they arise widely in neuroscience, finance, and epidemiology. In many applications the system is only partially observed, making inference very challenging when both drift and diffusion coefficients depend on the evolving empirical law. This paper develops a Bayesian framework for latent state inference and parameter estimation in such partially observed MVSDEs. We combine time-discretization with particle-based approximations to construct tractable likelihood estimators, and we design two particle Markov chain Monte Carlo (PMCMC) algorithms: a single-level PMCMC method and a multilevel PMCMC (MLPMCMC) method that couples particle systems across discretization levels. The multilevel construction yields correlated likelihood estimates and achieves mean square error $(O(\varepsilon^2))$ at computational cost $(O(\varepsilon^{-6}))$, improving on the $(O(\varepsilon^{-7}))$ complexity of single-level schemes. We address the fully law-dependent diffusion setting which is the most general formulation of MVSDEs, and provide theoretical guarantees under standard regularity assumptions. Numerical experiments confirm the efficiency and accuracy of the proposed methodology.
Although complete randomization is widely regarded as the gold standard for causal inference, covariate imbalance can still arise by chance in finite samples. Rerandomization has emerged as an effective tool to improve covariate balance across treatment groups and enhance the precision of causal effect estimation. While existing work focuses on average treatment effects, quantile treatment effects (QTEs) provide a richer characterization of treatment heterogeneity by capturing distributional shifts in outcomes, which is crucial for policy evaluation and equity-oriented research. In this article, we establish the asymptotic properties of the QTE estimator under rerandomization within a finite-population framework, without imposing any distributional or modeling assumptions on the covariates or this http URL estimator exhibits a non-Gaussian asymptotic distribution, represented as a linear combination of Gaussian and truncated Gaussian random variables. To facilitate inference, we propose a conservative variance estimator and construct corresponding confidence interval. Our theoretical analysis demonstrates that rerandomization improves efficiency over complete randomization under mild regularity conditions. Simulation studies further support the theoretical findings and illustrate the practical advantages of rerandomization for QTE estimation.
Accurately estimating the sensitivity of explosive materials is a potentially life-saving task which requires standardised protocols across nations. One of the most widely applied procedures worldwide is the so-called '1-In-6' test from the United Nations (UN) Manual of Tests in Criteria, which estimates a 'limiting stimulus' for a material. In this paper we demonstrate that, despite their popularity, limiting stimuli are not a well-defined notion of sensitivity and do not provide reliable information about a material's susceptibility to ignition. In particular, they do not permit construction of confidence intervals to quantify estimation uncertainty. We show that continued reliance on limiting stimuli through the 1-In-6 test has caused needless confusion in energetic materials research, both in theoretical studies and practical safety applications. To remedy this problem, we consider three well-founded alternative approaches to sensitivity testing to replace limiting stimulus estimation. We compare their performance in an extensive simulation study and apply the best-performing approach to real data, estimating the friction sensitivity of pentaerythritol tetranitrate (PETN).
We address the following question: given a collection $\{\mathbf{A}^{(1)}, \dots, \mathbf{A}^{(N)}\}$ of independent $d \times d$ random matrices drawn from a common distribution $\mathbb{P}$, what is the probability that the centralizer of $\{\mathbf{A}^{(1)}, \dots, \mathbf{A}^{(N)}\}$ is trivial? We provide lower bounds on this probability in terms of the sample size $N$ and the dimension $d$ for several families of random matrices which arise from the discretization of linear Schrödinger operators with random potentials. When combined with recent work on machine learning theory, our results provide guarantees on the generalization ability of transformer-based neural networks for in-context learning of Schrödinger equations.
Bayesian inference offers a powerful framework for variable selection by incorporating sparsity through prior beliefs and quantifying uncertainty about parameters, leading to consistent procedures with good finite-sample performance. However, accurately quantifying uncertainty requires a correctly specified model, and there is increasing awareness of the problems that model misspecification causes for variable selection. Current solutions to this problem either require a more complex model, detracting from the interpretability of the original variable selection task, or gain robustness by moving outside of rigorous Bayesian uncertainty quantification. This paper establishes the model quasi-posterior as a principled tool for variable selection. We prove that the model quasi-posterior shares many of the desirable properties of full Bayesian variable selection, but no longer necessitates a full likelihood specification. Instead, the quasi-posterior only requires the specification of mean and variance functions, and as a result, is robust to other aspects of the data. Laplace approximations are used to approximate the quasi-marginal likelihood when it is not available in closed form to provide computational tractability. We demonstrate through extensive simulation studies that the quasi-posterior improves variable selection accuracy across a range of data-generating scenarios, including linear models with heavy-tailed errors and overdispersed count data. We further illustrate the practical relevance of the proposed approach through applications to real datasets from social science and genomics
In this study, we apply functional regression analysis to identify the specific within-season periods during which temperature and precipitation anomalies most affect crop yields. Using provincial data for Italy from 1952 to 2023, we analyze two major cereals, maize and soft wheat, and quantify how abnormal weather conditions influence yields across the growing cycle. Unlike traditional statistical yield models, which assume additive temperature effects over the season, our approach is capable of capturing the timing and functional shape of weather impacts. In particular, the results show that above-average temperatures reduce maize yields primarily between June and August, while exerting a mild positive effect in April and October. For soft wheat, unusually high temperatures negatively affect yields from late March to early April. Precipitation also exerts season-dependent effects, improving wheat yields early in the season but reducing them later on. These findings highlight the importance of accounting for intra-seasonal weather patterns to provide insights for climate change adaptation strategies, including the timely adjustment of key crop management inputs.
Background: Stepped wedge cluster randomized trials (SW-CRTs) involve sequential measurements within clusters over time. Initially, all clusters start in the control condition before crossing over to the intervention on a staggered schedule. In cohort designs, secular trends, cluster-level changes, and individual-level changes (e.g., aging) must be considered. Methods: We performed a Monte Carlo simulation to analyze the influence of different time effects on the estimation of the intervention effect in cohort SW-CRTs. We compared four linear mixed models with different adjustment strategies, all including random intercepts for clustering and repeated measurements. We recorded the estimated fixed intervention effects and their corresponding model-based standard errors, derived from models both without and with cluster-robust variance estimators (CRVEs). Results: Models incorporating fixed categorical time effects, a fixed intervention effect, and two random intercepts provided unbiased estimates of the intervention effect in both closed and open cohort SW-CRTs. Fixed categorical time effects captured temporal cohort changes, while random individual effects accounted for baseline differences. However, these differences can cause large, non-normally distributed random individual effects. CRVEs provide reliable standard errors for the intervention effect, controlling the Type I error rate. Conclusions: Our simulation study is the first to assess individual-level changes over time in cohort SW-CRTs. Linear mixed models incorporating fixed categorical time effects and random cluster and individual effects yield unbiased intervention effect estimates. However, cluster-robust variance estimation is necessary when time-varying independent variables exhibit nonlinear effects. We recommend always using CRVEs.
We develop a data-driven algorithm for automatically selecting the regularisation parameter in Bayesian inversion under random tree Besov priors. One of the key challenges in Bayesian inversion is the construction of priors that are both expressive and computationally feasible. Random tree Besov priors, introduced in Kekkonen et al. (2023), provide a flexible framework for capturing local regularity properties and sparsity patterns in a wavelet basis. In this paper, we extend this approach by introducing a hierarchical model that enables data-driven selection of the wavelet density parameter, allowing the regularisation strength to adapt across scales while retaining computational efficiency. We focus on nonparametric regression and also present preliminary plug-and-play results for a deconvolution problem.
Full conformal prediction is a framework that implicitly formulates distribution-free confidence prediction regions for a wide range of estimators. However, a classical limitation of the full conformal framework is the computation of the confidence prediction regions, which is usually impossible since it requires training infinitely many estimators (for real-valued prediction for instance). The main purpose of the present work is to describe a generic strategy for designing a tight approximation to the full conformal prediction region that can be efficiently computed. Along with this approximate confidence region, a theoretical quantification of the tightness of this approximation is developed, depending on the smoothness assumptions on the loss and score functions. The new notion of thickness is introduced for quantifying the discrepancy between the approximate confidence region and the full conformal one.
Design-based inference, also known as randomization-based or finite-population inference, provides a principled framework for causal and descriptive analyses that attribute randomness solely to the design mechanism (e.g., treatment assignment, sampling, or missingness) without imposing distributional or modeling assumptions on the outcome data of study units. Despite its conceptual appeal and long history, this framework becomes challenging to apply when the underlying design probabilities (i.e., propensity scores) are unknown, as is common in observational studies, real-world surveys, and missing-data settings. Existing plug-in or matching-based approaches either ignore the uncertainty stemming from estimated propensity scores or rely on the post-matching uniform-propensity condition (an assumption typically violated when there are multiple or continuous covariates), leading to systematic under-coverage. Finite-population M-estimation partially mitigates these issues but remains limited to parametric propensity score models. In this work, we introduce propensity score propagation, a general framework for valid design-based inference with unknown propensity scores. The framework introduces a regeneration-and-union procedure that automatically propagates uncertainty in propensity score estimation into downstream design-based inference. It accommodates both parametric and nonparametric propensity score models, integrates seamlessly with standard tools in design-based inference with known propensity scores, and is universally applicable to various important design-based inference problems, such as observational studies, real-world surveys, and missing-data analyses, among many others. Simulation studies demonstrate that the proposed framework restores nominal coverage levels in settings where conventional methods suffer from severe under-coverage.
In this paper, the solution to the empirical risk minimization problem with $f$-divergence regularization (ERM-$f$DR) is presented and conditions under which the solution also serves as the solution to the minimization of the expected empirical risk subject to an $f$-divergence constraint are established. The proposed approach extends applicability to a broader class of $f$-divergences than previously reported and yields theoretical results that recover previously known results. Additionally, the difference between the expected empirical risk of the ERM-$f$DR solution and that of its reference measure is characterized, providing insights into previously studied cases of $f$-divergences. A central contribution is the introduction of the normalization function, a mathematical object that is critical in both the dual formulation and practical computation of the ERM-$f$DR solution. This work presents an implicit characterization of the normalization function as a nonlinear ordinary differential equation (ODE), establishes its key properties, and subsequently leverages them to construct a numerical algorithm for approximating the normalization factor under mild assumptions. Further analysis demonstrates structural equivalences between ERM-$f$DR problems with different $f$-divergences via transformations of the empirical risk. Finally, the proposed algorithm is used to compute the training and test risks of ERM-$f$DR solutions under different $f$-divergence regularizers. This numerical example highlights the practical implications of choosing different functions $f$ in ERM-$f$DR problems.
We consider non-linear regression models corrupted by generic noise when the regression functions form a non-linear subspace of L^2, relevant in non-linear PDE inverse problems and data assimilation. We show that when the score of the model is injective, the Fisher information operator is automatically invertible between well-identified Hilbert spaces, and we provide an operational characterization of these spaces. This allows us to construct in broad generality the efficient Gaussian involved in the classical minimax and convolution theorems to establish information lower bounds, that are typically achieved by Bayesian algorithms thus showing optimality of these methods. We illustrate our results on time-evolution PDE models for reaction-diffusion and Navier-Stokes equations.
This study explores how Bayesian networks (BNs) can improve forecast accuracy compared to logistic regression and recalibration and aggregation methods, using data from the Good Judgment Project. Regularized logistic regression models and a baseline recalibrated aggregate were compared to two types of BNs: structure-learned BNs with arcs between predictors, and naive BNs. Four predictor variables were examined: absolute difference from the aggregate, forecast value, days prior to question close, and mean standardized Brier score. Results indicated the recalibrated aggregate achieved the highest accuracy (AUC = 0.985), followed by both types of BNs, then the logistic regression models. Performance of the BNs was likely harmed by reduced information from the discretization process and violation of the assumption of linearity likely harmed the logistic regression models. Future research should explore hybrid approaches combining BNs with logistic regression, examine additional predictor variables, and account for hierarchical data dependencies.
This paper addresses a long-standing gap in natural hazard modeling by unifying physics-based fragility functions with real-time post-disaster observations. It introduces a Bayesian framework that continuously refines regional vulnerability estimates as new data emerges. The framework reformulates physics-informed fragility estimates into a Probit-Normal (PN) representation that captures aleatory variability and epistemic uncertainty in an analytically tractable form. Stage 1 performs local Bayesian updating by moment-matching PN marginals to Beta surrogates that preserve their probability shapes, enabling conjugate Beta-Bernoulli updates with soft, multi-fidelity observations. Fidelity weights encode source reliability, and the resulting Beta posteriors are re-projected into PN form, producing heteroscedastic fragility estimates whose variances reflect data quality and coverage. Stage 2 assimilates these heteroscedastic observations within a probit-warped Gaussian Process (GP), which propagates information from high-fidelity sites to low-fidelity and unobserved regions through a composite kernel that links space, archetypes, and correlated damage states. The framework is applied to the 2011 Joplin tornado, where wind-field priors and computer-vision damage assessments are fused under varying assumptions about tornado width, sampling strategy, and observation completeness. Results show that the method corrects biased priors, propagates information spatially, and produces uncertainty-aware exceedance probabilities that support real-time situational awareness.
Understanding associations between paired high-dimensional longitudinal datasets is a fundamental yet challenging problem that arises across scientific domains, including longitudinal multi-omic studies. The difficulty stems from the complex, time-varying cross-covariance structure coupled with high dimensionality, which complicates both model formulation and statistical estimation. To address these challenges, we propose a new framework, termed Functional-Aggregated Cross-covariance Decomposition (FACD), tailored for canonical cross-covariance analysis between paired high-dimensional longitudinal datasets through a statistically efficient and theoretically grounded procedure. Unlike existing methods that are often limited to low-dimensional data or rely on explicit parametric modeling of temporal dynamics, FACD adaptively learns temporal structure by aggregating signals across features and naturally accommodates variable selection to identify the most relevant features associated across datasets. We establish statistical guarantees for FACD and demonstrate its advantages over existing approaches through extensive simulation studies. Finally, we apply FACD to a longitudinal multi-omic human study, revealing blood molecules with time-varying associations across omic layers during acute exercise.
Interpreting gene expression data requires methods that can uncover coordinated patterns corresponding to biological pathways. Traditional approaches such as principal component analysis and factor models reduce dimensionality, but latent components may have unclear biological meaning. Current approaches to incorporate pathway annotations impose restrictive assumptions, require extensive hyperparameter tuning, and do not provide principled uncertainty quantification, hindering the robustness and reproducibility of results. Here, we develop Bayesian Analysis with gene-Sets Informed Latent space (BASIL), a scalable Bayesian factor modeling framework that incorporates gene pathway annotations into latent variable analysis for RNA-sequencing data. BASIL places structured priors on factor loadings, shrinking them toward combinations of annotated gene sets, enhancing biological interpretability and stability, while simultaneously learning new unstructured components. BASIL provides accurate covariance estimates and uncertainty quantification, without resorting to computationally expensive Markov chain Monte Carlo sampling. An automatic empirical Bayes procedure eliminates the need for manual hyperparameter tuning, promoting reproducibility and usability in practice. In simulations and large-scale human transcriptomic datasets, BASIL consistently outperforms state-of-the-art approaches, accurately reconstructing gene-gene covariance, selecting the correct latent dimension, and identifying biologically coherent modules.
Cluster-randomized trials (CRTs) are widely used to evaluate group-level interventions and increasingly collect multiple outcomes capturing complementary dimensions of benefit and risk. Investigators often seek a single global summary of treatment effect, yet existing methods largely focus on single-outcome estimands or rely on model-based procedures with unclear causal interpretation or limited robustness. We develop a unified potential outcomes framework for generalized treatment effects with multiple outcomes in CRTs, accommodating both non-prioritized and prioritized outcome settings. The proposed cluster-pair and individual-pair causal estimands are defined through flexible pairwise contrast functions and explicitly account for potentially informative cluster sizes. We establish nonparametric estimation via weighted clustered U-statistics and derive efficient influence functions to construct covariate-adjusted estimators that integrate debiased machine learning with U-statistics. The resulting estimators are consistent and asymptotically normal, attain the semiparametric efficiency bounds under mild regularity conditions, and have analytically tractable variance estimators that are proven to be consistent under cross-fitting. Simulations and an application to a CRT for chronic pain management illustrate the practical utility of the proposed methods.
Linearly parametrized models are widely used in control and signal processing, with the least-squares (LS) estimate being the archetypical solution. When the input is insufficiently exciting, the LS problem may be unsolvable or numerically unstable. This issue can be resolved through regularization, typically with ridge regression. Although regularized estimators reduce the variance error, it remains important to quantify their estimation uncertainty. A possible approach for linear regression is to construct confidence ellipsoids with the Sign-Perturbed Sums (SPS) ellipsoidal outer approximation (EOA) algorithm. The SPS EOA builds non-asymptotic confidence ellipsoids under the assumption that the noises are independent and symmetric about zero. This paper introduces an extension of the SPS EOA algorithm to ridge regression, and derives probably approximately correct (PAC) upper bounds for the resulting region sizes. Compared with previous analyses, our result explicitly show how the regularization parameter affects the region sizes, and provide tighter bounds under weaker excitation assumptions. Finally, the practical effect of regularization is also demonstrated via simulation experiments.
Differential item functioning (DIF) is a widely used statistical notion for identifying items that may disadvantage specific groups of test-takers. These groups are often defined by non-manipulable characteristics, e.g., gender, race/ethnicity, or English-language learner (ELL) status. While DIF can be framed as a causal fairness problem by treating group membership as the treatment variable, this invokes the long-standing controversy over the interpretation of causal effects for non-manipulable treatments. To better identify and interpret causal sources of DIF, this study leverages an interventionist approach using treatment decomposition proposed by Robins and Richardson (2010). Under this framework, we can decompose a non-manipulable treatment into intervening variables. For example, ELL status can be decomposed into English vocabulary unfamiliarity and classroom learning barriers, each of which influences the outcome through different causal pathways. We formally define separable DIF effects associated with these decomposed components, depending on the absence or presence of item impact, and provide causal identification strategies for each effect. We then apply the framework to biased test items in the SAT and Regents exams. We also provide formal detection methods using causal machine learning methods, namely causal forests and Bayesian additive regression trees, and demonstrate their performance through a simulation study. Finally, we discuss the implications of adopting interventionist approaches in educational testing practices.
In this paper, we extend distance correlation to categorical data with general encodings, such as one-hot encoding for nominal variables and semicircle encoding for ordinal variables. Unlike existing methods, our approach leverages the spacing information between categories, which enhances the performance of distance correlation. Two estimates including the maximum likelihood estimate and a bias-corrected estimate are given, together with their limiting distributions under the null and alternative hypotheses. Furthermore, we establish the sure screening property for high-dimensional categorical data under mild conditions. We conduct a simulation study to compare the performance of different encodings, and illustrate their practical utility using the 2018 General Social Survey data.
The increasing reliance on human preference feedback to judge AI-generated pseudo labels has created a pressing need for principled, budget-conscious data acquisition strategies. We address the crucial question of how to optimally allocate a fixed annotation budget between ground-truth labels and pairwise preferences in AI. Our solution, grounded in semi-parametric inference, casts the budget allocation problem as a monotone missing data framework. Building on this formulation, we introduce Preference-Calibrated Active Learning (PCAL), a novel method that learns the optimal data acquisition strategy and develops a statistically efficient estimator for functionals of the data distribution. Theoretically, we prove the asymptotic optimality of our PCAL estimator and establish a key robustness guarantee that ensures robust performance even with poorly estimated nuisance models. Our flexible framework applies to a general class of problems, by directly optimizing the estimator's variance instead of requiring a closed-form solution. This work provides a principled and statistically efficient approach for budget-constrained learning in modern AI. Simulations and real-data analysis demonstrate the practical benefits and superior performance of our proposed method.
In this article, we study nonparametric inference problems in the context of multivariate or functional time series, including testing for goodness-of-fit, the presence of a change point in the marginal distribution, and the independence of two time series, among others. Most methodologies available in the existing literature address these problems by employing a bandwidth-dependent bootstrap or subsampling approach, which can be computationally expensive and/or sensitive to the choice of bandwidth. To address these limitations, we propose a novel class of kernel-based tests by embedding the data into a reproducing kernel Hilbert space, and construct test statistics using sample splitting, projection, and self-normalization (SN) techniques. Through a new conditioning technique, we demonstrate that our test statistics have pivotal limiting null distributions under absolute regularity and mild moment assumptions. We also analyze the limiting power of our tests under local alternatives. Finally, we showcase the superior size accuracy and computational efficiency of our methods as compared to some existing ones.
Clustered data -- where units of observation are nested within higher-level groups, such as repeated measurements on users, or panel data of firms, industries, or geographic regions -- are ubiquitous in business research. When the objective is to estimate the causal effect of a potentially endogenous treatment, a common approach -- which we call the canonical two-stage least squares (2sls) -- is to fit a 2sls regression of the outcome on treatment status with instrumental variables (IVs) for point estimation, and apply cluster-robust standard errors to account for clustering in inference. When both the treatment and IVs vary within clusters, a natural alternative -- which we call the two-stage least squares with fixed effects (2sfe) -- is to include cluster indicators in the 2sls specification, thereby incorporating cluster information in point estimation as well. This paper clarifies the trade-off between these two approaches within the local average treatment effect (LATE) framework, and makes three contributions. First, we establish the validity of both approaches for Wald-type inference of the LATE when clusters are homogeneous, and characterize their relative efficiency. We show that, when the true outcome model includes cluster-specific effects, 2sfe is more efficient than the canonical 2sls only when the variation in cluster-specific effects dominates that in unit-level errors. Second, we show that with heterogeneous clusters, 2sfe recovers a weighted average of cluster-specific LATEs, whereas the canonical 2sls generally does not. Third, to guide empirical choice between the two procedures, we develop a joint asymptotic theory for the two estimators under homogeneous clusters, and propose a Wald-type test for detecting cluster heterogeneity.
We consider inference for M-estimators after model selection using a sparsity-inducing penalty. While existing methods for this task require bespoke inference procedures, we propose a simpler approach, which relies on two insights: (i) adding and subtracting carefully-constructed noise to a Gaussian random variable with unknown mean and known variance leads to two \emph{independent} Gaussian random variables; and (ii) both the selection event resulting from penalized M-estimation, and the event that a standard (non-selective) confidence interval for an M-estimator covers its target, can be characterized in terms of an approximately normal ``score variable". We combine these insights to show that -- when the noise is chosen carefully -- there is asymptotic independence between the model selected using a noisy penalized M-estimator, and the event that a standard (non-selective) confidence interval on noisy data covers the selected parameter. Therefore, selecting a model via penalized M-estimation (e.g. \verb=glmnet= in \verb=R=) on noisy data, and then conducting \emph{standard} inference on the selected model (e.g. \verb=glm= in \verb=R=) using noisy data, yields valid inference: \emph{no bespoke methods are required}. Our results require independence of the observations, but only weak distributional requirements. We apply the proposed approach to conduct inference on the association between sex and smoking in a social network.
This paper introduces a new problem-dependent regret measure for online convex optimization with smooth losses. The notion, which we call the $G^\star$ regret, depends on the cumulative squared gradient norm evaluated at the decision in hindsight $\sum_{t=1}^T \|\nabla \ell(x^\star)\|^2$. We show that the $G^\star$ regret strictly refines the existing $L^\star$ (small loss) regret, and that it can be arbitrarily sharper when the losses have vanishing curvature around the hindsight decision. We establish upper and lower bounds on the $G^\star$ regret and extend our results to dynamic regret and bandit settings. As a byproduct, we refine the existing convergence analysis of stochastic optimization algorithms in the interpolation regime. Some experiments validate our theoretical findings.
The growing availability of large health databases has expanded the use of observational studies for comparative effectiveness research. Unlike randomized trials, observational studies must adjust for systematic differences in patient characteristics between treatment groups. Propensity score methods, including matching, weighting, stratification, and regression adjustment, address this issue by creating groups that are comparable with respect to measured covariates. Among these approaches, overlap weighting (OW) has emerged as a principled and efficient method that emphasizes individuals at empirical equipoise, those who could plausibly receive either treatment. By assigning weights proportional to the probability of receiving the opposite treatment, OW targets the Average Treatment Effect in the Overlap population (ATO), achieves exact mean covariate balance under logistic propensity score models, and minimizes asymptotic variance. Over the last decade, the OW method has been recognized as a valuable confounding adjustment tool across the statistical, epidemiologic, and clinical research communities, and is increasingly applied in clinical and health studies. Given the growing interest in using observational data to emulate randomized trials and the capacity of OW to prioritize populations at clinical equipoise while achieving covariate balance (fundamental attributes of randomized studies), this article provides a concise overview of recent methodological developments in OW and practical guidance on when it represents a suitable choice for causal inference.
Predicting highly-cited papers is a long-standing challenge due to the complex interactions of research content, scholarly communities, and temporal dynamics. Recent advances in large language models (LLMs) raise the question of whether early-stage textual information can provide useful signals of long-term scientific impact. Focusing on statistical publications, we propose a flexible, text-centered framework that leverages LLMs and structured prompt design to predict highly cited papers. Specifically, we utilize information available at the time of publication, including titles, abstracts, keywords, and limited bibliographic metadata. Using a large corpus of statistical papers, we evaluate predictive performance across multiple publication periods and alternative definitions of highly cited papers. The proposed approach achieves stable and competitive performance relative to existing methods and demonstrates strong generalization over time. Textual analysis further reveals that papers predicted as highly cited concentrate on recurring topics such as causal inference and deep learning. To facilitate practical use of the proposed approach, we further develop a WeChat mini program, \textit{Stat Highly Cited Papers}, which provides an accessible interface for early-stage citation impact assessment. Overall, our results provide empirical evidence that LLMs can capture meaningful early signals of long-term citation impact, while also highlighting their limitations as tools for research impact assessment.
Compressed sensing, which involves the reconstruction of sparse signals from an under-determined linear system, has been recently used to solve problems in group testing. In a public health context, group testing aims to determine the health status values of p subjects from n<<p pooled tests, where a pool is defined as a mixture of small, equal-volume portions of the samples of a subset of subjects. This approach saves on the number of tests administered in pandemics or other resource-constrained scenarios. In practical group testing in time-constrained situations, a technician can inadvertently make a small number of errors during pool preparation, which leads to errors in the pooling matrix, which we term `model mismatch errors' (MMEs). This poses difficulties while determining health status values of the participating subjects from the results on n<<p pooled tests. In this paper, we present an algorithm to correct the MMEs in the pooled tests directly from the pooled results and the available (inaccurate) pooling matrix. Our approach then reconstructs the signal vector from the corrected pooling matrix, in order to determine the health status of the subjects. We further provide theoretical guarantees for the correction of the MMEs and the reconstruction error from the corrected pooling matrix. We also provide several supporting numerical results.
Average-reward reinforcement learning offers a principled framework for long-term decision-making by maximizing the mean reward per time step. Although Q-learning is a widely used model-free algorithm with established sample complexity in discounted and finite-horizon Markov decision processes (MDPs), its theoretical guarantees for average-reward settings remain limited. This work studies a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption, covering both single-agent and federated scenarios. For the single-agent case, we show that Q-learning with carefully chosen parameters achieves sample complexity $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{\varepsilon^3}\right)$, where $\|h^{\star}\|_{\mathsf{sp}}$ is the span norm of the bias function, improving previous results by at least a factor of $\frac{\|h^{\star}\|_{\mathsf{sp}}^2}{\varepsilon^2}$. In the federated setting with $M$ agents, we prove that collaboration reduces the per-agent sample complexity to $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{M\varepsilon^3}\right)$, with only $\widetilde{O}\left(\frac{\|h^{\star}\|_{\mathsf{sp}}}{\varepsilon}\right)$ communication rounds required. These results establish the first federated Q-learning algorithm for average-reward MDPs, with provable efficiency in both sample and communication complexity.
Modern macroeconomic monetary theory suggests that the labor share of income has effectively become a core macroe-conomic parameter anchored by top policymakers through Open Market Operations (OMO). However, the setting of this parameter remains a subject of intense economic debate. This paper provides a detailed summary of these controversies, analyzes the scope of influence exerted by market agents other than the top policymakers on the labor share, and explores the rationality of its setting mechanism.
We develop a statistical proxy framework for retrieval-augmented generation (RAG), designed to formalize how a language model (LM) should balance its own predictions with retrieved evidence. For each query x, the system combines a frozen base model q0 ($\times$ x) with a k-nearest neighbor retriever r (k ) ($\times$ x) through a measurable gate k(x). A retrieval-trust weight wfact (x) quantifies the geometric reliability of the retrieved neighborhood and penalizes retrieval in low-trust regions. We derive the Bayes-optimal per-query gate and analyze its effect on a discordance-based hallucination criterion that captures disagreements between LM predictions and retrieved evidence. We further show that this discordance admits a deterministic asymptotic limit governed solely by the structural agreement (or disagreement) between the Bayes rule and the LM. To account for distribution mismatch between queries and memory, we introduce a hybrid geometric-semantic model combining covariate deformation and label corruption. Overall, this note provides a principled statistical foundation for factuality-oriented RAG systems.
Academic Clinical Trial Units frequently face fragmented statistical workflows, leading to duplicated effort, limited collaboration, and inconsistent analytical practices. To address these challenges within an oncology Clinical Trial Unit, we developed grstat, an R package providing a standardised set of tools for routine statistical analyses. Beyond the software itself, the development of grstat is embedded in a structured organisational framework combining formal request tracking, peer-reviewed development, automated testing, and staged validation of new functionalities. The package is intentionally opinionated, reflecting shared practices agreed upon within the unit, and evolves through iterative use in real-world projects. Its development as an open-source project on GitHub supports transparent workflows, collective code ownership, and traceable decision-making. While primarily designed for internal use, this work illustrates a transferable approach to organising, validating, and maintaining a shared analytical toolbox in an academic setting. By coupling technical implementation with governance and validation principles, grstat supports efficiency, reproducibility, and long-term maintainability of biostatistical workflows, and may serve as a source of inspiration for other Clinical Trial Units facing similar organisational challenges.
Tukey's boxplot is widely used for outlier detection; however, its classic fixed-fence rule tends to flag an excessive number of outliers as the sample size grows. To address this limitation, we introduce two new R packages, ChauBoxplot and AdaptiveBoxplot, which implement more robust methods for outlier detection. We also provide practical guidance, drawn from simulation results, to help practitioners choose suitable boxplot methods and balance interpretability with statistical reliability.
Local Polynomial Regression (LPR) and Moving Least Squares (MLS) are closely related nonparametric estimation methods, developed independently in statistics and approximation theory. While statistical LPR analysis focuses on overcoming sampling noise under probabilistic assumptions, the deterministic MLS theory studies smoothness properties and convergence rates with respect to the \textit{fill-distance} (a resolution parameter). Despite this similarity, the deterministic assumptions underlying MLS fail to hold under random sampling. We begin by quantifying the probabilistic behavior of the fill-distance $h_n$ and \textit{separation} $\delta_n$ of an i.i.d. random sample. That is, for a distribution satisfying a mild regularity condition, $h_n\propto n^{-1/d}\log^{1/d} (n)$ and $\delta_n \propto n^{-1/d}$. We then prove that, for MLS of degree $k\!-\!1$, the approximation error associated with a differential operator $Q$ of order $|m|\le k-1$ decays as $h_n^{\,k-|m|}$ up to logarithmic factors, establishing stochastic analogues of the classical MLS estimates. Additionally, We show that the MLS approximant is smooth with high probability. Finally, we apply the stochastic MLS theory to manifold estimation. Assuming that the sampled Manifold is $k$-times smooth, we show that the Hausdorff distance between the true manifold and its MLS reconstruction decays as $h_n^k$, extending the deterministic Manifold-MLS guarantees to random samples. This work provides the first unified stochastic analysis of MLS, demonstrating that -- despite the failure of deterministic sampling assumptions -- the classical convergence and smoothness properties persist under natural probabilistic models
We propose a frequentist adaptive phase 2 trial design to evaluate the safety and efficacy of three treatment regimens (doses) compared to placebo for four types of helminth (worm) infections. This trial will be carried out in four Subsaharan African countries from spring 2025. Since the safety of the highest dose is not yet established, the study begins with the two lower doses and placebo. Based on safety and early efficacy results from an interim analysis, a decision will be made to either continue with the two lower doses or drop one or both and introduce the highest dose instead. This design borrows information across baskets for safety assessment, while efficacy is assessed separately for each basket. The proposed adaptive design addresses several key challenges: (1) The trial must begin with only the two lower doses because reassuring safety data from these doses is required before escalating to a higher dose. (2) Due to the expected speed of recruitment, adaptation decisions must rely on an earlier, surrogate endpoint. (3) The primary outcome is a count variable that follows a mixture distribution with an atom at 0. To control the familywise error rate in the strong sense when comparing multiple doses to the control in the adaptive design, we extend the partial conditional error approach to accommodate the inclusion of new hypotheses after the interim analysis. In a comprehensive simulation study we evaluate various design options and analysis strategies, assessing the robustness of the design under different design assumptions and parameter values. We identify scenarios where the adaptive design improves the trial's ability to identify an optimal dose. Adaptive dose selection enables resource allocation to the most promising treatment arms, increasing the likelihood of selecting the optimal dose while reducing the required overall sample size and trial duration.
The maximum mean discrepancy (MMD) is a kernel-based nonparametric statistic for two-sample testing, whose inferential accuracy depends critically on variance characterization. Existing work provides various finite-sample estimators of the MMD variance, often differing under the null and alternative hypotheses and across balanced or imbalanced sampling schemes. In this paper, we study the variance of the MMD statistic through its U-statistic representation and Hoeffding decomposition, and establish a unified finite-sample characterization covering different hypotheses and sample configurations. Building on this analysis, we propose an exact acceleration method for the univariate case under the Laplacian kernel, which reduces the overall computational complexity from $\mathcal O(n^2)$ to $\mathcal O(n \log n)$.
We evaluate the misclustering probability of a spectral clustering algorithm under a Gaussian mixture model with a general covariance structure. The algorithm partitions the data into two groups based on the sign of the first principal component score. As a corollary of the main result, the clustering procedure is shown to be consistent in a high-dimensional regime.
We establish topological necessary and sufficient conditions under which a pair of statistical hypotheses can be consistently distinguished when i.i.d. observations are recorded only to finite precision. Requiring the test's decision regions to be open in the sample-space topology to accommodate finite-precision data, we show that a pair of null- and alternative hypotheses $H_0$ and $H_1$ admits a consistent test if and only if they are $F_\sigma$ in the weak topology on the space of probability measures $W := H_0\cup H_1$. Additionally, the hypotheses admit uniform error control under $H_0$ and/or $H_1$ if and only if $H_0$ and/or $H_1$ are closed in $W$. Under compactness assumptions, uniform consistency is characterised by $H_0$ and $H_1$ having disjoint closures in the ambient space of probability measures. These criteria imply that - without regularity assumptions - conditional independence is not consistently testable. We introduce a Lipschitz-continuity assumption on the family of conditional distributions under which we recover testability of conditional independence with uniform error control under the null, with testable smoothness constraints.
This study examines generalized cross-validation for the tuning parameter selection for ridge regression in high-dimensional misspecified linear models. The set of candidates for the tuning parameter includes not only positive values but also zero and negative values. We demonstrate that if the second moment of the specification error converges to zero, generalized cross-validation is still a uniformly consistent estimator of the out-of-sample prediction risk. This implies that generalized cross-validation selects the tuning parameter for which ridge regression asymptotically achieves the smallest prediction risk among the candidates if the degree of misspecification for the regression function is small. Our simulation studies show that ridge regression tuned by generalized cross-validation exhibits a prediction performance similar to that of optimally tuned ridge regression and outperforms the Lasso under correct and incorrect model specifications.
Correlation analysis is a fundamental problem in statistics. In this paper, we consider the correlation detection problem between a pair of Erdos-Renyi graphs. Specifically, the problem is formulated as a hypothesis testing problem: under the null hypothesis, the two graphs are independent; under the alternative hypothesis, the two graphs are edge-correlated through a latent permutation. We focus on the scenario where only two induced subgraphs are sampled, and characterize the sample size threshold for detection. At the information-theoretic level, we establish the sample complexity rates that are optimal up to constant factors over most parameter regimes, and the remaining gap is bounded by a subpolynomial factor. On the algorithmic side, we propose polynomial-time tests based on counting trees and bounded degree motifs, and identify the regimes where they succeed. Moreover, leveraging the low-degree conjecture, we provide evidence of computational hardness that matches our achievable guarantees, showing that the proposed polynomial-time tests are rate-optimal. Together, these results reveal a statistical--computational gap in the sample size required for correlation detection. Finally, we validate the proposed algorithms on synthetic data and a real coauthor network, demonstrating strong empirical performance.
This paper introduces the modeling of circular data with excess zeros under a longitudinal framework, where the response is a circular variable and the covariates can be both linear and circular in nature. In the literature, various circular-circular and circular-linear regression models have been studied and applied to different real-world problems. However, there are no models for addressing zero-inflated circular observations in the context of longitudinal studies. Motivated by a real case study, a mixed-effects two-stage model based on the projected normal distribution is proposed to handle such issues. The interpretation of the model parameters is discussed and identifiability conditions are derived. A Bayesian methodology based on Gibbs sampling technique is developed for estimating the associated model parameters. Simulation results show that the proposed method outperforms its competitors in various situations. A real dataset on post-operative astigmatism is analyzed to demonstrate the practical implementation of the proposed methodology. The use of the proposed method facilitates effective decision-making for treatment choices and in the follow-up phases.
Consider testing a zero restriction on the mean of a $d$-dimensional random vector based on an i.i.d. sample of size $n$. Suppose further that the coordinates are only assumed to possess $m>2$ moments. Then, max-tests based on arithmetic means and critical values derived from Gaussian approximations are not guaranteed to be asymptotically valid unless $d$ is relatively small compared to $n$, because said approximation faces a polynomial growth barrier of $d=o(n^{m/2-1})$. We propose a max-test based on winsorized means, and show that it holds the desired asymptotic size even when $d$ grows at an exponential rate in $n$ and the data are adversarially contaminated. Our characterization of its asymptotic power function shows that these benefits do not come at the cost of reduced asymptotic power: the robustified max-test has identical asymptotic power to that based on arithmetic means whenever the stronger assumptions underlying the latter are satisfied. We also investigate when -- and when not -- data-driven (bootstrap) critical values can strictly increase asymptotic power of the robustified max-test.
Intermittent time series, characterised by the presence of a significant amount of zeros, constitute a large percentage of inventory items in supply chain. Probabilistic forecasts are needed to plan the inventory levels; the predictive distribution should cover non-negative values, have a mass in zero and a long upper tail. Intermittent time series are commonly forecast using local models, which are trained individually on each time series. In the last years global models, which are trained on a large collection of time series, have become popular for time series forecasting. Global models are often based on neural networks. However, they have not yet been exhaustively tested on intermittent time series. We carry out the first study comparing state-of-the-art local (iETS, TweedieGP) and global models (D-Linear, DeepAR, Transformers) on intermittent time series. For neural networks models we consider three different distribution heads suitable for intermittent time series: negative binomial, hurdle-shifted negative binomial and Tweedie. We use, for the first time, the last two distribution heads with neural networks. We perform experiments on five large datasets comprising more than 40'000 real-world time series. Among neural networks D-Linear provides best accuracy; it also consistently outperforms the local models. Moreover, it has also low computational requirements. Transformers-based architectures are instead much more computationally demanding and less accurate. Among the distribution heads, the Tweedie provides the best estimates of the highest quantiles, while the negative binomial offers overall the best performance.
This paper proposes a Mixture Density Network for forecasting time series that exhibit locally explosive behavior. By incorporating skewed t-distributions as mixture components, our approach offers enhanced flexibility in capturing the skewed, heavy-tailed, and potentially multimodal nature of predictive densities associated with bubble dynamics modeled by mixed causal-noncausal ARMA processes. In addition, we implement an adaptive weighting scheme that emphasizes tail observations during training and hence leads to accurate density estimation in the extreme regions most relevant for financial applications. Equally important, once trained, the MDN produces near-instantaneous density forecasts. Through extensive Monte Carlo simulations and an empirical application on the natural gas price, we show that the proposed MDN-based framework delivers superior forecasting performance relative to existing approaches.
Modeling the time-varying covariance structures of high-dimensional variables is critical across diverse scientific and industrial applications; however, existing approaches exhibit notable limitations in either modeling flexibility or inferential efficiency. For instance, change-point modeling fails to account for the continuous time-varying nature of covariance structures, while GARCH and stochastic volatility models suffer from over-parameterization and the risk of overfitting. To address these challenges, we propose a Bayesian factor modeling framework designed to enable simultaneous inference of both the covariance structure of a high-dimensional time series and its time-varying dynamics. The associated Expectation-Maximization (EM) algorithm not only features an exact, closed-form update for the M-step but also is easily generalizable to more complex settings, such as spatiotemporal multivariate factor analysis. We validate our method through simulation studies and real-data experiments using climate and financial datasets.
We introduce a general framework for testing temporal symmetries in time series based on the distribution of ordinal patterns. While previous approaches have focused on specific forms of asymmetry, such as time reversal, our method provides a unified framework applicable to arbitrary symmetry tests. We establish asymptotic results for the resulting test statistics under a broad class of stationary processes. Comprehensive experiments on both synthetic and real data demonstrate that the proposed test achieves high sensitivity to structural asymmetries while remaining fully data-driven and computationally efficient.
The study aimed to evaluate the applicability of environmental indices in the monitoring of smouldering coal-waste dumps. A dump located in the Upper Silesian Coal Basin served as the research site for a multi-method analysis combining remote sensing and field-based data. Two UAV survey campaigns were conducted, capturing RGB, infrared, and multispectral imagery. These were supplemented with direct ground measurements of subsurface temperature and detailed vegetation mapping. Additionally, publicly available satellite data from the Landsat and Sentinel missions were analysed. A range of vegetation and fire-related indices (NDVI, SAVI, EVI, BAI, among others) were calculated to identify thermally active zones and assess vegetation conditions within these degraded areas. The results revealed strong seasonal variability in vegetation indices on thermally active sites, with evidence of disrupted vegetation cycles, including winter greening in moderately heated root zones - a pattern indicative of stress and degradation processes. While satellite data proved useful in reconstructing the fire history of the dump, their spatial resolution was insufficient for detailed monitoring of small-scale thermal anomalies. The study highlights the diagnostic potential of UAV-based remote sensing in post-industrial environments undergoing land degradation but emphasises the importance of field validation for accurate environmental assessment.
Physics-Informed Neural Networks (PINNs) have emerged as a powerful tool for solving differential equations and modeling physical systems by embedding physical laws into the learning process. However, rigorously quantifying how well a PINN captures the complete dynamical behavior of the system, beyond simple trajectory prediction, remains a challenge. This paper proposes a novel experimental framework to address this by employing Fisher information for differentiable dynamical systems, denoted $g_F^C$. This Fisher information, distinct from its statistical counterpart, measures inherent uncertainties in deterministic systems, such as sensitivity to initial conditions, and is related to the phase space curvature and the net stretching action of the state space evolution. We hypothesize that if a PINN accurately learns the underlying dynamics of a physical system, then the Fisher information landscape derived from the PINN's learned equations of motion will closely match that of the original analytical model. This match would signify that the PINN has achieved comprehensive fidelity capturing not only the state evolution but also crucial geometric and stability properties. We outline an experimental methodology using the dynamical model of a car to compute and compare $g_F^C$ for both the analytical model and a trained PINN. The comparison, based on the Jacobians of the respective system dynamics, provides a quantitative measure of the PINN's fidelity in representing the system's intricate dynamical characteristics.
Recent developments in causal machine learning methods have made it easier to estimate flexible relationships between confounders, treatments and outcomes, making unconfoundedness assumptions in causal analysis more palatable. How successful are these approaches in recovering ground truth baselines? In this paper we analyze a new data sample including an experimental rollout of a new feature at a large technology company and a simultaneous sample of users who endogenously opted into the feature. We find that recovering ground truth causal effects is feasible -- but only with careful modeling choices. Our results build on the observational causal literature beginning with LaLonde (1986), offering best practices for more credible treatment effect estimation in modern, high-dimensional datasets.
Fairness-aware machine learning has recently attracted various communities to mitigate discrimination against certain societal groups in data-driven tasks. For fair supervised learning, particularly in pre-processing, there have been two main categories: data fairness and task-tailored fairness. The former directly finds an intermediate distribution among the groups, independent of the type of the downstream model, so a learned downstream classification/regression model returns similar predictive scores to individuals inputting the same covariates irrespective of their sensitive attributes. The latter explicitly takes the supervised learning task into account when constructing the pre-processing map. In this work, we study algorithmic fairness for supervised learning and argue that the data fairness approaches impose overly strong regularization from the perspective of the HGR correlation. This motivates us to devise a novel pre-processing approach tailored to supervised learning. We account for the trade-off between fairness and utility in obtaining the pre-processing map. Then we study the behavior of arbitrary downstream supervised models learned on the transformed data to find sufficient conditions to guarantee their fairness improvement and utility preservation. To our knowledge, no prior work in the branch of task-tailored methods has theoretically investigated downstream guarantees when using pre-processed data. We further evaluate our framework through comparison studies based on tabular and image data sets, showing the superiority of our framework which preserves consistent trade-offs among multiple downstream models compared to recent competing models. Particularly for computer vision data, we see our method alters only necessary semantic features related to the central machine learning task to achieve fairness.
We introduce a unified framework that seamlessly integrates algorithmic recourse, contextual bandits, and large language models (LLMs) to support sequential decision-making in high-stakes settings such as personalized medicine. We first introduce the recourse bandit problem, where a decision-maker must select both a treatment action and a feasible, minimal modification to mutable patient features. To address this problem, we develop the Generalized Linear Recourse Bandit (GLRB) algorithm. Building on this foundation, we propose LIBRA, a Language Model-Informed Bandit Recourse Algorithm that strategically combines domain knowledge from LLMs with the statistical rigor of bandit learning. LIBRA offers three key guarantees: (i) a warm-start guarantee, showing that LIBRA significantly reduces initial regret when LLM recommendations are near-optimal; (ii) an LLM-effort guarantee, proving that the algorithm consults the LLM only $O(\log^2 T)$ times, where $T$ is the time horizon, ensuring long-term autonomy; and (iii) a robustness guarantee, showing that LIBRA never performs worse than a pure bandit algorithm even when the LLM is unreliable. We further establish matching lower bounds that characterize the fundamental difficulty of the recourse bandit problem and demonstrate the near-optimality of our algorithms. Experiments on synthetic environments and a real hypertension-management case study confirm that GLRB and LIBRA improve regret, treatment quality, and sample efficiency compared with standard contextual bandits and LLM-only benchmarks. Our results highlight the promise of recourse-aware, LLM-assisted bandit algorithms for trustworthy LLM-bandits collaboration in personalized high-stakes decision-making.
Personalized health analytics increasingly rely on population benchmarks to provide contextual insights such as ''How do I compare to others like me?'' However, cohort-based aggregation of health data introduces nontrivial privacy risks, particularly in interactive and longitudinal digital platforms. Existing privacy frameworks such as $k$-anonymity and differential privacy provide essential but largely static guarantees that do not fully capture the cumulative, distributional, and tail-dominated nature of re-identification risk in deployed systems. In this work, we present a privacy-preserving cohort analytics framework that combines deterministic cohort constraints, differential privacy mechanisms, and synthetic baseline generation to enable personalized population comparisons while maintaining strong privacy protections. We further introduce a stochastic risk modeling approach that treats re-identification risk as a random variable evolving over time, enabling distributional evaluation through Monte Carlo simulation. Adapting quantitative risk measures from financial mathematics, we define Privacy Loss at Risk (P-VaR) to characterize worst-case privacy outcomes under realistic cohort dynamics and adversary assumptions. We validate our framework through system-level analysis and simulation experiments, demonstrating how privacy-utility tradeoffs can be operationalized for digital health platforms. Our results suggest that stochastic risk modeling complements formal privacy guarantees by providing interpretable, decision-relevant metrics for platform designers, regulators, and clinical informatics stakeholders.
We analyze daily lead-time distributions for two Airbnb demand metrics, Nights Booked (volume) and Gross Booking Value (revenue), treating each day's allocation across 0-365 days as a compositional vector. The data span 2,557 days from January 2019 through December 2025 in a large North American region. Three findings emerge. First, GBV concentrates more heavily in mid-range horizons: beyond 90 days, GBV tail mass typically exceeds Nights by 20-50%, with ratios reaching 75% at the 180-day threshold during peak seasons. Second, Gamma and Weibull distributions fit comparably well under interval-censored cross-entropy. Gamma wins on 61% of days for Nights and 52% for GBV, with Weibull close behind at 38% and 45%. Lognormal rarely wins (<3%). Nonparametric GAMs achieve 18-80x lower CRPS but sacrifice interpretability. Third, generalized Pareto fits suggest bounded tails for both metrics at thresholds below 150 days, though this may partly reflect right-truncation at 365 days; above 150 days, estimates destabilize. Bai-Perron tests with HAC standard errors identify five structural breaks in the Wasserstein distance series, with early breaks coinciding with COVID-19 disruptions. The results show that volume and revenue lead-time shapes diverge systematically, that simple two-parameter distributions capture daily pmfs adequately, and that tail inference requires care near truncation boundaries.
We propose a federated learning framework for the calibration of parametric insurance indices under heterogeneous renewable energy production losses. Producers locally model their losses using Tweedie generalized linear models and private data, while a common index is learned through federated optimization without sharing raw observations. The approach accommodates heterogeneity in variance and link functions and directly minimizes a global deviance objective in a distributed setting. We implement and compare FedAvg, FedProx and FedOpt, and benchmark them against an existing approximation-based aggregation method. An empirical application to solar power production in Germany shows that federated learning recovers comparable index coefficients under moderate heterogeneity, while providing a more general and scalable framework.
Matrix completion is a classical problem that has received recurring interest across a wide range of fields. In this paper, we revisit this problem in an ultra-sparse sampling regime, where each entry of an unknown, $n\times d$ matrix $M$ (with $n \ge d$) is observed independently with probability $p = C / d$, for a fixed integer $C \ge 2$. This setting is motivated by applications involving large, sparse panel datasets, where the number of rows far exceeds the number of columns. When each row contains only $C$ entries -- fewer than the rank of $M$ -- accurate imputation of $M$ is impossible. Instead, we estimate the row span of $M$ or the averaged second-moment matrix $T = M^{\top} M / n$. The empirical second-moment matrix computed from observed entries exhibits non-random and sparse missingness. We propose an unbiased estimator that normalizes each nonzero entry of the second moment by its observed frequency, followed by gradient descent to impute the missing entries of $T$. The normalization divides a weighted sum of $n$ binomial random variables by the total number of ones. We show that the estimator is unbiased for any $p$ and enjoys low variance. When the row vectors of $M$ are drawn uniformly from a rank-$r$ factor model satisfying an incoherence condition, we prove that if $n \ge O({d r^5 \epsilon^{-2} C^{-2} \log d})$, any local minimum of the gradient-descent objective is approximately global and recovers $T$ with error at most $\epsilon^2$. Experiments on both synthetic and real-world data validate our approach. On three MovieLens datasets, our algorithm reduces bias by $88\%$ relative to baseline estimators. We also empirically validate the linear sampling complexity of $n$ relative to $d$ on synthetic data. On an Amazon reviews dataset with sparsity $10^{-7}$, our method reduces the recovery error of $T$ by $59\%$ and $M$ by $38\%$ compared to baseline methods.
Insider threat detection is a key challenge in enterprise security, relying on user activity logs that capture rich and complex behavioral patterns. These logs are often multi-channel, non-stationary, and anomalies are rare, making anomaly detection challenging. To address these issues, we propose a novel framework that integrates wavelet-aware modulation, multi-resolution wavelet decomposition, and resolution-adaptive attention for robust anomaly detection. Our approach first applies a deviation-aware modulation scheme to suppress routine behaviors while amplifying anomalous deviations. Next, discrete wavelet transform (DWT) decomposes the log signals into multi-resolution representations, capturing both long-term trends and short-term anomalies. Finally, a learnable attention mechanism dynamically reweights the most discriminative frequency bands for detection. On the CERT r4.2 benchmark, our approach consistently outperforms existing baselines in precision, recall, and F1 score across various time granularities and scenarios.
Large language models (LLMs) are increasingly used to predict human behavior. We propose a measure for evaluating how much knowledge a pretrained LLM brings to such a prediction: its equivalent sample size, defined as the amount of task-specific data needed to match the predictive accuracy of the LLM. We estimate this measure by comparing the prediction error of a fixed LLM in a given domain to that of flexible machine learning models trained on increasing samples of domain-specific data. We further provide a statistical inference procedure by developing a new asymptotic theory for cross-validated prediction error. Finally, we apply this method to the Panel Study of Income Dynamics. We find that LLMs encode considerable predictive information for some economic variables but much less for others, suggesting that their value as substitutes for domain-specific data differs markedly across settings.
Real-world tabular databases routinely combine continuous measurements and categorical records, yet missing entries are pervasive and can distort downstream analysis. We propose Statistical-Neural Interaction (SNI), an interpretable mixed-type imputation framework that couples correlation-derived statistical priors with neural feature attention through a Controllable-Prior Feature Attention (CPFA) module. CPFA learns head-wise prior-strength coefficients $\{\lambda_h\}$ that softly regularize attention toward the prior while allowing data-driven deviations when nonlinear patterns appear to be present in the data. Beyond imputation, SNI aggregates attention maps into a directed feature-dependency matrix that summarizes which variables the imputer relied on, without requiring post-hoc explainers. We evaluate SNI against six baselines (Mean/Mode, MICE, KNN, MissForest, GAIN, MIWAE) on six datasets spanning ICU monitoring, population surveys, socio-economic statistics, and engineering applications. Under MCAR/strict-MAR at 30\% missingness, SNI is generally competitive on continuous metrics but is often outperformed by accuracy-first baselines (MissForest, MIWAE) on categorical variables; in return, it provides intrinsic dependency diagnostics and explicit statistical-neural trade-off parameters. We additionally report MNAR stress tests (with a mask-aware variant) and discuss computational cost, limitations -- particularly for severely imbalanced categorical targets -- and deployment scenarios where interpretability may justify the trade-off.
Cooperative MARL often assumes frequent access to global information in a data buffer, such as team rewards or other agents' actions, which is typically unrealistic in decentralized MARL systems due to high communication costs. When communication is limited, agents must rely on outdated information to estimate gradients and update their policies. A common approach to handle missing data is called importance sampling, in which we reweigh old data from a base policy to estimate gradients for the current policy. However, it quickly becomes unstable when the communication is limited (i.e. missing data probability is high), so that the base policy in importance sampling is outdated. To address this issue, we propose a technique called base policy prediction, which utilizes old gradients to predict the policy update and collect samples for a sequence of base policies, which reduces the gap between the base policy and the current policy. This approach enables effective learning with significantly fewer communication rounds, since the samples of predicted base policies could be collected within one communication round. Theoretically, we show that our algorithm converges to an $\varepsilon$-Nash equilibrium in potential games with only $O(\varepsilon^{-3/4})$ communication rounds and $O(poly(\max_i |A_i|)\varepsilon^{-11/4})$ samples, improving existing state-of-the-art results in communication cost, as well as sample complexity without the exponential dependence on the joint action space size. We also extend these results to general Markov Cooperative Games to find an agent-wise local maximum. Empirically, we test the base policy prediction algorithm in both simulated games and MAPPO for complex environments.
This paper develops a unified framework for partial identification and inference in stratified experiments with attrition, accommodating both equal and heterogeneous treatment shares across strata. For equal-share designs, we apply recent theory for finely stratified experiments to Lee bounds, yielding closed-form, design-consistent variance estimators and properly sized confidence intervals. Simulations show that the conventional formula can overstate uncertainty, while our approach delivers tighter intervals. When treatment shares differ across strata, we propose a new strategy, which combines inverse probability weighting and global trimming to construct valid bounds even when strata are small or unbalanced. We establish identification, introduce a moment estimator, and extend existing inference results to stratified designs with heterogeneous shares, covering a broad class of moment-based estimators which includes the one we formulate. We also generalize our results to designs in which strata are defined solely by observed labels.
Computing $\log\det(A)$ for large symmetric positive definite matrices arises in Gaussian process inference and Bayesian model comparison. Standard methods combine matrix-vector products with polynomial approximations. We study a different model: access to trace powers $p_k = \tr(A^k)$, natural when matrix powers are available. Classical moment-based approximations Taylor-expand $\log(\lambda)$ around the arithmetic mean. This requires $|\lambda - \AM| < \AM$ and diverges when $\kappa > 4$. We work instead with the moment-generating function $M(t) = \E[X^t]$ for normalized eigenvalues $X = \lambda/\AM$. Since $M'(0) = \E[\log X]$, the log-determinant becomes $\log\det(A) = n(\log \AM + M'(0))$ -- the problem reduces to estimating a derivative at $t = 0$. Trace powers give $M(k)$ at positive integers, but interpolating $M(t)$ directly is ill-conditioned due to exponential growth. The transform $K(t) = \log M(t)$ compresses this range. Normalization by $\AM$ ensures $K(0) = K(1) = 0$. With these anchors fixed, we interpolate $K$ through $m+1$ consecutive integers and differentiate to estimate $K'(0)$. However, this local interpolation cannot capture arbitrary spectral features. We prove a fundamental limit: no continuous estimator using finitely many positive moments can be uniformly accurate over unbounded conditioning. Positive moments downweight the spectral tail; $K'(0) = \E[\log X]$ is tail-sensitive. This motivates guaranteed bounds. From the same traces we derive upper bounds on $(\det A)^{1/n}$. Given a spectral floor $r \leq \lambda_{\min}$, we obtain moment-constrained lower bounds, yielding a provable interval for $\log\det(A)$. A gap diagnostic indicates when to trust the point estimate and when to report bounds. All estimators and bounds cost $O(m)$, independent of $n$. For $m \in \{4, \ldots, 8\}$, this is effectively constant time.
Entropic optimal transport problems play an increasingly important role in machine learning and generative modelling. In contrast with optimal transport maps which often have limited applicability in high dimensions, Schrodinger bridges can be solved using the celebrated Sinkhorn's algorithm, a.k.a. the iterative proportional fitting procedure. The stability properties of Sinkhorn bridges when the number of iterations tends to infinity is a very active research area in applied probability and machine learning. Traditional proofs of convergence are mainly based on nonlinear versions of Perron-Frobenius theory and related Hilbert projective metric techniques, gradient descent, Bregman divergence techniques and Hamilton-Jacobi-Bellman equations, including propagation of convexity profiles based on coupling diffusions by reflection methods. The objective of this review article is to present, in a self-contained manner, recently developed Sinkhorn/Gibbs-type semigroup analysis based upon contraction coefficients and Lyapunov-type operator-theoretic techniques. These powerful, off-the-shelf semigroup methods are based upon transportation cost inequalities (e.g. log-Sobolev, Talagrand quadratic inequality, curvature estimates), $\phi$-divergences, Kantorovich-type criteria and Dobrushin contraction-type coefficients on weighted Banach spaces as well as Wasserstein distances. This novel semigroup analysis allows one to unify and simplify many arguments in the stability of Sinkhorn algorithm. It also yields new contraction estimates w.r.t. generalized $\phi$-entropies, as well as weighted total variation norms, Kantorovich criteria and Wasserstein distances.
Estimating the unknown reward functions driving agents' behaviors is of central interest in inverse reinforcement learning and game theory. To tackle this problem, we develop a unified framework for reward function recovery in two-player zero-sum matrix games and Markov games with entropy regularization, where we aim to reconstruct the underlying reward functions given observed players' strategies and actions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish the reward function's identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building upon this theoretical foundation, we propose a novel algorithm to learn reward functions from observed actions. Our algorithm works in both static and dynamic settings and is adaptable to incorporate different methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm. Further, we conduct extensive numerical studies to demonstrate the practical effectiveness of the proposed framework, offering new insights into decision-making in competitive environments.
Online continual learning (OCL) methods adapt to changing environments without forgetting past knowledge. Similarly, online time series forecasting (OTSF) is a real-world problem where data evolve in time and success depends on both rapid adaptation and long-term memory. Indeed, time-varying and regime-switching forecasting models have been extensively studied, offering a strong justification for the use of OCL in these settings. Building on recent work that applies OCL to OTSF, this paper aims to strengthen the theoretical and practical connections between time series methods and OCL. First, we reframe neural network optimization as a parameter filtering problem, showing that natural gradient descent is a score-driven method and proving its information-theoretic optimality. Then, we show that using a Student's t likelihood in addition to natural gradient induces a bounded update, which improves robustness to outliers. Finally, we introduce Natural Score-driven Replay (NatSR), which combines our robust optimizer with a replay buffer and a dynamic scale heuristic that improves fast adaptation at regime drifts. Empirical results demonstrate that NatSR achieves stronger forecasting performance than more complex state-of-the-art methods.
Evolutionary accumulation models (EvAMs) are an emerging class of machine learning methods designed to infer the evolutionary pathways by which features are acquired. Applications include cancer evolution (accumulation of mutations), anti-microbial resistance (accumulation of drug resistances), genome evolution (organelle gene transfers), and more diverse themes in biology and beyond. Following these themes, many EvAMs assume that features are gained irreversibly -- no loss of features can occur. Reversible approaches do exist but are often computationally (much) more demanding and statistically less stable. Our goal here is to explore whether useful information about evolutionary dynamics which are in reality reversible can be obtained from modelling approaches with an assumption of irreversibility. We identify, and use simulation studies to quantify, errors involved in neglecting reversible dynamics, and show the situations in which approximate results from tractable models can be informative and reliable. In particular, EvAM inferences about the relative orderings of acquisitions, and the core dynamic structure of evolutionary pathways, are robust to reversibility in many cases, while estimations of uncertainty and feature interactions are more error-prone.
Notions of positive curvature have been shown to imply many remarkable properties for Markov processes, in terms, e.g., of regularization effects, functional inequalities, mixing time bounds and, more recently, the cutoff phenomenon. In this work, we are interested in a relaxed variant of Ollivier's coarse Ricci curvature, where a Markov kernel $P$ satisfies only a weaker Wasserstein bound $W_p(\mu P, \nu P) \leq K W_p(\mu,\nu)+M$ for constants $M\ge 0, K\in [0,1], p \ge 1$. Under appropriate additional assumptions on the one-step transition measures $\delta_x P$, we establish (i) a form of local concentration, given by a defective Talagrand inequality, and (ii) an entropy-transport regularization effect. We consider as illustrative examples the Langevin dynamics and the Proximal Sampler when the target measure is a log-Lipschitz perturbation of a log-concave measure. As an application of the above results, we derive criteria for the occurrence of the cutoff phenomenon in some negatively curved settings.
We develop a multilevel Monte Carlo (MLMC) framework for uncertainty quantification with Monte Carlo dropout. Treating dropout masks as a source of epistemic randomness, we define a fidelity hierarchy by the number of stochastic forward passes used to estimate predictive moments. We construct coupled coarse--fine estimators by reusing dropout masks across fidelities, yielding telescoping MLMC estimators for both predictive means and predictive variances that remain unbiased for the corresponding dropout-induced quantities while reducing sampling variance at fixed evaluation budget. We derive explicit bias, variance and effective cost expressions, together with sample-allocation rules across levels. Numerical experiments on forward and inverse PINNs--Uzawa benchmarks confirm the predicted variance rates and demonstrate efficiency gains over single-level MC-dropout at matched cost.
We introduce a novel model for time-varying, asymmetric, tail-dependent copulas in high dimensions that incorporates both spectral dynamics and regularization. The dynamics of the dependence matrix' eigenvalues are modeled in a score-driven way, while biases in the unconditional eigenvalue spectrum are resolved by non-linear shrinkage. The dynamic parameterization of the copula dependence matrix ensures that it satisfies the appropriate restrictions at all times and for any dimension. The model is parsimonious, computationally efficient, easily scalable to high dimensions, and performs well for both simulated and empirical data. In an empirical application to financial market dynamics using 100 stocks from 10 different countries and 10 different industry sectors, we find that our copula model captures both geographic and industry related co-movements and outperforms recent computationally more intensive clustering-based factor copula alternatives. Both the spectral dynamics and the regularization contribute to the new model's performance. During periods of market stress, we find that the spectral dynamics reveal strong increases in international stock market dependence, which causes reductions in diversification potential and increases in systemic risk.
Large-scale dynamic inverse problems are often ill-posed due to model complexity and the high dimensionality of the unknown parameters. Regularization is commonly employed to mitigate ill-posedness by incorporating prior information and structural constraints. However, classical regularization formulations are frequently infeasible in this setting due to prohibitive memory requirements, necessitating sequential methods that process data and state information online, eliminating the need to form the full space-time problem. In this work, we propose a memory-efficient framework for reconstructing dynamic sequences of undersampled images from computerized tomography data that requires minimal hyperparameter tuning. The approach is based on a prior-informed, dimension-reduced Kalman filter with smoothing. While well suited for dynamic image reconstruction, practical deployment is challenging when the state transition model and covariance parameters must be initialized without prior knowledge and estimated in a single pass. To address these limitations, we integrate regularized motion models with expectation-maximization strategies for the estimation of state transition dynamics and error covariances within the Kalman filtering framework. We demonstrate the effectiveness of the proposed method through numerical experiments on limited-angle and single-shot computerized tomography problems, highlighting improvements in reconstruction accuracy, memory efficiency, and computational cost.
Despite their promise, fair machine learning methods often yield Pareto-inefficient models, in which the performance of certain groups can be improved without degrading that of others. This issue arises frequently in traditional in-processing approaches such as fairness-through-regularization. In contrast, existing Pareto-efficient approaches are biased towards a certain perspective on fairness and fail to adapt to the broad range of fairness metrics studied in the literature. In this paper, we present BADR, a simple framework to recover the optimal Pareto-efficient model for any fairness metric. Our framework recovers its models through a Bilevel Adaptive Rescalarisation procedure. The lower level is a weighted empirical risk minimization task where the weights are a convex combination of the groups, while the upper level optimizes the chosen fairness objective. We equip our framework with two novel large-scale, single-loop algorithms, BADR-GD and BADR-SGD, and establish their convergence guarantees. We release badr, an open-source Python toolbox implementing our framework for a variety of learning tasks and fairness metrics. Finally, we conduct extensive numerical experiments demonstrating the advantages of BADR over existing Pareto-efficient approaches to fairness.
The Muon optimizer, a matrix-structured algorithm that leverages spectral orthogonalization of gradients, is a milestone in the pretraining of large language models. However, the underlying mechanisms of Muon -- particularly the role of gradient orthogonalization -- remain poorly understood, with very few works providing end-to-end analyses that rigorously explain its advantages in concrete applications. We take a step by studying the effectiveness of a simplified variant of Muon through two case studies: matrix factorization, and in-context learning of linear transformers. For both problems, we prove that simplified Muon converges linearly with iteration complexities independent of the relevant condition number, provably outperforming gradient descent and Adam. Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior. Our theory formalizes the preconditioning effect induced by spectral orthogonalization, offering insight into Muon's effectiveness in these matrix optimization problems and potentially beyond.
This paper constructs a multilayer recursive game model to demonstrate that in a rule vacuum environment, hierarchical predatory structures inevitably collapse into a monolithic political strongman system due to the conflict between exponentially growing rent dissipation and the rigidity of bottom-level survival constraints. We propose that the rise of a monolithic political strongman is essentially an "algorithmic entropy reduction" achieved through forceful means by the system to counteract the "informational entropy increase" generated by multilayer agency. However, the order gained at the expense of social complexity results in the stagnation of social evolutionary functions.
We present a unified theoretical framework for effective transport in periodic and tilted periodic potentials based on additive functionals of stochastic processes. By systematically combining the Poisson equation, corrector construction, and martingale decomposition, we show that both the long-time drift and diffusion of overdamped Brownian motion can be derived within a single and transparent scheme. In the absence of external tilt, the formalism naturally recovers the classical Lifson-Jackson formula for the effective diffusion coefficient. When a constant bias is applied, breaking detailed balance and inducing a finite stationary current, the same approach yields the Stratonovich expressions for the effective drift and diffusion in tilted periodic potentials. Beyond one dimension, we demonstrate that the same additive-functional structure extends directly to two-dimensional and general N dimensional periodic diffusions, leading to the standard homogenized drift and diffusion tensor expressed in terms of vector-valued correctors. Our derivation highlights the central role of additive functionals in separating bounded microscopic corrections from unbounded macroscopic transport and clarifies the connection between reversible and nonequilibrium steady states. This work provides a conceptually unified and mathematically controlled route to transport coefficients in periodic media, with direct relevance to stochastic transport, soft matter, and nonequilibrium statistical physics.
Power system outages expose market participants to significant financial risk unless promptly detected and hedged. We develop an outage identification method from public market signals grounded in the parametric quickest change detection (QCD) theory. Parametric QCD operates on stochastic data streams, distinguishing pre- and post-change regimes using the ratio of their respective probability density functions. To derive the density functions for normal and post-outage market signals, we exploit multi-parametric programming to decompose complex market signals into parametric random variables with a known density. These densities are then used to construct a QCD-based statistic that triggers an alarm as soon as the statistic exceeds an appropriate threshold. Numerical experiments on a stylized PJM testbed demonstrate rapid line outage identification from public streams of electricity demand and price data.
Fairness and privacy are two vital pillars of trustworthy machine learning. Despite extensive research on these individual topics, the relationship between fairness and privacy has received significantly less attention. In this paper, we utilize the information-theoretic measure Chernoff Information to highlight the data-dependent nature of the relationship among the triad of fairness, privacy, and accuracy. We first define Noisy Chernoff Difference, a tool that allows us to analyze the relationship among the triad simultaneously. We then show that for synthetic data, this value behaves in 3 distinct ways (depending on the distribution of the data). We highlight the data distributions involved in these cases and explore their fairness and privacy implications. Additionally, we show that Noisy Chernoff Difference acts as a proxy for the steepness of the fairness-accuracy curves. Finally, we propose a method for estimating Chernoff Information on data from unknown distributions and utilize this framework to examine the triad dynamic on real datasets. This work builds towards a unified understanding of the fairness-privacy-accuracy relationship and highlights its data-dependent nature.
Orthogonal and 1-Lipschitz neural network layers are essential building blocks in robust deep learning architectures, crucial for certified adversarial robustness, stable generative models, and reliable recurrent networks. Despite significant advancements, existing implementations remain fragmented, limited, and computationally demanding. To address these issues, we introduce Orthogonium , a unified, efficient, and comprehensive PyTorch library providing orthogonal and 1-Lipschitz layers. Orthogonium provides access to standard convolution features-including support for strides, dilation, grouping, and transposed-while maintaining strict mathematical guarantees. Its optimized implementations reduce overhead on large scale benchmarks such as ImageNet. Moreover, rigorous testing within the library has uncovered critical errors in existing implementations, emphasizing the importance of standardized and reliable tools. Orthogonium thus significantly lowers adoption barriers, enabling scalable experimentation and integration across diverse applications requiring orthogonality and robust Lipschitz constraints. Orthogonium is available at this https URL.
Self-Organizing Maps provide topology-preserving projections of high-dimensional data and have been widely used for visualization, clustering, and vector quantization. In this work, we show that the activation pattern of a SOM - the squared distances to its prototypes - can be inverted to recover the exact input under mild geometric conditions. This follows from a classical fact in Euclidean distance geometry: a point in $D$ dimensions is uniquely determined by its distances to $D{+}1$ affinely independent references. We derive the corresponding linear system and characterize the conditions under which the inversion is well-posed. Building upon this mechanism, we introduce the Manifold-Aware Unified SOM Inversion and Control (MUSIC) update rule, which enables controlled, semantically meaningful trajectories in latent space. MUSIC modifies squared distances to selected prototypes while preserving others, resulting in a deterministic geometric flow aligned with the SOM's piecewise-linear structure. Tikhonov regularization stabilizes the update rule and ensures smooth motion on high-dimensional datasets. Unlike variational or probabilistic generative models, MUSIC does not rely on sampling, latent priors, or encoder-decoder architectures. If no perturbation is applied, inversion recovers the exact input; when a target cluster or prototype is specified, MUSIC produces coherent semantic variations while remaining on the data manifold. This leads to a new perspective on data augmentation and controllable latent exploration based solely on prototype geometry. We validate the approach using synthetic Gaussian mixtures, the MNIST and the Faces in the Wild dataset. Across all settings, MUSIC produces smooth, interpretable trajectories that reveal the underlying geometry of the learned manifold, illustrating the advantages of SOM-based inversion over unsupervised clustering.
Accurate, low-latency estimates of the instantaneous phase of oscillations are essential for closed-loop sensing and actuation, including (but not limited to) phase-locked neurostimulation and other real-time applications. The endpoint-corrected Hilbert transform (ecHT) reduces boundary artefacts of the Hilbert transform by applying a causal narrow-band filter to the analytic spectrum. This improves the phase estimate at the most recent sample. Despite its widespread empirical use, the systematic endpoint distortions of ecHT have lacked a principled, closed-form analysis. In this study, we derive the ecHT endpoint operator analytically and demonstrate that its output can be decomposed into a desired positive-frequency term (a deterministic complex gain that induces a calibratable amplitude/phase bias) and a residual leakage term setting an irreducible variance floor. This yields (i) an explicit characterisation and bounds for endpoint phase/amplitude error, (ii) a mean-squared-error-optimal scalar calibration (c-ecHT), and (iii) practical design rules relating window length, bandwidth/order, and centre-frequency mismatch to residual bias via an endpoint group delay. The resulting calibrated ecHT achieves near-zero mean phase error and remains computationally compatible with real-time pipelines. Code and analyses are provided at this https URL.
Healthcare sector indices consolidate the economic health of pharmaceutical, biotechnology, and healthcare service firms. The short-term movements in these indices are closely intertwined with capital allocation decisions affecting research and development investment, drug availability, and long-term health outcomes. This research investigates whether historical open-high-low-close (OHLC) index data contain sufficient information for predicting the directional movement of the opening index on the subsequent trading day. The problem is formulated as a supervised classification task involving a one-step-ahead rolling window. A diverse feature set is constructed, comprising original prices, volatility-based technical indicators, and a novel class of nowcasting features derived from mutual OHLC ratios. The framework is evaluated on data from healthcare indices in the U.S. and Indian markets over a five-year period spanning multiple economic phases, including the COVID-19 pandemic. The results demonstrate robust predictive performance, with accuracy exceeding 0.8 and Matthews correlation coefficients above 0.6. Notably, the proposed nowcasting features have emerged as a key determinant of the market movement. We have employed the Shapley-based explainability paradigm to further elucidate the contribution of the features: outcomes reveal the dominant role of the nowcasting features, followed by a more moderate contribution of original prices. This research offers a societal utility: the proposed features and model for short-term forecasting of healthcare indices can reduce information asymmetry and support a more stable and equitable health economy.
We investigate the use of Wasserstein gradient flows for finding an $E$-optimal design for a regression model. Unlike the commonly used $D$- and $L$-optimality criteria, the $E$-criterion finds a design that maximizes the smallest eigenvalue of the information matrix, and so it is a non-differentiable criterion unless the minimum eigenvalue has geometric multiplicity equals to one. Such maximin design problems abound in statistical applications and present unique theoretical and computational challenges. Building on the differential structure of the $2$-Wasserstein space, we derive explicit formulas for the Wasserstein gradient of the $E$-optimality criterion in the simple-eigenvalue case. For higher multiplicities, we propose a Wasserstein steepest ascent direction and show that it can be computed exactly via a semidefinite programming (SDP) relaxation. We develop particle approximations that connect infinite-dimensional flows with finite-dimensional optimization, and provide approximation guarantees for empirical measures. Our framework extends naturally to constrained designs via projected Wasserstein gradient flows. Numerical experiments demonstrate that the proposed methods successfully recover $E$-optimal designs for both linear and nonlinear regression models, with competitive accuracy and scalability compared to existing heuristic approaches. This work highlights the potential of optimal transport-based dynamics as a unifying tool for studying challenging optimal design problems.
Ferromagnetic exponential random graph models (ERGMs) are random graph models under which the presence of certain small structures (such as triangles) is encouraged; they can be constructed by tilting an Erdős--Rényi model by the exponential of a particular nonlinear Hamiltonian. These models are mixtures of metastable wells which each behave macroscopically like an Erdős--Rényi model, exhibiting the same laws of large numbers for subgraph counts [CD13]. However, on the microscopic scale these metastable wells are very different from Erdős--Rényi models, with the total variation distance between the two measures tending to 1 [MX23]. In this article we clarify this situation by providing a sharp (up to constants) bound on the Hamming-Wasserstein distance between the two models, which is the average number of edges at which they differ, under the coupling which minimizes this average. In particular, we show that this distance is $\Theta(n^{3/2})$, quantifying exactly how these models differ. An upper bound of this form has appeared in the past [RR19], but this was restricted to the subcritical (high-temperature) regime of parameters. We extend this bound, using a new proof technique, to the supercritical (low-temperature) regime, and prove a matching lower bound which has only previously appeared in the subcritical regime of special cases of ERGMs satisfying a "triangle-free" condition [DF25]. To prove the lower bound in the presence of triangles, we introduce an approximation of the discrete derivative of the Hamiltonian, which controls the dynamical properties of the ERGM, in terms of local counts of triangles and wedges (two-stars) near an edge. This approximation is the main technical and conceptual contribution of the article, and we expect it will be useful in a variety of other contexts as well. Along the way, we also prove a bound on the marginal edge probability under the ERGM via a new bootstrapping argument. Such a bound has already appeared [FLSW25], but again only in the subcritical regime and using a different proof strategy.
We study low-rank tensor-product B-spline (TPBS) models for regression tasks and investigate Dirichlet energy as a measure of smoothness. We show that TPBS models admit a closed-form expression for the Dirichlet energy, and reveal scenarios where perfect interpolation is possible with exponentially small Dirichlet energy. This renders global Dirichlet energy-based regularization ineffective. To address this limitation, we propose a novel regularization strategy based on local Dirichlet energies defined on small hypercubes centered at the training points. Leveraging pretrained TPBS models, we also introduce two estimators for inference from incomplete samples. Comparative experiments with neural networks demonstrate that TPBS models outperform neural networks in the overfitting regime for most datasets, and maintain competitive performance otherwise. Overall, TPBS models exhibit greater robustness to overfitting and consistently benefit from regularization, while neural networks are more sensitive to overfitting and less effective in leveraging regularization.
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
The Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST) will produce unprecedented volumes of heterogeneous astronomical data (images, catalogs, and alerts) that challenge traditional analysis pipelines. The LSST Dark Energy Science Collaboration (DESC) aims to derive robust constraints on dark energy and dark matter from these data, requiring methods that are statistically powerful, scalable, and operationally reliable. Artificial intelligence and machine learning (AI/ML) are already embedded across DESC science workflows, from photometric redshifts and transient classification to weak lensing inference and cosmological simulations. Yet their utility for precision cosmology hinges on trustworthy uncertainty quantification, robustness to covariate shift and model misspecification, and reproducible integration within scientific pipelines. This white paper surveys the current landscape of AI/ML across DESC's primary cosmological probes and cross-cutting analyses, revealing that the same core methodologies and fundamental challenges recur across disparate science cases. Since progress on these cross-cutting challenges would benefit multiple probes simultaneously, we identify key methodological research priorities, including Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning for discovery. With an eye on emerging techniques, we also explore the potential of the latest foundation model methodologies and LLM-driven agentic AI systems to reshape DESC workflows, provided their deployment is coupled with rigorous evaluation and governance. Finally, we discuss critical software, computing, data infrastructure, and human capital requirements for the successful deployment of these new methodologies, and consider associated risks and opportunities for broader coordination with external actors.
In the usual Bayesian setting, a full probabilistic model is required to link the data and parameters, and the form of this model and the inference and prediction mechanisms are specified via de Finetti's representation. In general, such a formulation is not robust to model misspecification of its component parts. An alternative approach is to draw inference based on loss functions, where the quantity of interest is defined as a minimizer of some expected loss, and to construct posterior distributions based on the loss-based formulation; this strategy underpins the construction of the Gibbs posterior. We develop a Bayesian non-parametric approach; specifically, we generalize the Bayesian bootstrap, and specify a Dirichlet process model for the distribution of the observables. We implement this using direct prior-to-posterior calculations, but also using predictive sampling. We also study the assessment of posterior validity for non-standard Bayesian calculations. We show that the developed non-standard Bayesian updating procedures yield valid posterior distributions in terms of consistency and asymptotic normality under model misspecification. Simulation studies show that the proposed methods can recover the true value of the parameter under misspecification.
Bayes Factors, the Bayesian tool for hypothesis testing, are receiving increasing attention in the literature. Compared to their frequentist rivals ($p$-values or test statistics), Bayes Factors have the conceptual advantage of providing evidence both for and against a null hypothesis, and they can be calibrated so that they do not depend so heavily on the sample size. Research on the synthesis of Bayes Factors arising from individual studies has received increasing attention, mostly for the fixed effects model for meta-analysis. In this work, we review and propose methods for combining Bayes Factors from multiple studies, depending on the level of information available, focusing on the common effect model. In the process, we provide insights with respect to the interplay between frequentist and Bayesian evidence. We assess the performance of the methods discussed via a simulation study and apply the methods in an example from the field of positive psychology.
We study the classification problem for high-dimensional data with $n$ observations on $p$ features where the $p \times p$ covariance matrix $\Sigma$ exhibits a spiked eigenvalue structure and the vector $\zeta$, given by the difference between the {\em whitened} mean vectors, is sparse. We analyze an adaptive classifier (adaptive with respect to the sparsity $s$) that first performs dimension reduction on the feature vectors prior to classification in the dimensionally reduced space, i.e., the classifier whitens the data, then screens the features by keeping only those corresponding to the $s$ largest coordinates of $\zeta$ and finally applies Fisher linear discriminant on the selected features. Leveraging recent results on entrywise matrix perturbation bounds for covariance matrices, we show that the resulting classifier is Bayes optimal whenever $n \rightarrow \infty$ and $s \sqrt{n^{-1} \ln p} \rightarrow 0$. Notably, our theory also guarantees Bayes optimality for the corresponding quadratic discriminant analysis (QDA). Experimental results on real and synthetic data further indicate that the proposed approach is competitive with state-of-the-art methods while operating on a substantially lower-dimensional representation.
We address the problem of prediction for extreme observations by proposing an extremal linear prediction method. We construct an inner product space of nonnegative random variables derived from transformed-linear combinations of independent regularly varying random variables. Under a reasonable modeling assumption, the matrix of inner products corresponds to the tail pairwise dependence matrix, which can be easily estimated. We derive the optimal transformed-linear predictor via the projection theorem, which yields a predictor with the same form as the best linear unbiased predictor in non-extreme settings. We quantify uncertainty for prediction errors by constructing prediction intervals based on the geometry of regular variation. We demonstrate the effectiveness of our method through a simulation study and its applications to predicting high pollution levels, and extreme precipitation.
dynamite is an R package for Bayesian inference of intensive panel (time series) data comprising multiple measurements per multiple individuals measured in time. The package supports joint modeling of multiple response variables, time-varying and time-invariant effects, a wide range of discrete and continuous distributions, group-specific random effects, latent factors, and customization of prior distributions of the model parameters. Models in the package are defined via a user-friendly formula interface, and estimation of the posterior distribution of the model parameters takes advantage of state-of-the-art Markov chain Monte Carlo methods. The package enables efficient computation of both individual-level and aggregated predictions and offers a comprehensive suite of tools for visualization and model diagnostics.
The Ridgeless minimum $\ell_2$-norm interpolator in overparametrized linear regression has attracted considerable attention in recent years in both machine learning and statistics communities. While it seems to defy conventional wisdom that overfitting leads to poor prediction, recent theoretical research on its $\ell_2$-type risks reveals that its norm minimizing property induces an `implicit regularization' that helps prediction in spite of interpolation. This paper takes a further step that aims at understanding its precise stochastic behavior as a statistical estimator. Specifically, we characterize the distribution of the Ridgeless interpolator in high dimensions, in terms of a Ridge estimator in an associated Gaussian sequence model with positive regularization, which provides a precise quantification of the prescribed implicit regularization in the most general distributional sense. Our distributional characterizations hold for general non-Gaussian random designs and extend uniformly to positively regularized Ridge estimators. As a direct application, we obtain a complete characterization for a general class of weighted $\ell_q$ risks of the Ridge(less) estimators that are previously only known for $q=2$ by random matrix methods. These weighted $\ell_q$ risks not only include the standard prediction and estimation errors, but also include the non-standard covariate shift settings. Our uniform characterizations further reveal a surprising feature of the commonly used generalized and $k$-fold cross-validation schemes: tuning the estimated $\ell_2$ prediction risk by these methods alone lead to simultaneous optimal $\ell_2$ in-sample, prediction and estimation risks, as well as the optimal length of debiased confidence intervals.
We study batched bandit experiments and consider the problem of inference conditional on the realized stopping time, assignment probabilities, and target parameter, where all of these may be chosen adaptively using information up to the last batch of the experiment. Absent further restrictions on the experiment, we show that inference using only the results of the last batch is optimal. When the adaptive aspects of the experiment are known to be location-invariant, in the sense that they are unchanged when we shift all batch-arm means by a constant, we show that there is additional information in the data, captured by one additional linear function of the batch-arm means. In the more restrictive case where the stopping time, assignment probabilities, and target parameter are known to depend on the data only through a collection of polyhedral events, we derive computationally tractable and optimal conditional inference procedures.
Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
Scientists regularly pose questions about treatment effects on outcomes conditional on a post-treatment event. However, causal inference in such settings requires care, even in perfectly executed randomized experiments. Recently, the conditional separable effect (CSE) was proposed as an interventionist estimand that corresponds to scientifically meaningful questions in these settings. However, existing results for the CSE require no unmeasured confounding between the outcome and post-treatment event, an assumption frequently violated in practice. In this work, we address this concern by developing new identification and estimation results for the CSE that allow for unmeasured confounding. We establish nonparametric identification of the CSE in observational and experimental settings with time-varying confounders, provided that certain proxy variables for hidden common causes of the post-treatment event and outcome are available. For inference, we characterize an influence function for the CSE under a semiparametric model where nuisance functions are a priori unrestricted. Using modern machine learning methods, we construct nonparametric nuisance function estimators and establish convergence rates that improve upon existing results. Moreover, we develop a consistent, asymptotically linear, and locally semiparametric efficient estimator of the CSE. We illustrate our framework with simulation studies and a real-world cancer therapy trial.
Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first investigate whether the widely used fully conditional specification (FCS) approach indeed identifies the correct conditional distributions. Based on this analysis, we propose three essential properties an ideal imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new imputation method, denoted mice-DRF, that meets two out of the three criteria. We also discuss ways to compare imputation methods, based on distributional distances. Finally, numerical experiments illustrate the points made in this discussion.
We consider a Bayesian framework for estimating the sample size of a clinical trial. The new approach, called BESS, is built upon three pillars: Sample size of the trial, Evidence from the observed data, and Confidence of the final decision in the posterior inference. It uses a simple logic of "given the evidence from data, a specific sample size can achieve a degree of confidence in trial success." The key distinction between BESS and standard sample size estimation (SSE) is that SSE, typically based on Frequentist inference, specifies the true parameters values in its calculation to achieve properties under repeated sampling while BESS assumes possible outcome from the observed data to achieve high posterior probabilities of trial success. As a result, the calibration of the sample size is directly based on the probability of making a correct decision rather than type I or type II error rates. We demonstrate that BESS leads to a more interpretable statement for investigators, and can easily accommodates prior information as well as sample size re-estimation. We explore its performance in comparison to the standard SSE and demonstrate its usage through a case study of oncology optimization trial. An R tool is available at this https URL.
In this paper, we introduce a novel statistical model for the integrative analysis of Riemannian-valued functional data and high-dimensional data. We apply this model to explore the dependence structure between each subject's dynamic functional connectivity -- represented by a temporally indexed collection of positive definite covariance matrices -- and high-dimensional data representing lifestyle, demographic, and psychometric measures. Specifically, we employ a reformulation of canonical correlation analysis that enables efficient control of the complexity of the functional canonical directions using tangent space sieve approximations. Additionally, we enforce an interpretable group structure on the high-dimensional canonical directions via a sparsity-promoting penalty. The proposed method shows improved empirical performance over alternative approaches and comes with theoretical guarantees. Its application to data from the Human Connectome Project reveals a dominant mode of covariation between dynamic functional connectivity and lifestyle, demographic, and psychometric measures. This mode aligns with results from static connectivity studies but reveals a unique temporal non-stationary pattern that such studies fail to capture.
Most of the literature on differential privacy considers the item-level case where each user has a single observation, but a growing field of interest is that of user-level privacy where each of the $n$ users holds $T$ observations and wishes to maintain the privacy of their entire collection. In this paper, we derive a general minimax lower bound, which shows that, for locally private user-level estimation problems, the risk cannot, in general, be made to vanish for a fixed number of users even when each user holds an arbitrarily large number of observations. We then derive matching, up to logarithmic factors, lower and upper bounds for univariate and multidimensional mean estimation, sparse mean estimation and non-parametric density estimation. In particular, with other model parameters held fixed, we observe phase transition phenomena in the minimax rates as $T$ the number of observations each user holds varies. In the case of (non-sparse) mean estimation and density estimation, we see that, for $T$ below a phase transition boundary, the rate is the same as having $nT$ users in the item-level setting. Different behaviour is however observed in the case of $s$-sparse $d$-dimensional mean estimation, wherein consistent estimation is impossible when $d$ exceeds the number of observations in the item-level setting, but is possible in the user-level setting when $T \gtrsim s \log (d)$, up to logarithmic factors. This may be of independent interest for applications as an example of a high-dimensional problem that is feasible under local privacy constraints.
The present work aims at proving mathematically that a neural network inspired by biology can learn a classification task thanks to local transformations only. In this purpose, we propose a spiking neural network named CHANI (Correlation-based Hawkes Aggregation of Neurons with bio-Inspiration), whose neurons activity is modeled by Hawkes processes. Synaptic weights are updated thanks to an expert aggregation algorithm, providing a local and simple learning rule. We were able to prove that our network can learn on average and asymptotically. Moreover, we demonstrated that it automatically produces neuronal assemblies in the sense that the network can encode several classes and that a same neuron in the intermediate layers might be activated by more than one class, and we provided numerical simulations on synthetic dataset. This theoretical approach contrasts with the traditional empirical validation of biologically inspired networks and paves the way for understanding how local learning rules enable neurons to form assemblies able to represent complex concepts.
Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hájek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.
Many biological objects possess bilateral symmetry about a midline or midplane, up to a ``noise'' term. This paper uses landmark-based methods to measure departures from bilateral symmetry, especially for the two-group problem where one group is more asymmetric than the other. In this paper, we formulate our work in the framework of size-and-shape analysis including registration via rigid body motion. Our starting point is a vector of elementary asymmetry features defined at the individual landmark coordinates for each object. We introduce two approaches for testing. In the first, the elementary features are combined into a scalar composite asymmetry measure for each object. Then standard univariate tests can be used to compare the two groups. In the second approach, a univariate test statistic is constructed for each elementary feature. The maximum of these statistics lead to an overall test statistic to compare the two groups and we then provide a technique to extract the important features from the landmark data. Our methodology is illustrated on a pre-registered smile dataset collected to assess the success of cleft lip surgery on human subjects. The asymmetry in a group of cleft lip subjects is compared to a group of normal subjects, and statistically significant differences have been found by univariate tests in the first approach. Further, our feature extraction method leads to an anatomically plausible set of landmarks for medical applications.
We study the problem of finding the index of the minimum value of a vector from noisy observations. This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection. We develop an asymptotically normal test statistic, even in high-dimensional settings and with potentially many ties in the population mean vector, by integrating concepts and tools from cross-validation and differential privacy. The key technical ingredient is a central limit theorem for globally dependent data. We also propose practical ways to select the tuning parameter that adapts to the signal landscape. Numerical experiments and data examples demonstrate the ability of the proposed method to achieve a favorable bias-variance trade-off in practical scenarios.
Despite linear regression being the most popular statistical modelling technique, in real-life we often need to deal with situations where the true relationship between the response and the covariates is nonlinear in parameters. In such cases one needs to adopt appropriate non-linear regression (NLR) analysis, having wider applications in biochemical and medical studies among many others. In this paper we propose a new improved robust estimation and testing methodologies for general NLR models based on the minimum density power divergence approach and apply our proposal to analyze the widely popular Michaelis-Menten (MM) model in enzyme kinetics. We establish the asymptotic properties of our proposed estimator and tests, along with their theoretical robustness characteristics through influence function analysis. For the particular MM model, we have further empirically justified the robustness and the efficiency of our proposed estimator and the testing procedure through extensive simulation studies and several interesting real data examples of enzyme-catalyzed (biochemical) reactions.
We study experimentation under endogenous network interference. Interference patterns are mediated by an endogenous graph, where edges can be formed or eliminated as a result of treatment. We show that conventional estimators are biased in these circumstances, and present a class of unbiased, consistent and asymptotically normal estimators of total treatment effects in the presence of such interference. We show via simulation that our estimator outperforms existing estimators in the literature. Our results apply both to bipartite experimentation, in which the units of analysis and measurement differ, and the standard network experimentation case, in which they are the same.
We study simultaneous inference for multiple matrix-variate Gaussian graphical models in high-dimensional settings. Such models arise when spatiotemporal data are collected across multiple sample groups or experimental sessions, where each group is characterized by its own graphical structure but shares common sparsity patterns. A central challenge is to conduct valid inference on collections of graph edges while efficiently borrowing strength across groups under both high-dimensionality and temporal dependence. We propose a unified framework that combines joint estimation via group penalized regression with a high-dimensional Gaussian approximation bootstrap to enable global testing of edge subsets of arbitrary size. The proposed procedure accommodates temporally dependent observations and avoids naive pooling across heterogeneous groups. We establish theoretical guarantees for the validity of the simultaneous tests under mild conditions on sample size, dimensionality, and non-stationary autoregressive temporal dependence, and show that the resulting tests are nearly optimal in terms of the testable region boundary. The method relies only on convex optimization and parametric bootstrap, making it computationally tractable. Simulation studies and a neural recording example illustrate the efficacy of the proposed approach.
The classification of different patterns of network evolution, for example in brain connectomes or social networks, is a key problem in network inference and modern data science. Building on the notion of a network's Euclidean mirror, which captures its evolution as a curve in Euclidean space, we develop the Dynamic Network Clustering through Mirror Distance (DNCMD), an algorithm for clustering dynamic networks based on a distance measure between their associated mirrors. We provide theoretical guarantees for DNCMD to achieve exact recovery of distinct evolutionary patterns for latent position random networks both when underlying vertex features change deterministically and when they follow a stochastic process. We validate our theoretical results through numerical simulations and demonstrate the application of DNCMD to understand edge functions in Drosophila larval connectome data, as well as to analyze temporal patterns in dynamic trade networks.
Analyzing crime events is crucial to understand crime dynamics and it is largely helpful for constructing prevention policies. Point processes specified on linear networks can provide a more accurate description of crime incidents by considering the geometry of the city. We propose a spatio-temporal Dirichlet process mixture model on a linear network to analyze crime events in Valencia, Spain. We propose a Bayesian hierarchical model with a Dirichlet process prior to automatically detect space-time clusters of the events and adopt a convolution kernel estimator to account for the network structure in the city. From the fitted model, we provide crime hotspot visualizations that can inform social interventions to prevent crime incidents. Furthermore, we study the relationships between the detected cluster centers and the city's amenities, which provides an intuitive explanation of criminal contagion.
Statistical models for multivariate data often include a semi-orthogonal matrix parameter. In many applications, there is reason to expect that the semi-orthogonal matrix parameter satisfies a structural assumption such as sparsity or smoothness. From a Bayesian perspective, these structural assumptions should be incorporated into an analysis through the prior distribution. In this work, we introduce a general approach to constructing prior distributions for structured semi-orthogonal matrices that leads to tractable posterior inference via parameter-expanded Markov chain Monte Carlo. We draw on recent results from random matrix theory to establish a theoretical basis for the proposed approach. We then introduce specific prior distributions for incorporating sparsity or smoothness and illustrate their use through applications to biological and oceanographic data.
As Federated Learning (FL) expands, the challenge of non-independent and identically distributed (non-IID) data becomes critical. Clustered Federated Learning (CFL) addresses this by training multiple specialized models, each representing a group of clients with similar data distributions. However, the term ''CFL'' has increasingly been applied to operational strategies unrelated to data heterogeneity, creating significant ambiguity. This survey provides a systematic review of the CFL literature and introduces a principled taxonomy that classifies algorithms into Server-side, Client-side, and Metadata-based approaches. Our analysis reveals a distinct dichotomy: while theoretical research prioritizes privacy-preserving Server/Client-side methods, real-world applications in IoT, Mobility, and Energy overwhelmingly favor Metadata-based efficiency. Furthermore, we explicitly distinguish ''Core CFL'' (grouping clients for non-IID data) from ''Clustered X FL'' (operational variants for system heterogeneity). Finally, we outline lessons learned and future directions to bridge the gap between theoretical privacy and practical efficiency.
We introduce a framework for defining and interpreting collective mobility measures from spatially and temporally aggregated origin--destination (OD) data. Rather than characterizing individual behavior, these measures describe properties of the mobility system itself: how network organization, spatial structure, and routing constraints shape and channel population movement. In this view, aggregate mobility flows reveal aspects of connectivity, functional organization, and large-scale daily activity patterns encoded in the underlying transport and spatial network. To support interpretation and provide a controlled reference for the proposed time-elapsed calculations, we first employ an independent, network-driven synthetic data generator in which trajectories arise from prescribed system structure rather than observed data. This controlled setting provides a concrete reference for understanding how the proposed measures reflect network organization and flow constraints. We then apply the measures to fully anonymized data from the NetMob 2024 Data Challenge, examining their behavior under realistic limitations of spatial and temporal aggregation. While such data constraints restrict dynamical resolution, the resulting metrics still exhibit interpretable large-scale structure and temporal variation at the city scale.
The empirical use of variable transformations within (strictly) consistent loss functions is widespread, yet a theoretical understanding is lacking. To address this gap, we develop a theoretical framework that establishes formal characterizations of (strict) consistency for such transformed loss functions. Our analysis focuses on two interrelated cases: (a) transformations applied solely to the realization variable and (b) bijective transformations applied jointly to both the realization and prediction variables. These cases extend the well-established framework of transformations applied exclusively to the prediction variable, as formalized by Osband's revelation principle. We further develop analogous characterizations for (strict) identification functions. The resulting theoretical framework is broadly applicable to statistical and machine learning methodologies. For instance, we apply the framework to Bregman and expectile loss functions to interpret empirical findings from models trained with transformed loss functions and systematically construct new identifiable and elicitable functionals, which we term respectively $g$-transformed expectation and $g$-transformed expectile. Applications of the framework to simulated and real-world data illustrate its practical utility in diverse settings. By unifying theoretical insights with practical applications, this work advances principled methodologies for designing loss functions in complex predictive tasks.
The kidney paired donation (KPD) program provides an innovative solution to overcome incompatibility challenges in kidney transplants by matching incompatible donor-patient pairs and facilitating kidney exchanges. To address unequal access to transplant opportunities, there are two widely used fairness criteria: group fairness and individual fairness. However, these criteria do not consider protected patient features, which refer to characteristics legally or ethically recognized as needing protection from discrimination, such as race and gender. Motivated by the calibration principle in machine learning, we introduce a new fairness criterion: the matching outcome should be conditionally independent of the protected feature, given the sensitization level. We integrate this fairness criterion as a constraint within the KPD optimization framework and propose a computationally efficient solution using linearization strategies and column-generation methods. Theoretically, we analyze the associated price of fairness using random graph models. Empirically, we compare our fairness criterion with group fairness and individual fairness through both simulations and a real-data example.
We introduce a new global sensitivity measure, the global activity scores. We establish its theoretical connection with Sobol' sensitivity indices and demonstrate its performance through numerical examples. In these examples, we compare global activity scores with Sobol' sensitivity indices, derivative-based sensitivity measures, and activity scores. The results show that in the presence of noise or high variability, global activity scores outperform derivative-based measures and activity scores, while in noiseless settings the three approaches yield similar results.
In limited overs cricket, the team batting first posts a target score for the team batting second to achieve in order to win the match. The team batting second is constrained by decreasing resources in terms of number of balls left and number of wickets in hand in the process of reaching the target as the second innings progresses. The Pressure Index, a measure created by researchers in the past, serves as a tool for quantifying the level of pressure that a team batting second encounters in limited overs cricket. Through a ball-by-ball analysis of the second innings, it reveals how effectively the team batting second in a limited-over game proceeds towards their target. This research employs higher order Markov chains to examine the strategies employed by successful teams during run chases in Twenty20 matches. By studying the trends in successful run chases spanning over 16 years and utilizing a significant dataset of 6537 Twenty20 matches, specific strategies are identified. Consequently, an efficient approach to successful run chases in Twenty20 cricket is formulated, effectively limiting the Pressure Index to [0.5, 3.5] or even further down under 0.5 as early as possible. The innovative methodology adopted in this research offers valuable insights for cricket teams looking to enhance their performance in run chases.
Off-policy evaluation (OPE) constructs confidence intervals for the value of a target policy using data generated under a different behavior policy. Most existing inference methods focus on fixed target policies and may fail when the target policy is estimated as optimal, particularly when the optimal policy is non-unique or nearly deterministic. We study inference for the value of optimal policies in Markov decision processes. We characterize the existence of the efficient influence function and show that non-regularity arises under policy non-uniqueness. Motivated by this analysis, we propose a novel \textit{N}onparametric \textit{S}equenti\textit{A}l \textit{V}alue \textit{E}valuation (NSAVE) method, which achieves semiparametric efficiency and retains the double robustness property when the optimal policy is unique, and remains stable in degenerate regimes beyond the scope of existing asymptotic theory. We further develop a smoothing-based approach for valid inference under non-unique optimal policies, and a post-selection procedure with uniform coverage for data-selected optimal policies. Simulation studies support the theoretical results. An application to the OhioT1DM mobile health dataset provides patient-specific confidence intervals for optimal policy values and their improvement over observed treatment policies.
Predicting potential and counterfactual outcomes from observational data is central to individualized decision-making, particularly in clinical settings where treatment choices must be tailored to each patient rather than guided solely by population averages. We propose PO-Flow, a continuous normalizing flow (CNF) framework for causal inference that jointly models potential outcome distributions and factual-conditioned counterfactual outcomes. Trained via flow matching, PO-Flow provides a unified approach to individualized potential outcome prediction, conditional average treatment effect estimation, and counterfactual prediction. By encoding an observed factual outcome into a shared latent representation and decoding it under an alternative treatment, PO-Flow relates factual and counterfactual realizations at the individual level, rather than generating counterfactuals independently from marginal conditional distributions. In addition, PO-Flow supports likelihood-based evaluation of potential outcomes, enabling uncertainty-aware assessment of predictions. A supporting recovery guarantee is established under certain assumptions, and empirical results on benchmark datasets demonstrate strong performance across a range of causal inference tasks within the potential outcomes framework.
Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. Various methods have been proposed to extend PCA to the union of subspace (UoS) setting for clustering data that comes from multiple subspaces like K-Subspaces (KSS). However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a heteroscedastic-based subspace clustering method, named ALPCAHUS, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace bases associated with the low-rank structure of the data. This clustering algorithm builds on K-Subspaces (KSS) principles by extending the recently proposed heteroscedastic PCA method, named LR-ALPCAH, for clusters with heteroscedastic noise in the UoS setting. Simulations and real-data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing clustering algorithms. Code available at this https URL.
Existing causal methods for time-varying exposure and time-varying confounding focus on estimating the average causal effect of a time-varying binary treatment on an end-of-study outcome, offering limited tools for characterizing marginal causal dose-response relationships under continuous exposures. We propose a scalable, nonparametric Bayesian framework for estimating marginal longitudinal causal dose-response functions with repeated outcome measurements. Our approach targets the average potential outcome at any fixed dose level and accommodates time-varying confounding through the generalized propensity score. The proposed approach embeds a Dirichlet process specification within a generalized estimating equations structure, capturing temporal correlation while making minimal assumptions about the functional form of the continuous exposure. We apply the proposed methods to monthly metro ridership and COVID-19 case data from major international cities, identifying causal relationships and the dose-response patterns between higher ridership and increased case counts.
In-Context Learning (ICL) allows Large Language Models (LLMs) to adapt to new tasks with just a few examples, but their predictions often suffer from systematic biases, leading to unstable performances in classification. While calibration techniques are proposed to mitigate these biases, we show that, in the logit space, many of these methods are equivalent to merely shifting the LLM's decision boundary without having the ability to alter its orientation. This proves inadequate when biases cause the LLM to be severely misdirected. To address these limitations and provide a unifying framework, we propose Supervised Calibration (SC), a loss-minimization based framework which learns an optimal, per-class affine transformation of the LLM's predictive probabilities in the logit space without requiring external data beyond the context. By using a more expressive functional class, SC not only subsumes many existing calibration methods in ICL as special cases, but also enables the ability to alter and even completely reverse the orientation of the LLM's decision boundary. Furthermore, SC's loss-based nature facilitates the seamless integration of two purpose-built regularization techniques: context-invariance and directional trust-region. The former is designed to tackle the instability issue in ICL, while the latter controls the degree of calibration. Finally, SC delivers state-of-the-art performance over calibration baselines in the 4-shot, 8-shot, and 16-shot settings across all nine datasets for Mistral-7B-Instruct-v0.3, LLaMA-2-7B-chat, and Qwen2-7B-Instruct.
This paper proposes a sequential ensemble methodology for epidemic modeling that integrates discrete-time Hawkes processes (DTHP) and Susceptible-Exposed-Infectious-Removed (SEIR) models. Motivated by the need for accurate and reliable epidemic forecasts to inform timely public health interventions, we develop a flexible model averaging (MA) framework using Sequential Monte Carlo Squared. While generating estimates from each model individually, our approach dynamically assigns them weights based on their incrementally estimated marginal likelihoods, accounting for both model and parameter uncertainty, to produce a single ensemble estimate. We assess the methodology through simulation studies mimicking abrupt changes in epidemic dynamics, followed by an application to the Irish influenza and COVID-19 epidemics. Our results show that combining the two models can improve both estimates of the infection trajectory and reproduction number compared to using either model alone. Moreover, the MA consistently produces more stable and informative estimates of the time-varying reproduction number, with credible intervals that provide a realistic assessment of uncertainty. These features are particularly useful when epidemic dynamics change rapidly, enabling more reliable short-term forecasts and timely public health decisions. This research contributes to pandemic preparedness by enhancing forecast reliability and supporting more informed public health responses.
We propose an extension of Thompson sampling to optimization problems over function spaces where the objective is a known functional of an unknown operator's output. We assume that queries to the operator (such as running a high-fidelity simulator or physical experiment) are costly, while functional evaluations on the operator's output are inexpensive. Our algorithm employs a sample-then-optimize approach using neural operator surrogates. This strategy avoids explicit uncertainty quantification by treating trained neural operators as approximate samples from a Gaussian process (GP) posterior. We derive regret bounds and theoretical results connecting neural operators with GPs in infinite-dimensional settings. Experiments benchmark our method against other Bayesian optimization baselines on functional optimization tasks involving partial differential equations of physical systems, demonstrating better sample efficiency and significant performance gains.
Power and sample size calculations for Wald tests in generalized linear models (GLMs) are often limited to specific cases like logistic regression. More general methods typically require detailed study parameters that are difficult to obtain during planning. We introduce two new effect size measures for estimating power and sample size in studies using Wald tests across any GLM. These measures accommodate any number of predictors or adjusters and require only basic study information. We provide practical guidance for interpreting and applying these measures to approximate a key parameter in power calculations. We also derive asymptotic bounds on the relative error of these approximations, showing that accuracy depends on features of the GLM such as the nonlinearity of the link function. To complement this analysis, we conduct simulation studies across common model specifications, identifying best use cases and opportunities for improvement. Finally, we test the methods in finite samples to confirm their practical utility, using a case study on the relationship between education and receipt of mental health treatment.
This work develops algorithms for non-parametric confidence regions for samples from a univariate distribution whose support is a discrete mesh bounded on the left. We generalize the theory of Learned-Miller to preorders over the sample space. In this context, we show that the lexicographic low and lexicographic high orders are in some way extremal in the class of monotone preorders. From this theory we derive several approximation algorithms: 1) Closed form approximations for the lexicographic low and high orders with error tending to zero in the mesh size; 2) A polynomial-time approximation scheme for quantile orders with error tending to zero in the mesh size; 3) Monte Carlo methods for calculating quantile and lexicographic low orders applicable to any mesh size.
A common object to describe the extremal dependence of a $d$-variate random vector $X$ is the stable tail dependence function $L$. Various parametric models have emerged, with a popular subclass consisting of those stable tail dependence functions that arise for linear and max-linear factor models with heavy tailed factors. The stable tail dependence function is then parameterized by a $d \times K$ matrix $A$, where $K$ is the number of factors and where $A$ can be interpreted as a factor loading matrix. We study estimation of $L$ under an additional assumption on $A$ called the `pure variable assumption'. Both $K \in \{1, \dots, d\}$ and $A \in [0, \infty)^{d \times K}$ are treated as unknown, which constitutes an unconventional parameter space that does not fit into common estimation frameworks. We suggest two algorithms that allow to estimate $K$ and $A$, and provide finite sample guarantees for both algorithms. Remarkably, the guarantees allow for the case where the dimension $d$ is larger than the sample size $n$. The results are illustrated with numerical experiments and two case studies.
Predicting Parkinson's Disease (PD) progression is crucial, and voice biomarkers offer a non-invasive method for tracking symptom severity (UPDRS scores) through telemonitoring. Analyzing this longitudinal data is challenging due to within-subject correlations and complex, nonlinear patient-specific progression patterns. This study benchmarks LMMs against two advanced hybrid approaches: the Generalized Neural Network Mixed Model (GNMM) (Mandel 2021), which embeds a neural network within a GLMM structure, and the Neural Mixed Effects (NME) model (Wortwein 2023), allowing nonlinear subject-specific parameters throughout the network. Using the Oxford Parkinson's telemonitoring voice dataset, we evaluate these models' performance in predicting Total UPDRS to offer practical guidance for PD research and clinical applications.
The validity of classical hypothesis testing requires the significance level $\alpha$ be fixed before any statistical analysis takes place. This is a stringent requirement. For instance, it prohibits updating $\alpha$ during (or after) an experiment due to changing concern about the cost of false positives, or to reflect unexpectedly strong evidence against the null. Perhaps most disturbingly, witnessing a p-value $p\ll\alpha$ vs $p= \alpha- \epsilon$ for tiny $\epsilon > 0$ has no (statistical) relevance for any downstream decision-making. Following recent work of Grünwald (2024), we develop a theory of post-hoc hypothesis testing, enabling $\alpha$ to be chosen after seeing and analyzing the data. To study "good" post-hoc tests we introduce $\Gamma$-admissibility, where $\Gamma$ is a set of adversaries which map the data to a significance level. We classify the set of $\Gamma$-admissible rules for various sets $\Gamma$, showing they must be based on e-values, and recover the Neyman-Pearson lemma when $\Gamma$ is the constant map.
The Komlós$\unicode{x2013}$Major$\unicode{x2013}$Tusnády (KMT) inequality for partial sums is one of the most celebrated results in probability theory. Yet its practical application has been hindered by a lack of practical constants. This paper addresses this limitation for bounded i.i.d. random variables. At the cost of an additional logarithmic factor, we propose a computable version of the KMT inequality that depends only on the variables' range and standard deviation. We also derive an empirical version of the inequality that achieves nominal coverage even when the standard deviation is unknown. We then demonstrate the practicality of our bounds through applications to online change point detection and first hitting time probabilities. As a byproduct of our analysis, we obtain a Cramér-type moderate deviation bound for normalized centered partial sums.
This paper presents several situations leading to the observation of multiple correlated copies of a drifted process, and then non-asymptotic risk bounds are established on nonparametric estimators of the drift function $b_0$ and its derivative. For drifted Gaussian processes with a regular enough covariance function, a sharper risk bound is established on the estimator of $b_0'$, and a model selection procedure is provided with theoretical guarantees.
Azadkia and Chatterjee (2021) recently introduced a simple nearest neighbor (NN) graph-based correlation coefficient that consistently detects both independence and functional dependence. Specifically, it approximates a measure of dependence that equals 0 if and only if the variables are independent, and 1 if and only if they are functionally dependent. However, this NN estimator includes a bias term that may vanish at a rate slower than root-$n$, preventing root-$n$ consistency in general. In this article, we (i) analyze this bias term closely and show that it could become asymptotically negligible when the dimension is smaller than four; and (ii) propose a bias-correction procedure for more general settings. In both regimes, we obtain estimators (either the original or the bias-corrected version) that are root-$n$ consistent and asymptotically normal.
Parameter calibration is essential for reducing uncertainty and improving predictive fidelity in physics-based models, yet it is often limited by the high computational cost of model evaluations. Bayesian calibration methods provide a principled framework for combining prior information with data while rigorously quantifying uncertainty. In this work, we compare four emulator-based Bayesian calibration strategies, Calibrate-Emulate-Sample (CES), History Matching (HM), Bayesian Optimal Experimental Design (BOED), and a goal-oriented extension of BOED (GBOED). The proposed GBOED formulation explicitly targets information gain with respect to the calibration posterior, aligning design decisions with downstream inference. We assess methods using accuracy and uncertainty quantification metrics, convergence behavior under increasing computational budgets, and practical considerations such as implementation complexity and robustness. For the Lorenz '96 system, CES, HM, and GBOED all yield strong calibration performance, even with limited numbers of model evaluations, while standard BOED generally underperforms in this setting. Differences among the strongest methods are modest, particularly as computational budgets increase. For the two-layer quasi-geostrophic system, all methods produce reasonable posterior estimates, and convergence behavior is more consistent. Overall, our results indicate that multiple emulator-based calibration strategies can perform comparably well when applied appropriately, with method selection often guided more by computational and practical considerations than by accuracy alone. These findings highlight both the limitations of standard BOED for calibration and the promise of goal-oriented and iterative approaches for efficient Bayesian inference in complex dynamical systems.
The Wasserstein distance is a metric for assessing distributional differences. The measure originates in optimal transport theory and can be interpreted as the minimal cost of transforming one distribution into another. In this paper, the Wasserstein distance is applied to life table age-at-death distributions. The main finding is that, under certain conditions, the Wasserstein distance between two age-at-death distributions equals the corresponding gap in life expectancy at birth ($e_0$). More specifically, the paper shows mathematically and empirically that this equivalence holds whenever the survivorship functions do not cross. For example, this applies when comparing mortality between women and men from 1990 to 2020 using data from the Human Mortality Database. In such cases, the gap in $e_0$ reflects not only a difference in mean ages at death but can also be interpreted directly as a measure of distributional difference.
We show how intensive, large and accurate time series can allow us to see through time. Many phenomena have aperiodic and periodic components. An ideal time series analysis method would detect such trend and signal(-s) combinations. The widely-used Discrete Fourier Transform (DFT) and other frequency-domain parametric time series analysis methods have many application limitations constraining the trend and signal(-s) detection. We show that none of those limitations constrains our Discrete Chi-square Method (DCM) which can detect signal(-s) superimposed on an unknown trend. Our simulated time series analyses ascertain the revolutionary Window Dimension Effect (WDE): ``For any sample window $\Delta T$, DCM inevitably detects the correct $p(t)$ trend and $h(t)$ signal(-s) when the sample size $n$ and/or data accuracy $\sigma$ increase.'' The simulations also expose the DFT's weaknesses and the DCM's efficiency. The DCM's backbone is the Gauss-Markov theorem that the Least Squares (LS) is the best unbiased estimator for linear regression models. DCM can not fail because this simple method is based on the computation of a massive number of linear model LS fits. The Fisher-test gives the signal significance estimates and identifies the best DCM model from all alternative tested DCM models. The analytical solution for the non-linear DCM model is an ill-posed problem. We present a computational well-posed solution. The DCM can forecast complex time series. The best DCM model must be correct if it passes our Forecast-test. Our DCM is ideal for forecasting because its WDE spearhead is robust against short sample windows and complex time series.
Monte Carlo simulation studies are at the core of the modern applied, computational, and theoretical statistical literature. Simulation is a broadly applicable research tool, used to collect data on the relative performance of methods or data analysis approaches under a well-defined data-generating process. However, extant literature focuses largely on design aspects of simulation, rather than implementation strategies aligned with the current state of (statistical) programming languages, portable data formats, and multi-node cluster computing. In this work, I propose tidy simulation: a simple, language-agnostic, yet flexible functional framework for designing, writing, and running simulation studies. It has four components: a tidy simulation grid, a data generation function, an analysis function, and a results table. Using this structure, even the smallest simulations can be written in a consistent, modular way, yet they can be readily scaled to thousands of nodes in a computer cluster should the need arise. Tidy simulation also supports the iterative, sometimes exploratory nature of simulation-based experiments. By adopting the tidy simulation approach, researchers can implement their simulations in a robust, reproducible, and scalable way, which contributes to high-quality statistical science.
Composite indices like the Gender Equality Index (GEI) are widely used to monitor gender disparities and guide evidence-based policy. However, their original design is often limited when applied to subnational contexts. Building on the GEI framework and the WeWorld Index Italia, this study proposes a composite indicator tailored to measure gender disparities across Italian regions. The methodology, based on a variation of the Mazziotta-Pareto Index, introduces a novel aggregation approach that penalizes uneven performances across domains. Indicators cover employment, economic resources, education, use of time, political participation, and health, reflecting multidimensional gender inequality. Using open regional data for 2024, the proposed Italian Gender Equality Index (IGEI) provides a comparable and robust measure across regions, highlighting both high-performing and lagging areas. The approach addresses compensatory limitations of traditional aggregation and offers a practical tool for regional monitoring and targeted interventions, benefiting from the fact that the IGEI is specifically tailored on the GEI framework.
Debates about whether development projects improve living conditions persist, partly because observational estimates can be biased by incomplete adjustment and because reliable outcome data are scarce at the neighborhood level. We address both issues in a continent-scale, sector-specific evaluation of Chinese and World Bank projects across 9,899 neighborhoods in 36 African countries (2002-2013), representative of ~88% of the population. First, we use a recent dataset that measures living conditions with a machine-learned wealth index derived from contemporaneous satellite imagery, yielding a consistent panel of 6.7 km square mosaics. Second, to strengthen identification, we proxy officials' map-based placement criteria using pre-treatment daytime satellite images and fuse these with tabular covariates to estimate funder- and sector-specific ATEs via inverse-probability weighting. Incorporating imagery often shrinks effects relative to tabular-only models. On average, both donors raise wealth, with larger and more consistent gains for China; sector extremes in our sample include Trade and Tourism (330) for the World Bank (+12.29 IWI points), and Emergency Response (700) for China (+15.15). Assignment-mechanism analyses also show World Bank placement is often more predictable from imagery alone (as well as from tabular covariates). This suggests that Chinese project placements are more driven by non-visible, political, or event-driven factors than World Bank placements. To probe residual concerns about selection on observables, we also estimate within-neighborhood (unit) fixed-effects models at a spatial resolution about 67 times finer than prior fixed-effects analyses, leveraging the computer-vision-imputed IWI panels; these deliver smaller but, for Chinese projects, directionally consistent effects.
Improving statistical forecasts of Tropical Cyclone (TC) intensity is limited by complex nonlinear interactions and difficulty in identifying relevant predictors. Conventional methods prioritize correlation or fit, often overlooking confounding variables and limiting generalizability to unseen TCs. To address this, we leverage a multidata causal discovery framework with a replicated dataset based on Statistical Hurricane Intensity Prediction Scheme (SHIPS) using ERA5 meteorological reanalysis. We conduct multiple experiments to identify and select predictors causally linked to TC intensity changes. We then train multiple linear regression models to compare causal feature selection with no selection, correlation, and random forest feature importance across five forecast lead times from 1 to 5 days (24 to 120 hours). Causal feature selection consistently outperforms on unseen test cases, especially for lead times shorter than 3 days. The causal features primarily include vertical shear, mid-tropospheric potential vorticity and surface moisture conditions, which are physically significant yet often underutilized in TC intensity predictions. We build an extended predictor set (SHIPS plus) by adding selected features to the standard SHIPS predictors. SHIPS plus yields increased short-term predictive skill at lead times of 24, 48, and 72 hours. Adding nonlinearity using multilayer perceptron further extends skill to longer lead times, despite our framework being purely regional and not requiring global forecast data. Operational SHIPS tests confirm that three of the six added causally discovered predictors improve forecast skill, with the largest gains at longer lead times. Our results demonstrate that causal discovery improves TC intensity prediction and pave the way toward more empirical forecasts.
Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are learnable, via the lens of the classic PAC (Probably Approximately Correct) learning framework. We motivate several performative scenarios, accounting in particular for linear shifts in the label distribution, as well as for more general changes in both the labels and the features. We construct a performative empirical risk function, which depends only on data from the original distribution and on the type performative effect, and is yet an unbiased estimate of the true risk of a classifier on the shifted distribution. Minimizing this notion of performative risk allows us to show that any PAC-learnable hypothesis space in the standard binary classification setting remains PAC-learnable for the considered performative scenarios. We also conduct an extensive experimental evaluation of our performative risk minimization method and showcase benefits on synthetic and real data.
Generative models frequently suffer miscalibration, wherein statistics of the sampling distribution such as class probabilities deviate from desired values. We frame calibration as a constrained optimization problem and seek the closest model in Kullback-Leibler divergence satisfying calibration constraints. To address the intractability of imposing these constraints exactly, we introduce two surrogate objectives for fine-tuning: (1) the relax loss, which replaces the constraint with a miscalibration penalty, and (2) the reward loss, which converts calibration into a reward fine-tuning problem. We demonstrate that these approaches substantially reduce calibration error across hundreds of simultaneous constraints and models with up to one billion parameters, spanning applications in protein design, image generation, and language modeling.
We introduce a novel high-frequency daily panel dataset of both markets and news-based indicators -- including Geopolitical Risk, Economic Policy Uncertainty, Trade Policy Uncertainty, and Political Sentiment -- for 42 countries across both emerging and developed markets. Using this dataset, we study how sentiment dynamics shape sovereign risk, measured by Credit Default Swap (CDS) spreads, and evaluate their forecasting value relative to traditional drivers such as global monetary policy and market volatility. Our horse-race analysis of forecasting models demonstrates that incorporating news-based indicators significantly enhances predictive accuracy and enriches the analysis, with non-linear machine learning methods -- particularly Random Forests -- delivering the largest gains. Our analysis reveals that while global financial variables remain the dominant drivers of sovereign risk, geopolitical risk and economic policy uncertainty also play a meaningful role. Crucially, their effects are amplified through non-linear interactions with global financial conditions. Finally, we document pronounced regional heterogeneity, as certain asset classes and emerging markets exhibit heightened sensitivity to shocks in policy rates, global financial volatility, and geopolitical risk.
The difference-in-differences (DID) research design is a key identification strategy which allows researchers to estimate causal effects under the parallel trends assumption. While the parallel trends assumption is counterfactual and cannot be tested directly, researchers often examine pre-treatment periods to check whether the time trends are parallel before treatment is administered. Recently, researchers have been cautioned against using preliminary tests which aim to detect violations of parallel trends in the pre-treatment period. In this paper, we argue that preliminary testing can -- and should -- play an important role within the DID research design. We propose a new and more substantively appropriate conditional extrapolation assumption, which requires an analyst to conduct a preliminary test to determine whether the severity of pre treatment parallel trend violations falls below an acceptable level before extrapolation to the post-treatment period is justified. This stands in contrast to prior work which can be interpreted as either setting the acceptable level to be exactly zero (in which case preliminary tests lack power) or assuming that extrapolation is always justified (in which case preliminary tests are not required). Under mild assumptions on how close the actual violation is to the acceptable level, we provide a consistent preliminary test as well confidence intervals which are valid when conditioned on the result of the test. The conditional coverage of these intervals overcomes a common critique made against the use of preliminary testing within the DID research design. To illustrate the performance of the proposed methods, we use synthetic data as well as data on recentralization of public services in Vietnam and right-to-carry laws in Virginia.
The Fréchet regression is a useful method for modeling random objects in a general metric space given Euclidean covariates. However, the conventional approach could be sensitive to outlying objects in the sense that the distance from the regression surface is large compared to the other objects. In this study, we develop a robust version of the global Fréchet regression by incorporating weight parameters into the objective function. We then introduce the Elastic net regularization, favoring a sparse vector of robust parameters to control the influence of outlying objects. We provide a computational algorithm to iteratively estimate the regression function and weight parameters, with providing a linear convergence property. We also propose the Bayesian information criterion to select the tuning parameters for regularization, which gives adaptive robustness along with observed data. The finite sample performance of the proposed method is demonstrated through numerical studies on matrix and distribution responses.
We introduce two families of stochastic interventions with discrete treatments that connect causal modeling to cost-sensitive decision making. The interventions arise from a cost-penalized information projection of the independent product of the organic propensity scores and a reference policy, yielding closed-form Boltzmann-Gibbs couplings. The induced marginals define modified stochastic policies that interpolate smoothly, via a tilt parameter, from the organic law or from the reference law toward a product-of-experts limit when all destination costs are strictly positive. The first family recovers and extends incremental propensity score interventions, retaining identification without global positivity. For inference on the expected outcomes after these policies, we derive the efficient influence functions under a nonparametric model and construct one-step estimators. In simulations, the proposed estimators improve stability and robustness to nuisance misspecification relative to plug-in baselines. The framework can operationalize graded scientific hypotheses under realistic constraints. Because inputs are modular, analysts can sweep feasible policy spaces, prototype candidates, and align interventions with budgets and logistics before committing experimental resources.
Instrumental variable regression is a foundational tool for causal analysis across the social and biomedical sciences. Recent advances use kernel methods to estimate nonparametric causal relationships, with general data types, while retaining a simple closed-form expression. Empirical researchers ultimately need reliable inference on causal estimates; however, uniform confidence sets for the method remain unavailable. To fill this gap, we develop valid and sharp confidence sets for kernel instrumental variable regression, allowing general nonlinearities and data types. Computationally, our bootstrap procedure requires only a single run of the kernel instrumental variable regression estimator. Theoretically, it relies on the same key assumptions. Overall, we provide a practical procedure for inference that substantially increases the value of kernel methods for causal analysis.
Since the 1990s, considerable empirical work has been carried out to train statistical models, such as neural networks (NNs), as learned heuristics for combinatorial optimization (CO) problems. When successful, such an approach eliminates the need for experts to design heuristics per problem type. Due to their structure, many hard CO problems are amenable to treatment through reinforcement learning (RL). Indeed, we find a wealth of literature training NNs using value-based, policy gradient, or actor-critic approaches, with promising results, both in terms of empirical optimality gaps and inference runtimes. Nevertheless, there has been a paucity of theoretical work undergirding the use of RL for CO problems. To this end, we introduce a unified framework to model CO problems through Markov decision processes (MDPs) and solve them using RL techniques. We provide easy-to-test assumptions under which CO problems can be formulated as equivalent undiscounted MDPs that provide optimal solutions to the original CO problems. Moreover, we establish conditions under which value-based RL techniques converge to approximate solutions of the CO problem with a guarantee on the associated optimality gap. Our convergence analysis provides: (1) a sufficient rate of increase in batch size and projected gradient descent steps at each RL iteration; (2) the resulting optimality gap in terms of problem parameters and targeted RL accuracy; and (3) the importance of a choice of state-space embedding. Together, our analysis illuminates the success (and limitations) of the celebrated deep Q-learning algorithm in this problem context.
We consider the design of smoothings of the (coordinate-wise) max function in $\mathbb{R}^d$ in the infinity norm. The LogSumExp function $f(x)=\ln(\sum^d_i\exp(x_i))$ provides a classical smoothing, differing from the max function in value by at most $\ln(d)$. We provide an elementary construction of a lower bound, establishing that every overestimating smoothing of the max function must differ by at least $\sim 0.8145\ln(d)$. Hence, LogSumExp is optimal up to small constant factors. However, in small dimensions, we provide stronger, exactly optimal smoothings attaining our lower bound, showing that the entropy-based LogSumExp approach to smoothing is not exactly optimal.
Amortized Bayesian inference (ABI) offers fast, scalable approximations to posterior densities by training neural surrogates on data simulated from the statistical model. However, ABI methods are highly sensitive to model misspecification: when observed data fall outside the training distribution (generative scope of the statistical models), neural surrogates can behave unpredictably. This makes it a challenge in a model comparison setting, where multiple statistical models are considered, of which at least some are misspecified. Recent work on self-consistency (SC) provides a promising remedy to this issue, accessible even for empirical data (without ground-truth labels). In this work, we investigate how SC can improve amortized model comparison conceptualized in four different ways. Across two synthetic and two real-world case studies, we find that approaches for model comparison that estimate marginal likelihoods through approximate parameter posteriors consistently outperform methods that directly approximate model evidence or posterior model probabilities. SC training improves robustness when the likelihood is available, even under severe model misspecification. The benefits of SC for methods without access of analytic likelihoods are more limited and inconsistent. Our results suggest practical guidance for reliable amortized Bayesian model comparison: prefer parameter posterior-based methods and augment them with SC training on empirical datasets to mitigate extrapolation bias under model misspecification.
We study the problem of coincidence detection in time series data, where we aim to determine whether the appearance of simultaneous or near-simultaneous events in two time series is indicative of some shared underlying signal or synchronicity, or might simply be due to random chance. This problem arises across many applications, such as astrophysics (e.g., detecting astrophysical events such as gravitational waves, with two or more detectors) and neuroscience (e.g., detecting synchronous firing patterns between two or more neurons). In this work, we consider methods based on time-shifting, where the timeline of one data stream is randomly shifted relative to another, to mimic the types of coincidences that could occur by random chance. Our theoretical results establish rigorous finite-sample guarantees controlling the probability of false positives, under weak assumptions that allow for dependence within the time series data, providing reassurance that time-shifting methods are a reliable tool for inference in this setting. Empirical results with simulated and real data validate the strong performance of time-shifting methods in dependent-data settings.
Understanding how much each variable contributes to an outcome is a central question across disciplines. A causal view of explainability is favorable for its ability in uncovering underlying mechanisms and generalizing to new contexts. Based on a family of causal explainability quantities, we develop methods for their estimation and inference. In particular, we construct a one-step correction estimator using semi-parametric efficiency theory, which explicitly leverages the independence structure of variables to reduce the asymptotic variance. For a null hypothesis on the boundary, i.e., zero explainability, we show its equivalence to Fisher's sharp null, which motivates a randomization-based inference procedure. Finally, we illustrate the empirical efficacy of our approach through simulations as well as an immigration experiment dataset, where we investigate how features and their interactions shape public opinion toward admitting immigrants.
Randomness (in the sense of being generated in an IID fashion) and exchangeability are standard assumptions in nonparametric statistics and machine learning, and relations between them have been a popular topic of research. This short paper draws the reader's attention to the fact that, while for infinite sequences of observations the two assumptions are almost indistinguishable, the difference between them becomes very significant for finite sequences of a given length.
Discrete random probability measures are central to Bayesian inference, particularly as priors for mixture modeling and clustering. A broad and unifying class is that of proper species sampling processes (SSPs), encompassing many Bayesian nonparametric priors. We show that any proper SSP admits an exact conditional finite-mixture representation by augmenting the model with a latent truncation index and a simple reweighting of the atoms, which yields a conditional random finite-atom measure whose marginalized distribution matches the original SSP. This yields at least two consequences: (i) distributionally exact simulation for arbitrary SSPs, without user-chosen truncation levels; and (ii) posterior inference in SSP mixture models via standard finite-mixture machinery, leading to tractable MCMC algorithms without ad hoc truncations. We explore these consequences by deriving explicit total-variation bounds for the conditional approximation error when this truncation is fixed, and by studying practical performance in mixture modeling, with emphasis on Dirichlet and geometric SSPs.
A targeted learning (TL) framework is developed to estimate the difference in the restricted mean survival time (RMST) for a clinical trial with time-to-event outcomes. The approach starts by defining the target estimand as the RMST difference between investigational and control treatments. Next, an efficient estimation method is introduced: a targeted minimum loss estimator (TMLE) utilizing pseudo-observations. Moreover, a version of the copy reference (CR) approach is developed to perform a sensitivity analysis for right-censoring. The proposed TL framework is demonstrated using a real data application.
We introduce the setting of continuous index learning, in which a function of many variables varies only along a small number of directions at each point. For efficient estimation, it is beneficial for a learning algorithm to adapt, near each point $x$, to the subspace that captures the local variability of the function $f$. We pose this task as kernel adaptation along a manifold with noise, and introduce Local EGOP learning, a recursive algorithm that utilizes the Expected Gradient Outer Product (EGOP) quadratic form as both a metric and inverse-covariance of our target distribution. We prove that Local EGOP learning adapts to the regularity of the function of interest, showing that under a supervised noisy manifold hypothesis, intrinsic dimensional learning rates are achieved for arbitrarily high-dimensional noise. Empirically, we compare our algorithm to the feature learning capabilities of deep learning. Additionally, we demonstrate improved regression quality compared to two-layer neural networks in the continuous single-index setting.
Forecasting fails not because models are weak, but because effort is wasted on series whose futures are fundamentally unknowable. We propose a pre-modelling diagnostic framework that assesses horizon-specific forecastability before model selection begins, enabling practitioners to allocate effort where it can yield returns. We operationalise forecastability as auto-mutual information (AMI) at lag h, measuring the reduction in uncertainty about future values provided by the past. Using a k-nearest-neighbour estimator on training data only, we validate AMI against realised out-of-sample error (sMAPE) across 1,350 M4 series spanning six frequencies, with Seasonal Naive, ETS, and N-BEATS as probe models under a rolling-origin protocol. The central finding is that the AMI-sMAPE relationship is strongly frequency-conditional. For Weekly, Hourly, Quarterly, and Yearly series, AMI exhibits consistent negative rank association with realised error (p ranging from -0.52 to -0.66 for higher-capacity probes), supporting its use as a triage signal. Monthly shows moderate association; Daily exhibits weaker discrimination despite measurable dependence. Across all frequencies, median forecast error decreases monotonically from low to high AMI terciles, confirming decision-relevant separation. These results establish AMI as a practical screening tool for forecasting portfolios: identifying series where sophisticated modelling is warranted, where baselines suffice, and where effort should shift from accuracy improvement to robust decision design.
Outlier detection (OD) aims to identify abnormal instances, known as outliers or anomalies, by learning typical patterns of normal data, or inliers. Performing OD under an unsupervised regime-without any information about anomalous instances in the training data-is challenging. A recently observed phenomenon, known as the inlier-memorization (IM) effect, where deep generative models (DGMs) tend to memorize inlier patterns during early training, provides a promising signal for distinguishing outliers. However, existing unsupervised approaches that rely solely on the IM effect still struggle when inliers and outliers are not well-separated or when outliers form dense clusters. To address these limitations, we incorporate active learning to selectively acquire informative labels, and propose IMBoost, a novel framework that explicitly reinforces the IM effect to improve outlier detection. Our method consists of two stages: 1) a warm-up phase that induces and promotes the IM effect, and 2) a polarization phase in which actively queried samples are used to maximize the discrepancy between inlier and outlier scores. In particular, we propose a novel query strategy and tailored loss function in the polarization phase to effectively identify informative samples and fully leverage the limited labeling budget. We provide a theoretical analysis showing that the IMBoost consistently decreases inlier risk while increasing outlier risk throughout training, thereby amplifying their separation. Extensive experiments on diverse benchmark datasets demonstrate that IMBoost not only significantly outperforms state-of-the-art active OD methods but also requires substantially less computational cost.
Qualitative Comparative Analysis (QCA) requires researchers to choose calibration and dichotomization thresholds, and these choices can substantially affect truth tables, minimization, and resulting solution formulas. Despite this dependency, threshold sensitivity is often examined only in an ad hoc manner because repeated analyses are time-intensive and error-prone. We present TSQCA, an R package that automates threshold-sweep analyses by treating thresholds as explicit analytical variables. It provides four sweep functions (otSweep, ctSweepS, ctSweepM, dtSweep) to explore outcome thresholds, single-condition thresholds, multi-condition threshold grids, and joint outcome-condition threshold spaces, respectively. TSQCA integrates with the established CRAN package QCA for truth table construction and Boolean minimization, while returning structured S3 objects with consistent print/summary methods and optional detailed results. The package also supports automated Markdown report generation and configuration-chart output to facilitate reproducible documentation of cross-threshold results.
This paper is concerned with the approximation of a function $u$ in a given approximation space $V_m$ of dimension $m$ from evaluations of the function at $n$ suitably chosen points. The aim is to construct an approximation of $u$ in $V_m$ which yields an error close to the best approximation error in $V_m$ and using as few evaluations as possible. Classical least-squares regression, which defines a projection in $V_m$ from $n$ random points, usually requires a large $n$ to guarantee a stable approximation and an error close to the best approximation error. This is a major drawback for applications where $u$ is expensive to evaluate. One remedy is to use a weighted least squares projection using $n$ samples drawn from a properly selected distribution. In this paper, we introduce a boosted weighted least-squares method which allows to ensure almost surely the stability of the weighted least squares projection with a sample size close to the interpolation regime $n=m$. It consists in sampling according to a measure associated with the optimization of a stability criterion over a collection of independent $n$-samples, and resampling according to this measure until a stability condition is satisfied. A greedy method is then proposed to remove points from the obtained sample. Quasi-optimality properties are obtained for the weighted least-squares projection, with or without the greedy procedure. The proposed method is validated on numerical examples and compared to state-of-the-art interpolation and weighted least squares methods.
Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While it has been shown that policy gradient methods can find globally optimal policies in the risk-neutral setting, it remains unclear if the risk-averse variants enjoy the same global convergence guarantees. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive natural policy gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality and iteration complexity of the proposed risk-averse NPG algorithm with softmax parameterization and entropy regularization under both exact and inexact policy evaluation. Furthermore, we test our risk-averse NPG algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our method.
We construct multiperiodic processes -- a simple example of stationary ergodic (but not mixing) processes over natural numbers that enjoy the vanishing entropy rate under a mild condition. Multiperiodic processes are supported on randomly shifted deterministic sequences called multiperiodic sequences, which can be efficiently generated using an algorithm called the Infinite Clock. Under a suitable parameterization, multiperiodic sequences exhibit relative frequencies of particular numbers given by Zipf's law. Exactly in the same setting, the respective multiperiodic processes satisfy an asymptotic power-law growth of block entropy, called Hilberg's law. Hilberg's law is deemed to hold for statistical language models, in particular.
Stochastic simulation models effectively capture complex system dynamics but are often too slow for real-time decision-making. Traditional metamodeling techniques learn relationships between simulator inputs and a single output summary statistic, such as the mean or median. These techniques enable real-time predictions without additional simulations. However, they require prior selection of one appropriate output summary statistic, limiting their flexibility in practical applications. We propose a new concept: generative metamodeling. It aims to construct a "fast simulator of the simulator," generating random outputs significantly faster than the original simulator while preserving approximately equal conditional distributions. Generative metamodels enable rapid generation of numerous random outputs upon input specification, facilitating immediate computation of any summary statistic for real-time decision-making. We introduce a new algorithm, quantile-regression-based generative metamodeling (QRGMM), and establish its distributional convergence and convergence rate. Extensive numerical experiments demonstrate QRGMM's efficacy compared to other state-of-the-art generative algorithms in practical real-time decision-making scenarios.
We consider the optimization problem associated with training two-layer ReLU networks with \(d\) inputs under the squared loss, where the labels are generated by a target network. Recent work has identified two distinct classes of infinite families of minima: one whose training loss vanishes in the high-dimensional limit, and another whose loss remains bounded away from zero. The latter family is empirically avoided by stochastic gradient descent, hence \emph{hidden}, motivating the search for analytic criteria that distinguish hidden from non-hidden minima. A key challenge is that prior analyses have shown the Hessian spectra at hidden and non-hidden minima to coincide up to terms of order \(O(d^{-1/2})\), seemingly limiting the discriminative power of spectral methods. We therefore take a different route, studying instead certain curves along which the loss is locally minimized. Our main result shows that arcs emanating from hidden minima exhibit distinctive structural and symmetry properties, arising precisely from \(\Omega(d^{-1/2})\) eigenvalue contributions that are absent from earlier analyses.
This note discusses the interpretation of event-study plots produced by recent difference-in-differences methods. I show that even when specialized to the case of non-staggered treatment timing, the default plots produced by software for several of the most popular recent methods do not match those of traditional two-way fixed effects (TWFE) event-studies. The plots produced by the new methods may show a kink or jump at the time of treatment even when the TWFE event-study shows a straight line. This difference stems from the fact that the new methods construct the pre-treatment coefficients asymmetrically from the post-treatment coefficients. As a result, visual heuristics for evaluating violations of parallel trends using TWFE event-study plots should not be immediately applied to those from these methods. I conclude with practical recommendations for constructing and interpreting event-study plots when using these methods.
Constructing the architecture of a neural network is a challenging pursuit for the machine learning community, and the dilemma of whether to go deeper or wider remains a persistent question. This paper explores a comparison between deeper neural networks (DeNNs) with a flexible number of layers and wider neural networks (WeNNs) with limited hidden layers, focusing on their optimal generalization error in Sobolev losses. Analytical investigations reveal that the architecture of a neural network can be significantly influenced by various factors, including the number of sample points, parameters within the neural networks, and the regularity of the loss function. Specifically, a higher number of parameters tends to favor WeNNs, while an increased number of sample points and greater regularity in the loss function lean towards the adoption of DeNNs. We ultimately apply this theory to address partial differential equations using deep Ritz and physics-informed neural network (PINN) methods, guiding the design of neural networks.
In multiclass classification over $n$ outcomes, the outcomes must be embedded into the reals with dimension at least $n-1$ in order to design a consistent surrogate loss that leads to the "correct" classification, regardless of the data distribution. For large $n$, such as in information retrieval and structured prediction tasks, optimizing a surrogate in $n-1$ dimensions is often intractable. We investigate ways to trade off surrogate loss dimension, the number of problem instances, and restricting the region of consistency in the simplex for multiclass classification. Following past work, we examine an intuitive embedding procedure that maps outcomes into the vertices of convex polytopes in a low-dimensional surrogate space. We show that full-dimensional subsets of the simplex exist around each point mass distribution for which consistency holds, but also, with less than $n-1$ dimensions, there exist distributions for which a phenomenon called hallucination occurs, which is when the optimal report under the surrogate loss is an outcome with zero probability. Looking towards application, we derive a result to check if consistency holds under a given polytope embedding and low-noise assumption, providing insight into when to use a particular embedding. We provide examples of embedding $n = 2^{d}$ outcomes into the $d$-dimensional unit cube and $n = d!$ outcomes into the $d$-dimensional permutahedron under low-noise assumptions. Finally, we demonstrate that with multiple problem instances, we can learn the mode with $\frac{n}{2}$ dimensions over the whole simplex.
Temporally causal representation learning aims to identify the latent causal process from time series observations, but most methods require the assumption that the latent causal processes do not have instantaneous relations. Although some recent methods achieve identifiability in the instantaneous causality case, they require either interventions on the latent variables or grouping of the observations, which are in general difficult to obtain in real-world scenarios. To fill this gap, we propose an \textbf{ID}entification framework for instantane\textbf{O}us \textbf{L}atent dynamics (\textbf{IDOL}) by imposing a sparse influence constraint that the latent causal processes have sparse time-delayed and instantaneous relations. Specifically, we establish identifiability results of the latent causal process based on sufficient variability and the sparse influence constraint by employing contextual information of time series data. Based on these theories, we incorporate a temporally variational inference architecture to estimate the latent variables and a gradient-based sparsity regularization to identify the latent causal process. Experimental results on simulation datasets illustrate that our method can identify the latent causal process. Furthermore, evaluations on multiple human motion forecasting benchmarks with instantaneous dependencies indicate the effectiveness of our method in real-world settings.
When we interpret linear regression as estimating causal effects justified by quasi-experimental treatment variation, what do we mean? This paper formalizes a minimal criterion for quasi-experimental interpretation and characterizes its necessary implications. A minimal requirement is that the regression always estimates some contrast of potential outcomes under the true treatment assignment process. This requirement implies linear restrictions on the true distribution of treatment. If the regression were to be interpreted quasi-experimentally, these restrictions imply candidates for the true distribution of treatment, which we call implicit designs. Regression estimators are numerically equivalent to augmented inverse propensity weighting (AIPW) estimators using an implicit design. Implicit designs serve as a framework that unifies and extends existing theoretical results on causal interpretation of regression across starkly distinct settings (including multiple treatment, panel, and instrumental variables). They lead to new theoretical insights for widely used but less understood specifications.
Existing works on the expressive power of neural networks typically assume real parameters and exact operations. In this work, we study the expressive power of quantized networks under discrete fixed-point parameters and inexact fixed-point operations with round-off errors. We first provide a necessary condition and a sufficient condition on fixed-point arithmetic and activation functions for quantized networks to represent all fixed-point functions from fixed-point vectors to fixed-point numbers. Then, we show that various popular activation functions satisfy our sufficient condition, e.g., Sigmoid, ReLU, ELU, SoftPlus, SiLU, Mish, and GELU. In other words, networks using those activation functions are capable of representing all fixed-point functions. We further show that our necessary condition and sufficient condition coincide under a mild condition on activation functions: e.g., for an activation function $\sigma$, there exists a fixed-point number $x$ such that $\sigma(x)=0$. Namely, we find a necessary and sufficient condition for a large class of activation functions. We lastly show that even quantized networks using binary weights in $\{-1,1\}$ can also represent all fixed-point functions for practical activation functions.
Neural activity fluctuates over a wide range of timescales within and across brain areas. Experimental observations suggest that diverse neural timescales reflect information in dynamic environments. However, how timescales are defined and measured from brain recordings vary across the literature. Moreover, these observations do not specify the mechanisms underlying timescale variations, nor whether specific timescales are necessary for neural computation and brain function. Here, we synthesize three directions where computational approaches can distill the broad set of empirical observations into quantitative and testable theories: We review (i) how different data analysis methods quantify timescales across distinct behavioral states and recording modalities, (ii) how biophysical models provide mechanistic explanations for the emergence of diverse timescales, and (iii) how task-performing networks and machine learning models uncover the functional relevance of neural timescales. This integrative computational perspective thus complements experimental investigations, providing a holistic view on how neural timescales reflect the relationship between brain structure, dynamics, and behavior.
The Gibbs sampler (a.k.a. Glauber dynamics and heat-bath algorithm) is a popular Markov Chain Monte Carlo algorithm which iteratively samples from the conditional distributions of a probability measure $\pi$ of interest. Under the assumption that $\pi$ is strongly log-concave, we show that the random scan Gibbs sampler contracts in relative entropy and provide a sharp characterization of the associated contraction rate. Assuming that evaluating conditionals is cheap compared to evaluating the joint density, our results imply that the number of full evaluations of $\pi$ needed for the Gibbs sampler to mix grows linearly with the condition number and is independent of the dimension. If $\pi$ is non-strongly log-concave, the convergence rate in entropy degrades from exponential to polynomial. Our techniques are versatile and extend to Metropolis-within-Gibbs schemes and the Hit-and-Run algorithm. A comparison with gradient-based schemes and the connection with the optimization literature are also discussed.
Tabular data is one of the most ubiquitous sources of information worldwide, spanning a wide variety of domains. This inherent heterogeneity has slowed the development of Tabular Foundation Models (TFMs) capable of fast generalization to unseen datasets. In-Context Learning (ICL) has recently emerged as a promising solution for TFMs, enabling dynamic adaptation to new tasks without additional tuning. While many studies have attempted to re-purpose large language models for tabular ICL, they have had limited success, so recent works have focused on developing tabular-specific foundation models. In this work, we propose an approach to combine ICL-based retrieval with self supervised learning to train tabular foundation models. We also investigate the utility of real vs. synthetic data for model pre-training, and show that real data can contain useful signal not easily captured in synthetic training. Specifically, we show that incorporating real data during the pre-training phase can lead to significantly faster training and better downstream generalization to unseen data. Our resulting model, TabDPT, achieves strong performance on both regression (CTR23) and classification (CC18) benchmarks. Importantly, we also demonstrate that with our pre-training procedure, scaling both model and data size leads to consistent performance improvements that follow power laws. This echoes scaling laws in LLMs and other foundation models, and suggests that large-scale TFMs can be achievable. We open-source our full pipeline: inference code including trained model weights can be found at this http URL, and the training code to reproduce experiments can be found at this http URL.
Deep learning has led to tremendous success in computer vision, largely due to Convolutional Neural Networks (CNNs). However, CNNs have been shown to be vulnerable to crafted adversarial perturbations. This vulnerability of adversarial examples has has motivated research into improving model robustness through adversarial detection and defense methods. In this paper, we address the adversarial robustness of CNNs through causal reasoning. We propose CausAdv: a causal framework for detecting adversarial examples based on counterfactual reasoning. CausAdv learns both causal and non-causal features of every input, and quantifies the counterfactual information (CI) of every filter of the last convolutional layer. We then perform a statistical analysis of the filters' CI across clean and adversarial samples, to demonstrate that adversarial examples exhibit different CI distributions compared to clean samples. Our results show that causal reasoning enhances the process of adversarial detection without the need to train a separate detector. Moreover, we illustrate the efficiency of causal explanations as a helpful detection tool by visualizing the extracted causal features.
Data assimilation (DA) combines partial observations with dynamical models to improve state estimation. Filter-based DA uses only past and present data and is the prerequisite for real-time forecasts. Smoother-based DA exploits both past and future observations. It aims to fill in missing data, provide more accurate estimations, and develop high-quality datasets. However, the standard smoothing procedure requires using all historical state estimations, which is storage-demanding, especially for high-dimensional systems. This paper develops an adaptive-lag online smoother for a large class of complex dynamical systems with strong nonlinear and non-Gaussian features, which has important applications to many real-world problems. The adaptive lag allows the utilization of observations only within a nearby window, thus reducing computational complexity and storage needs. Online lag adjustment is essential for tackling turbulent systems, where temporal autocorrelation varies significantly over time due to intermittency, extreme events, and nonlinearity. Based on the uncertainty reduction in the estimated state, an information criterion is developed to systematically determine the adaptive lag. Notably, the mathematical structure of these systems facilitates the use of closed analytic formulae to calculate the online smoother and adaptive lag, avoiding empirical tunings as in ensemble-based DA methods. The adaptive online smoother is applied to studying three important scientific problems. First, it helps detect online causal relationships between state variables. Second, the advantage of reduced computational storage expenditure is illustrated via Lagrangian DA, a high-dimensional nonlinear problem. Finally, the adaptive smoother advances online parameter estimation with partial observations, emphasizing the role of the observed extreme events in accelerating convergence.
Gradient-based methods successfully train highly overparameterized models in practice, even though the associated optimization problems are markedly nonconvex. Understanding the mechanisms that make such methods effective has become a central problem in modern optimization. To investigate this question in a tractable setting, we study Deep Diagonal Linear Networks. These are multilayer architectures with a reparameterization that preserves convexity in the effective parameter, while inducing a nontrivial geometry in the optimization landscape. Under mild initialization conditions, we show that gradient flow on the layer parameters induces a mirror-flow dynamic in the effective parameter space. This structural insight yields explicit convergence guarantees, including exponential decay of the loss under a Polyak-Lojasiewicz condition, and clarifies how the parametrization and initialization scale govern the training speed. Overall, our results demonstrate that deep diagonal over parameterizations, despite their apparent complexity, can endow standard gradient methods with well-behaved and interpretable optimization dynamics.
We study the problem of training neural stochastic differential equations, or diffusion models, to sample from a Boltzmann distribution without access to target samples. Existing methods for training such models enforce time-reversal of the generative and noising processes, using either differentiable simulation or off-policy reinforcement learning (RL). We prove equivalences between families of objectives in the limit of infinitesimal discretization steps, linking entropic RL methods (GFlowNets) with continuous-time objects (partial differential equations and path space measures). We further show that an appropriate choice of coarse time discretization during training allows greatly improved sample efficiency and the use of time-local objectives, achieving competitive performance on standard sampling benchmarks with reduced computational cost.
This paper introduces a rule for policy selection in the presence of estimation uncertainty, explicitly accounting for estimation risk. The rule belongs to the class of risk-aware rules on the efficient decision frontier, characterized as policies offering maximal estimated welfare for a given level of estimation risk. Among this class, the proposed rule is chosen to provide a reporting guarantee, ensuring that the welfare delivered exceeds a threshold with a pre-specified confidence level. We apply this approach to the allocation of a limited budget among social programs using estimates of their marginal value of public funds and associated standard errors.
While electric vehicle (EV) adoption has been widely studied, most research focuses on the average effects of predictors on purchase intent, overlooking variation across the distribution of EV purchase intent. This paper makes a threefold contribution by analyzing four unique explanatory variables, leveraging large-scale US survey data from 2021 to 2023, and employing Bayesian ordinal probit and Bayesian ordinal quantile modeling to evaluate the effects of these variables-while controlling for other commonly used covariates-on EV purchase intent, both on average and across its full distribution. By modeling purchase intent as an ordered outcome-from "not at all likely" to "very likely"-we reveal how covariate effects differ across levels of interest. This is the first application of ordinal quantile modeling in the EV adoption literature, uncovering heterogeneity in how potential buyers respond to key factors. For instance, confidence in development of charging infrastructure and belief in environmental benefits are linked not only to higher interest among likely adopters but also to reduced resistance among more skeptical respondents. Notably, we identify a gap between the prevalence and influence of key predictors: although few respondents report strong infrastructure confidence or frequent EV information exposure, both factors are strongly associated with increased intent across the spectrum. These findings suggest clear opportunities for targeted communication and outreach, alongside infrastructure investment, to support widespread EV adoption.
Simulation-based problems involving mixed-variable inputs frequently feature domains that are hierarchical, conditional, heterogeneous, or tree-structured. These characteristics pose challenges for data representation, modeling, and optimization. This paper reviews extensive literature on these structured input spaces and proposes a unified framework that generalizes existing approaches. In this framework, input variables may be continuous, integer, or categorical. A variable is described as meta if its value governs the presence of other decreed variables, enabling the modeling of conditional and hierarchical structures. We further introduce the concept of partially-decreed variables, whose activation depends on contextual conditions. To capture these inter-variable hierarchical relationships, we introduce design space graphs, combining principles from feature modeling and graph theory. This allows the definition of general hierarchical domains suitable for describing complex system architectures. Our framework defines hierarchical distances and kernels to enable surrogate modeling and optimization on hierarchical domains. We demonstrate its effectiveness on complex system design problems, including a neural network and a green-aircraft case study. Our methods are available in the open-source Surrogate Modeling Toolbox (SMT 2.0).
Accurately forecasting Climate Policy Uncertainty (CPU) is essential for designing climate strategies that balance economic growth with environmental objectives. Elevated CPU levels can delay regulatory implementation, hinder investment in green technologies, and amplify public resistance to policy reforms, particularly during periods of economic stress. Despite the growing literature documenting the economic relevance of CPU, forecasting its evolution and understanding the role of macro-financial drivers in shaping its fluctuations have not been explored. This study addresses this gap by presenting the first effort to forecast CPU and identify its key drivers. We employ various statistical tools to identify macro-financial exogenous drivers, alongside Google search data to capture early public attention to climate policy. Local projection impulse response analysis quantifies the dynamic effects of these variables, revealing that household financial vulnerability, housing market activity, business confidence, credit conditions, and financial market sentiment exert the most substantial impacts. These predictors are incorporated into a Bayesian Structural Time Series (BSTS) framework to produce probabilistic forecasts for both US and Global CPU indices. Extensive experiments and statistical validation demonstrate that BSTS with time-invariant regression coefficients achieves superior forecasting performance. We demonstrate that this performance stems from its variable selection mechanism, which identifies exogenous predictors that are empirically significant and theoretically grounded, as confirmed by the feature importance analysis. From a policy perspective, the findings underscore the importance of adaptive climate policies that remain effective across shifting economic conditions while supporting long-term environmental and growth objectives.
In this paper, we propose new randomized algorithms for estimating the two-to-infinity and one-to-two norms in a matrix-free setting, using only matrix-vector multiplications. Our methods are based on appropriate modifications of Hutchinson's diagonal estimator and its Hutch++ version. We provide oracle complexity bounds for both modifications. We further illustrate the practical utility of our algorithms for Jacobian-based regularization in deep neural network training on image classification tasks. We also demonstrate that our methodology can be applied to mitigate the effect of adversarial attacks in the domain of recommender systems.
Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: this https URL
The process of discovering equations from data lies at the heart of physics and in many other areas of research, including mathematical ecology and epidemiology. Recently, machine learning methods known as symbolic regression emerged as a way to automate this task. This study presents an overview of the current literature on symbolic regression, while also comparing the efficiency of five state-of-the-art methods in recovering the governing equations from nine processes, including chaotic dynamics and epidemic models. Benchmark results demonstrate the PySR method as the most suitable for inferring equations, with some estimates being indistinguishable from the original analytical forms. These results highlight the potential of symbolic regression as a robust tool for inferring and modeling real-world phenomena.
The Waymo Open Motion Dataset (WOMD) has become a popular resource for data-driven modeling of autonomous vehicles (AVs) behavior. However, its validity for behavioral analysis remains uncertain due to proprietary post-processing, the absence of error quantification, and the segmentation of trajectories into 20-second clips. This study examines whether WOMD accurately captures the dynamics and interactions observed in real-world AV operations. Leveraging an independently collected naturalistic dataset from Level 4 AV operations in Phoenix, Arizona (PHX), we perform comparative analyses across three representative urban driving scenarios: discharging at signalized intersections, car-following, and lane-changing behaviors. For the discharging analysis, headways are manually extracted from aerial video to ensure negligible measurement error. For the car-following and lane-changing cases, we apply the Simulation-Extrapolation (SIMEX) method to account for empirically estimated error in the PHX data and use Dynamic Time Warping (DTW) distances to quantify behavioral differences. Results across all scenarios consistently show that behavior in PHX falls outside the behavioral envelope of WOMD. Notably, WOMD underrepresents short headways and abrupt decelerations. These findings suggest that behavioral models calibrated solely on WOMD may systematically underestimate the variability, risk, and complexity of naturalistic driving. Caution is therefore warranted when using WOMD for behavior modeling without proper validation against independently collected data.
We present the design of an autoregressive active inference agent in the form of message passing on a factor graph. Expected free energy is derived and distributed across a planning graph. The proposed agent is validated on a robot navigation task, demonstrating exploration and exploitation in a continuous-valued observation space with bounded continuous-valued actions. Compared to a classical optimal controller, the agent modulates action based on predictive uncertainty, arriving later but with a better model of the robot's dynamics.
We survey different perspectives on the stochastic localization process of Eldan, a powerful construction that has had many exciting recent applications in high-dimensional probability and algorithm design. Unlike prior surveys on this topic, our focus is on giving a self-contained presentation of all known alternative constructions of Eldan's stochastic localization, with an emphasis on connections between different constructions. Our hope is that by collecting these perspectives, some of which had primarily arisen within a particular community (e.g., probability theory, theoretical computer science, information theory, or machine learning), we can broaden the accessibility of stochastic localization, and ease its future use.
Modeling clinical time-series data is hampered by the challenge of capturing latent, time-varying dependencies among features. State-of-the-art approaches often rely on black-box mechanisms or simple aggregation, failing to explicitly model how the influence of one clinical variable propagates through others over time. We propose $\textbf{Chain-of-Influence (CoI)}$, an interpretable deep learning framework that constructs an explicit, time-unfolded graph of feature interactions. CoI enables the tracing of influence pathways, providing a granular audit trail that shows how any feature at any time contributes to the final prediction, both directly and through its influence on other variables. We evaluate CoI on mortality and disease progression tasks using the MIMIC-IV dataset and a chronic kidney disease cohort. Our framework achieves state-of-the-art predictive performance (AUROC of 0.960 on CKD progression and 0.950 on ICU mortality), with deletion-based sensitivity analyses confirming that CoI's learned attributions faithfully reflect its decision process. Through case studies, we demonstrate that CoI uncovers clinically meaningful, patient-specific patterns of disease progression, offering enhanced transparency into the temporal and cross-feature dependencies that inform clinical decision-making.
Effective assessment of mobile network coverage and the precise identification of service weak spots are paramount for network operators striving to enhance user Quality of Experience (QoE). This paper presents a novel framework for mobile coverage and weak spot analysis utilising crowdsourced QoE data. The core of our methodology involves coverage analysis at the individual cell (antenna) level, subsequently aggregated to the site level, using empirical geolocation data. A key contribution of this research is the application of One-Class Support Vector Machine (OC-SVM) algorithm for calculating mobile network coverage. This approach models the decision hyperplane as the effective coverage contour, facilitating robust calculation of coverage areas for individual cells and entire sites. The same methodology is extended to analyse crowdsourced service loss reports, thereby identifying and quantifying geographically localised weak spots. Our findings demonstrate the efficacy of this novel framework in accurately mapping mobile coverage and, crucially, in highlighting granular areas of signal deficiency, particularly within complex urban environments.
Risk assessment tools in healthcare commonly employ point-based scoring systems that map patients to ordinal risk categories via thresholds. While electronic health record (EHR) data presents opportunities for data-driven optimization of these tools, two fundamental challenges impede standard supervised learning: (1) labels are often available only for extreme risk categories due to intervention-censored outcomes, and (2) misclassification cost is asymmetric and increases with ordinal distance. We propose a mixed-integer programming (MIP) framework that jointly optimizes scoring weights and category thresholds in the face of these challenges. Our approach prevents label-scarce category collapse via threshold constraints, and utilizes an asymmetric, distance-aware objective. The MIP framework supports governance constraints, including sign restrictions, sparsity, and minimal modifications to incumbent tools, ensuring practical deployability in clinical workflows. We further develop a continuous relaxation of the MIP problem to provide warm-start solutions for more efficient MIP optimization. We apply the proposed score optimization framework to a case study of inpatient falls risk assessment using the Johns Hopkins Fall Risk Assessment Tool.
Most counterfactual inference frameworks traditionally assume acyclic structural causal models (SCMs), i.e. directed acyclic graphs (DAGs). However, many real-world systems (e.g. biological systems) contain feedback loops or cyclic dependencies that violate acyclicity. In this work, we study counterfactual inference in cyclic SCMs under shift-scale interventions, i.e., soft, policy-style changes that rescale and/or shift a variable's mechanism.
Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See this http URL for further details.
Finding cause-effect relationships is of key importance in science. Causal discovery aims to recover a graph from data that succinctly describes these cause-effect relationships. However, current methods face several challenges, especially when dealing with high-dimensional data and complex dependencies. Incorporating prior knowledge about the system can aid causal discovery. In this work, we leverage Cluster-DAGs as a prior knowledge framework to warm-start causal discovery. We show that Cluster-DAGs offer greater flexibility than existing approaches based on tiered background knowledge and introduce two modified constraint-based algorithms, Cluster-PC and Cluster-FCI, for causal discovery in the fully and partially observed setting, respectively. Empirical evaluation on simulated data demonstrates that Cluster-PC and Cluster-FCI outperform their respective baselines without prior knowledge.
Model combination is a powerful approach for achieving superior performance compared to selecting a single model. We study both theoretically and empirically the effectiveness of ensembles of Multi-Frequency Echo State Networks (MFESNs), which have been shown to achieve state-of-the-art macroeconomic time series forecasting results (Ballarin et al., 2024a). The Hedge and Follow-the-Leader schemes are discussed, and their online learning guarantees are extended to settings with dependent data. In empirical applications, the proposed Ensemble Echo State Networks demonstrate significantly improved predictive performance relative to individual MFESN models.
Simultaneous load forecasting across multiple entities (e.g., regions, buildings) is crucial for the efficient, reliable, and cost-effective operation of power systems. Accurate load forecasting is a challenging problem due to the inherent uncertainties in load demand, dynamic changes in consumption patterns, and correlations among entities. Multi-task learning has emerged as a powerful machine learning approach that enables the simultaneous learning across multiple related problems. However, its application to load forecasting remains underexplored and is limited to offline learning methods, which cannot capture changes in consumption patterns. This paper presents an adaptive multi-task learning method for probabilistic load forecasting. The proposed method can dynamically adapt to changes in consumption patterns and correlations among entities. In addition, the techniques presented provide reliable probabilistic predictions for loads of multiple entities and assess load uncertainties. Specifically, the method is based on vectorvalued hidden Markov models and uses a recursive process to update the model parameters and provide predictions with the most recent parameters. The performance of the proposed method is evaluated using datasets that contain the load demand of multiple entities and exhibit diverse and dynamic consumption patterns. The experimental results show that the presented techniques outperform existing methods both in terms of forecasting performance and uncertainty assessment.
Machine learning systems appear stochastic but are deterministically random, as seeded pseudorandom number generators produce identical realisations across repeated executions. Standard evaluation practice typically treats runs across alternatives as independent and does not exploit shared sources of randomness. This paper analyses the statistical structure of comparative evaluation under shared random seeds. Under this design, competing systems are evaluated using identical seeds, inducing matched stochastic realisations and yielding strict variance reduction whenever outcomes are positively correlated at the seed level. We demonstrate these effects using an extended learning-based multi-agent economic simulator, where paired evaluation exposes systematic differences in aggregate and distributional outcomes that remain statistically inconclusive under independent evaluation at fixed budgets.
Identifying high crash risk road segments and accurately predicting crash incidence is fundamental to implementing effective safety countermeasures. While collision data inherently reflects risk, the infrequency and inconsistent reporting of crashes present a major challenge to robust risk prediction models. The proliferation of connected vehicle technology offers a promising avenue to leverage high-density safety metrics for enhanced crash forecasting. A Hard-Braking Event (HBE), interpreted as an evasive maneuver, functions as a potent proxy for elevated driving risk due to its demonstrable correlation with underlying crash causal factors. Crucially, HBE data is significantly more readily available across the entire road network than conventional collision records. This study systematically evaluated the correlation at individual road segment level between police-reported collisions and aggregated and anonymized HBEs identified via the Google Android Auto platform, utilizing datasets from California and Virginia. Empirical evidence revealed that HBEs occur at a rate magnitudes higher than traffic crashes. Employing the state-of-the-practice Negative-Binomial regression models, the analysis established a statistically significant positive correlation between the HBE rate and the crash rate: road segments exhibiting a higher frequency of HBEs were consistently associated with a greater incidence of crashes. This sophisticated model incorporated and controlled for various confounding factors, including road type, speed profile, proximity to ramps, and road segment slope. The HBEs derived from connected vehicle technology thus provide a scalable, high-density safety surrogate metric for network-wide traffic safety assessment, with the potential to optimize safer routing recommendations and inform the strategic deployment of active safety countermeasures.
Forecasting agricultural markets remains challenging due to nonlinear dynamics, structural breaks, and sparse data. A long-standing belief holds that simple time-series methods outperform more advanced alternatives. This paper provides the first systematic evidence that this belief no longer holds with modern time-series foundation models (TSFMs). Using USDA ERS monthly commodity price data from 1997-2025, we evaluate 17 forecasting approaches across four model classes, including traditional time-series, machine learning, deep learning, and five state-of-the-art TSFMs (Chronos, Chronos-2, TimesFM 2.5, Time-MoE, Moirai-2), and construct annual marketing year price predictions to compare with USDA's futures-based season-average price (SAP) forecasts. We show that zero-shot foundation models consistently outperform traditional time-series methods, machine learning, and deep learning architectures trained from scratch in both monthly and annual forecasting. Furthermore, foundation models remarkably outperform USDA's futures-based forecasts on three of four major commodities despite USDA's information advantage from forward-looking futures markets. Time-MoE delivers the largest accuracy gains, achieving 54.9% improvement on wheat and 18.5% improvement on corn relative to USDA ERS benchmarks on recent data (2017-2024 excluding COVID). These results point to a paradigm shift in agricultural forecasting.