Droughts and flash droughts (rapidly developing droughts; FDs) remain impactful events that are known to desiccate landscape and destroy crops. In particular, droughts in Africa are often more impactful than in other locations, such as the United States or Europe, due to many regions in Africa heavily depending on local agriculture for sustenance. In recent years, large machine learning (ML) models, such as GraphCast and AIFS, have emerged as effective tools for global weather prediction. However, sparse data observations and few ML studies in Africa have left it unclear if these ML models retain their skill when focused on Africa. As such, this project seeks to examine the predictability of drought and FD in Africa using a CrossFormer model based on the Community Research Earth Digital Intelligence Twin (CREDIT) framework developed by NSF NCAR. Our CrossFormer model, termed DroughtFormer, incorporates variables from the ERA5 and GLDAS2 reanalyses and the IMERG and MODIS satellite observations, and employs dry air mass and moisture conservation, to predict soil moisture, vegetation health, and other drought-related surface variables. While DroughtFormer displayed lower accuracy in predicting precipitation and FD indices, it showed significant skill in predicting the remaining variables, delivering stable and skillful forecasts out to 90-day lead times (either beating out or having comparable skill to climatology). In particular, DroughtFormer skillfully represented climate anomalies for key variables, such as soil moisture (though it struggled with the magnitude of the anomalies). Thus, DroughtFormer showed significant promise in representing and predicting agricultural level drought in a region that is heavily impacted by drought events.
We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnostic independent sampler suffers a collapse rate bounded away from zero for hard prompts regardless of the budget. Motivated by this, we recast intermediate state selection as a monotone submodular maximization problem, where a greedy one-step selector enjoys a 1 minus 1/e approximation guarantee. Our Uncertainty-aware Upper Confidence Bound (UUCB) terms arise as closed-form marginal gains of this objective. This turns the token-level entropy bonus from an empirical trick into an analytic consequence of the formulation. We present InfoTree, a training-time tree-search framework coupling UUCB with a learned Adaptive Budget Allocator (ABA) and an asynchronous Speculative Expansion scheme. ABA rescues prompts whose initial tree is wasted on uniform outcomes, lifting the mixed-outcome ratio from 58.1 percent to 76.3 percent with less than 5 percent budget overhead. Speculative Expansion reduces wall-clock overhead from 14.3 percent to 4.8 percent by tolerating bounded staleness in UUCB scores. Across nine benchmarks spanning math reasoning (AIME 2024 and 2025, MATH-500, OlympiadBench, USAMO), web-search agents (GAIA, HLE-100, BrowseComp-lite), and tool-rich coding and OS agents (APPS-verified, AgentBench-OS), InfoTree outperforms flat GRPO, DeepSearch, Tree-GRPO, AT2PO, CW-GRPO, and RC-GRPO. Head-to-head compositions with Tree-GRPO prefix sharing and CW-GRPO contribution weights deliver further gains, confirming that our selector operates orthogonally to rollout reuse and trajectory re-weighting. A 5 by 5 by 5 robustness grid reveals that over three quarters of the hyperparameter space lies on a performance plateau, confirming UUCB robustness.
Accurate trend forecasting in healthcare time series is essential for planning and resource allocation. This paper proposes a Bayesian framework for predicting oncology demand trends, modeling weekly appointments as a Poisson process with a Gamma prior to the demand rate. To enhance adaptability and capture persistent directional patterns, we incorporate a residual-based boosting mechanism grounded in a Gamma-Log-Normal conjugate structure. This boosting approach allows the model to track both short- and long-term trend shifts while maintaining the analytical tractability of conjugate Bayesian updating. The methodology was evaluated on real oncology service data from Cariri, Ceara, Brazil, and compared against established baselines, including linear regression, ARIMA, naive forecasting, LSTM neural networks, and XGBoost. Results showed that the proposed model outperforms competing methods in trend detection accuracy, with gains in terms of percentage of correct direction of 38.25% in relation to the second best approach in some cases.
Advances in sensing technology have made it possible to collect large volumes of high-dimensional time-series data. In fields like genetics and neuroscience, key questions concern whether directed relationships between variables can be learned from these data. To this end, graphical vector autoregressions are a popular tool because zeros among the autoregressive coefficients and error precision matrix have natural interpretations in terms of Granger non-causality and contemporaneous conditional independence. In applications where system dynamics are subject to functional or structural constraints, assuming the process is stable can be advantageous. However, enforcing stability demands restricting the autoregressive coefficients to lie in a constrained space with a complex geometry called the stationary region. The resulting inferential challenges are compounded when sparsity is also a requirement. Working in the Bayesian paradigm, we tackle the problem of developing a prior that simultaneously enforces stationarity and sparsity through parameter expansion, constructing a spike-and-slab prior with support constrained to the stationary region. A mixture of G-Wishart distributions provides a sparse prior for the error precision matrix. Computational inference is carried out using Metropolis-within-Gibbs, exploiting the No-U-Turn Sampler and reversible-jump steps. We demonstrate the inferential and predictive benefits of our approach through simulations and applications in macroeconomics and neuroscience.
Covariance matrix outcomes arise naturally in neuroimaging experiments to study brain functional connectivity. It is also of interest to understand how brain network organization varies with subject-level covariates. Existing covariance regression methods operate in a single-level framework and do not accommodate the hierarchically nested data structure in which subjects are grouped into clusters, such as age cohorts in lifespan studies. A Multilevel Covariate-Assisted Principal Regression (MCAP) framework is introduced, which identifies, for each cluster, a linear projection such that a generalized linear mixed effects model can be formulated with the covariates. The cluster-specific projections are modeled on the unit sphere via a von Mises-Fisher distribution, enabling principled borrowing of information across clusters. Model parameters are estimated by maximizing a hierarchical likelihood. For inference, a two-stage bootstrap procedure is proposed. Asymptotic properties of the estimators are established. Simulation studies demonstrate that MCAP substantially outperforms single-level competitors in estimating regression coefficients. Applied to the Human Connectome Project Lifespan Study spanning ages from five to ninety, MCAP identifies a dominant spectral brain network capturing age and sex effects on functional connectivity, and reveals findings including the convergence of neural reorganization patterns in late adulthood and the coordinated lifespan modulation of cross-network regions linked to language and executive function.
Sampling geographically dispersed minority populations poses substantial challenges when individual group membership cannot be directly observed. Although stratified sampling can offer efficiency gains, these gains are typically modest unless the minority population is highly concentrated within a small number of strata. In this paper, we propose using Bayesian Improved Surname Geocoding (BISG) to enhance the efficiency of minority population sampling. BISG generates individual-level probabilities of minority group membership based on names and residential addresses. We incorporate these probabilities into a stratified Poisson probability sampling design. Applying the proposed approach to a national survey of Jewish Americans, we find that our estimates closely align with those from a large-scale Pew Research Center survey of the same population, which relied on a substantially more expensive sampling strategy involving geographic stratification and screening. At a fraction of the cost, our survey reproduces nearly identical patterns observed by Pew, including estimates of religious denominations and participation in specific religious activities.
High-dimensional spatially correlated covariates are common in regression models encountered in environmental sciences and other fields. In such models, the regression coefficients often exhibit a sparse structure with spatial dependence. Although standard variable selection approaches can help detect the sparse structure, incorporating the dependence into variable selection helps recover spatially contiguous signals and improves prediction accuracy. Motivated by a real-world challenge in hurricane count prediction, we propose a novel neighborhood-structured global-local shrinkage prior for prediction and region selection in Poisson regression with spatial covariates. The proposed prior combines the Conditional Auto-Regressive (CAR) prior with a Super Heavy-tailed prior to introduce spatial dependence among the coefficients while ensuring appropriate shrinkage effects for covariate selection. We develop an efficient Metropolis-within-Gibbs sampler for computation that accommodates the count data. Extensive simulation studies demonstrate that the proposed model excels when signals are weak and adjacent and the spatial dependence in covariates is strong. In the application of hurricane prediction from the north Atlantic, our method outperforms traditional regression-based approaches and rivals the benchmark oracle model.
In large observational studies, the case-cohort design is commonly used to reduce the cost associated with covariate measurement. For survival outcomes, literature has suggested that the restricted mean survival time (RMST) be a more appropriate marginal causal effect measure than the hazard ratio. In this paper, we develop a marginal causal effect estimation method for RMST difference under the stratified case-cohort design. We adjust for measured confounders using an innovative template matching design. Compared with conventional matching designs, template matching allows greater flexibility in the sample sizes of the exposed and unexposed groups. We establish the asymptotic properties of the proposed causal effect estimators and develop a bootstrap procedure to estimate their variances. By conducting comprehensive simulation studies, we evaluate the finite sample performance of the proposed estimators, demonstrate the advantage of template matching over conventional matching, and compare between matching on propensity score and matching on covariates. Finally, we apply the proposed methods to the Atherosclerosis Risk in Communities (ARIC) Study to estimate the marginal causal effect of serum hs-CRP level on the coronary heart disease-free survival.
Kappa distributions are widely used in space plasma physics to model velocity distribution functions with extended tails. However, parameter estimation in these distributions presents a fundamental challenge: the kappa distribution does not belong to the exponential family, which prevents the direct application of analytically tractable maximum likelihood methods. In this work we propose a solution to this problem based on data augmentation: we introduce the inverse temperature $\beta$ as a gamma-distributed latent variable, thereby recovering the exponential family structure in the complete-data likelihood. This enables an implementation of the expectation-maximization (EM) algorithm in analytically closed form, with E-step and M-step derived from sufficient statistics. Our approach is agnostic with respect to the underlying physical mechanism generating fluctuations of $\beta$, which is a central aspect of the Beck-Cohen superstatistics framework, allowing a statistically rigorous treatment without compromising physical interpretability. We demonstrate that the method converges to the usual maximum likelihood estimators by applying it to synthetic data. Our results suggest that EM offers a computationally efficient and conceptually clear alternative for inference in superstatistical systems with temperature fluctuations.
We study nonparametric estimation of Schrödinger bridge (SB) drifts from i.i.d.\ data observed on a single time interval. Starting from the conditional-ratio form of the Schrödinger bridge time-series (SBTS) drift formula, we analyze a direct Nadaraya--Watson plug-in estimator built from kernelized numerator and denominator terms. Unlike recent SB analyses based on entropic-OT potentials, Sinkhorn iterations, or iterative bridge solvers, our approach works directly at the drift level and isolates \emph{statistical error} from optimization, approximation, and discretization error. Under Hölder regularity, a marginal-density floor, and bounded support, we prove a uniform non-asymptotic bound for admissible bandwidth pairs, a pointwise CLT under genuine undersmoothing, and an adaptive bandwidth selector satisfying an oracle inequality. We also prove a pivot-local minimax lower bound which, through an explicit uniform pivot, yields a global minimax lower bound under transparent compatibility conditions; hence the adaptive selector is minimax-rate optimal up to logarithmic factors. Synthetic experiments provide theorem-targeted diagnostics for finite-sample scaling, Gaussian approximation, and adaptive behavior.
Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization -- connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like $\ell_1$ and $\ell_2$. It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit $\ell_2$ effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.
Nonconvex methods have emerged as a dominant approach for low-rank matrix estimation, a problem that arises widely in machine learning and AI for learning and representing high-dimensional data. Existing analyses for these methods often require additional regularization to mitigate nonconvexity, even though such regularization is often unnecessary in practice. Moreover, most analyses rely on problem-specific arguments that are difficult to generalize to more complex settings. In this paper, we develop a theoretical framework for studying nonconvex procedures across a broad class of low-rank matrix estimation problems. Rather than focusing on a specific model, we reveal a fundamental mechanism that explains why nonconvex procedures can behave well in low-rank estimation. Our key device is a {\it benign regularizer} that does not alter the original update rule, but yields an equivalent locally strongly convex formulation of the algorithm. This perspective uncovers a disguised convexity inherent in the nonconvex procedure and provides a new route to theoretical guarantees for nonconvex low-rank matrix estimation.
High-dimensional functional data are becoming increasingly common in fields such as environmental monitoring and neuroimaging. This paper studies high-dimensional functional linear regression models that relate a scalar response to ultra-high-dimensional functional predictors, where each predictor is treated as a random element in an infinite-dimensional functional space. To address the dual challenges of high-dimensionality and model interpretability, we propose MoFI-FLR, a novel two-step estimation framework rooted in reproducing kernel Hilbert space (RKHS) theory. The first step employs a functional elastic-net penalty to screen out irrelevant covariates, while the second step decomposes each selected predictor's functional coefficient into an interpretable finite-dimensional simple component and an infinite-dimensional complementary complement. By penalizing only the complementary component, our method automatically distinguishes simple effects, which consist only of the simple component, from complex effects, which also include complementary deviations. Under mild regularity conditions, we establish non-asymptotic theoretical guarantees, demonstrating that MoFI-FLR consistently recovers the active covariates and accurately identifies their true functional forms. We develop a computationally efficient algorithm to implement the proposed method and evaluate its performance through comprehensive simulation studies and an application to Psychomotor Vigilance Task EEG data.
We formally introduce a class of models inspired by renormalization group (RG) theory, built on additive hierarchical expansions analogous to those appearing in functional ANOVA and mixed-effects models. Like ReLU convolutional neural networks, they are almost everywhere locally linear; unlike ReLU networks, their partition structure is explicit, interpretable, and easy to modify or constrain. In these models, one defines a multidimensional lattice partition of the input space and uses it to scaffold variations in regression parameters. Each dimension of the lattice corresponds to an attribute by which the statistics of the problem may vary. The parameters are themselves expressed in the form of an expansion, where each term captures variations relative to a lower (coarser) interaction scale. These models admit multiple equivalent interpretations: as piecewise GLMs, as hierarchical mixed-effects regressions, or as regression trees with structured parameter sharing. Since RG motivates the design of these models, we use techniques from statistical physics -- specifically replica analysis -- to study their generalization properties. Specifically, we analyze the behavior of the Watanabe-Akaike Information Criterion (WAIC) as a proxy for generalization loss. This analysis yields two practical results: (i) guidance on the lattice design as a function of dataset size and predictor dimensionality; and (ii) a principled scaling law for the regularization prior when adding higher-order terms to the expansion so that one can increase model complexity without an expected increase in generalization loss. We evaluate the methodology on public datasets and find performance competitive against both blackbox methods and other intrinsically interpretable approaches.
We introduce a novel framework for constructing scalable and flexible covariance kernels for Gaussian processes (GPs) by directly learning the covariance structure under a regression-type parameterization induced by Vecchia approximations, using deep neural architectures. Specifically, we model kriging coefficients and conditional standard deviations, deterministic quantities that uniquely characterize the covariance, providing stable and informative learning targets. Exploiting the permutation-equivariant structure of conditioning sets in the Vecchia factorization, we derive a universal representation for permutation-preserving functions and design neural architectures that respect this symmetry, leading to improved training stability and data efficiency. The proposed approach enables expressive, non-stationary kernel learning while maintaining computational scalability, thereby bridging classical GP methodology with modern deep learning.
Sparse regression based on global-local shrinkage priors are increasingly used for Bayesian modeling of modern high-dimensional data, but scaling up the Gibbs sampler for posterior inference remains a challenge. While much effort has gone into speeding up the high-dimensional coefficient update step, insufficient attention has been given to the potential poor mixing of the global scale parameter $\tau$ and of the overall sampler. One proposed remedy has been to marginalize out the coefficients when updating $\tau$. Here we show that, while this collapsed update was previously thought to require a Metropolis step, we can in fact sample directly and efficiently from the collapsed density. This is made possible by careful linear algebraic manipulations and a strategic per-Gibbs-scan spectral decomposition, allowing subsequent evaluations of the collapsed density across hundreds of values of $\tau$ at negligible cost. We combine this computational trick with adaptive numerical integration and inverse transform sampling to construct a direct sampler. This eliminates the need to tune Metropolis proposals and yields faster convergence and improved mixing. We demonstrate our method on two big data applications, fitting logistic regression under the horseshoe prior to datasets with design matrices of size 120,000 x 1,379 and 1,980 x 17,848.
This paper introduces the Statverse, a Metaverse framework designed to revolutionize statistical education in the digital age. Our key goal is to report our progress and encourage others to integrate similar strategies into their programs. The proposed framework seamlessly integrates the physical and digital realms to provide an immersive environment for the nuanced representation of complex statistical concepts. Finally, we discuss the potential impact of Statverse on advancing Statistical Education, offering a transformative approach to teaching and learning in the digital age. Statverse is the outcome of an academic partnership between Universidad Técnica Federico Santa María (UTFSM) and the University of Edinburgh (UoE).
Machine-learning systems used in survey-based social measurement require uncertainty estimates that are reliable across population subgroups, not merely valid in aggregate. We study ordinal conformal prediction for five-level AI-attitude forecasting on the Pew American Trends Panel (Wave 152; n=4,591; 12 race x education subgroups), comparing standard split conformal, Mondrian (group-specific) conformal, and a regularized Mondrian comparator across 100 respondent-disjoint splits with survey-weighted evaluation. Standard conformal achieves nominal marginal coverage for all four base predictors but leaves weighted subgroup gaps of ~13 percentage points. For the strongest predictor (XGBoost), Mondrian worsens the fairness-efficiency trade-off: weighted set size rises by +0.036 (dz =1.66) while the weighted subgroup gap grows by +0.013 (dz =0.30). A regularized comparator that shrinks group thresholds toward the global quantile mitigates this instability (Delta gap = -0.001, Delta size = +0.012) but does not yield a decisive fairness gain. Failure analysis traces the mechanism to calibration-cell fragmentation interacting with group-specific confidence mismatch. The negative result persists across alternate outcome codings and subgroup granularities, demonstrating that nominal marginal validity is insufficient for subgroup reliability and that naive group-specific calibration is not a dependable fairness remedy in complex survey settings.
Despite the growing availability of large datasets, causal structure learning remains computationally prohibitive at scale. We revisit sparsest-permutation learning for linear structural equation models and show that exact Cholesky factorization is unnecessary for structure recovery. This observation motivates a support-level relaxation that searches for sparse triangular factors over a precision-support screening graph. The relaxed formulation can be efficiently evaluated via masked zero-fill incomplete Cholesky factorization, enabling scalable comparison of candidate orderings. At the population level, we establish soundness for Markov equivalence class (MEC) recovery under no-cancellation and sparsest Markov representation assumptions, as well as robustness to ordering misspecification. Motivated by these guarantees, we introduce SCOPE, a sparse-Cholesky pipeline that provides a scalable implementation of the relaxed formulation. Experiments on synthetic and real datasets demonstrate that SCOPE matches the MEC recovery accuracy of substantially slower baselines, while achieving significantly reduced runtime and scaling to 10k variables.
Positive-unlabeled (PU) learning addresses binary classification when only a set of labeled positives is available alongside a pool of unlabeled samples drawn from a mixture of positives and negatives. Existing PU methods typically require dataset-specific training or iterative optimization, which limits their applicability when many tasks must be solved quickly or with little tuning. We introduce PUICL, a pretrained transformer that solves PU classification entirely through in-context learning. PUICL is pretrained on synthetic PU datasets generated from randomly instantiated structural causal models, exposing it to a wide range of feature-label relationships and class-prior configurations. At inference time, PUICL receives the labeled positives and the unlabeled samples as a single input and returns class probabilities for the unlabeled rows in one forward pass, with no gradient updates or per-task fitting. On 20 semi-synthetic PU benchmarks derived from the UCI Machine Learning Repository, OpenML, and scikit-learn, PUICL outperforms four standard PU learning baselines in average AUC and accuracy, and is competitive on F1-score. These results show that the in-context learning paradigm extends naturally beyond fully supervised tabular prediction to the semi-supervised PU setting.
Express transportation network design is uncertain because origin--destination demand, travel time, operating cost, hub congestion, and realized sorting productivity vary over time. Existing multi-topology express network models usually optimize cost and maximum arrival time under fixed input data, which may produce designs that are efficient nominally but fragile under demand surges, route disruptions, and hub productivity losses. This paper develops a Bayesian posterior-predictive framework for multi-topology express transportation network design. The model learns demand, travel-time, cost, and hub-reliability uncertainty from historical or benchmark-calibrated data and propagates them through posterior predictive scenarios. For fully connected, hub-and-spoke, restricted-allocation, and direct-link hybrid topologies, candidate designs are evaluated using posterior expected cost, conditional value-at-risk of maximum arrival time, service reliability, hub hold-time reliability, and emission-aware penalties. A Bayesian multi-structure design methodology is proposed using posterior simulation, sample-average approximation, topology-wise optimization, and Bayes-risk selection. Theoretical results establish existence of a Bayes-optimal design, convergence of posterior scenario risks, and stability of topology selection. Simulation and CAB benchmark experiments show that the Bayesian design can trade modest additional cost for substantial reductions in tail delivery risk and improved hub reliability.
Stochastic differential equations (SDEs) provide a flexible framework for modeling temporal dynamics in partially observed systems. A central task is to calibrate such models from data, which requires inferring latent trajectories and parameters from sparse, noisy observations. Classical smoothing methods for this problem are often limited by path degeneracy and poor scalability. In this work, we developed a novel method based on characterization of the posterior SDE in terms of conditional backward-in-time score defined as the gradient of a function solving a Kolmogorov backward equation with multiplicative updates at observation times. We learn this conditional score using neural networks trained to satisfy both the governing PDE and the observation-induced jump conditions, thereby integrating continuous-time dynamics with discrete Bayesian updates. The resulting score induces a posterior SDE with the same diffusion coefficient but a modified drift, enabling efficient posterior trajectory sampling. We further derive a likelihood-based objective for learning the SDE parameters, yielding an evidence lower bound (ELBO) for joint state smoothing and parameter estimation. This leads to a variational EM-style procedure, where the neural conditional score is optimized to approximate the smoothing distribution, followed by a maximization step over the SDE parameters using samples from the induced posterior. Experiments on nonlinear systems demonstrate accurate and stable inference with a very few observations demonstrating significant improved scalability compared to classical MCMC methods.
We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.
Standard cardiovascular risk calculators, including the Framingham Risk Score and the ACC/AHA Pooled Cohort Equations, estimate the conditional probability P(CHD | SysBP = s) rather than the interventional quantity P(CHD | do(SysBP = s)). When confounding is present, this distinction has direct clinical consequences: observational estimates may systematically overstate the absolute benefit of antihypertensive treatment. We applied Pearl's do-calculus to the Framingham Heart Study Offspring Cohort (n = 4,240; primary analysis on 3,776 complete cases; 574 ten-year coronary heart disease events). A structurally corrected directed acyclic graph (DAG) was specified and evaluated using conditional independence testing. The average causal effect (ACE) of a 20 mmHg systolic blood pressure reduction was estimated by g-computation with bootstrap confidence intervals, corroborated by propensity score matching and inverse probability weighting. G-computation yielded an ACE of 3.40 percent absolute risk reduction (95 percent CI: 2.64 to 4.14), compared with a naive observational estimate of 4.14 percent, corresponding to an approximate 21.8 percent relative overestimation. Conditional average treatment effects were estimated using R-Learner and T-Learner metalearners. These findings suggest that observational cardiovascular risk tools may overestimate the absolute benefit of blood pressure reduction, with implications for clinical risk stratification and prescribing thresholds.
Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics changes, separating learning-side architectural improvements from primarily execution-side gains. These predictive and diagnostic signals persist across the 12-, 36-, and 48-layer model tiers. Finally, a mechanistic model proves the main observations and explains how activation covariance spectra correlate with task-aligned feature learning.
Differential item functioning (DIF) arises alongside latent population heterogeneity in many applications, and both must be accounted for when assessing measurement invariance. In many practical settings, however, the comparison groups are unobserved and anchor items are unknown. A further challenge is that item response theory models traditionally assume symmetric link functions, yet empirical response processes may exhibit substantial asymmetry. This paper proposes a general framework for jointly analysing impact and DIF under asymmetric item response models. Unobserved group differences are represented by latent classes within a mixture item response model, while item-specific shifts capture DIF effects. Assuming the number of DIF items is relatively small, an $\ell_1$-regularised estimator is used to simultaneously identify the latent classes and select DIF items without requiring observed group labels or pre-specified anchor items. A simulation study evaluates recovery of impact, item parameters, and DIF effects across a range of configurations. The method is illustrated using two empirical applications from educational testing. In one dataset, the selected model reveals both impact and item-level DIF, whereas in the other, the results indicate substantial impact but little evidence of item-level DIF.
Gaussian process marginal likelihood scores and kernel conditional independence tests are theoretically appealing for nonlinear causal discovery but computationally prohibitive at scale. We present two complementary RFF-based methods forming a practical toolkit for score-based, constraint-based, and hybrid causal discovery. The Fourier Feature Marginal Likelihood (FFML) score approximates the exact GP marginal likelihood by replacing the n x n kernel Gram matrix with a finite-dimensional feature representation, reducing cost to O(nm^2 + m^3) while retaining the probabilistic interpretation and automatic complexity penalty of the exact score. FFML extends to mixed (continuous + discrete) parent sets via a product-kernel construction, with a Kronecker path for small discrete parent sets and a Hadamard-product path otherwise. The Fourier Feature Conditional Independence (FFCI) test is a fast nonparametric CI test for mixed data. Each variable is featurized individually: continuous variables via RFF or Orthogonal Random Features (ORF), discrete variables via a Cholesky-factored categorical feature map, with blocks concatenated. Conditioning uses ridge residualization in feature space; the test statistic is a Frobenius norm of the residualized cross-covariance, approximated as a weighted sum of chi-squared variables. Although FFML and FFCI share the same RFF/ORF machinery, they differ architecturally: FFML builds a joint kernel over a parent set for scoring, while FFCI featurizes variables individually for testing. We compare FFML to TRFF, a penalized Student-t regression alternative. Empirically, BOSS+FFML outperforms linear and kernel-ridge baselines on nonlinear data. When run through the same PC-Max implementation, FFCI and RCIT exhibit complementary precision-recall profiles: RCIT is more precise while FFCI achieves better recall and lower SHD, and runs in one third the time.
The discrete Pareto (or Zeta, Zipf) distribution, arises naturally in modeling rank-frequency data across diverse fields such as linguistics, demography, biology, and computer science. Despite its widespread applicability, goodness-of-fit testing for the discrete Pareto distribution remains underdeveloped, particularly in the presence of heavy tails and infinite support. This article introduces a novel goodness-of-fit test based on a new Stein-type characterization of the discrete Pareto distribution, formulated using its probability generating function. The proposed method is applicable even when the shape parameter is unknown and avoids binning or smoothing techniques. We study the asymptotic properties of the test and assess its empirical size and power through extensive simulation experiments. The results show that the proposed test either outperforms or matches the performance of existing method across various alternatives. Applications to real datasets are provided to demonstrate its practical relevance and robustness.
The role of AI-generated synthetic data has recently been expanded to support realistic Monte Carlo simulations. However, guidance is limited on generating data with multilevel structures and designing simulations based on such data. This study proposes a general framework for AI-based simulation studies to evaluate the predictive performance and parameter recovery of quantitative methods, specifically using multilevel data commonly observed in the social sciences. Our proposed six-stage workflow consists of (i) specifying a method and real data, (ii) training Generative AI with real data, (iii) assessing synthetic data quality, (iv) designing and conducting simulations, (v) evaluating method performance, and (vi) checking robustness. To enhance fidelity in multilevel data generation, we also introduce targeted modifications to diffusion models and Generative Adversarial Networks (GANs). Furthermore, we develop a systematic quality evaluation framework that assesses both within-table and between-table fidelity, and discuss how AI-based simulation designs should differ depending on whether the simulation's objective is predictive performance or parameter recovery. Finally, using empirical multilevel data and multilevel modeling methods, we demonstrate the utility of the proposed AI-based simulation framework. This approach leads to more accurate and honest evaluations of quantitative methods in the real world, unlike traditional simulation studies based on arbitrary simulated scenarios.
We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement policy-improvement methods, including semi-gradient SARSA and actor-critic, via explicit parameter constructions. Beyond existence, we design a teacher-mimicking training procedure, analyze its gradient-flow dynamics, and establish the first convergence guarantee in the ICRL literature: under suitable richness conditions on the training MDP distribution, gradient flow converges locally and exponentially to an optimal parameter manifold corresponding to the desired RL update. Empirically, training transformers on randomly generated tabular MDPs confirms these predictions: the learned models recover the parameter structure of our explicit constructions and, when deployed on unseen MDPs, deliver strong in-context control performance. Together, these results illuminate how transformer architectures internalize and execute classical reinforcement learning algorithms in context, bridging mechanistic understanding and training dynamics in ICRL.
In this paper, we investigate the supremum-norm generalization error and the uniform inference for a specific class of kernel regression methods, namely the kernel gradient flows. Under the widely adopted capacity-source condition framework in the kernel regression literature, we first establish convergence rates for the supremum norm generalization error of both continuous and discrete kernel gradient flows under the source condition $s>\alpha_0$, where $\alpha_0\in(0,1)$ denotes the embedding index of the kernel function. Moreover, we show that these rates match the minimax optimal rates. Building on this result, we then construct simultaneous confidence bands for both continuous and discrete kernel gradient flows. Notably, the widths of the proposed confidence bands are also optimal, in the sense that their shrinkage rates are greater than, while can be arbitrarily close to, the minimax optimal rates.
Double machine learning (DML) delivers valid inference on low-dimensional causal parameters while permitting flexible nuisance estimation, but its computational cost becomes prohibitive once cross-fitted learners must be trained on massive observational data. Applying DML to a uniformly drawn subsample alleviates this burden, yet such a reduction disregards the geometry of the covariate space and can exacerbate treated-control imbalance as well as overlap deficiency. We propose Uniform Design Double Machine Learning (UD-DML), a design-based subsampling strategy for average treatment effect (ATE) estimation. UD-DML first constructs a low-discrepancy skeleton in a PCA-rotated covariate space under the mixture-discrepancy criterion, and then assigns, to each skeleton point, the nearest treated and control units via KD-tree search. The resulting matched subsample is, by construction, both representative of the full covariate distribution and balanced across treatment arms; cross-fitted DML is subsequently applied to it. We establish discrepancy-based guarantees for representativeness and balance, and prove that the UD-DML estimator is $\sqrt{r}$-asymptotically normal under mild conditions, where the selected subsample size $r \ll n$. The dominant nuisance-fitting cost is thereby reduced from the $n$-scale to the $r$-scale. Monte Carlo experiments show that UD-DML attains lower RMSE, narrower confidence intervals and more reliable coverage than uniform subsampling, with the largest gains in low-overlap and misspecified regimes. An application to a large observational dataset further demonstrates its practical feasibility.
We propose a new constrained EM algorithm that is applicable to general constrained estimation problems. The proposed method is based on a novel framework, the `dual-homotopy framework,' which combines deterministic annealing EM with a barrier-based optimization, enabling stable estimation under parameter constraints. Building on this framework, we further introduce an adaptive constrained EM algorithm that preserves likelihood monotonicity, regardless of the underlying distributional form or the specific structure of the constraints. Through simulation studies and a real-data analysis, both under parameter constraints, we demonstrate that the proposed algorithm yields more stable and accurate estimates than existing methods, including the standard EM algorithm.
Algorithms in machine learning and AI do critically depend on at least three key components: (i) the risk function, which is the expectation of the loss function, (ii) the function space, which is often called the hypothesis space, and (iii) the set of probability measures, which are allowed for the specified algorithm. This paper gives a survey of a certain class of loss functions, which we call ratio-based. In supervised learning, margin-based loss functions for classification tasks depending on the product of the output values $y_i$ and the predictions $f(x_i)$ as well as distance-based loss functions depending on the difference of $y_i$ and $f(x_i)$ for regression are common. Distance-based loss functions are in particular useful, if an additive model assumption seems plausible, i.e. the common signal plus noise assumption. However, in the literature, several loss functions proposed for regression purposes have a multiplicative error structure in mind and pay attention to relative errors, i.e. to the ratio of $y_i$ and $f(x_i)$. In this survey article, we systematically investigate such ratio-based loss functions and propose a few new losses, which may be interesting for future research. We concentrate on investigating general properties of ratio-based loss functions like continuity, Lipschitz-continuity, convexity, and differentiability, because these properties play a central role in most machine learning algorithms. Therefore, we do not focus on some specific machine learning algorithm to derive universal consistency, learning rates, or stability results. Instead, we want to enable future research in this direction.
We propose a framework for determining whether the causal dependence of an outcome $Y$ on a covariate $X$ changes at a given time point, given confounders $\boldsymbol{Z}$. For instance, in financial markets, the effect of a market indicator on asset returns may causally change over time. While many existing measures of association can be used to detect changes in joint and marginal distributions, in the absence of strong assumptions on the data generating process none are suitable for detecting changes in the causal mechanism or in the strength of causal relationship. In this work we approach the problem from a fully non-parametric perspective, and treat the causal mechanism as well as the distribution of the data as unknown. We introduce a quantity based on the integrated difference between kernel mean embeddings of certain conditionals copula, which is provably equal to zero if the causal dependence does not change and strictly positive else. A near-linear time estimator for the quantity is proposed, with rates of convergence explicitly spelled out. Extensive experiments demonstrate that the proposed statistic achieves high accuracy on multiple synthetic and real-world datasets. We additionally show how the proposed statistic can be used for change point detection when the goal is to detect changes in causal dependence occurring at an unknown times.
In placebo-controlled randomized trials, the post-randomization use of concomitant medications may be higher in the placebo arm than in the treatment arm. This may dilute the full benefits of the randomized drug as estimated by the intention-to-treat analysis. We focus on cardiovascular outcomes trials in type-2 diabetes patients of glucose-lowering treatments where patients in the placebo arm are more likely to add other glucose-lowering agents with established cardio-protective properties. As a supplement to the intention-to-treat analysis, we propose a class of estimands within a causal framework that isolates the specific impact of the treatment being studied from that of concomitant treatment use. These estimands are defined under time-dependent treatment interventions to balance exposure to additional medications across intervention arms. We advocate for specific stochastic interventions to achieve this balance while minimizing positivity violations, which arise when certain treatment combinations or characteristics are not sufficiently represented in the data. We employ targeted minimum loss-based estimation (TMLE) to optimize the estimation procedure for our estimands while allowing for flexible adjustments for time-dependent covariates from follow-up visits. Finally, we demonstrate the application of the methods through a simulation study and a real-world example from the LEADER cardiovascular outcomes trial, which assessed cardiovascular risk for liraglutide versus placebo.
Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is data-dependent and the set of possible answers is not known in advance. We study anytime-valid certification of a prespecified target answer as the unique mode of the model's response distribution, a guarantee distinct from answer correctness. We propose the Certification by Intersection-union Testing with E-processes (CITE) algorithm, which provably controls false certification at any prescribed level under arbitrary data-driven stopping, without requiring prior knowledge of the answer category set. We also prove an category-set-size-free stopping-time rate, establish matching minimax lower bounds up to constants in the main regime, and extend the construction to confidence-weighted voting. Simulations and LLM self-consistency experiments show empirical error control and improved certification in diffuse-tail settings.
Artificial-intelligence systems are becoming ubiquitous in society, yet their predictions typically inherit biases with respect to protected attributes such as race, gender, or age. Classical fairness notions, most notably Statistical Parity (SP), demand that predictions be independent of the protected attributes, but are overly restrictive when these attributes influence mediating variables that are considered business necessities. Recent causal formulations relax SP by distinguishing allowed from not-allowed causal paths and by complementing SP with Predictive Parity (PP), requiring the predictor to replicate the legitimate influence of business-necessities. Existing path-based definitions are mainly practical when applied to categorical attributes. This paper introduces a new framework for fairness in structural causal models that is tailored to continuous protected attributes. We formalize SP and PP through path-specific partial derivatives, establish conditions under which these criteria coincide with prior causal definitions, and characterize when a fair predictor, one that satisfies SP along not-allowed paths while achieving PP along allowed paths, exists. Building on this theory, we propose a fair tuning algorithm that either constructs such a predictor or, when not possible, allows for a trade-off between SP and PP. We present experiments on simulated and real data to evaluate our proposal, compare it with previously proposed methods, and show that it performs better when PP is considered.
Increasing evidence suggests that variability in longitudinal biomarkers, in addition to their mean trajectory, carries prognostic information for time-to-event outcomes. However, standard joint models typically capture only the expected value of the biomarker process, assuming constant residual variability across individuals and time. Fully joint extensions that model within-subject variability exist but are computationally demanding and require dedicated software packages. We propose a flexible two-step approach for incorporating biomarker variability into joint models. First, residuals (or their transformations) from a mixed-effects model are used to derive subject- and time-specific measures of variability. Second, these variability measures are included in a standard joint model, allowing their association with survival to be estimated alongside the mean biomarker trajectory. Our approach can also accommodate multiple biomarkers simultaneously and is readily implemented using existing joint modeling software without custom extensions. Through simulations, we show that our method provides reasonable performance for variability effects across a range of scenarios. We further illustrate our approach using longitudinal data of white blood cell counts from a large phase III glioblastoma trial, demonstrating that both mean levels and variability of hematological markers carry prognostic information for overall survival.
Integrating non-probability samples into finite-population inference typically requires modeling unknown selection probabilities under a missing-at-random (MAR) assumption that is difficult to verify. We propose a design-based alternative in which the non-probability sample is treated as a fully observed certainty stratum and a probability sample is drawn only from the complementary, previously unsampled units. Within this sequential framework, we develop two generalized regression estimators: one fitting the outcome model separately in the complementary stratum, the other pooling both samples; we make two distinct contributions. First, both estimators are design-consistent and admit consistent variance estimators with no assumption whatsoever on the non-probability selection mechanism, including under not-missing-at-random (NMAR) selection. Second, under a working superpopulation model that holds in both strata, the pilot non-probability sample can be used to construct second-stage inclusion probabilities that achieve Isaki-Fuller asymptotic optimality for the separate estimator; this optimality claim relies on assumptions strictly stronger than MAR, but its failure does not invalidate the consistency results above. A diagnostic test for coefficient homogeneity is proposed to guide the choice between the two estimators. Simulations confirm that the sequential estimators remain essentially unbiased under both MAR and NMAR, while propensity-adjusted competitors can be severely biased under NMAR. Two applications from Lithuanian official statistics illustrate that separate regression is preferable when the pilot stratum and its complement are strongly heterogeneous, whereas combined regression offers a modest efficiency gain when the two strata are similar.
Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.
Robins and Richardson (2010) reformulated mediation analysis by decomposing treatments into multiple components and examining separable effects of each component. While this approach is increasingly popular, existing work has analyzed ``two-arm'' data, where components are strictly bundled and manipulated simultaneously. However, in practice, four-arm data where components are assigned independently are often available. For example, testing accommodations might strictly bundle extra time with a separate session or allow them to be assigned separately. To address this distinction, we propose a general framework for analyzing separable effects in four-arm and two-arm designs. This framework provides distinct identification and estimation strategies for each design. For estimation, we utilize efficient influence function estimators coupled with machine learning and cross-fitting techniques. Additionally, we introduce two falsification tests for key identification assumptions required in the two-arm design by leveraging four-arm data. We investigate the performance of the proposed estimators via a simulation study and demonstrate their application by studying the effect of extended time accommodations using data from the National Assessment of Educational Progress. Ultimately, this separable effects analysis enables practitioners to clearly communicate underlying mechanisms and derive informative policy recommendations.
Instrumental variable (IV) and control function (CF) methods are powerful tools for causal effect estimation in the presence of unmeasured confounding, yet most existing approaches target only mean effects and/or demand substantial fitting and tuning effort. In this paper, we introduce a simple method, TabCF, for control function regression using tabular foundation models, which enables accurate, fast, identification-transparent, and tuning-light causal estimation of distributional quantities, such as interventional means and quantiles; we also propose a copula-based approximation for multivariate outcomes. TabCF performs favorably against representative methods across a broad range of small- to medium-sized synthetic and real data scenarios. The central message is two-fold: for practitioners, it highlights that TabCF is an effective tool for distributional causal inference; for researchers, it suggests that the proposed approach could be considered a strong baseline for future method development. Code is available at this https URL.
Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including $L^2$-functional data and random graphs in Laplacian spaces arising in modern medical applications.
In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce label error. This causes systematic model error for specific groups and biases performance metrics during validation. This paper proposes a method to correct for such bias in prediction models due to differential diagnostic delay. We use a causal inference framework to define our target estimand: an individual's diagnosis probability in a counterfactual scenario where their diagnosis rate matches that of a reference group. We model the longitudinal process as a hidden Markov model, in which confirmatory test results are emissions from a latent progressive disease stage. We validate our approach in simulated data and apply it to a case study of chronic kidney disease prediction using electronic health records. In simulations, our method reduces prediction bias and improves calibration-in-the-large, correcting the Observed:Expected ratio in the underdiagnosed group from 1.34 (standard deviation: 0.09) in a model developed without any correction for underdiagnosis bias to 1.02 (0.09). Violations of assumptions in the simulation affected the estimation of model parameters, but the proposed approach nonetheless remained better calibrated than the standard model. In the clinical case study, we identify diabetes as the main driver of observability, with an odds ratio of 10.36 (95% confidence interval, 9.80 - 11.02) in 6-month urine albumin-creatinine ratio testing rate. Using our approach to predict the counterfactual diagnostic rate in patients without diabetes, we improved the Observed:Expected ratio of a developed clinical prediction model from 1.55 (1.51 - 1.59) to 1.01 (0.98 - 1.04).
Langevin sampling from distributions of the form $p(x) \propto \exp(-\Psi(x))$ faces two major challenges: (global) mode coverage and (local) mode exploration. The first challenge is particularly relevant for multi-modal distributions with disjoint modes, whereas the second arises when the potential $\Psi$ exhibits diverse and ill-conditioned local mode geometry. To address these challenges, a common approach is to precondition Langevin dynamics with problem-specific information, such as the sample covariance or the local curvature of $\Psi$. However, existing preconditioner choices inherently involve a trade-off between global mode coverage and local mode exploration, and no prior method resolves both simultaneously. To overcome this limitation, we propose the TIPreL, which introduces a time- and position-dependent preconditioner. This design effectively addresses both challenges mentioned above within a single framework. We establish convergence of the resulting dynamics in the Wasserstein-2 distance both in continuous time and for a tamed Euler discretization. In particular, our analysis extends the existing state of the art by proving convergence under time- and space-dependent diffusion coefficients, and only locally Lipschitz drifts, which has not been covered by prior work. Finally, we experimentally compare TIPreL with competing preconditioning schemes on a two-dimensional, severely ill-posed example and on a Bayesian logistic regression task in higher dimensions, confirming the efficiency of the proposed method.
We propose a joint individualized hurdle-ordinal regression model for paired zero-inflated ordinal outcomes with subject-specific, spatially varying, and time-varying covariate effects, motivated by the Iowa Fluoride Study (IFS). The two outcomes, dental caries and dental fluorosis, are measured repeatedly across ages at fine spatial resolution, yielding nested longitudinal data with substantial zero inflation, ordinality, and heterogeneity across individuals and locations. For each outcome, a hurdle component models disease presence, while a proportional-odds component models severity among positive observations. To parsimoniously represent the high-dimensional coefficient arrays, we introduce a linked Tucker tensor factorization. Shared subject-mode factors induce dependence between the caries and fluorosis coefficient tensors, while separate spatial factors accommodate the distinct measurement grids of tooth surfaces and tooth zones. A horseshoe prior on the core tensor elements encourages sparsity, and posterior computation is performed using the No-U-Turn Sampler in NumPyro. Population-level effect summaries are obtained by projecting individualized posterior linear predictors onto the design space, and Wasserstein barycenters aggregate these summaries across tooth locations and anatomical classes. Applied to the IFS, the model reveals spatially heterogeneous associations between early-life fluoride and dietary exposures and both outcomes. Fluoride exposure is associated with increased odds and severity of fluorosis, while soda intake consistently increases caries risk. These associations differ between presence and severity components and vary across tooth locations, ages, and subpopulations defined by prior caries status, highlighting the importance of the joint hurdle-ordinal framework for disentangling disease occurrence from disease progression in multilevel dental data.
Weekly healthcare activity data are typically non-negative counts with temporal dependence and occasional system-wide disruptions, settings in which Gaussian time-series models may be inadequate. Solid organ transplant (SOT) activity provides a representative case study of a count process affected by a large external shock. We analyse weekly SOT counts in the USA and Italy from 2014 to October 2024, stratified by donor type (deceased vs living) and organ (kidney and liver). We fit Poisson and negative-binomial count time-series models incorporating short-term dynamics, calendar effects (holiday weeks), and pre-specified pandemic-period level and/or slope indicators. Candidate specifications are screened within a pre-defined portfolio and selected using BIC within each training window. Forecasting performance is evaluated with an expanding-window design at horizons $h\in\{4,8,12\}$ weeks. Alongside RMSE, we report empirical coverage of nominal $95\%$ predictive intervals and interval widths to summarise calibration and forecast uncertainty. Across strata, selected models capture substantial pandemic-period deviations and varying post-period trajectories. Deceased-donor series are broadly consistent with a return towards pre-pandemic baselines in both countries, whereas the US living-donor series shows a more gradual convergence in this application. Within the explored model class and validation protocol, auxiliary covariates representing COVID burden and mortality add limited incremental predictive contribution beyond autoregressive and calendar components. Our analysis shows that donation time series represent an unconditional phenomenon, with auxiliary variables having a statistically negligible impact on donations, thus allowing a focus on more practical aspects related to ongoing challenges in the post-pandemic era, such as hospital overloads and changes in public perception.
Many normalizing flow architectures impose regularity constraints, yet their distributional approximation properties are not fully characterized. We study the expressivity of bi-Lipschitz normalizing flows through the lens of score-based diffusion models. For the probability flow ODE of a variance-preserving diffusion, Lipschitz regularity of the score induces a flow of bi-Lipschitz diffeomorphic transport maps. This ODE bridge allows us to analyze the distributional approximation power of bi-Lipschitz normalizing flows and, conversely, derive deterministic convergence guarantees for diffusion-based transport. Our key idea is to use the probability flow ODE to link regularity of the score to regularity of the induced transport maps. We verify score regularity for broad target densities, including compactly supported densities, Gaussian convolutions of compactly supported measures and finite Gaussian mixtures. We obtain a universal distributional approximation result: Gaussian pullbacks induced by bi-Lipschitz variance-preserving transport maps are $L^1$-dense among all probability densities. For Gaussian convolution targets, we further obtain convergence in Kullback-Leibler divergence without early stopping.
Trimming suspicious calibration points is a common response to contamination in conformal prediction. Its effect on clean-target coverage, however, is governed by the retained law induced by trimming, not by the contamination level alone. We analyse fixed-threshold trimming as conditioning rather than purification. It replaces the contaminated calibration law with a retained law, reducing clean-target coverage to a one-dimensional score-CDF transfer problem with an exact finite-sample identity. A componentwise bound on the transfer gap gives a population-level diagnostic. This separates a clean-side covariance cost from a retained-contamination cost, governed by the dirty-to-clean retention ratio. Trimming helps when the anomaly score separates retention probabilities while remaining score-neutral on the clean population. Otherwise, it cannot substantially reduce contamination through the retained mixture coefficient. We also give finite-sample certificate templates that provide numerical guarantees under independent audit.
Constructing minimum-volume prediction regions that satisfy conditional coverage is a fundamental challenge in multivariate regression. Standard approaches rely on explicitly estimating the full conditional density and subsequently thresholding it. This two-step plug-in process is notoriously difficult, sensitive to estimation errors, and computationally expensive. One would like to instead optimize the region directly. Formulating a direct solution is challenging, however, because it requires minimizing a volume objective that is coupled with the conditional quantiles of the model's own estimation error. In this work, we address this challenge. We introduce super-level-set regression (SLS), a novel mathematical framework that successfully resolves this implicit coupling, allowing us to directly parameterize and optimize the geometric boundaries of the target conditional level sets. By bypassing full distribution estimation and leveraging flexible volume-preserving frontier functions, our approach natively captures complex, multimodal, and disjoint conditional structures end-to-end. Ultimately, SLS offers a new perspective on multivariate conditional quantile regression, replacing the restrictive assumptions of density-first methods with a direct geometric optimization strategy.
The Plackett-Luce model is widely used to deal with probabilities in discrete choice settings. This paper introduces a novel two-level Plackett-Luce model combined with a multinomial logistic scheme that provides the basis for the route choice module in a smart mobility platform. For this, we develop Bayesian inference and prediction mechanisms to capture consumers' preferences for personalized route recommendations. The model is empirically tested, allowing for refinements and discussion of its applicability. We also illustrate its practical relevance through several use cases, including relevant route selection, coordinated car pooling, incentive design and synthetic data generation.
The problem of optimal dosage estimation arises in diverse scientific domains, from pharmacology and toxicology to aquaculture and environmental studies. Statistical modeling of nonlinear dose-response relationships is essential to quantify biological effects and determine response-optimal levels. This paper introduces a flexible Bayesian fractional polynomial (BFP) framework for modeling such relationships, allowing for model uncertainty quantification and robust prediction through Bayesian model averaging. Extensive simulation results demonstrate that the proposed BFP approach yields accurate estimation of optimal dose levels, outperforming benchmarks significantly. The approach is demonstrated on real data from fish nutrient requirement experiments.
Quantile regression is a fundamental tool for distributional learning but poses significant optimization challenges for deep models due to the non-smoothness of the pinball loss. We propose ConquerNet, a class of \textbf{con}volution-smoothed \textbf{qu}antil\textbf{e} \textbf{R}eLU neural \textbf{net}works, which yield smooth objectives while preserving the underlying quantile structure. We establish general nonasymptotic risk bounds for ConquerNet under mild conditions, providing minimax guarantees over Besov function classes. In numerical studies, we demonstrate that the proposed approach outperforms standard quantile neural networks at multiple quantile levels, showing improved estimation accuracy and training efficiency across the board, with particularly pronounced advantages at high and low quantiles.
Random directed acyclic graphs (DAGs) based on imposing an order on Erdős-Rényi and scale free random graphs are widely used for evaluating causal discovery algorithms. We show that in such DAGs, the set of nodes reachable via open paths, termed relatives, increases monotonically along the causal order. We assess the prevalence of this pattern numerically, and demonstrate that it can be exploited for causal order recovery via sorting by the estimated number of relatives. We note that many simulations in the literature feature settings where this yields an excellent proxy for the causal order, and show that a strict increase of relatives along the causal order leads to a singular Markov equivalence class. We propose sampling time-series DAGs as a possible alternative and discuss implications for causal discovery algorithms and their evaluation on synthetic data.
When modeling class-imbalanced data, it is crucial to address the imbalance, as models trained on such data tend to be biased towards the majority classes. This problem is amplified under partial supervision, where pseudo-labels for unlabeled data are predicted based on imbalanced labeled data, propagating the bias. While recent semi-supervised models address class imbalance, they typically assume single-modal input data. However, with the growing availability of multimodal data, it is essential to leverage complementary modalities. In this article, we propose a multimodal deep generative model for semi-supervised learning under class imbalance. Our approach uses separate encoders for each modality, sharing latent variables across modalities, and simplifies joint posterior computation with a product-of-experts method. To further address class imbalance, we replace typical Gaussian distributions with Student's t-distributions for the prior, encoder, and decoder, better capturing the heavy-tailed latent distributions in imbalanced data. We derive a new objective function for training the proposed model on both labeled and unlabeled data using $\gamma$-power divergence. Empirical results on benchmark and real-world datasets demonstrate that our model outperforms baseline methods in generalization, achieving superior classification performance for partially labeled multimodal data with imbalanced class distributions.
Learning identifiable representations in deep generative models remains a fundamental challenge, particularly for sequential data with regime-switching dynamics. Existing approaches establish identifiability under restrictive assumptions, such as stationarity or limited emission models, and typically rely on variational autoencoder (VAE) estimators, which introduce approximation gaps that limit the recovery of the latent structure. In this work, we address both the theoretical and practical limitations of this setting. First, we establish identifiability of a broad class of recurrent nonlinear switching dynamical systems under flexible assumptions, significantly extending prior results. Second, we introduce $\Omega$SDS, a flow-based estimator that enables exact likelihood optimization using expectation-maximisation. Through empirical validation on both synthetic and real-world data, our results demonstrate that $\Omega$SDS achieves improved disentanglement compared to VAE-based estimators and more accurate forecasting of underlying dynamics.
Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring higher-variance classes-while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion. Together, these results suggest that diffusion models can memorize some classes while others remain insufficiently learned. We validate our theoretical predictions empirically using U-Net models trained on Fashion MNIST.
Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $\tau$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with $\tau$-mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of $\tau$-mixing data. Moreover, we derive the sample complexity of DQN under $tau$-mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.
Prior-Fitted Networks (PFNs) amortize Bayesian prediction by meta-learning over a synthetic task prior, but their standard output is a posterior predictive distribution over noisy observations. For sequential decision-making, such as active learning and Bayesian optimization, acquisition should prioritize epistemic uncertainty about the latent signal rather than irreducible aleatoric observation noise. We show that this epistemic--aleatoric split is not identifiable in general from the posterior predictive distribution alone, even when that distribution is known exactly. We then exploit a distinctive advantage of PFNs: because the synthetic data-generating process is under our control, each task can contain an explicit latent signal and noise function, and the generator can provide query-level labels for both the noiseless target and the observation-noise variance. We use these labels to train a decoupled PFN with separate latent-signal and aleatoric heads. The observation-level predictive is induced by convolving the latent signal distribution with the learned noise model. Empirically, epistemic-only acquisition mitigates the failure mode of total-variance exploration in noisy and heteroscedastic settings. In matched comparisons, decoupled models usually improve over tuned observation-level baselines, with the clearest gains in HPO; in broader sweeps, a decoupled model obtains the best average rank in both HPO and synthetic BO.
We study the minimax estimation of covariance eigenfunctions and eigenvalues in functional principal component analysis when $n$ trajectories are observed at $p$ common grid points with additive noise. We consider covariance kernels with arbitrary Hölder smoothness and no prescribed parametric decay of the eigenvalues. In this setting, kernel smoothness and local spectral separation play distinct roles: a minimax inconsistency result over the smoothness-only class shows that kernel regularity alone is not sufficient for minimax-consistent eigenfunction estimation. To capture this interplay, we introduce a class of processes that jointly controls the Hölder smoothness of the covariance kernel and a local relative inverse eigengap quantity at the target index $\ell$. Over this class, we derive non-asymptotic minimax lower bounds for eigenfunction estimation that disentangle sampling variability, discretization and spectral effects, revealing rates of order $\delta_\ell n^{-1}+p^{-2\alpha}$, where $\delta_\ell$ quantifies the spectral difficulty. We also obtain non-asymptotic lower bounds for eigenvalue estimation under a relative squared-error loss. We then construct a computable wavelet projection estimator based on Coiflet scaling functions and a quadrature scheme designed to accommodate arbitrary Hölder smoothness. For eigenfunction estimation, this estimator matches the minimax dependence on the sample size and grid resolution, up to the natural spectral factor, for any Hölder index $\alpha>0$. Finally, we show that the proposed framework covers several classical Gaussian processes and Karhunen--Loève constructions. In particular, a Karhunen--Loève based criterion links spectral decay, eigenfunction regularity and covariance-kernel smoothness, and yields controlled simulation settings illustrating the predicted phase transitions and least-favourable discretization effects.
Traditional multi-population models, such as the Li-Lee framework, rely on the assumption of mean-reverting country-specific deviations. However, recent data from high-longevity clusters suggest a systemic break in this paradigm. We identify a stationarity paradox where mortality residuals in countries like Sweden and West Germany exhibit persistent unit roots, leading to a systematic mispricing of longevity risk in linear models. To address these non-linearities, we propose Hybrid-Lift, a neural-actuarial framework that combines Hierarchical LSTM networks with a Mean-Bias Correction (MBC) anchoring mechanism. Positioned as a governance-friendly model challenger rather than a replacement of classical approaches, the framework exhibits selective superiority on out-of-sample validation (2012-2020): it outperforms Li-Lee by 17.40% in Sweden and 12.57% in West Germany, while remaining comparable for near-linear regimes such as Switzerland and Japan. We complement the predictive model with an integrated governance suite comprising SHAP-based cross-country influence mapping, a dual uncertainty framework for regulatory capital calibration (Swiss ES 99.0% of +1.153 years), and a reverse stress test identifying the critical shock threshold for solvency buffer exhaustion. This research provides evidence that neural networks, when properly anchored by actuarial principles, can serve as effective model challengers for longevity risk management under the SST and Solvency II standards.
Predictive models are often deployed through existing decision policies that stakeholders are reluctant to change unless a risk constraint requires intervention. We study risk-controlled post-processing: given a deterministic baseline policy, choose a new policy that maximizes agreement with the baseline subject to a chance constraint on a user-specified loss. At the population level, we show that the optimal policy has a threshold structure: it follows the baseline except on contexts where switching to the oracle fallback policy yields a large reduction in conditional violation risk. At the finite-sample level, given a fitted fallback policy and score, we develop a post-processing algorithm that uses calibration data to select a threshold. Leveraging tools from algorithmic stability and stochastic processes, we show that under regularity conditions, in the i.i.d. setting, the expected excess risk of the post-processed policy is $O(\log n/n)$. In the special case when an exact-safe fallback policy is available, the algorithm achieves precise expected risk control under exchangeability. In this setting, we also give high-probability near-optimality guarantees on the post-processed policy. Experiments on a COVID-19 radiograph diagnosis task, an LLM routing problem, and a synthetic multiclass decision task show that targeted post-processing can meet or nearly meet risk budgets while preserving substantially more agreement with the baseline than score-blind random mixing.
In many scientific domains, including experimentation, researchers rely on measurements of proxy outcomes to achieve faster and more frequent reads, especially when the primary outcome of interest is challenging to measure directly. While proxies offer a more readily accessible observation for inference, the ultimate goal is to draw statistical inferences about the primary outcome parameter and proxy data are typically imperfect in some ways. To correct for these imperfections, current statistical inference methods often depend on strict identifying assumptions (such as surrogacy, covariate/label shift, or missingness assumptions). These assumptions can be difficult to validate and may be violated by various additional sources of distribution shift, potentially leading to biased parameter estimates and miscalibrated uncertainty quantification. We introduce an estimate-level framework, inspired by domain adaptation techniques, to empirically calibrate proxy-based inference. This framework models the proxy-primary metric discrepancy as a random effect at the parameter level, estimating its distribution from aggregated historical observations across past domains (e.g., experiments, time periods, or distinct segments). This method avoids the requirement for retaining individual-level response data. Additionally, this adjustment can be layered on top of existing proxy-correction methods (such as prediction-powered inference or importance weighting) to account for additional biases not addressed by those corrections. To manage uncertainty when the number of historical domains is limited, we provide both a method-of-moments estimator and a domain bootstrap procedure. We further validate this approach using publicly available datasets and real-world experiments.
This work has two major parts. First, we extend the recent study of Pham et al. (2025) on point estimation of the association parameter of a bivariate Frank copula. We investigate two Bayes estimators under the generalized flat prior and the Jeffreys prior, and compare them with the maximum likelihood estimator (MLE). Simulation results show that, for small sample sizes (n <= 25), the Bayes estimator under the Jeffreys prior uniformly outperforms both the generalized flat prior estimator and the MLE in terms of mean squared error (MSE). For moderate and large sample sizes, all estimators have very similar performances in terms of bias and MSE. We also discuss computational issues in the R package implementation that may significantly affect the computation of the MLE for very small samples. In the second part, we apply the Frank copula to analyze the association between groundwater arsenic concentration and other hydrochemical variables using a recent dataset from Vietnam. We revisit the goodness-of-fit tests proposed by Genest et al. (2006), investigate several non-intuitive behaviors of the test statistics, and provide extensive simulated critical value tables. Our results complement and refine the computational findings reported in the earlier literature.
Anytime-valid tests allow evidence to be checked during data collection: one can either continue testing or stop and reject the null while still controlling type-I error. Yet, in many applications rejection is useful only if it comes soon enough. We introduce a time-sensitive testing-by-betting framework that favours early rejection by assigning rewards to rejection times and maximising their expected value under a given alternative. This encompasses hard deadlines and softer time preferences. The resulting optimal control problem admits a Bellman representation in terms only of time and evidence against the null, rather than the full history. For hard deadlines, the simple-vs-simple case reduces to a finite-horizon Neyman--Pearson problem and identify the corresponding optimal e-process. Furthermore, we show that exponentially decaying rewards admit a stationary approximation, yielding the exponential-decay-optimal (EDO) criterion: a finite-time-scale counterpart to the classical growth-rate-optimal (GRO) viewpoint in anytime-valid statistics, with the GRO criterion recovered in the large-time-scale limit.
Tree-based regression models are widely used in supervised learning, with the Classification and Regression Tree (CART) algorithm serving as a standard reference. CART construction involves solving a sequence of split-selection optimization problems. For categorical predictors, this problem can be formulated as a combinatorial fractional optimization problem. This structure makes the exact optimization computationally challenging and leads to standard implementations that rely on greedy heuristics, which may result in suboptimal splits. In this work, we reformulate this fractional problem and apply Dinkelbach (1967) algorithm to convert it into a Quadratic Unconstrained Binary Optimization (QUBO) problem. Using state-of-the-art QUBO solvers, we obtain QUBO-based regression trees with predictive performance comparable to standard CART while yielding higher-quality split solutions. These results highlight the potential of QUBO formulations for improving tree-based learning methods and open perspectives for future hybrid classical--quantum implementations.
In networks, effective dynamic treatment allocation requires deciding both whom to treat and also when, so as to amplify policy impact through spillovers. An early intervention at a well-connected node can trigger cascades that change which nodes are worth targeting in the next period. Existing treatment strategies under network interference are largely static while dynamic treatment frameworks typically ignore network structure altogether. We integrate these perspectives and propose Q-Ising, a three-stage pipeline that (i) estimates network adoption dynamics via a Bayesian dynamic Ising model from a single observed panel, (ii) augments treatment adoption histories with continuous posterior latent states, and (iii) learns a dynamic policy via offline reinforcement learning. The Bayesian mechanism enables uncertainty quantification over dynamic decisions, yielding posterior ensemble policies with interpretable spillover estimates. We provide a finite-sample regret upper bound that decomposes into standard offline-RL uncertainty, network abstraction error, and first stage error in Ising state estimation. We apply our method to data from Indian village microfinance networks and synthetic stochastic block models under simulated heterogeneous susceptible-infected-susceptible (SIS) dynamics and demonstrate that adaptive targeting outperforms static centrality benchmarks.
Since its introduction by Fisher, the method of hypothesis testing that relies on computing error probabilities has witnessed several developments. Perhaps the most significant development was the seminal contributions of Neyman and Pearson who brought in the concept of the alternative hypothesis with its corresponding error of the second kind. Significance tests have played a major role in various scientific and technological developments, but not without controversies. Although originally cast as frequentist approaches, Bayesian ideas have been incorporated into significance tests, widening access to them. The quantities central to computations of error probabilities are the sampling distributions, which can be computed even without thresholds or alternative hypotheses. Even though Fisher used the significance threshold of 0.05 in his calculations, he cautioned against prescribing any specific threshold. Recently, there have been calls for reformation in practice with regard to the almost standard use of the significance threshold of 0.05, prepublication confirmatory studies, the dichotomous consideration of the null and alternative hypothesis and abandoning significance tests altogether in favour of other approaches such as confidence intervals and Bayesian decision theory. In this paper, we examine these calls for reform and unearth their strengths and short comings.
Existing conformal prediction methods for time-to-event outcomes leverage only baseline covariates, producing prediction intervals that are insufficiently informative to facilitate decision making. We propose History-Aware Prediction Sets (HAPS), a conformal framework that constructs prediction sets for individual event times using covariate histories observed up to a decision time, targeting coverage among individuals who have survived to this time. HAPS handles right censoring adjusted for time-varying confounders via inverse probability of censoring weighting. When the censoring weights are consistently estimated, it achieves PAAC (probably asymptotically approximately correct) coverage among survivors. We further propose two doubly robust extensions of HAPS to weaken reliance on consistent estimation of the censoring distribution. In simulations, HAPS and its extensions reduce median prediction interval length by up to 75\% relative to baseline comparators while maintaining close to nominal coverage. On two public benchmark data sets, HAPS reduces the median interval length by up to 60\% for predictions at year 5, compared to the baseline comparators.
Recent advances in biomedical research have identified an increasing number of biomarkers associated with heterogeneity in patient responses to medical treatments. When a treatment is suspected to benefit certain patient subpopulations, adaptive enrichment designs may be more efficient and ethical. In such designs, an interim analysis is incorporated during the trial to select patient subpopulations for which the experimental treatment appears promising, according to predefined subpopulation selection rules. However, data-dependent selection can induce selection bias, causing conventional maximum likelihood estimators (MLEs) to overestimate the treatment effect in the selected patient subgroup. Existing inference methods for addressing this bias are typically rule-specific, highlighting the need for an estimation framework that accommodate a broader class of subpopulation selection rules. In this work, we define a general class of subpopulation selection rules based on the sample space partition condition and provide a systematic derivation that yields a unified formula for the Uniformly Minimum Variance Conditional Unbiased Estimator (UMVCUE). This generality allows our formulation to encompass a wide spectrum of adaptive enrichment designs, eliminating the necessity for case-specific derivations for each new design. Extensive simulations confirm the unbiasedness of the proposed UMVCUE, ensuring that therapeutic benefits are not overestimated. By bridging the gap between flexible interim subpopulation selection and rigorous statistical inference, our framework has the potential to facilitate the implementation of diverse subpopulation selection rules with greater ease in real-world trials and promote more efficient and ethical drug development.
Randomized controlled trials typically assume that prognostic covariates are known and available at no cost. In practice, obtaining high-dimensional pretreatment data is costly, forcing a trade-off between covariate-adaptive precision and a measurement budget. We introduce Dynamic Adaptive Rerandomization via Thompson Sampling (DARTS), which treats covariate acquisition as a sequential optimization problem embedded within a design-based causal inference task. A budgeted combinatorial Thompson sampler learns which covariates are most prognostic across successive batches; selected covariates then drive rerandomization and regression adjustment to reduce batch-level average treatment effect variance. Our primary theoretical contribution is a decoupling result: adaptive covariate selection based on past batches preserves batch-level randomization validity, and the cumulative inverse-variance weighted estimator achieves at least nominal asymptotic coverage. We further derive a Bayes risk bound for the acquisition layer that matches the minimax lower bound up to logarithmic factors. Empirically, DARTS systematically concentrates the budget on informative features, significantly closing the efficiency gap to oracle designs while maintaining strict inferential validity.
Covariate adjustment is a general method for improving precision when estimating treatment effects in randomized trials and is recommended by the FDA in its 2023 guidance when baseline variables are prognostic for the primary outcome. We focus on a method highlighted in that guidance called ``standardization" (or ``g-computation") for estimating the marginal treatment effect. We address the question of how to reliably estimate variance for binary outcomes when marginal outcome probabilities are close to 0 or 1. We propose an influence function-based leave-one-out cross-validated (IF-LOO) variance estimator for the standardized difference-in-means average treatment effect. Through simulation studies, we show that this estimator provides appropriate type-I error control and performs reliably in challenging settings where existing methods can yield inflated type-I error or fail entirely, such as when outcome events are rare or sample sizes are small. In addition to having desirable statistical properties, we derive a closed-form expression for the proposed estimator, enabling straightforward and reliable implementation by study statisticians. The robust finite-sample performance and ease of implementation suggest the IF-LOO variance estimator is a prudent default choice for standardization in clinical trials.
Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard $\ell_1$-regularized SAEs suffer from feature starvation (dead neurons) and shrinkage bias, often requiring computationally expensive heuristic resampling and nondifferentiable hard-masking methods to bypass these challenges. We argue that feature starvation is not merely an empirical artifact of poor data diversity, but a fundamental optimization-geometric pathology of overcomplete dictionaries: the $\ell_1$-induced sparse coding map is unstable and fundamentally misaligned with shallow, amortized encoders. To address this structural instability, we introduce adaptive elastic net SAEs (AEN-SAEs), a fully differentiable architecture grounded in classical sparse regression. AEN-SAEs combine an $\ell_2$ structural term that enforces strong convexity and Lipschitz stability with adaptive $\ell_1$ reweighting that eliminates shrinkage bias and suppresses spurious features, thereby jointly controlling the curvature and interaction structure of the induced polyhedral geometry. Theoretically, we show that AEN-SAEs yield a Lipschitz-continuous sparse coding map and recover the global feature support under mild assumptions. Empirically, across synthetic settings and LLMs (Pythia 70M, Llama 3.1 8B), AEN-SAEs mitigate feature starvation without auxiliary heuristics while maintaining competitive reconstruction abilities.
The main XAI attribution methods for deep neural networks -- GradCAM, SHAP, LIME, Integrated Gradients -- operate on separate theoretical foundations and are not formally comparable. We present GRALIS (Gradient-Riesz Averaged Locally-Integrated Shapley), a mathematical framework establishing a representation theory for attributions: every additive, linear, and continuous attribution functional on L^2(Q,mu) admits a unique canonical representation (Q, w, Delta), proved necessary by the Riesz Representation Theorem. This class encompasses SHAP, IG, LIME and linearized GradCAM, but excludes nonlinear functionals such as standard GradCAM or attention maps. Seven formal theorems provide simultaneous guarantees absent in any individual method: (T1) necessary canonical form; (T2) exact completeness; (T3) Monte Carlo convergence O(1/sqrt(m))+O(1/k); (T4) exact Shapley Interaction Values; (T5) Hoeffding ANOVA decomposition; (T6) Sobol sensitivity generalization; (T7) multi-scale extension (MS-GRALIS) with minimum-variance weights. An algebraic appendix justifies the GRALIS-SIV correspondence via the Mobius transform without circularity. GRALIS satisfies 13.5/14 axiomatic properties vs. 2.5-6/14 for individual methods, including completeness, sensitivity, locality, order-k interactions and optimal multi-scale aggregation simultaneously. Preliminary validation on BreaKHis (1,187 histology images, DenseNet-121) reports deletion faithfulness AUC +0.015 (malignant), 96% class-conditional consistency, SAL = 0.762+/-0.109 and sparsity index 0.39. Extended comparison with baseline XAI methods is planned for a companion paper.
Active feature acquisition (AFA) considers prediction problems in which features are costly to obtain and the learner adaptively decides which feature values to acquire for each instance and when to stop and predict. AFA can be formulated as a partially observable Markov decision process (POMDP), which naturally admits a sequential decision-making perspective. In this paper, we present non-myopic pathwise policy gradients (NM-PPG), a new AFA method built around this formulation. We introduce a continuous relaxation of the acquisition process that enables pathwise gradients through the full acquisition trajectory, avoiding the high variance of standard score-function policy gradients while allowing end-to-end optimization of a non-myopic acquisition policy. To better align training with deployment, we further develop a straight-through rollout scheme that follows hard feature acquisitions in the forward pass while backpropagating through the corresponding soft relaxation in the backward pass. We stabilize optimization with entropy regularization and staged temperature sharpening. Experiments on both synthetic and real-world datasets demonstrate that NM-PPG yields superior performance relative to state-of-the-art AFA baselines.
Commercial Microwave Links (CMLs) offer dense spatial coverage for rainfall sensing but produce path-integrated measurements that make accurate ground-level reconstruction challenging. Existing methods typically oversimplify CMLs as point sensors and neglect line integration relating rainfall to signal attenuation, resulting in degraded performance under heterogeneous precipitation. In this work, we view rain field reconstruction as a Bayesian inverse problem with Diffusion Models (DMs) as high-fidelity spatial priors. We show that diffusion models better preserve key rainfall statistics compared to censored Gaussian processes. Framing rainfall estimation as a Bayesian inverse problem with a DM prior enables training-free posterior sampling using a broad family of methods, including Plug-and-Play, Sequential Monte Carlo, and Replica Exchange methods. Experiments on synthetic and real-world datasets demonstrate consistent improvements over established CML-based reconstruction baselines.
Counterfactual utilities evaluate decisions not only by the realized outcome under a given decision, but also by the counterfactual outcomes that would arise under alternative decisions. By generalizing standard utility frameworks, they allow decision-makers to encode asymmetric criteria, such as avoiding harm and anticipating regret. Recent work, however, has raised fundamental concerns about the coherence and transitivity of counterfactual utilities. We address these concerns by extending the von Neumann-Morgenstern (vNM) framework to preferences defined on the extended space of all potential outcomes rather than realized outcomes alone. We show that expected counterfactual utility satisfies the vNM axioms on this extended domain, thereby admitting a coherent preference representation. We further examine how counterfactual preferences map onto the realized outcome space through menu-dependent and context-dependent projections. This axiomatic framework reconciles apparent inconsistencies highlighted by the Russian roulette example in the statistics literature and resolves the well-known Allais paradox from behavioral economics. We also derive an additional axiom required to reduce counterfactual utilities to standard utilities on the same potential outcome space, and establish an axiomatic foundation for additive counterfactual utilities, which satisfy a necessary and sufficient condition for point identification. Finally, we show that our results hold regardless of whether individual potential outcomes are deterministic or stochastic.
We study contextual dynamic pricing with linear valuations and bounded-support agnostic noise, whose induced demand curve may be non-Lipschitz with arbitrary jumps and atoms. Such discontinuities break the cross-context interpolation arguments used by smooth-demand pricing algorithms, while the best previous method achieved only $\tilde O(T^{3/4})$ regret. We propose Conservative-Markdown Redirect-UCB Pricing, a polynomial-time algorithm that combines randomized parameter estimation, conservative residual-grid probing, and confidence-based one-step redirection. Our algorithm achieves $\tilde O(T^{2/3})$ optimal regret, matching the known lower bounds of Kleinberg and Leighton (2003) up to logarithmic factors and improving over the previous upper bound of Xu and Wang (2022). Under stochastic well-conditioned contexts, this closes the long-existing open regret gap in linear-valuation contextual pricing under agnostic non-Lipschitz noise distribution.
These notes provide a pedagogical introduction to the role of transversality theory in the analysis of statistical degeneracies within the framework of distributional statistical models. The classical question of when a statistical model is well-behaved - in the sense of being identifiable, having non-singular Fisher information, and admitting robust estimation - is reformulated as a question about the geometry of a kernel-induced feature map. Statistical pathologies correspond to geometric degeneracies of this map, and transversality theory provides a precise language for understanding when and why such degeneracies are non-generic. The exposition is organised in three parts. Part I surveys the statistical phenomena that motivate the geometric treatment: representation failure, non-identifiability, moment indeterminacy, singular information, nuisance parameters, and the Behrens-Fisher problem. Part II develops the necessary geometric toolkit - smooth maps, Sard's theorem, transversality, jets, stratifications, and the parametric transversality theorem - at a level accessible to students with a background in analysis and linear algebra but no prior exposure to differential topology. Part~III returns to the statistical problems of Part~I and shows how each one admits a unified geometric interpretation as a transversality condition on the feature map. These notes are a pedagogical companion to the research paper Labouriau (2026) "Transversality and Geometric Regularisation in Distributional Statistical Models" (arXiv:2605.04536 [math.ST]), expanding its arguments with motivating examples, geometric intuition, and exercises aimed at advanced Master's and PhD students with a background in mathematical statistics and measure theory. They are designed to support seminars or reading groups.
Unlike MLPs, Kolmogorov-Arnold Networks (KANs) expose explicit learnable edge functions on every connection, enabling mechanistic explanation in time-series forecasting. This paper introduces Temporal Functional Circuits, a framework that transforms KAN edge functions from latent visualizations into faithful, temporally grounded explanations. Built on a gated residual KAN that decomposes forecasts into a linear base and a sparsely activated KAN correction, the framework (i) maps each edge to input lags via output-aware attribution, (ii) ranks edges by learned activation range, and (iii) validates faithfulness through edge-level interventions including zeroing and spline removal. Removing the learned B-spline component while retaining the base SiLU term degrades forecasts, providing evidence that the spline shape itself carries predictive value beyond the base activation. On four synthetic regimes of increasing complexity, the learned gate opens progressively wider as signal complexity grows. On regime-switching signals, gated KAN achieves 59% lower MSE than linear-only models. Across eight benchmarks, the gated architecture is competitive with linear, attention, and MLP alternatives, while providing interpretable edge functions that MLP-based corrections cannot offer.
Kernel quadrature can exploit RKHS spectral structure and outperform Monte Carlo on smooth integrands, but optimized quadrature weights are generally signed and may be numerically unstable. We study whether spectral acceleration remains possible when the weights are constrained to be positive, i.e., simplex weights. In the exact-target fixed-pool setting, an evaluated i.i.d. candidate pool of size $N$ is already available and the task is to reweight it so as to approximate the kernel mean embedding. We show that this positive reweighting problem is governed not by the equal-weight empirical average, but by the random convex hull generated by the pool. Our main geometric result shows that the mean of a bounded $d$-dimensional random vector can be approximated by a convex combination of $N$ i.i.d. samples at accuracy $O(d/N)$ with high probability, sharper than equal-weight averaging in the fixed-dimensional regime. We transfer this $d$-dimensional convex-hull approximation to full RKHS worst-case error through an augmented Mercer-truncation argument. The resulting positive-weight KQ bounds consist of a spectral tail term and a finite-sample convex-hull term, yielding Monte-Carlo-beating rates in favorable spectral regimes, including near-$O(1/N)$ rates up to logarithmic factors under exponential spectral decay. We also provide a constructive Frank--Wolfe algorithm that operates directly on the pool atoms, maintains simplex weights, and admits an explicit optimization-error bound.
Estimating causal effects from observational data has become increasingly critical in diverse fields including healthcare, economics, and social policy. The fundamental challenge in causal inference arises from the missing counterfactuals and the selection bias. Existing methods are largely limited to point estimates and lack the capacity for distribution modeling. In this work, we propose RepFlow, a novel framework that formulates causal effect estimation as a joint optimization problem integrating representation learning with Conditional Flow Matching (CFM). RepFlow mitigates selection bias by minimizing the entropically regularized Wasserstein distance between treated and control representations. To enhance numerical stability, we further introduce an $L_2$ normalization constraint on latent representations. This balanced representation enables the flow model to accurately capture the distribution of potential outcomes. Extensive experiments across a wide range of benchmarks demonstrate that RepFlow consistently outperforms existing methods in both point and distributional causal effect estimation.
Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level $\varepsilon$ is multiplied by $\sqrt{d_\mathrm{eff}}$, where $d_\mathrm{eff}$ is the kernel effective dimension, while in online regret bounds, the corresponding penalty is $\sqrt{\gamma_n}\,n\varepsilon$, where $\gamma_n$ is the maximum information gain after $n$ rounds of interaction. In this work, we show that, for a large class of kernels, the misspecification amplification can be reduced to logarithmic or polylogarithmic growth. In the offline setting, we first prove high-probability simple-regret bounds whose misspecification term is governed by a spectral Lebesgue constant. This yields logarithmic amplification for one-dimensional monotone spectra and polylogarithmic amplification for multivariate Fourier-diagonal product kernels. In the online setting, we modify a domain-splitting algorithm and prove a cumulative regret bound of $\widetilde{\mathcal O}(\sqrt{\gamma_n n}+n\varepsilon)$ under mild localized eigendecay assumptions, removing the extra $\sqrt{\gamma_n}$ factor from the misspecification term. The common principle is localization: spectral localization controls the Lebesgue constant of the offline approximation operator, while domain splitting implements the spatial analogue of this mechanism in the online setting, preventing local misspecification errors from being amplified globally.
We study the fine-grained uniform convergence behavior of halfspaces beyond worst-case VC bounds. For inhomogeneous halfspaces in $\mathbb{R}^d$ with $d\ge 2$, we show that standard first-order VC bounds are essentially tight: even consistent hypotheses can incur population error $\Theta(d\ln(n/d)/n)$, and in the agnostic setting the deviation scales as $\sqrt{\tau\ln(1/\tau)}$ at true error $\tau$. In contrast, homogeneous halfspaces in $\mathbb{R}^2$ exhibit a markedly different behavior. In the realizable case, every hypothesis consistent with the sample has error $O(1/n)$. In the agnostic case, we prove a bandwise, log-free deviation bound on each dyadic risk band via a critical-wedge localization argument. Unioning over bands incurs only a $\ln\ln n$ overhead, and we establish a matching lower bound showing this overhead is unavoidable. Together, these results give a fine-grained and nearly complete picture of uniform convergence for halfspaces, revealing sharp dimensional and structural thresholds.
Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.
Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open-system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emph{pre-training degree} of new agents quantifies how much information an agent carries upon entry, \emph{stability} measures the impact of new agents on the system, and \emph{global dynamic regret} compares the cumulative expected reward of all active agents with that of the varying optimal arms. We develop certified global-UCB learning methodologies with provable guarantees. Our regret bounds reveal that entry uncertainty enters linearly via the pre-training degree, while in stable regimes, regret is governed by the time needed to identify a persistent optimal arm, as well as by the agent patterns. We further show that these dependencies are tight via lower bounds in hard instances.
We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $\phi(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.
Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource-constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge-deployable plant disease systems rely on end-to-end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware-level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed-form Bayesian classifier with a mobile-grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8-Nano (5.9 MB) for lesion localisation, MobileNetV3-Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non-iterative estimators via projection, for the classification. The Jacobi-DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end-to-end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi-GP, and demonstrate that the Jacobi-DMR offers the best trade-off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi-DMR. All data and codes are available here: this https URL
We propose a scalable and theoretically grounded low-rank conditional expectation model for recursive Monte Carlo optimal stopping problems, in particular American option pricing. Our method reformulates the estimation of continuation values as a learning problem in a reproducing kernel Hilbert space, in which the conditional expectation is represented as a linear operator acting on future payoffs. This perspective yields an offline-online decomposition: the operator is learned once from simulated data and subsequently reused across all exercise dates, eliminating the need to recompute regression models at each step of the backward recursion. We establish convergence guarantees and derive bounds quantifying the approximation errors across exercise dates. Numerical experiments demonstrate the speed and accuracy of the proposed approach relative to extant methods.
We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ($H_1$). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics -- specifically, Fourier analysis and local intrinsic dimension -- persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-scale structure. Ablations across data regimes and control settings show that these topological transitions are tied to generalization rather than memorization. Our results suggest that persistent homology offers a principled and interpretable framework for analyzing how neural networks internalize latent structure during training.
Order-Agnostic autoregressive models have demonstrated strong performance in deep generative modeling, yet their use in settings with incomplete data remains largely unexplored. In this work, we reinterpret them through the lens of missing data. First, we show that their standard training procedure on fully observed data implicitly performs imputation under a missing completely at random mechanism, resulting in robust out-of-sample imputation performance in settings with high missingness. Second, we introduce the first principled framework for training them directly on incomplete datasets under general missingness mechanisms. Third, we leverage their amortized conditional density estimation to perform active information acquisition, i.e., sequentially selecting the most informative missing variables for downstream prediction or inference. Across a suite of real-world benchmarks, our Missingness-Aware Order-Agnostic Autoregressive Model (MO-ARM) consistently outperforms established imputation baselines.
Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO's clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO's gradient is a positive scalar multiple of standard GRPO's gradient, explaining its empirical stability despite discarding continuous reward magnitudes. Building on this foundation, we propose Hard-Pair-GRPO, an advanced variant introducing explicit local probability constraints and constrained KL-fitting optimization to further suppress gradient noise and global policy drift. We provide comprehensive theoretical guarantees for both variants--including monotonic policy improvement, deterministic gradient direction, gradient-variance reduction, and dynamic step-size convergence. Extensive experiments on standard LLM alignment benchmarks (HH-RLHF,UltraFeedback) and the MuJoCo continuous control task HalfCheetah-v4 demonstrate that our Pair-GRPO family consistently outperforms state-of-the-art baselines in alignment quality, human preference win rate, training stability, and generalization to general reinforcement learning. Ablation studies validate the critical contributions of each core component.
This position paper argues that, in debiased machine learning, balancing functions should be derived from the Neyman orthogonal score, not chosen only as functions of covariates. Covariate balancing is effective when the regression error entering the score can be represented by functions of covariates alone, and it is the natural finite-dimensional approximation for targets such as ATT counterfactual means. For ATE estimation under treatment effect heterogeneity, however, the score error generally contains treatment-specific components because the outcome regression is a function of the full regressor $X=(D,Z)$. In that case, balancing common functions of $Z$ can leave the treatment-specific component unbalanced. We therefore advocate regressor balancing, implemented by Riesz regression with basis functions of $X$, as the general balancing principle for DML. The position is not that covariate balancing is invalid, but that covariate balancing should be understood as the special case that is appropriate when the score-relevant regression error is a function of covariates alone.
We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^\pi$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.
Regulatory approval of products in high-stakes domains such as drug development requires statistical evidence of safety and efficacy through large-scale randomized controlled trials. However, the high financial cost of these trials may deter developers who lack absolute certainty in their product's efficacy, ultimately stifling the development of `moonshot' products that could offer high social utility. To address this inefficiency, in this paper, we introduce a statistical protocol for experimentation where the product developer (the agent) conducts a randomized controlled trial sequentially and the regulator (the principal) partially subsidizes its cost. By modeling the protocol using a belief Markov decision process, we show that the agent's optimal strategy can be found efficiently using dynamic programming. Further, we show that the social utility is a piecewise linear and convex function over the subsidy level the principal selects, and thus the socially optimal subsidy can also be found efficiently using divide-and-conquer. Simulation experiments using publicly available data on antibiotic development and approval demonstrate that our statistical protocol can be used to increase social utility by more than $35$$\%$ relative to standard, non-sequential protocols.
We study online prediction under distribution shift, where inputs arrive chronologically and outcomes are revealed only after prediction. In this setting, predictors must remain stable in quiet regimes yet adapt when regimes shift, and the right adaptation memory is unknown in advance. We propose MELO (Memory-hedged Exponentially Weighted Least-Squares Online aggregation), a model-agnostic method that hedges across adaptation scales: it wraps any non-anticipating base-predictor pool with exponentially weighted least-squares (EWLS) adaptation experts at multiple forgetting factors, and aggregates raw and EWLS-adapted forecasts with MLpol, a parameter-free online aggregation rule. Under boundedness conditions, we establish deterministic oracle inequalities showing that it competes with both the best raw predictor and the best bounded, time-varying affine combinations of the base predictions, up to a path-length-dependent tracking cost and a sublinear aggregation overhead. We evaluate MELO on French national electricity-load forecasting through the COVID-19 lockdown using no regime indicators, lockdown dates, or policy covariates. MELO reduces overall RMSE by 34.7\% relative to base-only MLpol and achieves lower overall RMSE than a TabICL reference supplied with an external COVID policy-response covariate. Moreover, MELO requires only lightweight per-step recursive updates without model retraining.
This paper proposes a hybrid methodology to improve the approximation of SABR (Stochastic Alpha Beta Rho) implied volatility by combining analytical structure with machine learning. The approach augments the neural-network input representation with geometric features derived from the stochastic differential equations of the SABR model. Unlike approaches that fully replace analytical formulas with black-box models, the proposed framework preserves the analytical backbone of the model. The hybridization operates along two complementary dimensions. First, geometry-aware variables reflecting intrinsic properties of the SABR dynamics are used as structured inputs to the network. Second, the neural network is trained to learn the residual error relative to Hagan's closed-form approximation rather than implied volatility directly. The resulting model acts as a structured residual correction to the analytical formula, retaining interpretability while capturing higher-order effects that are not included in the asymptotic expansion. Numerical experiments conducted over realistic parameter domains, as well as stressed environments, show that the method improves accuracy and robustness compared with both analytical approximations and standard neural-network approaches. Because the correction remains lightweight and structurally consistent with the underlying model, the framework is well suited for real-time pricing and calibration in practical trading environments.
Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regression, with each layer exactly performing one step of normalized gradient descent on an in-context loss. Then, we show that our constructed transformer can be obtained through (i) training a single self-attention layer supervised by one-step gradient descent, and (ii) recurrently applying the trained layer to obtain a looped model. Training convergence guarantees of the self-attention layer and out-of-distribution generalization guarantees of the looped model are provided. Our results advance the theoretical understanding of ICL mechanism by showcasing how softmax transformers can effectively act as in-context learners.
Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.
Bayesian model calibration is central to digital twins and computer experiments, as it aligns model outputs with field observations by estimating calibration parameters and correcting systematic model bias. Classical Bayesian calibration introduces latent parameters and a discrepancy function to model bias, but suffers from parameter--discrepancy confounding and is typically formulated as an offline procedure under a stationary data-generating assumption. These limitations are restrictive in modern digital twin applications, where systems evolve over time and may exhibit gradual drift and abrupt regime shifts. While data assimilation methods enable sequential updates, they generally do not explicitly model systematic bias and are less effective under abrupt changes. We propose Bayesian Recursive Projected Calibration (BRPC), an online Bayesian calibration framework for streaming data under simulator mismatch and nonstationarity. BRPC extends projected calibration to the online setting by separating a discrepancy-free particle update for calibration parameters from a conditional Gaussian process update for discrepancy, preserving identifiability while enabling bias-aware adaptation under gradual system evolution. To handle abrupt changes, BRPC is integrated with restart mechanisms that detect regime shifts and reset the calibration process. We establish theoretical guarantees for both components, including tracking performance under gradual evolution and false-alarm and detection behavior for restart mechanisms. Empirical studies on synthetic and plant-simulation benchmarks show that BRPC improves calibration accuracy under gradual changes, while restart-augmented BRPC further improves robustness and predictive performance under abrupt regime shifts compared to sliding-window Bayesian calibration and data assimilation baselines.
We contribute to an uncertainty quantification problem in imaging that evaluates a hypothesis test questioning the existence of local "artefacts" appearing in the maximum a posteriori (MAP) estimate (obtained from standard numerical tools). Such a method, called Bayesian uncertainty quantification by optimization (BUQO), was introduced a few years ago as an efficient and scalable alternative to sampling methods when per-pixel error-bars are not needed. BUQO formulates a hypothesis test for probing the existence of local structures in the MAP estimate as a minimization problem, that can be solved efficiently with standard optimization algorithms. In this context, BUQO requires a "mathematical" definition of the "local artefact". This definition can be interpreted as an inpainting of the structure. However, only simple hand-crafted techniques have been proposed so far due to the complexity of the problem. In this work, we propose a data-driven alternative to BUQO where the inpainting procedure in the algorithm is performed using a convolutional inpainting neural network (NN). This results in a plug-and-play algorithm, based on the primal-dual Condat-Vu iterations,where the inpainting procedure is performed with a NN. The proposed approach is assessed on two image reconstruction problems inspired by medicine. We specifically perform simulations on two Fourier undersampling problems (discrete and non-uniform) encountered in magnetic resonance imaging, as well as a computed tomography problem using the Radon measurement operator.
This paper examines whether repeated payday loan use, commonly known as the debt trap, harms borrowers' financial wellbeing. Using Open Banking data from 1,815 UK borrowers observed between 2017 and 2018, we model borrowing intensity using a two-state hidden Markov model (HMM). The HMM outperforms single-regime alternatives and identifies two distinct borrowing patterns: occasional (low-intensity) and persistent (high-intensity) use. Each regime exhibits a characteristic relationship between borrowing intensity and wider transaction behaviour. We translate the decoded state sequence into a practical monitoring rule based on sustained high-intensity exposure. Defining a trigger event as 12 consecutive weeks in the high-intensity regime, we find that 36.4% of borrowers experience at least one such event. Among those who do, high-intensity weeks represent 17.8% of all borrower-week observations on average. Together, these results provide evidence for a persistent high-intensity borrowing pattern and demonstrate that it can serve as a simple, interpretable rule for monitoring prolonged reliance on payday loans.
We introduce CatNet, an algorithm that effectively controls False Discovery Rate (FDR) and selects significant features in LSTM. CatNet employs the derivative of SHAP values to quantify the feature importance, and constructs a vector-formed mirror statistic for FDR control with the Gaussian Mirror algorithm. To avoid instability due to nonlinear or temporal correlations among features, we also propose a new kernel-based independence measure. CatNet performs robustly on different model settings with both simulated and real-world data, which reduces overfitting and improves interpretability of the model. Our framework that introduces SHAP for feature importance in FDR control algorithms and improves Gaussian Mirror can be naturally extended to other time-series or sequential deep learning models.
We investigate the monotone representation and measurability of generalized $\psi$-estimators introduced by the authors in 2022. Our first main result, applying the unique existence of a generalized $\psi$-estimator, allows us to construct this estimator in terms of a function $\psi$, which is decreasing in its second variable. We then interpret this result as a bridge from a nonconvex optimization problem to a convex one. Further, supposing that the underlying measurable space (sample space) has a measurable diagonal and some additional assumptions on $\psi$, we show that the measurability of a generalized $\psi$-estimator is equivalent to the measurability of the corresponding function $\psi$ in its first variable.
Missingness in variables that define study eligibility criteria is a seldom addressed challenge in electronic health record (EHR)-based settings. It is typically the case that patients with incomplete eligibility information are excluded from analysis without consideration of (implicit) assumptions that are being made, leaving study conclusions subject to potential selection bias. In an effort to ascertain eligibility for more patients, researchers may look back further in time prior to study baseline, and in using outdated values of eligibility-defining covariates may inappropriately be including individuals who, unbeknownst to the researcher, fail to meet eligibility at baseline. To the best of our knowledge, however, very little work has been done to mitigate these concerns. We propose a robust and efficient estimator of the causal average treatment effect on the treated, defined in the study eligible population, in cohort studies where eligibility-defining covariates are missing at random. The approach facilitates the use of flexible machine-learning strategies for component nuisance functions while maintaining appropriate convergence rates for valid asymptotic inference. This method is directly motivated by, and applied throughout to EHR data from Kaiser Permanente to analyze differences between two common bariatric surgical interventions for long-term weight and glycemic outcomes among a cohort of severely obese patients with type II diabetes mellitus.
Federated Learning has gained traction in privacy-sensitive collaborative environments, with local SGD emerging as a key optimization method in decentralized settings. While its convergence properties are well-studied, asymptotic statistical guarantees beyond convergence remain limited. In this paper, we present two generalized Gaussian approximation results for local SGD and explore their implications. First, we prove a Berry-Esseen theorem for the final local SGD iterates, enabling valid multiplier bootstrap procedures. Second, motivated by robustness considerations, we introduce two distinct time-uniform Gaussian approximations for the entire trajectory of local SGD. The time-uniform approximations support Gaussian bootstrap-based tests for detecting adversarial attacks. Extensive simulations are provided to support our theoretical results.
Least absolute shrinkage and selection operator or Lasso is one of the widely used regularization methods in regression. Statisticians usually implement Lasso in practice by choosing the penalty parameter in a data-dependent way, the most popular being the $K-$fold cross-validation (or $K-$fold CV). However, inferential properties, such as the variable selection consistency and $n^{1/2}-$consistency, of the $K-$fold CV based Lasso estimator and validity of the Bootstrap approximation are still unknown. In this paper, we consider the heteroscedastic linear regression model and show only under some moment type conditions that the Lasso estimator with $K$-fold CV based penalty is $n^{1/2}-$consistent, but not variable selection consistent. Additionally, we establish the validity of Bootstrap in approximating the distribution of the $K-$fold CV based Lasso estimator. Therefore, our results theoretically justify the use of $K-$fold CV based Lasso estimator to perform statistical inference in linear regression. We validate our Bootstrap method for the $K-$fold CV based Lasso estimator in finite samples based on simulations. We also implement our Bootstrap based inference on a real data set.
While split conformal prediction guarantees marginal coverage, approaching the stronger property of conditional coverage is essential for reliable uncertainty quantification. Naive conformal scores, however, suffer from poor conditional coverage in heteroskedastic settings. In univariate regression, this is commonly addressed by normalizing non-conformity scores using an estimated local score variance. In this work, we propose a natural extension of this normalization to the multivariate setting, effectively whitening the residuals to decouple output correlations and standardize local variance. Furthermore, we derive a sufficient condition characterizing a broad class of distributions for which standardized residuals yield asymptotic conditional coverage. We demonstrate that using the Mahalanobis distance induced by a learned local covariance as a non-conformity score provides a closed-form, computationally efficient mechanism for capturing inter-output correlations and heteroskedasticity, avoiding the expensive sampling required by previous methods based on cumulative distribution functions. This structure unlocks several practical extensions, including the handling of missing output values, the refinement of conformal sets when partial information is revealed, and the construction of valid conformal sets for transformations of the output. Finally, we provide extensive empirical evidence on both synthetic and real-world datasets showing that our approach yields conformal sets that improve upon the conditional coverage of existing multivariate baselines.
We develop a Bayesian tree ensemble model to estimate heterogeneous treatment effects in censored survival data with high-dimensional covariates. Instead of imposing sparsity through the tree structure, we place a horseshoe prior directly on the step heights to achieve adaptive global-local shrinkage. This strategy allows flexible regularisation and reduces noise. We develop a reversible jump Gibbs sampler to accommodate the non-conjugate horseshoe prior within the tree ensemble framework. We show through extensive simulations that the method accurately estimates treatment effects in high-dimensional covariate spaces, at various sparsity levels, and under non-linear treatment effect functions. We further illustrate the practical utility of the proposed approach by a re-analysis of pancreatic ductal adenocarcinoma (PDAC) survival data from The Cancer Genome Atlas.
Ecological Momentary Assessment (EMA) studies enable the collection of high-frequency self-reports of suicidal thoughts and behaviors (STBs) via smartphones. Latent stochastic differential equations (SDEs) are a promising model class for EMA data, as it is irregularly sampled, noisy, and partially observed. But SDE-based models suffer from two key limitations. (a) These models often violate domain constraints, undermining scientific validity and clinical trust of the model. (b) Training is numerically unstable without ad hoc fixes (e.g. oversimplified dynamics) that are ill-suited for high-stakes applications. Here, we develop a novel class of expressive SDEs whose solutions are provably confined to a prescribed compact polyhedral state space, matching the domains of EMA data. In this work, (1) we show why chain-rule based constructions of SDEs on compact domains fail, theoretically and empirically; (2) we derive constraints on drift and diffusion for general and stationary SDEs so their solutions remain in the desired state space; and (3), we introduce a parameterization that maps arbitrary (neural or expert-given) dynamics into constraint-satisfying SDEs. On several real EMA datasets, including a large suicide-risk study, our parameterization improves forecasts and optimization dynamics over standard latent neural SDE baselines. These contributions pave the way for principled, trustworthy continuous-time models of suicide risk and other clinical time series and extend applications of SDE-based methods (e.g. diffusion models) to domains with hard state constraints.
When solving PDEs, classical numerical solvers are often computationally expensive, while machine learning methods can suffer from spectral bias, failing to capture high-frequency components. Designing an optimal hybrid iterative solver--where, at each iteration, a solver is selected from an ensemble of solvers to leverage their complementary strengths--poses a challenging combinatorial problem. While greedy selection is desirable for its constant-factor approximation guarantee to the optimal solution under Lipschitz assumptions, it requires knowledge of the true error at each step, which is unavailable in practice. We address this by proposing an approximate greedy router that efficiently mimics a greedy approach to solver selection. Empirical results on the Poisson and convection-diffusion equations show that our method consistently reduces final error and area-under-the-curve (AUC) of the error trajectory relative to single-solver baselines and existing hybrid approaches such as HINTS. In particular, our method reaches comparable error levels in substantially fewer iterations while exhibiting more stable error decay.
Simulating the kinetic Langevin dynamics is a popular approach for sampling from distributions, where only their unnormalized densities are available. Various discretizations of the kinetic Langevin dynamics have been considered, where the resulting algorithm is collectively referred to as the kinetic Langevin Monte Carlo (KLMC) or underdamped Langevin Monte Carlo. Specifically, the stochastic exponential Euler discretization, or exponential integrator for short, has previously been studied under strongly log-concave and log-Lipschitz smooth potentials via the synchronous Wasserstein coupling strategy. Existing analyses, however, impose restrictions on the parameters that do not explain the behavior of KLMC under various choices of parameters. In particular, all known results fail to hold in the overdamped regime, suggesting that the exponential integrator degenerates in the overdamped limit. In this work, we revisit the synchronous Wasserstein coupling analysis of KLMC with the exponential integrator. Our refined analysis results in Wasserstein contractions and bounds on the asymptotic bias that hold under weaker restrictions on the parameters, which assert that the exponential integrator is capable of stably simulating the kinetic Langevin dynamics in the overdamped regime, as long as proper time acceleration is applied.
Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparametrized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to "shatter" with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.
We study in-context learning problems where a Transformer is pretrained on tasks drawn from a mixture distribution $\pi=\sum_{\alpha\in\mathcal{A}} \lambda_{\alpha} \pi_{\alpha}$, called the pretraining prior, in which each mixture component $\pi_{\alpha}$ is a distribution on tasks of a specific difficulty level indexed by $\alpha$. Our goal is to understand the performance of the pretrained Transformer when evaluated on a different test distribution $\mu$, consisting of tasks of fixed difficulty $\beta\in\mathcal{A}$, and with potential distribution shift relative to $\pi_\beta$, subject to the chi-squared divergence $\chi^2(\mu,\pi_{\beta})$ being at most $\kappa$. In particular, we consider nonparametric regression problems with random smoothness, and multi-index models with both random smoothness and random effective dimension. We prove that a large Transformer pretrained on sufficient data achieves the optimal rate of convergence corresponding to the difficulty level $\beta$, uniformly over test distributions $\mu$ in the chi-squared divergence ball. Thus, the pretrained Transformer is able to achieve faster rates of convergence on easier tasks and is robust to distribution shift at test time. Finally, we prove that even if an estimator had access to the test distribution $\mu$, the convergence rate of its expected risk over $\mu$ could not be faster than that of our pretrained Transformers, thereby providing a more appropriate optimality guarantee than minimax lower bounds.
We investigate the problem of density estimation on the unit circle and the unit sphere from a computational perspective. Our primary goal is to develop new density estimators that are both rate-optimal and computationally efficient for direct implementation. After establishing these estimators, we derive closed-form expressions for probability estimates over regions of the circle and the sphere. Then, the proposed theories are supported by extensive simulation studies. The considered settings naturally arise when analyzing phenomena on the Earth's surface or in the sky (sphere), as well as directional or periodic phenomena (circle). The proposed approaches are broadly applicable, and we illustrate their usefulness through case studies in zoology, climatology, geophysics, and astronomy, which may be of independent interest. The methodologies developed here can be readily applied across a wide range of scientific domains.
Reducing the number of experimental units is one of the three pillars of the 3R principles (Replace, Reduce, Refine) in animal research. At the same time, statistical error rates need to be controlled to enable reliable inferences and decisions. This paper proposes to adopt diagnostic likelihood ratios and the diagnostic odds ratio to statistical hypothesis tests and to adjust it for sample size to obtain a novel measure to quantify for the evidentiary value of one experimental unit. The experimental unit information index (EUII) is based on power, Type-I error and sample size, and has attractive interpretations both in terms of frequentist error rates and Bayesian posterior odds. We introduce the EUII in simple statistical test settings and show that its asymptotic value depends only on the assumed relative effect size under the alternative. We then extend the definition to adaptive designs where early stopping for efficacy or futility may cause reductions in sample size. Application to group-sequential designs show the usefulness of the approach when the goal is to maximize the evidentiary value of one experimental unit. A reanalysis of 2738 animal experiments with simulated results from (post-hoc) interim analyses illustrates the possible savings in sample size.
Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
We consider statistical inference for errors-in-variables regression models with dependent observations under the high dimensionality of the error covariance matrix. It is tempting to prewhiten the model and data that had led to efficient weighted least squares estimation in the presence of the measurement errors, as being practised in the optimal fingerprinting approach in climate change studies. However, it is unclear to what extent the prewhitened estimator can improve the estimation efficiency of the unprewhitened estimator for errors-in-variables regression. We compare the prewhitening and unprewhitening estimators in terms of their estimation efficiency and computational cost. It shows that while the prewhitening operation does not necessarily improve the estimation efficiency of its unprewhitening counterpart, it demands more on the ensemble size needed in the error-covariance matrix estimation to ensure the asymptotic normality, and hence it would requires much more computationally resource.
We propose a geometric latent-subspace framework for generative modeling of discrete data. Specifically, we introduce latent subspaces in the exponential parameter space of product manifolds of categorical distributions as a novel method for learning generative models of discrete data. The resulting low-dimensional latent space encodes statistical dependencies and removes redundant degrees of freedom among the categorical variables. We equip the parameter domain with a Riemannian geometry such that the latent subspace and induced data manifold are related by isometries enabling consistent flow matching. Exploiting this structure, we propose a geometry-aware dimensionality reduction objective, called geometric PCA (GPCA), which we formulate as a regularized cross-entropy minimization that encourages small Riemannian distances between the data and their reconstructions. In particular, under the induced geometry, geodesics become straight lines in the latent parameter space which makes model training by flow matching effective. Empirical results show that low-dimensional latent representations suffice to accurately model high-dimensional discrete data.
Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.
Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate with downstream tasks such as sampling and probabilistic forecasting. We show that any sufficiently regular differentiable nonconformity score induces a deterministic flow on the output space whose trajectories converge to the boundary of the corresponding conformal prediction set. This leads to a computationally efficient, training-free method for sampling conformal boundaries in arbitrary dimensions. Mixing across confidence levels yields conformal predictive distributions whose quantile regions coincide with the empirical conformal prediction sets. We provide an approximation bound decomposing CPD predictive error into score-induced distortion, base-measure quality, and gradient flow-induced distortion. We evaluate the approach on PDE inverse problems, precipitation downscaling, climate model debiasing, and hurricane trajectory forecasting.
Flow matching (FM) is increasingly used in scientific domains for time series generation and forecasting, where data often arise from underlying dynamical systems. However, it is not well-understood whether it learns transferable dynamical structure or simply performs an effective "trajectory replay". We study this question by deriving the velocity field targeted by the empirical FM objective on sequential data in the limit of perfect function approximation. For the Gaussian conditional paths commonly used in practice, we show that the implied sampler is an ODE whose dynamics constitutes a nonparametric, memory-augmented continuous-time dynamical system. The optimal field admits a closed-form expression as a similarity-weighted mixture of instantaneous velocities induced by observed transitions, making the dataset dependence explicit and interpretable. This characterization positions neural FM models as parametric surrogates of an ideal nonparametric solution and suggests practical approximation schemes for robust ODE-based generation. As a byproduct of our analysis, the resulting closed-form sampler, FreeFM, provides strong probabilistic forecasts on nonlinear dynamical system benchmarks directly from historical transitions, without training.
Given $n$ i.i.d. samples from an unknown discrete distribution over an unknown set, the unseen species problem is to predict how many new outcomes would be observed in $m$ additional samples. For small $m$ we show that the Good-Toulmin estimator is the unique estimator which both respects the symmetries of the problem and has non-trivial rate. We resolve the open problem of constructing principled prediction intervals for it. For intermediate $m$ we propose a new estimator which has a vastly improved worst case MSE compared to competing methods and we expect that our method can be applied to other species sampling problems. For large $m$ we follow previous authors in assuming a power law tail and show that a simple estimator achieves the same rate and better empirical performance than a recent sophisticated method. Moreover, we give pre-asymptotic guarantees. We extend the rate guarantees to incidence data, without further independence assumptions, provided that the sets are of bounded size. In the process we use Stein's method to obtain concentration inequalities for some natural functionals of sequences of i.i.d. discrete-set-valued random variables which are of independent interest.
Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape. The key measure is the local learning coefficient (LLC) which quantifies the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging SLT, we develop a basin-selection perspective on grokking in quadratic networks: LLC ranks competing near-zero-loss basins by statistical preference, while the training-time transition between them is governed by optimisation dynamics. In this view, grokking corresponds to a transition from a higher-LLC (memorising) basin to a lower-LLC (generalising) basin that dominates the posterior. To support this, we derive analytic formulas for the LLC in shallow quadratic networks under both lazy and feature learning regimes. Empirically, we demonstrate that LLC trajectories estimated from training data track the onset of generalisation and provide an informative probe of the optimisation path.
Gradient descent on overparameterized neural networks typically operates at the Edge of Stability (EoS), where the largest Hessian eigenvalue hovers around a step-size-dependent threshold. We study how sparse connectivity changes generalization below this threshold in two-layer ReLU networks. Prior results have shown that for fully-connected networks (FCNs), generalization guarantees in this regime degrade and become vacuous on high-dimensional spherical inputs. Our analysis reveals that sparse connectivity fundamentally alters this picture. Under sparse connectivity, the network processes a collection of low-dimensional patches rather than the full input vector, so the effective constraint imposed by the stability condition is governed by the geometry of the training patch collection. We prove that when the receptive fields are small relative to the ambient dimension, the effective constraint yields non-vacuous generalization bounds in precisely the spherical regime where FCNs provably fail. The same framework also reveals a contrasting failure mode: if the patch collection lacks geometric structure, the constraint becomes unable to prevent overfitting. We corroborate this theory by analyzing the patch geometry of natural images, showing that standard convolutional designs produce patch multiset with low-dimensional structure that facilitates generalization. This provides a principled explanation for the generalization advantage of convolutional networks. Thus, our analysis yields a unified framework that identifies how architecture, data geometry, and gradient descent jointly govern generalization performance.
Quantum principal component analysis (qPCA) is commonly formulated as the extraction of eigenvalues and eigenvectors of a covariance-encoded density operator. Yet in many qPCA settings the practical goal is simpler: projection onto the dominant spectral subspace. Here we introduce a projection-first framework, the Filtered Spectral Projection Algorithm (FSPA), which bypasses explicit eigenvalue estimation while preserving the relevant spectral structure. FSPA amplifies any nonzero warm-start overlap with the leading subspace and remains robust in small-gap and near-degenerate regimes, without artificial symmetry breaking in the absence of bias. We show that FSPA achieves an oracle complexity $\mathcal{O}((\log(1/\epsilon)+\log(1/|a_1|^2))/\log(\lambda_1/\lambda_2))$,which is tight by a matching lower bound, establishing it as an\emph{optimal} projection primitive. We derive a convergence rate for degenerate spectra, give a circuit resource analysis with $n+\mathcal{O}(1)$ qubit overhead independent of system dimension, and extend the method to threshold spectral projection, Threshold-FSPA, which converges in $\mathcal{O}(\log(1/\epsilon))$ calls when the threshold lies between eigenvalues. In the density matrix exponentiation access model, FSPA gives an exponential copy-complexity advantage over classical methods. For classical datasets, we show that for amplitude-encoded centered data the ensemble density matrix $\rho=\sum_i p_i|\psi_i\rangle\langle\psi_i|$ equals the covariance matrix. Numerical tests on chemistry density matrices, noisy circuit outputs, Breast Cancer Wisconsin, handwritten Digits, and 1--4-qubit scalability confirm the theory. A minimal Qiskit implementation validates magnitude invariance, signal amplification, and no spurious symmetry breaking. These results establish FSPA as an optimal and deployable quantum spectral projection primitive.
Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and we establish explicit conditions under which it is provably more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance--reliability trade-offs.
Strict minimum message length (SMML) estimation is an information-theoretic coding principle that represents a continuous statistical model by a finite set of assertions and a partition of the sample space. We formulate SMML as an entropy-cross-entropy optimisation problem, in which the assertion cost is the Shannon entropy of the assertion distribution and the detail cost is a conditional cross-entropy between the cellwise predictive distribution and the assigned model. For a fixed partition, we show that SMML codepoints are Kullback-Leibler projections of normalised cellwise predictive distributions onto the model family. In regular parametric models, the local quadratic approximation of Kullback-Leibler divergence induces Fisher-Rao geometry. Under high-resolution regularity conditions, optimal SMML partitions are asymptotically the pullback, through the maximum likelihood estimator, of weighted Fisher-Rao Voronoi tessellations in parameter space, with assertion probabilities appearing as additive weights. For regular exponential families, SMML codepoints satisfy a moment-matching condition and admit an interpretation as KL/Bregman centroids, while exact SMML cells are pullbacks of convex polyhedra in sufficient-statistic space. Together, these results show that SMML induces a natural information-geometric quantisation linking entropy-based coding, KL projection, and divergence-based Voronoi geometry.
We propose the covariate-balanced-and-adjusted response-adaptive randomization (CBARA) procedure for adaptive design in clinical trials, which integrates the complementary strengths of covariate-adjusted response-adaptive randomization (CARA) and covariate-adaptive randomization (CAR). The CBARA procedure updates the target allocation ratio according to observed responses and patient covariate profiles without requiring a correctly specified model, thereby retaining CARA's ethical and efficiency considerations while improving robustness. In addition, the CBARA procedure extends the CAR principle from fixed target allocation ratios to covariate-adjusted adaptive target allocation ratios, yet still pursues balance in treatment allocation with respect to covariate features. This integration is enabled by a newly defined imbalance vector and three interrelated components: the allocation function, parameter estimation and update mechanism. We establish the asymptotic properties of covariate imbalance and the estimators under the CBARA procedure. The results demonstrate that the CBARA procedure can improve balance for both observed and unobserved covariates while preserving the consistency of the allocation ratio. The theoretical analysis is developed through a pseudo-Markov chain framework, where a new discrepancy measure for transition kernels is introduced to handle the continuity of Poisson equation solutions with respect to parameters.
Causal inference is essential for data-driven decision-making, as it aims to uncover causal relationships from observational data. However, identifying causality remains challenging due to the potential for confounding and the distinction between correlation and causation. While recent advances in causal machine learning and matching algorithms have improved estimation accuracy, these methods often face trade-offs between interpretability and computational efficiency. This paper proposes a novel approach that combines a tree-based discretization technique, tailored for causal inference, with an integer linear programming-based matching algorithm. The discretization ensures approximately linear relationships for control datasets within strata, enabling effective matching, while the optimization framework optimizes for global balance. The resulting algorithm yields computational efficiency and less biased ATT estimates compared to state-of-the-art algorithms. Empirical evaluations demonstrate the proposed method's practical advantages over existing techniques in causal inference scenarios.
Log-logistic distribution is a flexible distribution that can model a wide range of failure patterns in the field of electrical, electronic and mechanical engineering and is often used in reliability inference. However, the inference of the parameters and reliability function of the log-logistic distribution can be challenging, especially in small sample scenarios. In this paper, we propose a new inference framework based on the least squares estimator-based generalized pivotal quantities (LSE-GPQ) for the parameters and reliability functions of the log-logistic distribution, which can provide better coverage in small sample scenarios. We will compare the performance of our proposed method with traditional methods such as the MLE and parametric bootstrapping through simulation studies and real data applications.
External priors of unknown reliability create a brittle trade-off in causal discovery: blind trust amplifies errors, blind rejection wastes signal. Real priors are also heterogeneously reliable -- physical laws are trustworthy, LLM-suggested edges are speculative -- yet existing methods either ignore priors or impose them through globally uniform trust. We propose PRCD-MAP, a soft prior-consumption layer that assigns per-edge trust to an imperfect prior and uses it to modulate a prior-aware $\ell_1$ and prior-weighted $\ell_2$ regularizer in a MAP objective. Trust is calibrated by empirical Bayes on a Laplace-approximated marginal likelihood and propagated along the prior graph by an MLP, so data-confirmed neighborhoods boost trust and contradictions suppress it. PRCD-MAP enjoys a population-level safety guarantee: it is $\varepsilon$-safe in expectation over the prior-generation distribution, with $\varepsilon\leq C\cdot\mathrm{acc}(1{-}\mathrm{acc})\cdot d^2/T$ at the parametric $T^{-1}$ rate and vanishing at the prior-quality endpoints. When the prior is uninformative, learned trust provably collapses to its floor and the method recovers a no-prior baseline. Empirically, on real CausalTime data PRCD-MAP exploits informative LLM priors (LLM-prior gain $+0.067/+0.089$ AUROC on AQI/Medical over a no-prior PRCD-MAP backbone; combined backbone+prior lead $+0.123/+0.043$ over PCMCI+), auto-attenuates on the anonymous-variable Traffic stress test, and retains a lead at $d{=}300$; against BayesDAG, the closest soft-Bayesian baseline, PRCD-MAP wins on every CausalTime dataset under a matched $W_0$-only protocol. A four-way ablation isolates each component: EB calibration and MLP trust propagation jointly carry the plurality of the gain, with positive sign on every dataset. Extensions to nonlinear (NAM) and cross-sectional settings show the calibrated-trust principle is setting-agnostic.
Time-varying dependence is often modeled with dynamic correlations or Gaussian graphical models, but multivariate systems can change through tail behavior, asymmetry, or conditional structure even when correlations are nearly stable. We introduce Dynamic Vine Copulas (DVC), a temporal vine-copula framework for estimating and diagnosing sequence-wide non-Gaussian dependence. DVC fixes a chosen vine factorization for comparability; the framework applies to C-, D-, and R-vines, and our experiments use fixed-root-order C-vines. Pair-copula states evolve through smooth parameter trajectories or temporally regularized family-switching paths. The main diagnostic is a held-out comparison between a full vine and its matched 1-truncated version, which separates flexible first-tree pairwise dependence from evidence contributed by higher-tree conditional terms. At the population level, under a correct fixed vine and the simplifying assumption, this contrast equals the higher-tree component of a vine total-correlation decomposition; in finite samples, it is a predictive diagnostic. In controlled benchmarks, DVC detects Student-t degrees-of-freedom changes, Clayton-to-Gumbel switches, and recurrent conditional-interaction episodes missed or conflated by Gaussian dynamic baselines. The higher-tree score remains near zero in pairwise-only regimes and rises during conditional-interaction regimes. On Allen Visual Behavior Neuropixels data, DVC identifies a reproducible time-indexed higher-tree signal that is positive across held-out splits and vanishes under a decorrelated null, indicating simultaneous cross-area dependence. DVC therefore provides a flexible temporal copula model and an interpretable test of whether temporal dependence changes are pairwise or conditional.
Compositional data, which are vectors of proportions constrained to the probability simplex, arise frequently in modern scientific applications, including microbiome relative abundances across body sites and cell-type mixture weights derived from single-cell genomics. While regression methods for compositional data are well developed, no existing graphical model framework addresses the problem of learning conditional dependence structures among multiple compositional vectors. This paper introduces a novel framework for directed tree structure learning over compositional nodes. We employ the Kullback-Leibler divergence as the scoring function and model the conditional expectation of each child composition as a mixture of a baseline composition and a parent-driven component parameterized by a column-stochastic transition matrix. This formulation respects the simplex geometry, handles zero-inflated compositions gracefully, and, combined with a non-degeneracy condition on the transition matrix, ensures identifiability of edge directions from observational data. We prove consistency of structure recovery and derive finite-sample guarantees that characterize the required sample size in terms of the signal gap, node dimension, and penalty level. The efficacy of our approach is demonstrated through simulations and applications to multi-site microbiome data and single-cell data, yielding interpretable directed structures that align with known biological mechanisms.
Model selection plays an important role in longitudinal data analysis, especially when models are estimated using the generalized method of moments (GMM) in the presence of time-dependent covariates. In this setting, the number of valid moment conditions can grow quickly and may lead to over-parameterized models. The Kullback--Leibler Information Criterion (KLIC) has been proposed as a model-selection tool for this framework; however, the original KLIC criterion may favor overly complex models when the number of parameters or valid moment conditions increases. To address this limitation, this study proposes two penalized versions of KLIC that incorporate penalties based on both the number of model parameters and the number of valid moment conditions. The proposed criteria are referred to as the Moment--Parameter Product Penalty KLIC (MPPP--KLIC) and the Logarithmic Penalty KLIC (LP--KLIC). These criteria provide a theoretically motivated mechanism for balancing model fit and model complexity in GMM-based longitudinal models. Through an extensive simulation study involving both binary and continuous response settings, the proposed criteria are shown to improve the ability of KLIC to distinguish among competing models and to reduce the selection of over-parameterized models. The performance of the proposed methods is further illustrated using the Filipino Child Morbidity dataset, a longitudinal study of child health in the Philippines. The results show that the proposed penalized criteria provide stable and interpretable model rankings and consistently identify age as the most important predictor of child morbidity. Overall, the proposed penalized KLIC criteria offer practical and theoretically grounded tools for model selection in GMM-based longitudinal data analysis with time-dependent covariates.
Pairwise comparisons from multiple judges are central to large language model evaluation and preference modeling, yet standard ranking pipelines often pool judgments into a single score vector, treating systematic judge disagreement as noise. We propose Heterogeneous Judge-Aware (HJA) ranking, a structured multi-judge ranking framework that separates consensus ranking, judge-specific sensitivity to consensus, and residual preference disagreement. HJA thereby treats ranking, judge sensitivity, and structured disagreement as separate inferential targets. We establish conditions under which this decomposition is identifiable and develop an anchored alternating algorithm that preserves the identifying geometry. For confidence quantification, we study a fixed-panel repeated-comparison regime in which the judge panel may remain fixed or modest while information grows through repeated judgments. This yields uncertainty statements for consensus and judge-specific ranking contrasts, sensitivity parameters, pairwise probabilities, and summaries of residual this http URL on synthetic and real multi-judge comparison data show that HJA improves recovery, robustness, uncertainty calibration, and near-tie performance relative to pooled and sensitivity-only baselines. The fitted model also provides diagnostics for judge disagreement and model-affinity patterns, giving a statistically grounded framework for ranking under heterogeneous comparative judgments.
We propose a new estimator for the Generalised Dynamic Factor Model (GDFM) that simplifies estimation by avoiding frequency-domain methods. Our key theoretical insight shows that under reasonable conditions the dynamic common component can be represented in terms of a finite number of lags of contemporaneously pervasive factors. In this case the dynamic factor decomposition of the GDFM reduces to the OLS regression of observed variables on estimated factors and their lags, with factors obtained via static principal components. The approach naturally accommodates weak (non-pervasive) factors within the dynamic common space addressing an important limitation of existing methods. We establish consistency and asymptotic normality for both the dynamic and weak common components. An application to a large European macroeconomic dataset demonstrates strong empirical performance and uncovers a sizeable weak common component - particularly in sentiment indicators and several other variables - revealing dynamics that standard methods overlook.
A new approximate Bayesian inferential framework is proposed that exploits multiple information sources -- daily spot returns, high-frequency spot data and option prices -- and enables fast calculation of probabilistic predictions of future option prices. This approach operates directly from the theoretical option pricing model, and does not require an explicit statistical model, or likelihood, for the observed option prices. We demonstrate that our approach produces accurate probabilistic option-price predictions in realistic scenarios and, despite not explicitly modelling option-pricing errors via a statistical model, the method is shown to be robust to the presence of such errors. Predictive accuracy based on the Heston option pricing model is illustrated empirically for short-maturity options, with the rapidity of real-time updates of the predictive distributions highlighted.
Compositional graphoids are fundamental discrete structures which appear in probabilistic reasoning, particularly in the area of graphical models. They are semigraphoids which satisfy the Intersection and Composition properties. These important properties, however, are not enjoyed by general probability distributions. This paper surveys what is known about them, providing systematic constructions of examples and counterexamples as well as necessary and sufficient conditions. Novel sufficient conditions for both properties are derived in the context of discrete random variables via information-theoretic tools.
Why and when does depth improve generalization? We study this question in an implementation-agnostic state-transition model, where a depth-$k$ predictor is a readout class $H$ composed with the word ball $B(k,F)$ generated by hidden state transitions. Generalization bounds separate implementation error, approximation error, and statistical complexity, and upper bound the depth-dependent variance term by a Dudley entropy integral over $B(k,F)$, with a conditional lower-bound diagnostic under readout separation. We identify geometric and semigroup mechanisms that keep this entropy contribution saturated or polynomial, and contrast them with separation mechanisms that recover the classical exponential-growth obstruction. Coupling these variance upper bounds with approximation rates gives typical depth trade-off patterns, clarifying that depth is statistically favorable when approximation improves rapidly while the transition semigroup remains geometrically tame.
This article studies the fundamental problem of using i.i.d. coin tosses from an entropy source to efficiently generate random variables $X_i \sim P_i$ $(i \ge 1)$, where $(P_1, P_2, \dots)$ is a random sequence of rational discrete probability distributions subject to an \textit{arbitrary} stochastic process. Our method achieves an amortized expected entropy cost within $\varepsilon > 0$ bits of the information-theoretically optimal Shannon lower bound using $O(\log(1/\varepsilon))$ space. This result holds both pointwise in terms of the Shannon information content conditioned on $X_i$ and $P_i$, and in expectation to obtain a rate of $\mathbb{E}[H(P_1) + \dots + H(P_n)]/n + \varepsilon$ bits per sample as $n \to \infty$ (where $H$ is the Shannon entropy). The combination of space, time, and entropy properties of our method improves upon the Knuth and Yao (1976) entropy-optimal algorithm and Han and Hoshi (1997) interval algorithm for online sampling, which require unbounded space. It also uses exponentially less space than the more specialized methods of Kozen and Soloviev (2022) and Shao and Wang (2025) that generate i.i.d. samples from a fixed distribution. Our online sampling algorithm rests on a powerful algorithmic technique called \textit{randomness recycling}, which reuses a fraction of the random information consumed by a probabilistic algorithm to reduce its amortized entropy cost. On the practical side, we develop randomness recycling techniques to accelerate a variety of prominent sampling algorithms. We show that randomness recycling enables state-of-the-art runtime performance on the Fisher-Yates shuffle when using a cryptographically secure pseudorandom number generator, and that it reduces the entropy cost of discrete Gaussian sampling. Accompanying the manuscript is a performant software library in the C programming language.
Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. In this work, we reformulate constrained black-box optimization as posterior inference, and perform this inference in the latent space of generative models. Our method iterates through two stages. First, we train flow-based models to capture the data distribution and surrogate models that predict both function values and constraint violations. Second, we cast the candidate selection problem as a posterior inference problem to effectively search for promising candidates that have high objective values while not violating the constraints. Concretely, we utilize outsourced diffusion models to amortize the sampling from the posterior distribution in the latent space of flow-based models, which can bypass the issue of mode collapse. We empirically demonstrate that our method achieves superior performance across synthetic and real-world tasks. Our code is available \href{this https URL}{here}.
Recent advances in generative artificial intelligence applications have raised new data security concerns. This paper focuses on defending diffusion models against membership inference attacks. This type of attack occurs when the attacker can determine if a certain data point was used to train the model. Although diffusion models are intrinsically more resistant to membership inference attacks than other generative models, they are still susceptible. The defense proposed here utilizes critically-damped higher-order Langevin dynamics, which introduces several auxiliary variables and a joint diffusion process along these variables. The idea is that the presence of auxiliary variables mixes external randomness that helps to corrupt sensitive input data earlier on in the diffusion process. This concept is theoretically investigated and validated on a toy dataset and a speech dataset using the Area Under the Receiver Operating Characteristic (AUROC) curves and the FID metric.
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.
This article introduces a leave-one-out regression adjustment (LOORA) for estimating average treatment effects in randomized controlled trials. In finite samples, LOORA removes the bias of conventional regression adjustment and yields exact variance formulas for regression-adjusted Horvitz-Thompson and difference-in-means estimators. Ridge regularization curbs the influence of high-leverage observations, improving stability and precision in small samples. In large samples, LOORA matches the variance of the regression-adjusted estimator in Lin (2013) while remaining exactly unbiased. Two within-subject experimental applications, each providing a realistic joint distribution of potential outcomes as ground truth, show that LOORA removes substantial bias and achieves confidence interval coverage close to the nominal level.
Recent works have shown that gradient-update alignment is a powerful signal for modulating optimizer updates, often leading to faster training. We promote this update-wise heuristic as a mathematically grounded principle for selecting and tuning optimizer hyperparameters. By treating gradients and updates as signals and an optimizer as a causal filter that maps between them, we formulate optimizer selection as maximizing the expected drop rate in loss over a prescribed family of optimizers. We show that this objective is exactly the inner product between the optimizer filter and the gradient autocorrelation, and prove that a greedy optimum exists and has a stability bound under perturbations of the estimated gradient statistics. Specializing in momentum-based optimizers, the theory yields simple dynamic momentum selection rules for both SGD+Momentum and Adam/AdamW. Experiments across image classification, language model fine-tuning, and vision transformer fine-tuning show that the resulting dynamic momentum rules match or improve upon the best fixed hyperparameters found via manual sweeps, reducing the need for exhaustive momentum sweeps. Code is available at this https URL
As artificial intelligence increasingly drives critical decisions, the ability to genuinely explain how neural networks make predictions is essential for trust. Yet, most current explanation methods offer post-hoc rationalizations rather than guaranteeing a true reflection of the model's reasoning. We introduce the notion of explanatory alignment, a requirement that explanations directly construct predictions rather than rationalize them. To achieve this in complex data domains, we present Pointwise-interpretable Networks (PiNets), a pseudo-linear architecture that forms linear models instance-wise. Evaluated on image classification and segmentation tasks, PiNets demonstrate that their explanations are deeply faithful across four criteria: meaningfulness, alignment, robustness, and sufficiency (MARS). Our contributions pave the way for promising avenues: by reconciling the predictive power of deep learning with the interpretability of linear models, PiNets provide a principled foundation for trustworthy AI and data-driven scientific discovery.
Decentralized learning on resource-constrained edge devices demands algorithms that are communication-efficient, robust to data corruption, and lightweight in memory. State-of-the-art gossip-based methods address communication efficiency, but achieving robustness remains challenging. Methods for robust estimation and optimization typically rely on non-smooth objectives (\textit{e.g.}, pinball loss, $\ell_1$ loss), yet standard gossip methods are primarily designed for smooth losses. Asynchronous decentralized ADMM-based methods have been proposed to handle such non-smooth objectives; however, existing approaches require memory that scales with node degree, making them impractical when memory is limited. We propose AsylADMM, a novel asynchronous gossip algorithm for decentralized non-smooth optimization requiring only two variables per node. We provide a new theoretical analysis for the synchronous variant and leverage it to prove convergence of AsylADMM in a simplified setting based on the squared loss. Empirically, AsylADMM converges faster than existing baselines on challenging non-smooth problems, including quantile and geometric median estimation, lasso regression, and robust regression. More broadly, our novel gossip framework opens a practical pathway toward robust and non-smooth decentralized learning.
We establish an optimal sample complexity of $O(\epsilon^{-2})$ for obtaining an $\epsilon$-optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of $O(\epsilon^{-3})$. Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.
We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.
Short-term forecasting of vegetation dynamics is a key enabler for data-driven decision support in precision agriculture. Normalized Difference Vegetation Index (NDVI) forecasting from satellite observations, however, remains challenging due to sparse and irregular sampling caused by cloud masking, as well as the heterogeneous climatic conditions under which crops evolve. In this work, we propose a probabilistic forecasting framework for field-level NDVI prediction under sparse, irregular clear-sky acquisitions. The architecture separates the encoding of historical NDVI and meteorological observations from future exogenous covariates, fusing both representations for multi-step quantile prediction. To address irregular revisit patterns and horizon-dependent uncertainty, we introduce a temporal-distance weighted quantile loss that aligns the training objective with the effective forecasting horizon. In addition, we incorporate cumulative and extreme-weather feature engineering to capture delayed meteorological effects relevant to vegetation response. Experiments on European satellite data show that the proposed approach outperforms statistical, deep learning, and time-series baselines on both pointwise and probabilistic evaluation metrics. Ablation studies confirm that target history is the primary driver of performance, with meteorological covariates providing additional gains in the full multimodal setting. The code is available at this https URL.
Deep learning (DL) methods are currently being explored to restore images from sparse-view-, limited-data-, and undersampled-based acquisitions in medical applications. Although outputs from DL may appear visually appealing based on likability/subjective criteria (such as less noise, smooth features), they may also suffer from hallucinations. This issue is further exacerbated by a lack of easy-to-use techniques and robust metrics for the identification of hallucinations in DL outputs. In this work, we propose performing Fourier Ring Correlation (FRC) analysis over small patches and concomitantly (s)canning across DL outputs and their reference counterparts to detect hallucinations (termed as sFRC). We describe the rationale behind sFRC and provide its mathematical formulation. The parameters essential to sFRC may be set using predefined hallucinated features annotated by subject matter experts or using imaging theory-based hallucination maps. We use sFRC to detect hallucinations for three undersampled medical imaging problems: CT super-resolution, CT sparse view, and MRI subsampled restoration. In the testing phase, we demonstrate sFRC's effectiveness in detecting hallucinated features for the CT problem and sFRC's agreement with imaging theory-based outputs on hallucinated feature maps for the MR problem. Finally, we quantify the hallucination rates of DL methods on in-distribution versus out-of-distribution data and under increasing subsampling rates to characterize the robustness of DL methods. Beyond DL-based methods, sFRC's effectiveness in detecting hallucinations for a conventional regularization-based restoration method and a state-of-the-art unrolled method is also shown.
We formally define algorithmic capture of combinatorial tasks as the ability of a transformer to extrapolate to arbitrary task sizes with controllable error and logarithmic sample adaptation, providing a sharp scaling criterion for distinguishing logic internalization from statistical interpolation. Empirically, across scaling ranges spanning up to 2.5 orders of magnitude, we observe evidence of capture and non-capture. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the combinatorial tasks these networks can capture. We show that, despite their universal expressivity, transformers possess an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class, consistent with successful capture on simpler combinatorial tasks such as induction heads, sort, and string matching.
Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its exponential nonlinearity; linearized attention is the canonical tractable proxy and the object of study here. This paper establishes that even this proxy does not converge to its NTK limit at any practical width, revealing a fundamental trade-off in the learning dynamics of attention. An exact correspondence is established between parameter-free linearized attention and a data-dependent Gram-induced kernel; spectral amplification analysis shows that the attention transformation cubes the Gram matrix's condition number, requiring width $m = \Omega(\kappa_d(\mathbf{G})^6 n\log n)$ for NTK convergence, where $\kappa_d(\mathbf{G})$ is the effective condition number of the rank-$\min(n,d)$ truncation of the input Gram matrix; for natural image datasets this threshold is physically infeasible ($m \gg 10^{24}$ for MNIST and $m \gg 10^{29}$ for CIFAR-10, 12--17 orders of magnitude beyond the largest known architectures). \emph{Influence malleability} is introduced to characterize this non-convergence: linearized attention exhibits 2--9$\times$ higher malleability than ReLU networks under adversarial data perturbation, with the gap depending on dataset condition number and task setting. A dual implication is established: the same data-dependent kernel is shown theoretically to reduce approximation error when targets align with the data geometry, while, empirically, creating vulnerability to adversarial manipulation of the training data. The structural argument extends to trainable QKV attention under standard initialization, with direct consequences for influence methods applied to deployed transformer architectures.
Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance metrics conflate Markov breakdowns with other sources of suboptimality, leaving practitioners without tools to detect such violations. This paper introduces a prediction-based Markov Violation Score (MVS) that quantifies non-Markovian structure in observation trajectories. A random forest first removes nonlinear Markov-compliant dynamics; ridge regression then tests whether historical observations reduce prediction error on the residuals beyond what the current observation provides. The resulting score is bounded in [0, 1] and requires no causal graph construction. Evaluation spans six environments (CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d), three algorithms (PPO, A2C, SAC), controlled AR(1) noise at six intensity levels, and 10 seeds per condition. In post-hoc detection, 7 of 16 environment-algorithm pairs, primarily high-dimensional locomotion tasks, show significant positive monotonicity between noise intensity and MVS (Spearman rho up to 0.78, confirmed under repeated-measures analysis); under training-time noise, 13 of 16 pairs exhibit statistically significant reward degradation. An inversion phenomenon is documented in low-dimensional environments where the random forest absorbs the noise signal, causing MVS to decrease as true violations grow, a failure mode analyzed in detail. A practical utility experiment demonstrates that MVS correctly identifies partial observability and guides architecture selection, fully recovering performance lost to non-Markovian observations. Source code to reproduce all results is available at this https URL.
Multi-objective bandits have attracted increasing attention for their broad applicability, with \(d\)-dimensional reward vectors inducing Pareto regret. There has been a subtle debate over whether this added structure makes the problem fundamentally harder than single-objective bandits. We answer this by showing that, in terms of Pareto regret, it is surprisingly no harder: Pareto regret scales inversely with \(g^\dagger\), the largest objective-wise suboptimality gap, and thus matches the smallest objective-wise classical regret. We formalize this idea via a novel method with upper and lower confidence-bound estimators for every arm-objective pair. It uses top-two races to compare arms within each objective and an uncertainty-greedy rule to allocate exploration toward the largest objective-wise gap \(g^\dagger\), until the corresponding Pareto-optimal arm is committed to. We prove that it achieves Pareto regret of \(O(\nicefrac{\log T}{g^\dagger})\), where \(T\) is the horizon, with \emph{no dependence on \(d\)}. A matching lower bound of \(\Omega(\nicefrac{\log T}{g^\dagger})\) implies optimality. We evaluate the method on synthetic and real-world datasets, confirming the theory and achieving order-of-magnitude reductions in Pareto regret over baselines. Real-world results further show that our method commits to a Pareto optimal arm, possibly at the cost of empirical fairness, suggesting a potential hardness absent in single-objective bandits.
We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise $\tanh$-like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers, the theory explains why these architectures can be more sensitive to initialization and optimization choices and require careful tuning for stable training.
Modeling high-dimensional dependencies while keeping likelihoods tractable remains challenging. Classical vine-copula pipelines are interpretable but can be expensive, while many neural estimators are flexible but less structured. In this work, we propose Vine Denoising Copula (VDC), an amortized vine-copula pipeline for continuous-data, simplified-vine dependence modeling. VDC trains a single bivariate denoising model and reuses it across all vine edges. For each edge, given pseudo-observations, the model predicts a piecewise-constant density grid. We then apply an IPFP/Sinkhorn projection that normalizes mass and drives the marginals to uniformity. This preserves the tractable vine-likelihood structure and the usual copula interpretation while replacing repeated per-edge optimization with GPU inference. Across synthetic and real-data benchmarks, VDC delivers strong bivariate density accuracy, competitive MI/TC estimation, and faster high-dimensional vine fitting. These gains make explicit information estimation and dependence decomposition feasible when repeated vine fitting would otherwise be costly, while conditional downstream tasks remain a limitation.
LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesian principles should shine. Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions. Making LLMs themselves explicitly Bayesian belief-updating engines remains computationally intensive and conceptually nontrivial as a general modeling target. In contrast, this paper argues that coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters. This paper articulates practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration, and provides concrete examples and design patterns to illustrate how calibrated beliefs and utility-aware policies can improve agentic AI orchestration.
Pairwise Fisher graphs capture local covariance information, but they cannot distinguish an irreducible multi-observable radiation pattern from a collection of ordinary pairwise correlations. We show that this missing structure is naturally supplied by higher-order Fisher tensors. In a finite basis of binned EECs, ECFs, or EFPs, and in the natural exponential-family coordinates generated by that basis, the same local tensor has three equivalent interpretations: a coefficient in the local Kullback-Leibler expansion, a connected cumulant of the chosen correlator observables, and a signed weight on a hyperedge linking those observables. This gives an exact Fisher-correlator-hypergraph triality in the local exponential-family embedding. The triality provides a direct construction of physics-informed hypergraphs from correlator data. Extending the quadratic Fisher matrix to the first non-trivial higher tensor identifies genuinely connected multi-observable radiation patterns, supplies hyperedge weights for higher-order Laplacians and message passing, and gives a principled criterion for compressing observable bases beyond pairwise information. We develop these constructions and spell out why the exact cumulant interpretation is special to natural exponential-family coordinates. We illustrate the framework in four applications. In a minimal local-KL study, the cubic Fisher tensor reduces the KL truncation error and isolates the dominant triplet structure. In a two-versus-three prong jet substructure benchmark, the hypergraph selector improves compressed-basis classification. In a 33-observable basis-design problem, the Fisher hypergraph retains more third-order local response at twelve observables. A low-capacity learning benchmark then shows how the same Fisher hyperedges can be used as an interpretable inductive bias for message passing on correlator observables.
Activation-alignment measures such as Representational Similarity Analysis (RSA), Canonical Correlation Analysis (CCA), and Centered Kernel Alignment (CKA) are widely used to compare biological and artificial neural representations. Recent theoretical work interprets many of these methods as assessing agreement between optimal linear readouts over broad families of global tasks. However, agreement at the level of global readouts does not determine how a system uses local stimulus evidence. Specifically, representations may align in activation space yet differ in their sensitivity to small perturbations. To address this challenge, we introduce a complementary framework based on local decodable information, which focuses on a representation's ability, under noise, to discriminate small perturbations within a specified stimulus-coordinate subspace. Building on Fisher information and local representation geometry, we summarize each representation using the expected projected pullback/Fisher metric over that subspace. This formulation induces a second-moment family of local discrimination tasks, for which the resulting operator provides a minimal, complete dataset-level summary of expected discriminability. We compare these regularized signatures using a log-spectral distance on the manifold of symmetric positive definite (SPD) matrices, yielding the Spectral Riemannian Alignment Score (S-RAS) and a uniform multiplicative certificate over the corresponding family of lifted task values. Empirically, this framework enables the recovery of corresponding layers across independently trained artificial neural networks, supports transferable class-conditional probes, reveals controlled dissociations between standard and robust training, and uncovers stimulus-coordinate family effects across mouse visual cortex using the Allen Brain Observatory static gratings dataset.
This paper develops efficient GMM estimation when the moment conditions are misspecified. We observe that the influence function of the standard GMM estimator under misspecification depends on both the original moment conditions and their Jacobian, motivating a new class of estimators based on augmented moment conditions with recentering. The standard GMM estimator is a special case within this class, and generally suboptimal. By optimally weighting the augmented system, we obtain a misspecification-efficient (ME) estimator with the smallest asymptotic variance for the same GMM pseudo-true value. In linear models, the asymptotic variance of ME estimator reduces to the textbook efficient-GMM variance formula $(G'W^{*}G)^{-1}$, where $W^{*}$ is the inverse of the variance of residualized moments after projection on the Jacobian $G$. We consider a feasible double-recentered bootstrap estimator, which can be considered as a misspecification-robust and efficient version of Hall and Horowitz (1996) recentered bootstrap GMM estimator, and also consider a split-sample ME estimator. Finally, we establish uniform local asymptotic minimax bounds over a class of weighting matrices. We illustrate the proposed methods in simulation and empirical examples.