Mechanistic interpretability methods summarize a transformer component by a single importance score, conflating two distinct roles: a component may matter because it transports task-relevant content, or because the forward computation degrades when its contribution is removed. We introduce \emph{Interchange-Group Sobol Decomposition} (IGSD), a paired-intervention framework that compares matched activation replacement with zero ablation on the same component, estimates two Sobol-style variance indices, and uses their signed difference to separate the two roles, with intervention validity monitored by a symmetric off-manifold diagnostic $\widehat{\mathrm{ST}}>1$. In factual recall, IGSD identifies an early-layer content channel in both GPT-2 small and Qwen2.5-1.5B that standard importance methods underestimate. A controlled subject and relation donor design shows that the early channel transports relation-frame content while late attention transports subject-retrieval content, refining at head granularity to the known $\mathrm{Attn}_{L9H8}$ head. Late-layer clamping confirms that the early signal is expressed through downstream transformations rather than residual pass-through. These results show that replacement and deletion are not interchangeable controls and their divergence provides a practical statistical diagnostic for content transport in transformer components.
We present a family of conformal test martingales based on shifted Legendre polynomials, which extends the Simple Jumper martingale. The Simple Legendre Jumper substitutes the linear betting function with a polynomial of arbitrary degree, thereby facilitating the detection of variance, skewness, and higher-order deviations from uniformity; the standard Simple Jumper is a specific instance of degree one. The Product Legendre Jumper integrates multiple polynomial degrees into a unified betting function, although its state space expands exponentially-a cost we refer to as the jumping tax. To address this issue, we introduce the Variational Legendre Jumper, which factorises the joint adaptation through a mean-field approximation, thereby reducing exponential scaling to linear time with minimal loss in power. Lastly, the Composite Legendre Jumper incorporates several jumping rates, ensuring a wealth floor under exchangeability and automatic adaptation to the shift's timescale. Empirical results from a real-world classification task demonstrate that the combined methods consistently surpass any single-degree martingale under distributional shift, and the composite variant is recommended as the default when the shift timescale is unknown.
Decision-making under partial or adversarial observability requires accurate inference of the environment's latent state and its associated uncertainty. This work analyses adversarial attacks on linear probabilistic state-space models, commonly integrated within reinforcement learning architectures, where the attacker alters observations under likelihood constraints that ensure the perturbations remains consistent. We analyze how such adversarial yet realistic observation shifts influence the latent state and influence policy decisions. This perspective provides a principled pathway toward building more robust reinforcement learning systems, with direct relevance to safety-critical domains such as robotics, where reliable operation under sensor noise, partial failures, and adversarial conditions is essential.
This paper considers a partial linear regression model with scalar response missing at random, one finite-dimensional covariate (a vector, $X$) and one infinite-dimensional covariate (a functional variable, $\mathcal{X}$). While the effect of $X$ on the response is linear, the effect of $\mathcal{X}$ is nonparametric. Three $k$NN-based estimators are proposed for both the vector parameter and the nonparametric operator, and some first asymptotic results are obtained.
Summary: This paper presents MBRarefy, an R package that provides a reproducible workflow for alpha diversity analysis under confounding from heterogeneous library sizes. Building on the multi-bin rarefying approach in Li et al (2024), MBRarefy supports alpha diversity association analysis with repeated rarefying, bin-wise testing, and cross-bin meta-analysis. A key new feature is automated, data-adaptive selection of library size bin thresholds via a genetic algorithm (GA), which replaces ad hoc cutpoints with an objective optimization procedure based on the rarefying-derived profiles. The package also supports routine data-management tasks, including file-based sample-wise processing and standardized output generation, enabling users to execute the full analysis pipeline from raw count files to combined inferential results. Availability and implementation: The R package MBRarefy is freely available on GitHub at this https URL.
Mutual anticipation of teammates' actions enables efficient interactions in team coordination that achieves a common goal and high performance. In team sports involving direct competition, such implicit and non-verbal interactions within short periods are required. If players begin moving only after observing their teammates, gaps may emerge, allowing opponents to interfere. When mutual anticipation functions properly, players' interactions are smooth without gaps, and their movements are expected to become synchronized. Synchronization represents a temporally stable structure in interactions and its mechanisms have been examined in previous studies. However, few studies have investigated synchronization in real-world coordination involving heterogeneous roles and interactions evolving over time, or quantitatively examined how temporally stable structures differ from a baseline. In our approach, we utilized team sports and introduced a statistical method to probabilistically examine these differences. The purpose of this study was to extract the temporal components of movement using 3-on-3 basketball. We calculated the relative phases in which players approached or moved away from their teammates during mini-games in a field experiment that investigated the effects of advice on offensive coordination. These frequency distributions were estimated using Bayesian inference and were compared before and after advice. The results showed that the probability of a synchronization trend among the offensive players after advice compared with before advice reached 70\% or higher. This may be a typical case that is related to mutual anticipation based on the team strategy established through coaching. These findings contribute to a quantitative understanding of coordination processes.
In many domains, practitioners seek models that produce accurate forecasts while faithfully capturing latent system dynamics. Existing approaches typically sacrifice one of these goals: deep state space models often assume Gaussian latent transitions, limiting fit and forecasting, while diffusion models are highly expressive but lack principled inference for the underlying dynamics. To combine the strengths of both, we introduce the Diffusion-Driven State Space Model (DDSSM), which replaces the conventional Gaussian transition distribution with a diffusion model. Our DDSSM resolves the open problem of how to jointly train an autoencoder and a diffusion model on sequential data, thereby extending the literature on latent diffusion models for time series. Moreover, we find that the DDSSM empirically outperforms a state-of-the-art deep SSM at fitting and forecasting a simulated time series with multimodal transitions.
In simulation studies evaluating asymptotic approximations it is common practice to report averages and standard deviations over repeated simulations. We argue that quantile-based summaries are more appropriate from both a theoretical and practical point of view. Theoretically, convergence of moments -- or even existence of moments -- is not guaranteed by convergence in distribution, so sample moments are not ideal for assessing the accuracy of a distributional approximation. In practice, means and variances are not good summaries of approximately-Normal distributions that may have occasional outliers. We suggest the median and median absolute deviation, and empirical confidence interval coverage, as better general summaries, and argue that moments should be reserved for simulation settings where they are of substantive interest.
Risk assessment instruments, also known as "risk scores," are widely used in high-stakes decision-making settings such as medicine and the criminal justice system. A risk score predicts the likelihood of an undesired outcome if no intervention is made. Thus, a sufficiently high score is often interpreted as a recommendation to intervene. However, risk scores fail to account for what would happen if a decision-maker does intervene. This failure is problematic because effective decision making requires consideration of both or multiple potential outcomes. We propose "triage scores," which are based on additive counterfactual utilities and include risk scores as a special case. Unlike risk scores, triage scores can incorporate counterfactual outcomes under alternative decisions, enabling decision makers to incorporate a wide range of ethical and practical factors. We illustrate the use of triage scores with an application to our own randomized controlled trial evaluating a pretrial risk score. Our analysis demonstrates that triage scores are able to capture rich utility structures and yield substantively distinct results regarding policy evaluation and learning.
Bayesian model averaging in support-indexed regression induces a posterior distribution over active predictor supports. Under predictor redundancy, posterior mass can spread across many nearly interchangeable supports, making exact-support summaries unstable or hard to interpret even when prediction is stable. We study how to report an already fitted Bayesian model averaging posterior without changing the Bayesian target. A report uses hard or soft regions of support space, and its compressed reporting law is compared with the reference posterior through an explicit density ratio. This ratio gives computable total-variation and Kullback--Leibler distortion, bounds for bounded predictive summaries, retained-mass diagnostics, and fallback-weight diagnostics. The framework covers fixed hard regions, metric-ball regions, posterior-cluster regions, and pooled-pruned region dictionaries. We prove exact error formulas and validation bounds for these region reports, and give conditions under which a few regions can replace a long list of individual supports. In simulations, our region reports often give shorter and clearer summaries while preserving the main posterior information, and the density-ratio diagnostics show when too much information has been lost.
We study whether deterministic AI weather prediction (AIWP) models issue more informative forecasts for extreme weather events than deterministic numerical weather prediction (NWP) models. The deterministic model output is subjected to statistical post-processing via isotonic distributional regression (IDR), or EasyUQ, before the resulting probabilistic forecasts are assessed using weighted versions of the continuous ranked probability score (CRPS). This extends the Potential CRPS (PCRPS) measure proposed by Gneiting et al. (2026) to focus on extreme outcomes. Since IDR exhibits optimality properties with respect to weighted versions of the CRPS, the proposed approach inherits desirable properties of the PCRPS, and, in particular, facilitates fair comparisons between data-driven and physics-based models when forecasting extreme weather events. We apply this evaluation framework to forecasts in the WeatherBench 2 dataset issued by the AIWP models GraphCast, Pangu-Weather, and FuXi, with the ECMWF's high-resolution NWP model serving as a physics-based reference. The forecast models are compared when predicting mean sea level pressure, temperature, wind speed, and precipitation extremes, defined as exceedances or non-exceedances of thresholds obtained from historical observation data. We additionally study forecast performance when predicting record-breaking events, though the ordering of the different methods is largely insensitive to the thresholds on which emphasis is placed. We find that AIWP models, particularly FuXi, result in the most informative forecasts for extreme weather events across most settings, suggesting that AIWP models have the potential to outperform NWP models when forecasting extremes.
There is a precise sense in which drawing causal inferences from observational data is hard, even when identifiability is assumed. In particular, Robins and Ritov (1997) and Robins et al. (2003) showed that causal effects can be discontinuous as a function of the data distribution: two arbitrarily close data distributions might correspond to different causal effects. This is a fact independent of the choice of estimator; however, not all estimators are equally unstable. Our contribution is to surface a second layer of instability that depends on the choice of estimator. We show that many standard point estimates can be read as point summaries of multimodal distributions over the space of structural causal models. As such, estimators can jump discontinuously in the data distribution. This defines a taxonomy of estimators that admits a decision-theoretic reading: stability depends on whether the implicit loss function an estimator optimizes is aligned with the causal effect itself. Specifically, inverse propensity weighted estimators and regression estimators are examples of discontinuous summaries, while explicit posterior means and medians are shown to be continuous.
We introduce a semi-parametric framework for nonlinear system identification, which decouples discrepancy functions from physics-based components. Orthogonal Gaussian process regression balances sparse parameter selection (the white box) with discrepancy learning (the black box) to produce interpretable models from incomplete physics.
Prediction-powered inference integrates a small gold-standard dataset with large pseudo-labeled data, whose labels are generated by machine learning methods, to enhance statistical inference. In modern applications, multiple data sources and diverse machine learning methods often give rise to multiple pseudo-labeled datasets, each encoding potentially different aspects of the underlying information. However, how to optimally combine multiple data sources and machine learning methods for statistical inference remains unclear. To address this problem, we propose a multi-source prediction-powered inference method by aggregating multiple pseudo-labeled datasets together, where the aggregation weights are estimated by minimizing the asymptotic volume of the resulting confidence region. We study both homogeneous settings, where the source and target distributions coincide, and heterogeneous settings, where distributional discrepancies arise between source and target distributions, including covariate shift and domain shift. Theoretically, we establish the asymptotic normality of the proposed estimator and show that the resulting confidence-region volume is asymptotically equivalent to the oracle optimal volume within the proposed weighting class. We further characterize when our method yields smaller confidence regions compared with both classical target-only inference and single-source prediction-powered inference. Simulation studies and a real-data application on dual-energy X-ray absorptiometry measured high body fat prevalence show that MPPI can reduce confidence-region volume while maintaining inferential validity in the settings considered.
In the era of big data, subsampling became a common practice in statistical learning. By selecting a subgroup of individuals based on which the learner is trained, subsampling aims at reducing the computational cost and time of the estimation step, and ideally leads to a decrease of its energy consumption and carbon footprint. This work focuses on a nonparametric setting, in which the hypotheses set lies in a reproducing kernel Hilbert space, and the estimator is a minimizer of an empirical risk reweighted à la Horvitz-Thompson. By studying the asymptotic properties of this estimator, we reveal an optimal subsampling scheme (regarding the trace of the covariance operator) and show that it can be used via plug-in. A numerical study on synthetic and real-world datasets shows the practicability and the benefit of the proposed approach.
Snowball sampling is a widely used design for collecting network data from large or hard-to-reach populations, yet naive inference that ignores the sampling mechanism produces systematically biased parameter estimates. We derive the exact likelihood of a multi-wave snowball sample for the class of continuous latent space (CLS) models, in which edges form independently conditional on latent vertex-level quantities, and show that conditional edge independence reduces the marginalization over unobserved network configurations to a closed-form expression portable across the entire CLS class. We develop a stochastic Expectation-Maximization algorithm for the Euclidean latent distance model as a concrete implementation, and apply the framework to the large-scale co-inventor network of German semiconductor patent applicants by drawing multiple snowball samples. We find that the naive procedure severely underestimates latent space variance, produces networks with nearly twice the observed edge count, and achieves a spectral goodness-of-fit nine times worse than the corrected model, which directly affects the quantitative interpretation of covariate effects.
The growing popularity of vine copulas in multivariate statistical analysis is largely driven by their ability to capture complex dependence structures. However, this flexibility comes at a cost, as the number of possible vine models grows rapidly and becomes intractable even in moderately low-dimensional settings. These limitations affect the practical applicability of current Bayesian inference and model selection approaches, effectively restricting it to problems of relatively small-dimension due to their high computational cost. This paper addresses the still open challenge of efficient model selection and estimation in Bayesian vine methodology. We propose a novel framework for Bayesian vine copula model selection that combines loss-based model priors with the shotgun stochastic search strategy. The strength of the proposed approach is twofold: it promotes sparsity and enables fast and effective structure selection. Furthermore, our comprehensive framework jointly identifies the vine structure, selects the copula families, and estimates the model parameters. The power of the proposed approach is demonstrated via simulation studies and an application to a real dataset of EFT portfolio asset returns.
We introduce the zero-one censored transformed normal (ZOC-TN) model for proportional responses with potential probability mass at the boundaries 0 and 1. The model combines a censored Gaussian variable with a two-parameter affine-logit transformation on the interior (0,1). We characterize the transformation parameters, establish large-sample properties, and relate the affine-logit specification to broader classes of interior distributions. Theoretical and experimental results demonstrate that the proposed model can capture a wider range of qualitative density shapes than several benchmark models while remaining parsimonious, computationally efficient, and numerically stable. Furthermore, the ZOC-TN model can be extended (i) to account for nonlinearities and interactions in a tree-boosting machine learning framework and (ii) to explicitly model residual spatio-temporal variability. We apply the ZOC-TN model to loss given default (LGD) modeling for a large dataset of U.S. residential mortgages and compare it to multiple benchmark models. We find that a tree-boosted ZOC-TN model with a spatio-temporal frailty Gaussian process delivers the strongest out-of-sample performance, indicating that mortgage losses are shaped by nonlinear covariate effects and by unaccounted-for space-time variation.
Differential Item Functioning (DIF) analysis is used to identify potentially biased items in multi-item measurements. In addition to testing the statistical significance, it is essential to evaluate the practical significance of DIF through effect-size measures. We review existing DIF effect-size measures and cut-off values used to classify the effect-size magnitudes for the Mantel-Haenszel test, SIBTEST, and model-based methods for binary items, and introduce a refinement of area-based effect-size measures. A simulation study is conducted to investigate the properties of these effect-size measures and existing classification guidelines, and to assess their comparative performance. The results indicate that some commonly used effect-size measures exhibit undesirable properties, including inconsistent classifications, systematic underestimation of the magnitude of the underlying DIF, and strong dependence on design factors. To address these issues, we introduce usage restrictions for some effect-size measures, revise cut-off values that unify results across different methods, and propose new cut-off values for area-based effect-size measures. The methods are demonstrated using two real data examples. Implementation is provided in the R software.
Maximum entropy, Bayesian updating, and exponential-family estimation are all instances of a common inference principle: selecting the measure or distribution that minimizes a divergence subject to the available constraints. Which divergence to use is usually decided by analytic convenience, by empirical performance, or by a set of axioms chosen to single it out, leaving open a basic question: why one divergence and not another? We answer it from a single requirement: an inference method should return the same answer whenever the same problem is presented in an equivalent form, for instance, after simply renaming its parts. This requirement alone forces inference to be the minimisation of a classical divergence, and each further reformulation it must respect tightens the admissible family one notch, narrowing the broad f-divergences to the {\alpha}-divergences and finally to the single Kullback-Leibler (KL) divergence. Mathematically, inference is recast from minimising a numerical functional to selecting a least element under a preorder on positive measures, a divergence being merely one numerical scale that reproduces that preorder. The reformulations are the morphisms of a category of inference problems, and the invariance requirement says the inference operator is a covariant functor into the category of statistical models of Cencov, mirroring his characterisation of the Fisher metric. The representation is proved on finite spaces and lifted to general measurable spaces by an elementary closure, covering discrete and continuous spaces alike. Earlier axiomatisations, such as those of Shore-Johnson and Csiszar, postulate their consistency axioms directly and only on finite alphabets; here the axioms follow from reformulation invariance alone.
Recently [LT; Theorem 3.1] showed an extension of Walker's inequality [W] based on $N=3$ random variables. In this note we prove that extension is just a particular three-dimensional instance of a general family of second order mixed moment inequalities based on Gram matrices of arbitrary random vectors. We also discuss some implications of these inequalities on Cramer-Rao lower bound for biased estimators.
We consider the parameter estimation problem in logistic regression with Gaussian design: the estimation of a fixed unknown parameter $\theta^*\in \mathbb{R}^d$ ($\|\theta^*\|_2\ge 1$) from $n$ i.i.d. samples $\{(x_i,y_i)\}_{i=1}^n$, where $x_i\sim N(0,I_d)$ and $y_i|x_i \sim {\rm Bernoulli}(1/(1+\exp(-x_i^\top \theta^*)))$. Our main aim is to characterize the finite-sample estimation performance and convergence behavior of gradient descent (GD) on the maximum likelihood objective (i.e., the logistic loss). Under small $O(1)$ stepsize and $0$ initialization, we show that GD linearly converges to a small neighborhood of $\theta^*$ achieving an $\ell_2$ error of order $O(\sqrt{\|\theta^*\|_2^5d/n})$. This substantially goes beyond existing theoretical results that lack non-asymptotic estimation error rate and exhibit much slower parameter convergence. We also establish a faster local linear convergence to the same statistical error under a large $\Theta(\|\theta^*\|_2)$ stepsize. The main technical component is to show that the gradient of the logistic loss satisfies a certain approximate invertibility condition (AIC). To that end, we uniformly control the deviation of the gradient from its population counterpart by covering and peeling arguments, and then show that the population GD is a contraction by a delicate analysis based on the eigenvalues of population Hessian matrices. Finally, we build upon the recent work Matsumoto and Mazumdar (2025) and devise a novel efficient estimator that attains a sharper rate in high dimensions. This indicates that the existing non-asymptotic guarantees exhibit sub-optimal dependence on $\|\theta^*\|_2$, and that in many regimes $\Theta(\sqrt{\|\theta^*\|_2d/n})$ is the tight estimation error rate. Numerical examples are provided to corroborate our theoretical results.
The estimation of sums of functions of observable and unobservable variables is a long-standing problem in statistics with applications across many domains. Empirical Bayes methods provide a natural framework for this task under mixture models, but existing approaches often rely on restrictive parametric assumptions or apply only to limited classes of functionals in nonparametric settings. We propose a nonparametric methodology, referred to as quasi-Bayes empirical Bayes, that addresses these limitations through a recursive estimation of the mixing distribution based on Newton's algorithm. The resulting plug-in estimate of the target sum is computationally efficient, scalable, and applicable to a broad class of utility functions, while enabling uncertainty quantification via asymptotic credible intervals derived from a Gaussian central limit theorem. We establish large sample asymptotic theoretical guarantees by proving a merging between the quasi-Bayes and Bayes estimates and by showing consistency under a correctly specified frequentist model. Synthetic-data and real-data analyses demonstrate the practical accuracy and stability of the method, with performance comparable to, and in some cases better than, existing empirical Bayes procedures.
Individualized treatment rules (ITRs) map an individual patient's characteristics to their recommended treatment value. Typically, the optimal ITR is defined as the rule which maximizes a mean counterfactual outcome; the resulting ITR maximizes the effect of treatment along all causal pathways to the outcome, including indirect pathways through mediating variables. Although maximizing the total effect is often sufficient, explicitly incorporating causal mediation in an ITR analysis has several potential benefits such as enhanced interpretability, and additional flexibility in targeting specific causal pathways. For this purpose, we introduce novel Bayesian semiparametric and nonparametric estimators for conditional mediation effects in the presence of multiple mediators and show how they can be used to estimate optimal ITRs. We demonstrate the proposed methodology via an application to optimal kidney allocation with hepatitis C positive donors.
A single geometric operation -- projecting a reference onto a constrained family under a Bregman divergence -- underlies a striking range of statistical methods. This tutorial develops the operation first as pure convex geometry, with no statistics attached. A strictly convex generator $G$ and its conjugate $F$ furnish two coordinate systems, a projection theorem with existence and uniqueness, and a Pythagorean {theorem}; the Pythagorean theorem itself produces {two} dual projections -- the information (e-) projection onto moment-constrained families and the moment (m-) projection onto exponential families -- exchanged by the conjugacy $G\leftrightarrow F$, so a single theorem governs both. Part~II reads off the statistics. The generalized linear model is treated in detail as the concrete carrier of the two projections: {under the canonical link,} the score equation is exactly the Pythagorean orthogonality, and the fit is simultaneously an e-projection in the natural coordinate and an m-projection in the mean coordinate. Maximum entropy, survey calibration, over-identified moment models, the EM algorithm, variational inference, autoencoders, and expectation propagation then fall into place as instances of the same construction -- exactly where the underlying families are flat, and as controlled approximations or neighboring-divergence analogies where they are not. The mathematics of Part~I is self-contained; the statistical sections presume only familiarity with the methods being unified.
Proxy metrics are widely used to improve the precision and velocity of online experimentation (aka A/B testing). Although proxies are often motivated by long-term outcomes that the experimenter does not observe, in many settings they are used alongside a contemporaneous but statistically insensitive north star. This can lead to a practical dilemma: when should experimenters trust the proxy metric, and when should they trust the north star? In this paper, I propose an optimal blending approach that smoothly guides decision-making towards the north star as the power of the experiment increases and away from the north star as the quality of the proxy metric improves. I study the implications of this decision-making framework for the design of experiments and of experimentation programs. Equipped with better (worse) proxy metrics, experimenters should run smaller and more (larger and fewer) experiments. I show how to leverage past experiments to estimate optimal blending weights and experiment sizes. Lastly, I describe the real-world application of the methodology to an experimentation program at Netflix.
We develop a novel distributional Difference-in-Differences (DiD) framework to capture treatment heterogeneity across outcome distributions. By leveraging optimal transport, we use the control group to estimate the untreated distributional drift from the pre- to post-treatment period and apply it to the treated group's pre-treatment baseline, constructing a counterfactual distribution under the assumption of no treatment effect. We frame the null hypothesis as a distributional equality between the transported counterfactual distribution and the observed treated post-treatment distribution, and test it using a maximum mean discrepancy statistic in a reproducing kernel Hilbert space (RKHS). The resulting nonparametric omnibus test is sensitive to changes in location, scale, shape, and tail behavior. Under the null, we derive the asymptotic Gaussian quadratic-form limit of the test statistic, while under local alternatives, we provide a unified characterization of power that establishes its Pitman local power and moderate-deviation consistency. Our theory reveals how detectability is shaped by the interaction between transport-induced drift and RKHS geometry. Simulations and an application to the Card--Krueger minimum-wage data demonstrate that the proposed method identifies key distributional treatment effects missed by classical mean-based DiD.
Modern data analysis usually gives a prediction without showing whether the evidence behind it is clear, conflicting, or stable. Two cases can have the same fitted confidence even when one has mostly agreeing evidence and the other has strong support and strong opposition. We propose Signed Evidence Flow (SEF), which combines a fitted prediction rule with signed feature attributions to measure support, opposition, conflict, and perturbation stability. We prove that confidence determines conflict exactly when it also determines total evidence mass, derive the remaining conditional variance, and state when conflict can improve loss prediction beyond confidence and other audit variables. We also connect conflict to geometric decision fragility. Across healthcare, Covertype, black-box, finance, and ten external data sets, conflict sometimes separates risk among predictions that already appear confident. Cross-fitted tests show added error-ranking information beyond confidence and attribution entropy on several data sets, including two large finance tasks. The direction is not universal: in some tasks, lowconflict cases are riskier. We therefore introduce ScopeGate, a held-out permutation diagnostic that checks the direction before SEF is used for review triage. SEF is consequently an audit tool rather than a universal risk score: it describes evidence structure, while an independent calibration sample determines whether that structure is useful in the target population.
Latent signals are often obscured by measurement noise, yet encode the underlying laws and dynamics of complex systems; learning both the signals and their distributions remains a central challenge in scientific inference. The noise is often non-negligible, and the likelihoods for expressive generative models are often intractable. We utilize a convolutional maximum mean discrepancy (convMMD) loss and propose a likelihood-free framework for nonparametric density deconvolution and empirical Bayes denoising under additive measurement error. Our method learns a latent generative model by matching the observed data distribution to the noise-convolved model distribution. This yields a differentiable, simulation-based objective for multivariate homoscedastic or heteroscedastic noise, compatible with expressive sieve classes such as Gaussian mixtures and normalizing flows. The learned density then serves as an empirical prior for posterior denoising of individual latent values. Theoretically, we extend convMMD from parametric to nonparametric estimation, proving finite-sample bounds for empirical sieve minimizers and $L_2$ convergence rates under Sobolev smoothness. These rates recover the classical inverse-problem dependence: polynomial for ordinary-smooth and logarithmic for super-smooth noises. Our method provides a practical, theoretically grounded approach to deconvolution and denoising under generative latent distribution models.
Optimal transport methods have recently attracted a lot of attention in statistics. Their appeal lies in providing a geometric framework for comparing probability measures, leading to new perspectives on classical problems. A central problem in statistics is the construction of valid confidence sets as fundamental inferential tools in practice. A well-known problem is that for complex problems or relatively small samples, their asymptotic approximations often show poor performance. This suggests to apply optimal transport methods when constructing confidence sets for hard problems to improve their coverage properties. We introduce such a procedure, derive the theoretical framework studying consistency and error bounds for the coverage probability of the resulting intervals. To guarantee feasibility in practice, we propose data-driven choices for our hyper parameters. This approach extends classical quantile-based confidence intervals by leveraging optimal couplings to minimize coverage deviations. Simulations demonstrate striking performance in different estimation problems, outperforming standard methods in accuracy and robustness.
Estimating the average causal effect (ACE) using observational data is a key focus in causal inference for which missing data present an important challenge. Multiple imputation (MI) is a widely used method for handling missing data and can yield unbiased estimates when the imputation is compatible with the substantive analysis. One of the advantages of MI is its scope to include so-called "auxiliary variables", defined as variables associated with incomplete variables that are excluded from the substantive analysis. Although many studies have looked at the use of auxiliary variables in MI for improving precision, the study of auxiliary variables that are necessary for the identifiability (or "recoverability") of the ACE in the presence of missing data has been scant. In this work, we investigate the use of auxiliary variables, both mediators and non-mediators, across a range of typical univariable and multivariable missingness mechanisms depicted by missingness directed acyclic graphs (m-DAGs). For each setting, we derive recoverability results, then evaluate MI-based and complete-case methods for estimating the ACE using correctly specified g-computation, considering different strategies for incorporating auxiliary variables and varying degrees of compatibility for MI models. Based on findings from the simulation studies, we provide practical guidance, highlighting that distinguishing appropriately between mediator and non-mediator auxiliary variables is important to avoid bias as is the use of compatible and flexible (non-parametric) MI methods that incorporate these variables.
We compare Chatterjee's rank correlation $\xi$ with Kendall's $\tau$ and Spearman's $\rho$ under positive-dependence assumptions on bivariate copulas. Our main technical contribution is a sharp order-violation bound for two stochastically ordered distribution functions. This local inequality controls each conditional order-violation probability appearing in Kendall's tau by the cross-rank variance functionals that determine Chatterjee's rank correlation. As a consequence, we prove the sharp Kendall bound $\xi(C)\leq \tau(C)$ for every stochastically increasing copula $C$. The bound is best possible: ordinal sums of product copulas attain equality. We also prove that the weaker left-tail decreasing (LTD) and right-tail increasing (RTI) conditions jointly imply the Spearman bound $\xi(C)\leq \rho(C)$, with equality if and only if $C$ is either the independence or comonotonicity copula. Finally, checkerboard examples show that LTD or RTI alone does not imply $\xi(C)\leq\rho(C)$, that LTD and RTI together do not imply $\xi(C)\leq\tau(C)$, and that both bounds are directional for $\xi$.
We propose a graph generative model for sequences of extremely sparse, edge-exchangeable networks. Models for sparse graphs often face a trade-off between desirable properties like exchangeability and the ability to capture the sparsity observed in real-world networks. While models based on vertex or edge exchangeability have successfully generated sparse graphs, achieving the "extremely sparse" regime, where the number of edges scales near-linearly with the number of nodes, has remained a challenge. Recently, a novel Completely Random Measure (CRM) was introduced, demonstrating that this rate could be achieved within the vertex-exchangeable framework of Caron and Fox. This paper extends this work by demonstrating that this new CRM can be integrated into the alternative edge-exchangeable framework to achieve extreme sparsity.
In directed networks, nodes may form groups with similar interaction patterns, while these groups may themselves follow an ordered structure. Existing methods typically treat these features separately, either clustering nodes without enforcing a coherent block order, or ranking individual nodes without allowing for structurally equivalent groups. We introduce the Transitive Stochastic Block Model (TSBM), a Bayesian model for directed weighted networks that uses transitivity-inducing priors to infer ordered blocks. The model separates the total volume of interaction between two nodes from the direction of interaction conditional on interaction occurring, so that hierarchy is imposed on directional imbalance rather than interaction frequency. We consider two order-restricted specifications: a flexible weak-stochastic-transitivity version, which excludes cyclic dominance patterns while allowing heterogeneous block-pair strengths, and a Toeplitz strong-stochastic-transitivity version, in which directional advantage increases with rank separation. Posterior inference is performed through a Gibbs sampler using Pólya-Gamma data augmentation. Since ordered block labels are not exchangeable, we introduce an age-ordered partition prior to infer the number of blocks jointly with node allocation. Simulation studies show that order-constrained priors improve prediction and partition recovery, especially in sparse networks. Across six empirical directed networks, the TSBM improves predictive performance in four cases and yields partitions with clearer ordered structure. The results also identify cases, such as nearly deterministic dominance networks or non-transitive citation networks, where imposing ordered blocks can harm prediction. The TSBM therefore provides a probabilistic framework for estimating ordered groups and assessing when a transitive block structure is supported by the data.
Causal mediation analysis provides a fundamental framework for quantifying the contributions of different pathways from a treatment $X$ to an outcome $Y$ through a mediator. The natural direct and indirect effects (NDE and NIE) are widely used to decompose the total effect. In this paper, we observe that NDE and NIE can give rise to paradoxical interpretations due to their failure to satisfy two desirable properties of interpretable causal effects: skew-symmetry and additivity. To address these limitations, we introduce new measures of direct and indirect effects for continuous treatments, termed the cumulative natural direct and indirect effects (CNDE and CNIE), constructed by decomposing local causal effects $\mathbb{E}[\partial_xY_{x}]$ into local direct and indirect effects. CNDE and CNIE yield a decomposition of the total effect that preserves both skew-symmetry and additivity. We further extend this framework to ordinal treatments by defining discrete analogues of the cumulative effects over ordered treatment levels that preserve these structural properties. We establish decomposition and identification results for the proposed measures under standard causal assumptions. We illustrate their behavior, in comparison with NDE and NIE, using canonical linear mediation models with interaction and a real-world dataset.
Diffusion models are typically sampled independently, even when the downstream objective is to obtain a diverse set of candidates. We introduce a variance-weighted batch distribution that favours collections of samples with large empirical spread after a prescribed linear feature map. The target is specified explicitly, and the sampler is derived as the corresponding Doob $h$-transform of independent diffusion dynamics. The resulting correction has a compact form: an interaction term that repels posterior denoised means, together with a curvature term that moves particles to the region of higher feature variance. This yields an interacting-particle sampler with a transparent probabilistic target rather than a heuristic repulsive drift.
This paper investigates convergence properties of regularized Nyström subsampling applied to the unsupervised domain adaptation problem under covariate shift. We focus on the low-smoothness (misspecified) case where the target function lies outside the reproducing kernel Hilbert space. By combining Tikhonov regularization with Nyström projection onto a subsampled subspace, we obtain upper bounds on the excess risk that hold with high probability and are expressed in terms of the source condition, the effective dimension, and the sample sizes. We further extend the analysis to the setting where the Radon-Nikodym derivative between the target and source marginal distributions is unknown and must be approximated, and we identify the minimal additional sample sizes required to maintain the same convergence rate as in the oracle case.
Conformal selection aims to identify test candidates whose unknown responses fall in a target region while controlling the false discovery rate. Existing methods often inherit prediction-oriented nonconformity scores, such as residual or clipped residual scores, from conformal prediction. We argue that the natural score for selection is instead the target-membership probability. This score directly addresses the binary event being selected, and any monotone transform of it gives the Neyman--Pearson oracle ranking at a fixed null selection level. This distinction is irrelevant for mean-monotone targets, where conventional scores induce essentially the same ranking, but becomes important for interval-valued, variance-driven, multimodal, or multi-condition targets, where prediction-oriented scores can be misaligned with selection power. We study membership-score-based conformal selection and isolate one conformal calibration route, Null-Calibrated Conformal Selection (NCCS), which ranks test scores against confirmed non-target calibration examples. Under null exchangeability, NCCS yields finite-sample valid null p-values, which can be combined with BY under arbitrary dependence or with BH under standard positive-dependence conditions. Experiments support the score principle: membership scores match conventional scores on mean-monotone targets, substantially improve over mean-score selection on variance-driven targets, and, when calibrated by NCCS, trade power for finite-sample null validity in rare-target regimes where direct empirical-FDP thresholding can be anti-conservative.
Principled regression for stochastic processes is a long-standing challenge with deep connections to scientific inverse problems. We introduce Flow Annealing Posterior Sampling (FAPS), to our knowledge the first function-space posterior sampling framework that unifies stochastic-process regression and PDE inverse problems. Built on pretrained function-space flow-matching priors, FAPS enables likelihood-guided posterior inference from sparse and noisy observations, supports variable query discretizations, and avoids explicit prior-density evaluation. Its Langevin correction uses a low-rank covariance preconditioner to exploit dominant function-space correlations across discretizations. Across Gaussian and non-Gaussian stochastic-process regression benchmarks and diverse PDE inverse problems, FAPS produces coherent posterior samples with accurate uncertainty quantification, significantly outperforming existing functional regression baselines and achieving competitive or better PDE noisy inverse performance than diffusion-based posterior samplers while reducing test-time sampling cost.
With data growing in scale and complexity, traditional linear dimension reduction techniques are becoming inadequate in some settings. Manifold fitting offers an important alternative by capturing low-dimensional latent geometric structures within high-dimensional spaces. This capability allows it to support downstream analysis in complex data settings. In this review, we explore the development and applications of manifold fitting. First, we introduce the basic concepts of manifold fitting and distinguish it from related techniques such as manifold embedding and denoising. We review the development of manifold fitting with three distinct stages: early nonparametric statistical methods, insights from mathematical analysis, and contemporary practical statistical approaches. Furthermore, we present diverse applications of manifold fitting, particularly in neural networks and bioinformatics, which illustrate its utility in complex data scenarios. Despite considerable progress, manifold fitting remains a fertile area for research. Many theoretical and practical questions remain unanswered, and ongoing investigations will further clarify its role in modern data science as a geometric tool for a wide range of data analysis challenges.
Structure-agnostic (SA) models introduced by Balakrishnan et al. (2026) aim to reflect the general lack of knowledge of structural assumptions on data-generating laws such as smoothness or sparsity in practice. Roughly speaking, SA models restrict the observed-data generating law to be in some rn-neighborhood of (black-box machine learning) estimates, treated as given and fixed, where rn encodes the convergence rates of the estimates to the truth. Under SA models, Balakrishnan et al. (2026) show that the popular Double Machine Learning (DML) estimators for three functionals, the quadratic functional in the Gaussian sequence model, the quadratic density integral functional and the expected conditional covariance, are minimax. However, minimax estimators may be inadmissible. In this paper, we show that, for the first two of the three functionals, the DML estimator is asymptotically inadmissible under the SA model. In particular, we show that these two functionals fall into a class of functionals, which we refer to as the monotone bias class. For this class, we exhibit second-order (U-statistic) estimators, which asymptotically dominate DML estimators, under the SA model. These second-order estimators are empirical higher-order influence function (HOIF) estimators introduced in Liu et al. (2017). Furthermore, the empirical HOIF estimator, like the DML estimator, is minimax for the third functional (the expected conditional covariance), although neither asymptotically dominates the other.
Penalization is a widely used approach to model selection with roots in information theory and Bayesian inference. We study a model selection problem involving non-nested candidate models for which penalization is counterproductive. We propose a Maximum Likelihood Criterion for this non-nested setting that selects the candidate model with the highest maximum likelihood. This criterion does not take into consideration the number of parameters of a candidate model. It is well-suited for situations where all candidate models are regarded as equal with no preference for models having fewer parameters. We establish the consistency of this criterion and compare its performance with that of existing penalization-based criteria.
In the study of survival endpoints, the size of the sample is crucial. The calculation of the sample size for survival endpoints was first proposed and presented by Freedman based on the logrank statistic. Later, many statisticians gradually introduced multiple factors such as dropout rate and delayed onset, proposed various estimation methods including the lrstat method, and developed corresponding statistical programs. However, the application of professional statistical functions is relatively complex. Based on this, this paper has developed a corresponding visual web application on the basis of the R package lrstat and powerSurvEpi, and tested its functions and program accordingly. The test results are basically consistent with the case, proving the effectiveness of the application.
Sparse signal discovery is a fundamental problem in large-scale inference, where the goal is to identify a small number of active signals hidden among a large collection of null effects. Despite the prevalence of dependence in modern applications, relatively little is known about how much dependence can be exploited for efficient sparse signal recovery from a Bayes-risk perspective. In this paper, we develop a Bayesian Step-Down (BSD) procedure for sparse signal discovery under arbitrary known covariance dependence. BSD adopts a posterior-guided model-pursuit strategy that sequentially accumulates evidence for competing sparse signal configurations while explicitly incorporating the data's covariance structure. To assess its effectiveness, we introduce a Bayes Oracle for a class of sparse one-factor dependence models and compare BSD with the Oracle, the recently proposed MRD-GBS procedure of Ghosh and Chakrabarti (2026), the original MRD procedure of Cohen et al. (2009), and the Benjamini-Hochberg method. Our simulation studies reveal a striking phenomenon: across a broad range of dimensions, sparsity levels, and dependence structures, BSD exhibits near-oracle behavior and is often virtually indistinguishable from the Bayes Oracle in terms of Bayes risk and support recovery performance. Remarkably, a similarly close agreement is observed between BSD and MRD-GBS despite their fundamentally different Bayesian and frequentist foundations. These findings provide new insight into the attainable Bayes-risk frontier for sparse signal discovery under dependence and suggest that BSD may serve as a useful benchmark when exact Oracle calculations are unavailable. Finally, we show that BSD admits a residual representation, thereby yielding admissibility under arbitrary covariance dependence and substantial computational simplifications.
We show that replacing the standard MSE denoising loss in diffusion models with a nonlinear transformation induced by an f-divergence yields a simple robust training surrogate that empirically improves performance under data contamination, with small additional computational overhead. The theoretical foundation rests on a local divergence construction: under the Gaussian reverse-kernel structure of DDPM, each per-step likelihood ratio follows a lognormal distribution parameterized by a scalar mismatch, so the conditional f-divergence at each step reduces to a one-dimensional function of the denoising error. Summing these local divergences yields a training objective that unifies diffusion training as divergence induced weighted denoising, where the derivative of the induced divergence acts as a residual-space influence weight that controls the contribution of each sample. Bounded-influence divergences (Hellinger, negative exponential) suppress large error samples, with Hellinger yielding an explicit exponential weight, connecting the framework to robust M-estimation. Empirically, on CIFAR-10 under 30% contamination, NED reduces FID from 93.0 (KL) to 77.5, while also outperforming standard robust losses such as Huber and clipped MSE.
We introduce OASIS, a simulation-based inference framework for scientific settings where observations are distorted by measurement error, selection effects, and other survey-specific transformations. In many real applications, simulators generate latent, noiseless quantities, while the data are observed only after passing through a complex observational pipeline. Standard simulation-based inference methods often ignore this distinction, comparing observations to idealized simulator outputs or relying on low-dimensional summaries that can miss important structure. OASIS addresses this mismatch by explicitly embedding the observation model into the simulator and performing inference directly at the level of observed-data distributions. The method constructs a pseudo-posterior by reweighting prior samples according to a maximum mean discrepancy (MMD) loss between the empirical distributions of the observed data and forward-simulated observations, thereby avoiding both handcrafted summaries and learned neural surrogates. We provide theoretical guarantees for Monte Carlo consistency, convergence of the empirical pseudo-posterior to its population counterpart, and posterior concentration on the MMD-identified parameter set, with consistency for the true parameter under correct specification and identifiability. In controlled errors-in-variables regression experiments, OASIS delivers robust parameter recovery and well-calibrated uncertainty under heterogeneous and non-Gaussian measurement noise. We then demonstrate the method on a realistic cosmological application involving galaxy cluster observations across multiple wavelengths, in which latent physical properties are linked to observables through nonlinear scaling relations, heteroscedastic errors, selection functions, and incomplete coverage.
Gaussian Processes (GPs) are a powerful tool for Bayesian time-series modeling, yet their cubic computational cost remains a severe barrier for application to long, high-cadence datasets in astronomy. While specialized scalable solvers like Celerite elegantly reduce this scaling to linear time, repeatedly evaluating the exact likelihood during iterative Bayesian sampling is a bottleneck for developing more complex models, like hierarchical or additive models in which Celerite is only one component. To make this inference computationally tractable, we introduce a generative surrogate framework. By utilizing a Variational Autoencoder (VAE) to learn a compressed representation of the Celerite prior, we map highly correlated stochastic dependencies into a low-dimensional, isotropic manifold. This transition completely bypasses exact covariance operations, shifting the computational burden to a rapid neural network forward pass. Through an extensive simulation study, we show that the generative surrogate accurately reproduces the structural fidelity of exact physical kernels like Celerite. Finally, we demonstrate embedding our VAE approximation into an additive model that combines Celerite and a hidden Markov model (HMM) for stellar flare detection in time series data of stars. We evaluate the joint VAE+HMM architecture against the exact Celerite+HMM framework on empirical astrophysical time series and demonstrate that the proposed methodology achieves significant reductions in computational time, enabling the rigorous, large-scale characterization of stellar flares across massive data archives.
This paper extends the approximate Bayesian estimation framework for Stochastic Volatility in Mean (SVM) models to accommodate heavy-tailed distributions from the Scale Mixture of Normals (SMN) family. To overcome the computational challenges arising from these models, we propose a numerically stable estimation procedure that exploits special functions to eliminate the need for direct numerical integration. Furthermore, the implementation incorporates parallel computing strategies that substantially reduce computational costs. Simulation studies and empirical applications demonstrate that the proposed approach delivers accurate inference while achieving computational times that are approximately an order of magnitude smaller than those required by conventional Markov chain Monte Carlo (MCMC) methods.
We revisit the problem of approximating a bivariate distribution with finite support by another such distribution which is totally positive or order two (TP2). Approximation is meant in a maximum likelihood sense.
This paper presents a robust Expectation-Maximization framework for covariance estimation in Scale-Invariant Random Vector (SIRV) models with missing data under ignorable missingness mechanisms. By adopting an inverse-gamma prior on the scale variables, the resulting observation model leads to a complex multivariate Student-t distribution and allows closed-form E-step and M-step updates. The proposed algorithm incorporates numerical robustness techniques such as computation reuse for common observation patterns, regularized matrix inversions, and explicit enforcement of Hermitian positive semidefinite structure. Experiments on synthetic data and Sentinel-1 interferograms show effective missing value reconstruction and denoising performance under both MCAR and MNAR scenarios.
Contextual bandit algorithms have transformed modern experimentation by enabling real-time adaptation for personalized treatment. Yet these advantages create challenges for statistical inference due to adaptivity. We study inference with contextual-bandit data without assuming a well-specified outcome model. In this setting, we show a previously overlooked issue: standard algorithms such as LinUCB may fail to stabilize under misspecified working models, leading to non-Gaussian estimator behavior and invalid inference. This issue is practically important, as misspecified working models -- such as approximations of complex dynamical systems -- are often employed by online agents in real-world adaptive experiments to balance reward, computational tractability, and robustness. We develop an inverse-probability-weighted Z-estimation framework for a broad class of marginal moment targets, including projection parameters, structural parameters with noisy contexts, and off-policy values. We identify a stability condition tailored to this framework, scaled inverse-propensity convergence, under which the IPW-Z estimator is consistent and asymptotically normal with a consistent sandwich variance estimator. We further establish sufficient conditions for scaled inverse-propensity convergence for several policy classes, including multi-armed bandit algorithms and smooth contextual allocation policies. Simulations and a HeartSteps V1 real-data-calibrated application show reliable coverage and competitive performance across multiple targets. Overall, our results highlight the importance of stability-aware adaptive design for valid post-experiment inference.
This paper introduces Wittgenstein's Rule Following (WRF) data evolution, a framework in philomatics for evolving or generating a new dataset from a sequence of previously observed datasets. The method is inspired by Ludwig Wittgenstein's rule-following considerations and his notion of family resemblance in Philosophical Investigations. Unlike standard synthetic data generation, where the goal is usually to sample from or augment a fixed distribution, WRF aims to continue the implicit rule expressed by a historical sequence of datasets while preserving resemblance to the previous datasets. WRF represents each dataset by structural descriptors rather than pointwise correspondences. These descriptors summarize geometric, distributional, clustering, and, in the supervised case, label-based properties of the data. The method predicts a rule-following target by extrapolating descriptor trajectories and a family-resemblance target by averaging historical descriptors. Candidate datasets are then generated from the observed history through balanced or bounded mixture recombination, scored according to these targets, and optionally refined through differentiable optimization in descriptor space. The proposed framework allows both sample size and feature dimension to vary over time and does not assume that the next dataset is a direct transformation of the last one. Simulations on synthetic and image datasets show that WRF can generate meaningful continuations of evolving datasets in both unsupervised and supervised settings.
In Bayesian analysis, the prior effective sample size (ESS) expresses the information carried by a prior distribution in units of observations, quantifying how much independent information the prospective data must provide to outweigh an informative prior elicited from a previous study. For network models such as Gaussian graphical models (GGMs), the prior ESS is not straightforward to compute. The Wishart and G-Wishart priors induce dependence among the entries of the precision matrix, and their informativeness has never been expressed in an interpretable, observation-equivalent unit. As a result, researchers eliciting an informative prior for a GGM have had no principled basis for sample size planning. In this paper, we close this gap by formalizing a pre-data ESS for GGMs under the Wishart and G-Wishart priors. We adapt five ESS estimators to the GGM setting and compute each through two aggregation schemes: a global ESS measure based on a determinant ratio, and a parameterwise version based on a Cholesky decomposition. Building on these measures, we introduce two complementary planning strategies: the Data-to-Prior Information Ratio (DPIR), which determines the sample size at which the data dominate the prior, and a GGM extension of Bayes Factor Design Analysis (BFDA), which determines the sample size required for conclusive edge-based evidence. Simulation studies show that the two procedures target complementary design goals and that the ESS estimators differ systematically in their sensitivity to network structure and geometry. We conclude by outlining extensions to other graphical models, including time-dependent variants, as well as to matrix-variate mixture priors.
We introduce a broad class of normalized scale-invariant indices (NPRIs) generated by homogeneous functions and encompassing several well-known measures, including the Gini coefficient, generalized Gini indices, entropy-based measures, and variability indices. Explicit expressions are obtained for these indices under gamma populations. Exploiting the independence between the total sum and the associated Dirichlet proportions, we derive a simple unbiased estimator based on a U-statistic. The resulting estimator is shown to be unbiased for any NPRI when the underlying population follows a gamma distribution. Several examples are provided to illustrate the general theory. A Monte Carlo simulation study is carried out that shows the good performance of the unbiased estimator in several scenarios of index choices. We also present a simulation study that goes beyond the established theory by examining the estimator's applicability in settings characterized by a generalized gamma distribution. We evaluate the effectiveness of the NPRIs and their estimates in modeling a real-world dataset related to gross domestic product per capita in the Americas.
Likelihood is standard for Hawkes-process inference, while less computationally demanding methods have largely developed separately. We show that least squares, Takács--Fiksel, and related moment-based estimators form a single class of compensator-based estimating equations, with the likelihood score as the efficient benchmark. For fixed-dimensional multivariate Hawkes processes with compact memory, nonlinear positive links, and signed kernels allowing inhibition, every suitably regular predictable functional of a fixed lag window yields an unbiased estimating equation when integrated against $\text{d} N-\lambda\,\text{d} t$. Under common regularity, identification, and rank conditions, estimators based on every admissible finite library achieve uniform high-probability and pointwise almost-sure $\mathcal O(\sqrt{\log(T)/T})$ rates, asymptotic normality with Godambe covariance, and admit feasible two-step optimal weighting. A projection identity quantifies each library's exact efficiency loss as the score information outside its predictable span; a two-point bound shows the root-$T$ scale cannot be improved uniformly. Although compact memory localizes the intensity rather than the stationary law, exponential forgetting yields Bernstein-type concentration and transfers the theory to nonstationary starts after a logarithmic burn-in. Within this scope, the compensator class is exhaustive for finite-library comparisons: it contains the score, gives admissible libraries common guarantees, and quantifies their efficiency gaps exactly.
The ability to communicate statistical results to domain experts and stakeholders is an important goal of the undergraduate statistics and data science curriculum. However, as large language models (LLMs) have become more accessible, a major concern is that students are offloading important cognitive tasks to generative AI. Using a corpus of over 1,600 undergraduate students' data analysis reports from 2021 to 2025, we show how students' writing style and verb usage have become more similar to that of LLMs. This shift is most pronounced in the first and fifth quintiles of students' reports, which roughly map onto the introduction and conclusion sections, respectively. At the same time, we demonstrate that students' writing style has become more similar to that of statistics experts with the addition of LLMs. We end by discussing the implications of our findings for statistics and data science educators. In particular, we propose alternative modes of assessment that still emphasize statistical thinking, such as targeted writing assignments for structuring a report introduction.
Distribution shift between training and deployment is a pervasive challenge for modern AI systems. In many cases, the target marginals of covariates and response are known or specified through population-level observations, boundary conditions, properties of simulator configurations, or alignment-time distributional constraints. Such knowledge may provide valuable side information for regression estimation. We study this problem in the multivariate linear regression setting with a stable conditional mean $E[Y\mid X]$ across source and target, and identify the hybrid-loss estimator, which jointly incorporates both target marginals, as a benchmark target-aware estimator. Its direct computation, however, requires solving a coupled nonlinear optimization that is expensive at scale. Our main contribution is to develop and evaluate two computationally tractable alternatives: a constrained moment-matching estimator and a two-stage estimator that augments ordinary least squares with a calibration step. For all three estimators, we derive and compare closed-form asymptotic mean squared errors, yielding conditions under which the tractable alternatives match or closely approximate the hybrid benchmark, and regimes in which they do not. Monte Carlo experiments across three controlled shift regimes validate the theoretical results, investigate the accuracy-runtime tradeoffs among the three estimators, and translate into guidance on estimator choice. In particular, the two-stage estimator nearly matches the hybrid benchmark in the high signal-to-noise regime at essentially no additional cost, providing theoretical grounding for empirical observations in nonlinear settings.
Semiparametric efficiency theory provides the mathematical foundation for influence-function-based estimation, including one-step estimators, targeted minimum loss estimators, and many modern inferential methods used in causal inference and missing data analysis. Despite its widespread use, the theory is often presented through a collection of technical constructions whose geometric meaning remains opaque. As a result, influence functions are often derived and applied without an intuitive understanding of the principles connecting scores, tangent spaces, nuisance tangent spaces, and efficient influence functions. This tutorial develops a geometric exposition of semiparametric efficiency theory as a form of differential calculus on a space of probability distributions. Drawing systematic parallels with ordinary multivariable calculus, we show that paths of distributions play the role of curves, scores play the role of velocity vectors, influence functions play the role of gradients, and efficient influence functions arise as projected gradients. This perspective provides a unified explanation for several foundational questions, including why perturbation directions are represented by functions, why tangent spaces depend only on the statistical model whereas nuisance tangent spaces depend on the parameter of interest, and why efficient influence functions arise through orthogonal projection. The resulting framework offers a geometric perspective on semiparametric efficiency theory and influence-function-based inference.
Bayesian modelling workflows often consider multiple candidate models of varying complexity. Model selection is commonly used to navigate potential trade-offs between model complexity and generalisability to new data. We study when model selection is unnecessary or can even be harmful for predictive performance in finite data regimes and find that the need for selecting simpler models can depend on prior choice. We formalise predictively consistent priors, which keep prior predictive implications stable as model complexity increases. Across examples and numerical experiments, including adding covariates in linear and logistic regression, forward variable selection, and nonlinear modelling, flexible models with predictively consistent priors typically match or outperform selected simpler models in out-of-sample predictive performance. When selection helps, it can indicate poor joint prior implications, such as excessive prior mass on implausible predictive values. Based on our findings, we propose replacing the notion of sparsity or parsimony at the level of model components with specifying priors that remain sensible in predictive space as models become more complex.
Density regression extends conventional parametric regression by allowing the entire distribution of the response to vary flexibly with covariates rather than just low-order moments. In the Bayesian setting, logistic Gaussian process (GP) priors have been widely used for density estimation and extend naturally to density regression. The prior can be centred on a base density model, with the nonparametric component providing an interpretable correction that is useful for model criticism. However, logistic GP density regression models have seen limited use, since they require computation of a normalizing constant for every observation, typically via numerical integration. We address this difficulty by proposing a generalized Bayesian approach using a loss function based on the Hyvarinen score. The Hyvarinen score depends only on derivatives of the log density with respect to the response, eliminating the need to compute normalizing constants. Since GP computations remain expensive, we also employ sparse inducing point approximations and variational inference to develop a scalable approach. We demonstrate the method on one simulated and two real datasets, including a German weather dataset with more than 150,000 observations.
Nested Sampling is a Monte Carlo algorithm enabling posterior estimation and Bayesian model comparison, and is especially robust in multi-modal posteriors. This is because nested sampling maintains a population of live points sampled from the entire prior. In each iteration, the population is advanced above a likelihood threshold, potentially discarding modes ruled out by the data. However, the Monte Carlo nature of point replenishment can also accidentally discard a mode. We draw a connection to the neutral Moran process in genetics, and quantify the occurrence probability of this failure mode of nested sampling with a simple symmetric random walk model on the live point occupancy. We find a simple rule for setting the minimum number of live points so that mode die-out is made unlikely.
Nested sampling is a Monte Carlo algorithm for posterior estimation and Bayesian model comparison. It maintains a population of $K$ live points sampled from the prior, and at each iteration discards the lowest-likelihood point and replaces it with a new sample drawn from the prior restricted to exceed the discarded likelihood. Achieving this likelihood-restricted prior sampling efficiently and reliably is the central computational challenge. For low-to-moderate dimensional problems, MLFriends is a general and robust region-based approach that constructs a proposal region by bootstrap aggregation over the current live points and rejects proposals outside this region. We present a self-contained mathematical formulation of MLFriends and derive, under a homogeneous Binomial point process model for the live points, heuristic bounds on the expected fraction of the likelihood-restricted prior not covered by the proposal region. These bounds decay as $(\frac{1}{3}Km)^{-3/2}$, where $m$ is the number of bootstrap rounds, and are negligibly small for practical parameter choices. We show heuristically that the resulting bias in the marginal likelihood estimate is negligible compared to the inherent statistical variance of a nested sampling run. While a fully rigorous treatment remains an open problem, these results provide the first analytical characterisation of a fully specified and practically implementable nested sampling algorithm, without assuming an idealised or asymptotic sampling procedure.
This paper studies kurtosis in multivariate normal variance-mean mixtures through its fourth-cumulant representation. We obtain an explicit expression for the fourth cumulant whose structure separates naturally into a rank-one directional component, a mixed direction--covariance component, and a covariance-pairing component induced by the mixing variable. This formulation shows that kurtosis in this class is not merely a directional tail phenomenon, but also reflects the interaction between mean variation, covariance structure, and stochastic mixing. We further derive the standardized fourth cumulant, relate it to Mardia's multivariate excess kurtosis, and study directional excess kurtosis through projection pursuit. Statistical applications are developed for cumulant-based diagnostics of multivariate non-Gaussianity, dominant-tail-direction analysis, and influential-tail-event detection. The practical relevance of the theoretical results is illustrated with simulated data and daily stock returns.
Likelihood-free inference methods can perform Bayesian inference when evaluating the likelihood is impractical but simulating synthetic data from the model is feasible. Approximate Bayesian computation (ABC) is a well-established likelihood-free approach that constructs particle posterior approximations by evaluating the similarity between simulated and observed data using a distance function, which is used in rejection or weighting steps. Here we extend previous work on adaptive distance learning for ABC to misspecified time series, while also exploring applications in neural posterior estimation using prior-data fitted networks (NPE-PFN) with localization. The adaptation of the distance that we consider optimizes out-of-sample predictive performance using a scoring rule. We also establish a connection between linear pooling for forecast combination and our posterior estimation methods with randomized distances, showing that empirical estimation of pooling weights can be interpreted as another form of adaptive distance learning. For both ABC algorithms and NPE-PFN methods with localization, adaptive distance learning improves forecasting performance in simulated and real examples.
We develop a comprehensive theory for regularized M-estimation in reproducing kernel Hilbert spaces. Under mild conditions on the loss we establish existence and measurability of the estimator, covering a wide range of convex and non-convex losses, including bounded robust losses. We further prove sharp rates of convergence with an explicit bias-variance decomposition governed by a novel complexity measure. We show that the variance is independent of misspecification, while the bias depends on a source condition parameter known in the learning literature. For tensor product Sobolev spaces we obtain new rates that connect to spaces of functions with dominating mixed smoothness, substantially extending existing results and explaining why these estimators circumvent the curse of dimensionality. Our methodology, combining elements from both functional analysis and empirical process theory, allows for an asymptotic linearisation of the objective function that avoids both closed-form solutions and global Lipschitz assumptions, and may be of independent interest. The estimators are implemented in C++ and theory is supported by numerical experiments.
Decision-making in hierarchical systems requires probabilistic forecasts at all cross-sectional levels. Current hierarchical forecasting methods typically generate independent forecasts at each level and reconcile them post hoc to ensure coherence between upper and lower levels. Such post hoc corrections do not incorporate hierarchical structure or decision goals into the underlying parameter estimation. We propose a fully Bayesian hierarchical forecasting framework that shares information more effectively between and across levels than reconciliation alone. Our approach has the flexibility to softly penalise incoherence, subject to model specification, and to focus the global model and coherence update on hierarchical levels most relevant to decision outcomes. This yields parameter estimates that are focused towards the forecasting goals and capture the requirement for coherency, removing the need to estimate covariance matrices for multi-step forecasting horizons. We demonstrate improvements in predictive accuracy metrics on both simulated data and Australian domestic tourism forecasting.
Functional data are often modeled through one likelihood-linked curve, while the scientific target is a larger state containing rates, accumulated quantities, boundary values, or nonlinear functionals of several linked levels. These targets require more than smoothing the observed curve: derivative uncertainty, cross-level covariance, and integration constants must be handled jointly. We introduce anchored Gaussian process differential ensembles, embedding an anchor \(f_0\) in a joint Gaussian state with its mean-square derivatives and repeated integrals. Integral levels add explicit Gaussian integration constants. This separates the anchor-induced covariance from finite-dimensional boundary uncertainty and clarifies why anchor-only observations do not identify independent integration constants. For stationary one-dimensional kernels, we compute the ensemble with a transformed Hilbert space Gaussian process approximation that applies derivative and integral operators to Laplacian--Dirichlet basis functions while retaining the integration-constant covariance exactly. We establish operator-level approximation bounds and conditional finite-grid posterior convergence. We introduce TARTARE, a target-aware calibration procedure for finite-rank differential ensemble approximations, to address derivative under-resolution by anchor-calibrated bases. In second-order simulations, derivative-aware calibration improves derivative posterior recovery relative to anchor-only calibration while preserving anchor and integral summaries. A motorcycle crash analysis illustrates coherent posterior inference on a coupled kinematic state and short-horizon turning-point functionals.
The deployment of data-driven models in 6G wireless networks is increasingly challenged by frequent distribution shifts that degrade performance over time. Unsupervised Domain Adaptation (UDA) offers an alternative approach by adapting the trained model to a shifted domain without requiring labels. However, UDA pipelines are often more complex than single-task training due to additional modules and optimization procedures, raising a practical question: do the benefits of adaptation come at a higher energy cost, and how does this trade-off compare to retraining when labeling effort is also considered? In this work, we investigate the energy consumption of UDA and compare it to single task. We further propose a way to determine the minimum number of target domains for which UDA becomes more energy-efficient than retraining, taking into account the labeling cost. Our results aim to clarify when UDA should be preferred over classical train-from-scratch approaches from an energy and labeling-aware perspective.
Drawing on the observed regularity of typical trips, this study proposes a method for identifying commuting patterns without using the individual identifiers of public bike-share users. The approach combines an analysis of trip duration distributions, mixture models and logistic regression based on the posterior probabilities of belonging to the ___fast' and ___typical' journey categories.
Generalized linear models are central to actuarial modelling of binary risk, claim frequency, utilization, and cost-related outcomes. Yet fairness diagnostics often rely on linear-model intuitions, although GLM predictions are obtained by transporting a latent score through a nonlinear inverse link. We develop a moment-based decomposition framework for diagnosing group disparities in fitted GLM predictions. In an exact linear-Gaussian benchmark, the Wasserstein barycentric criterion for distributional demographic-parity violation reduces to a two-moment criterion and decomposes into direct mean, indirect mean, interaction, and structural components. For GLMs, we distinguish the empirical output-scale criterion $U_2(f)$, a within-group proxy $\widetilde U_2(f)$, and a leading decomposition $D_1(f)$. This leading term preserves the four linear channels and adds two curvature components induced by the inverse link: curvature coupling and curvature amplification. We derive explicit formulas for logistic, Poisson, and Tweedie specifications and illustrate the diagnostic on medical-expenditure survey data. The framework is not a legal test of discrimination, nor a full characterization of distributional parity outside the linear-Gaussian case. It is a tractable actuarial diagnostic for identifying whether fitted prediction disparities arise from explicit sensitive effects, proxy-mediated covariate profiles, covariance-structure differences, or nonlinear link effects.
Health-related quality-of-life (HRQoL) outcomes are increasingly incorporated into oncology research to complement traditional survival endpoints by capturing patients' well-being over time. These outcomes are typically collected through multidimensional questionnaires yielding longitudinal ordinal data, and are often subject to dropout due to disease progression or death. In this context, joint models provide a well-established framework to account for the dependence between longitudinal HRQoL trajectories and time-to-event outcomes, but fully joint estimation rapidly becomes computationally prohibitive when multiple latent dimensions and random effects are involved. We propose a novel slope-corrected two-stage (SC2S) approach for the joint analysis of multivariate ordinal HRQoL data and survival outcomes within a multidimensional latent trait framework. The proposed approach propagates longitudinal information to the survival model through informative priors on the random effects, while additionally re-estimating longitudinal slope parameters. This strategy substantially reduces bias in both longitudinal and survival submodels while preserving much of the computational efficiency of two-stage procedures. Through simulation studies and an application to HRQoL data from patients with progressive glioblastoma, we show that the proposed method closely approximates fully joint Bayesian estimation while requiring notably less computation time.
In high-dimensional data settings, dimensionality reduction or variable selection are key steps when using statistical learning techniques. Principal Covariate Regression-type methods aim to perform both dimensionality reduction and (regularized) regression steps in one analysis. However, existing PCovR methods cannot simultaneously select dimensionalities and estimate regularized coefficients, forcing researchers to make ad-hoc choices in the order of these steps. In this study, we propose a novel method called Principal Covariate Regression with Nuclear Norm Penalty (PcovRnnp) that allows simultaneous dimension reduction and estimation of regularized coefficients.
The objective of modern early oncology dose-finding is to identify an optimal biological dose (OBD), rather than simply the maximum tolerated dose. In basket trials, the dose-toxicity and dose-efficacy relationships may differ across biomarker or disease-defined subtrials, so a single common dose from pooled analysis may be suboptimal. We propose a flexible exchangeability-non-exchangeability (EXNEX) dose finding design (DF-EXNEX design) for subtrial-specific OBD selection in basket phase I/II trials with binary toxicity and continuous efficacy endpoints. Patient toxicity is modelled by a monotone logistic regression and efficacy by a quadratic dose-response curve. Robust borrowing is introduced through extended EXNEX mixture priors on the subtrial-specific curve parameters, allowing the strength of borrowing to adapt to the similarity of subtrials. Dose recommendation is based on an admissible set defined by posterior safety and futility rules, and an OBD-oriented utility function combining toxicity and efficacy on comparable scales. The operating characteristics were evaluated in a large-scale simulation study for the basket trial with four subtrials and five dose levels, and 70 scenarios covering all non-redundant combinations of true subtrial-specific OBD locations. Results showed that, compared with a no-borrowing NEX design, the DF-EXNEX design can increase the correct OBD selection for most scenarios while reducing overly toxic recommendation as final OBD. The improvement increased with subtrial similarity due to robust information borrowing, but a small number of mixed low/high OBD scenarios showed negative or near-zero gains, consistent with occasional over-borrowing towards intermediate doses. These results support robust borrowing for subtrial-specific OBD finding while highlighting the need to monitor borrowing behaviour when true OBDs are widely separated.
We present a framework for online and adaptive forecasting and hierarchical reconciliation using linear regression models. We begin by formalizing hierarchies using graphs, and motivated by their structure, formulate a multivariate linear model using the matrix normal distribution to characterize residuals. Parameter estimation is posed as a ridge regression problem and applied to hierarchical forecast reconciliation. The connections between ridge regression, Bayesian estimation and shrinkage for hierarchical reconciliation are discussed, and results for uncertainty quantification in parameters and forecasts are provided. Based on the ridge regression formulation, a recursive inference scheme inspired by recursive least squares is described. The algorithm is implemented in the PyOnlineForecast package. Finally, the proposed methodology is demonstrated on a case study for district heating load forecasting using a temporal hierarchy. Our results provide a reference for implementation of forecast reconciliation via multivariate linear models in an online setting. The case study furthermore highlights practical considerations of using temporal hierarchies in an online setting and demonstrates the usefulness of the proposed framework and implementation, both for district heating load forecasting and more generally for online hierarchical forecasting.
Neuron degeneration is the underlying mechanism for the development of many diseases. Quantifying the association between increasing levels of toxic exposure and progressive neuronal damage is a critical component of understanding this development. We investigate this association by analyzing a novel dataset of ordinal neuronal damage scores derived from a series of toxicological assays of C. elegans, including variables such as toxicant concentration, maternal treatment, and direct chemical exposure. We propose a computationally efficient parameter-constrained Bayesian ordinal regression that captures the monotonic association between neuron damage scores and corresponding treatments. Power analysis via simulation studies reinforces the advantages of our model over standard alternatives used in existing work by practitioners. Analysis of the novel C. elegans assays indicates that maternal toxicity increases susceptibility in progeny, with the offspring generation exhibiting amplified neuronal damage upon later-life rotenone exposure even under mild parental developmental treatment.
To address the computational challenges arising from large-scale longitudinal data, an optimal Poisson subsampling algorithm is proposed for quantile regression. The proposed method can substantially alleviate computational burden. Under some regularity conditions, we derive the asymptotic properties of the estimators from weighted quantile generalized estimating equations. For practical implementation, an efficient algorithm is proposed for parameter estimation. Furthermore, asymptotic theory is established for penalized weighted smooth quantile generalized estimating equations, and regularized parameter estimation is performed within the optimal Poisson subsampling framework. Both numerical simulations and a real data application demonstrate that the proposed optimal Poisson subsampling algorithm outperforms the uniform Poisson subsampling algorithm, and the regularized estimation exhibits satisfactory performance as well.
Gaussian mixtures of regressions are commonly implemented via a Gibbs sampler. This Markov chain Monte Carlo (MCMC) algorithm can be computationally burdensome because of the need to update discrete-valued latent component allocation parameters whose dimension increases as the sample size increases. In this article, we propose applying the method of composition to a Gaussian finite mixture model with a Normal-Inverse-Gamma (NIG) prior which allows one to write the posterior distribution as the product of conditional distributions. Namely, the conditional distribution of parameters given the data and mixture labels, times the marginal posterior of the mixture labels. The conditional distribution of parameters given the data and mixture labels, can be sampled from directly, instead of using MCMC. The expression of the marginal posterior of the mixture labels is known up to a proportionality constant and we adapt existing approaches in Bayesian selective inference to constrain the space of component labels to those arising from preliminary estimators, which alleviates a commonly encountered bottleneck. In simulation studies, we consider several settings and compare several versions of our constrained mixture of NIG models to two different MCMC-based strategies and demonstrate their use on natality data from the CDC.
Time series classification involves learning a mapping from a continuous, temporally ordered sequence of real-valued observations to a discrete response variable, like class labels. This task is fundamental in domains, including health monitoring, where the temporal structure of data is critical for accurate prediction. Dynamic Time Warping (DTW) is a standard technique for measuring similarity between sequences varying in time or speed. However, DTW is restricted to discrete point matching. To move beyond pairwise alignment, we propose a theoretical framework that learns mappings between real-valued functions. These mappings approximate the flow associated with the characteristic curves of a linear transport equation with a space-dependent velocity field, providing a diffeomorphic transformation between two time series. Using the method of characteristics, we transform this partial differential equation into ordinary differential equations (ODEs) modeling system dynamics. The objective function used to learn these ODEs derives from the fundamental theorem of calculus. To enable flexible, expressive representations of the velocity field, we utilize reproducing kernel Hilbert spaces and optimal control methods. Our method, Diffeomorphic Time Warping (DiffTW), provides a theoretically grounded dissimilarity measure. Using a 1-nearest neighbor classifier, DiffTW outperforms DTW on 60 of 86 datasets.
Over the past decade, deep neural networks (DNNs) have achieved remarkable success on complex machine-learning tasks, yet the theoretical foundations of their performance remain incomplete. From a statistical viewpoint, a natural question is: can DNNs attain feature-learning and prediction consistency comparable to that of classical models? While a full characterization is open, we provide positive results for a broad subclass. We establish feature-learning consistency guarantees for sublinearly structured DNNs-architectures whose input/output dimensions and number of hidden neurons grow sublinearly with the sample size-when learning hierarchically compositional target functions. Importantly, this consistency still holds even in the conventional "over-parameterized" regime where the total number of parameters exceeds the number of training samples. Empirically, sublinearly structured DNNs match or surpass wide DNNs in prediction. A structural audit further indicates that widely used convolutional neural networks (CNNs), including AlexNet, VGGNet, ResNet, GoogLeNet, are sublinearly structured on their image classification benchmarks. We further prove that the sublinearly structured DNNs achieve universal approximation for hierarchically compositional functions in the large-sample limit. Moreover, images exhibit an inherent hierarchical, compositional structure. Taken together, these results explain, through a statistical lens, why many large-scale deep learning models succeed after adequate training on massive image datasets.
Existing power and design methods for multiple-intervention stepped wedge designs (M-SWDs) typically assume exchangeable cluster-period correlation, despite evidence that correlation often decays over time. Misspecification of this correlation structure can substantially distort variance estimation and power, particularly for treatment interaction effects. We develop a unified covariance framework for M-SWDs that separates intracluster correlation from an explicit cluster-period correlation matrix. This formulation accommodates exchangeable, autoregressive, and more general distance-dependent correlation structures while preserving closed-form expressions for the variance of treatment effect estimators under linear mixed models. Using analytic results and simulation studies, we demonstrate that assuming uniform correlation when the true structure is time-dependent can lead to substantial power mischaracterization. Specifically, we find that designs calibrated under independence assumptions may be overly conservative and compound symmetry can be either optimistic or conservative. These findings demonstrate the importance of explicitly modeling cluster-period correlation at the design stage of M-SWDs and provide practical guidance for power calculation and design selection in realistic settings.
We study how the choice of default prior for a common Gaussian scale affects high-dimensional shrinkage risk, highlighting the role played by high-dimensional geometry. Formally, we consider a high-dimensional setting in which the near-zero behavior of the common scale prior has first-order consequences for shrinkage risk, and show that priors that are flat on the variance and those flat on the standard deviation allocate markedly different mass near the zero-scale boundary, leading to distinct shrinkage behavior and informing principled default prior selection. Specifically, under a radial-power benchmark, we establish that the SD-flat benchmark has a one-unit asymptotic risk advantage near the origin, crosses over in the critical regime, and is second-order equivalent to the variance-flat benchmark for strong signals. Proper single global-scale hyperpriors and bounded coordinate-multiplier mixtures inherit these limits through the near-zero exponent of their SD-scale density. For heavier-tailed or sparse priors, that exponent still classifies the common global-scale component, while local-scale tails, model-size priors, or allocation priors can also affect risk.
Frameworks for ensuring fairness in machine learning typically focus on learning fair models from existing data. But this endeavor is often undermined by biases already present in that data. We therefore look to modify the data acquisition process itself to help gather fairer data that is inherently more suitable for training fair predictors. To this end, we introduce FairBED, which provides novel formulations for quantifying the fairness of datasets themselves based on the idea that fair datasets should be uninformative about sensitive attributes. We then use this to construct practical fairness-aware Bayesian experimental design (BED) objectives that maximize expected information gain about the target quantity of interest while minimizing expected information gain about sensitive attributes. We further derive a theoretical link between FairBED and demographic parity, and show empirically that models trained on data gathered using FairBED provide improved fairness-accuracy trade-offs compared to randomly acquired data and conventional BED.
We develop an operator-theoretic framework for finite-dimensional, regime-dependent Volterra equations with completely monotone memory kernels, dissipative network coupling, and Hawkes-type self-excitation. For each fixed regime we construct the associated Volterra resolvent family and prove global well-posedness, continuity across regime switches, and explicit a priori bounds. The main stability result is sharp in the commuting case: after simultaneous diagonalization of the network Laplacian and the excitation operator, each mode obeys a scalar characteristic equation, and global asymptotic stability holds exactly when every modal branching ratio lies below the intensity damping threshold. We also give a norm-based sufficient condition for noncommuting operators and a Perron--Frobenius spectral criterion for nonnegative intensity blocks, showing when norm estimates are conservative. Beyond mean stability, we prove a pathwise finite-range power law for burst amplitudes generated by residence in a Hurwitz but nonnormal regime: under a cone-alignment event, the survival exponent is the ratio of the regime exit rate to a cone-corrected finite-time growth rate bounded above by the logarithmic norm of a fixed Markovian realization in the chosen Euclidean metric. A complementary idealized-feedback result shows how a logarithmic-norm contraction caps the amplification band. Finally, we derive the deterministic intensity block as a mean-field limit of a relaxing long-memory Hawkes system with regimes. Numerical experiments on modal equations, a small-world network, and a switched nonnormal ODE validate the sharp threshold and the finite-range amplification mechanism without using the closed-form tail formula as input.
Neural networks are a commonly used prediction tool in computer science and statistics. However, the barrier to entry of this interesting field remains high, particularly for classical statisticians trained in a frequentist perspective. In this letter, we demystify neural networks by describing networks that approximate a linear regression and describe common customizations that provide a foundation for further study.
Diffusion models are known to exploit unknown low-dimensional structure to accelerate sampling. However, existing convergence theory under low-dimensional data structure has largely focused on update rules with narrowly prescribed coefficient choices. This raises a fundamental question: is adaptation to low-dimensional structure sensitive to the precise choice of update coefficients? In this paper, we show that such adaptation is a robust property of diffusion models. For a broad class of update coefficients, we prove that $\widetilde{O}(k/\varepsilon)$ iterations suffice to generate an $\varepsilon$-accurate sample in total variation (TV) distance, independently of the ambient dimension. Our framework substantially broadens the class of diffusion samplers known to enjoy low dimensional adaptation and applies to several commonly used methods in practice. These results provide a theoretical justification for the empirical effectiveness of diffusion samplers across different coefficient choices when applied to structured, high-dimensional data.
Exposure misclassification is a common concern in studies of respiratory infections in cystic fibrosis. Throat swabs are frequently used in place of expectorated or induced sputum cultures, although they have imperfect sensitivity and specificity to detect Pseudomonas aeruginosa and Staphylococcus aureus. We develop calibration weighting and control variate estimators for causal inference with multiple misclassified binary exposures and clustered observations. The calibration approach treats misclassification as a missing data problem, achieving consistency without modelling the misclassification mechanism. The control variate adjustment integrates information from error-prone observations to reduce variance while preserving the consistency of the gold-standard estimator. We show that the resulting estimator inherits double robustness from its component estimators. We also characterize a structural ceiling on efficiency gains in the bivariate setting, where joint correct classification of both exposures limits the variance reduction achievable relative to univariate applications. Simulation studies confirm the consistency and double robustness of the proposed estimators under model misspecification. We then apply these methods to a cohort of $651$ cystic fibrosis patients ages $6$-$21$. Swab-based estimates attenuate the effect of P. aeruginosa on percent predicted FEV$_1$ by approximately $69\%$ relative to sputum-based estimates ($-2.67$ vs. $-8.52$ percentage points; $95\%$ CI for sputum: $-13.40$, $-3.63$). These findings suggest that relying on throat swabs may lead to under-treatment of P. aeruginosa infections. More broadly, the methods provide a framework for causal inference with multiple misclassified exposures.
Bayesian experimental design (BED) has traditionally been based on maximising expected uncertainty reductions from prior to posterior. A major shortfall of this approach is that it leads to doubly intractable objectives that are difficult to optimise, while customising them to particular downstream tasks of interest can also be difficult. Following first principles decision theory, we demonstrate that BED can alternatively be formulated in terms of an expected future loss (EFL) on downstream actions, providing a simple and naturally task-driven framework. Critically, we then show that all such EFLs can be rearranged into singly intractable objectives that can be jointly optimised with respect to both the design policy and a downstream action policy using stochastic gradients, an approach we refer to as ACTION-BED. This formulation further sidesteps the need for any explicit posterior or marginal likelihood estimation and is naturally implicit, requiring only the ability to sample from the joint model over model parameters and data, and evaluate the downstream loss function. It thus allows design policies to be learned more effectively, efficiently, and simply than existing methods, while providing easy customisation to different downstream tasks and losses.
Large language models (LLMs) are increasingly woven into expert cognitive work in daily research, yet we know little about how human expertise should adapt when an AI system can execute substantial technical reasoning on its own. Here we use statistical proof development, a demanding and structured form of expert reasoning, as a window into this broader question. Drawing on day-to-day proof problems, we find that current general-purpose LLMs occupy a useful but limited role: they can execute technical components given a precisely formulated problem and targeted guidance, but become unreliable when the problem is open-ended or requires a long reasoning chain with multiple interdependent steps. This execution-strategy gap is rooted in what makes research-level statistical proof distinctive: unlike pure mathematics, where problems arrive pre-formulated and often demand novel techniques, statistical proof requires first modeling a scientific question into a statistical framework with appropriate assumptions, and then identifying and adapting the right strategy from a repertoire of reusable domain-specific tools. Each step requires deep expertise in both the statistical literature and the real-world context being modeled. In such work, current AI assistance does not reduce the need for human expertise; it relocates that expertise to where human decision-making matters most, such as problem formulation and verification of AI-generated results, and may raise the bar for both. These findings yield practical suggestions for how statisticians can structure AI-assisted proof workflows, and point to a broader community agenda for shared resources, better AI tools, and training the next generation of researchers. Using statistical proof as a window, our study has implications for how experts structure human-AI collaboration in technical cognitive domains more broadly.
Building on a two-parameter Weibull framework for diagnosing transformer weight distributions, we study why the Weibull weight-scale parameter $\lambda$ grows, overshoots, and then relaxes during AdamW training. We derive a leading-order three-force decomposition of the squared weight norm from the AdamW update: an alignment force measuring the correlation between weights and the adaptive update direction, an injection force from adaptive step magnitude, and a decay force from decoupled weight decay. On self-trained Pythia-70M models with ground-truth optimizer moments, alignment dominates the rise phase, contributing 88-94% of the absolute force budget across four random seeds and remaining robust to super-weight removal. Near saturation, alignment and decay approach balance, explaining the transition from weight-scale growth to relaxation. These force dynamics directly govern the squared-norm component underlying $\lambda(t)$; the remaining RMS-to-Weibull reconstruction offset is measurable and decomposes into bridge and integration components, totaling approximately 5-6% in densely sampled regions. To extend the analysis to real models where optimizer moments are unavailable, we introduce a spline displacement method that recovers the alignment force from sparse checkpoints with approximately 92-94% accuracy, about twice the naive two-point baseline. We further observe that the peak value of $\lambda(t)$ varies with training-data coherence in our experiments, suggesting a data-dependent component of weight-scale growth that we leave to a controlled follow-up study. Code and data are available at this https URL.
Multi-agent debate improves the reliability of large language models (LLMs) through iterative peer critiques. However, fixed topologies often introduce persistent positional biases, amplify unreliable agents, and cause high sensitivity to role assignments. We introduce \textit{Permutation-Equivariant Adaptive Routing Multi-Agent Debate (PEAR)}, an inference-time protocol that dynamically reconfigures communication roles and sparse topologies across consecutive debate rounds. By strategically switching agent-to-role assignments based on evolving agent states, PEAR prevents any agent from permanently occupying a privileged network position or distributes influence more evenly across the debate. We theoretically characterize PEAR as an equivariant sparse router: it preserves accuracy under agent relabeling while reducing routing complexity and improving generalization. Comprehensive empirical evaluations across four reasoning benchmarks and six diverse LLM backbones demonstrate PEAR significantly improves average accuracy over the strongest debate baselines. The code is at this https URL.
Significant progress has been made in aligning LLMs with target value functions. We argue that, even when an LLM has been well aligned in (post-)training, it may still fail to maximise the aligned value in reasoning. We mathematically formalise this gap as rational value risk: the utility discrepancy between a model's deployed reasoning strategy and its rational counterpart, which is defined to be the responses that maximise expected utility in the steepest direction. The estimation error of rational value risk is further decomposed into three components from finite candidates, finite prompts, and imperfect verifiers. Extensive experiments are conducted, covering models Llama-3.1, Qwen-2.5, T{\"}ulu-3 families (7B-72B), GPT-5.2, GPT-5.5, and DeepSeek-V4, and benchmarks UltraFeedback, AlpacaEval, GSM8K, MATH, HumanEval, and MathArena. The results validate that (1) rational value risk is widespread; (2) value alignment can reduce, but cannot eliminate, it; (3) the risk is highly sensitive to inference-time reasoning strategy; and (4) longer reasoning improves rationality with diminishing returns. The code is at this https URL.
Asymptotic statistical theory is a challenging domain for AI-assisted formalization: its central results mix convergence statements, asymptotic expansions, functional analysis, and regularity conditions that have a large gap from existing infrastructure in Lean 4 formalization. To address these challenges, we propose a hypothesis-disciplined Lean 4 formalization pipeline built from multiple agents: a manager that coordinates seven specialist roles for proof planning, skeleton scaffolding, Mathlib reconnaissance, proof construction, integration, independent review, and audit. The main methodological discipline is the hypothesis-disciplined audit, implemented by the Auditor agent: every main-theorem hypothesis and concept-layer field must be anchored in the source mathematical prose, justified as a Lean encoding adapter, marked as source-implied, or rejected as an unsupported strengthening. Using this workflow, we build a systematic formalization of asymptotic statistical theory, especially the parametric and semi-parametric models' asymptotic distribution and efficiency results. The resulting Lean development is axiom-clean and source-faithful, with Lean-checked and human-audited proofs of core parametric and semi-parametric theorems organized so that theorem-agnostic infrastructure and statistical concept definitions are separated from theorem-specific assembly. The formalization results are available at this https URL.
This paper argues that Artificial Intelligence should be understood as a form of monism: a unified substance that cannot be decomposed into separate elements such as data, algorithms, or technical architectures. Drawing from philosophical traditions of monism, dualism, and holism, the paper contends that AI is not merely a collection of components but a single, indivisible essence reflecting the phenomena it replicates. Treating AI as monism has deep implications across multiple dimensions. Epistemologically, it positions AI as the central interpretive force across technological, organisational, and societal domains, while raising ethical and existential concerns regarding singularity, the homogenisation of innovation, and the concentration of decision-making power. At the organisational level, a monistic approach challenges traditional siloed structures, advocating instead for transversal, problem-centric teams whose mandate derives from the integrity of the problem rather than from departmental hierarchy. In project management, it implies a unified vision and an integrated evaluation of complexity in which no single stakeholder perspective dominates the assessment of outcomes. In data and information management, it calls for architectures that reflect the irreducible unity of the phenomena being modelled. Ultimately, this paper calls for a paradigm shift in how AI is conceptualised, governed, and integrated, suggesting that only by embracing AI as monism can organisations achieve genuine agility and avoid the structural inefficiencies inherent to reductionist approaches.
Explainable Artificial Intelligence aims to make black-box models more trustworthy by presenting, in a human-understandable manner, the elements that lead to the model's output. This involves both (i) identifying components and connections with genuine causal influence on outputs and (ii) translating such structures into an interpretable representation. For the former, we introduce CIExplainer, a novel perturbation-based method grounded in causal inference for explaining Graph Neural Networks (GNNs). CIExplainer identifies the subgraph with the highest causal effects on GNN predictions using the Potential Outcome Framework. We evaluate and compare CIExplainer on various GNN architectures (GCN, GraphSAGE, GAT, GIN) and datasets. To bridge subgraph explanations with human interpretability, we further propose G2TeXplainer, a method that transforms causal subgraphs into natural language explanations that capture both feature-level and relational information.
Emergent misalignment (EM) is a phenomenon in which models generalize with narrow fine-tuning, leading to broad (yet uneven) misalignment across evaluation questions. We study EM and its variability directly through the components of fine-tuning: training dynamics, model priors, and data. (1) We first explored how in-domain training loss relates to out-of-domain alignment scores across datasets and model families. Then, we tried to induce potential alternative local minima through different learning schedules for one narrow fine-tuning, but did not find strong runs with better broad alignment scores conditioned on similar or lower training loss. (2) We found that although the mean and standard deviations of the misaligned model scores are usually statistically different from those of the pre-trained model, there are some potential signals on overall positive correlation. The evaluation prompt-only activations from both the pre-trained and the original instruct models (prior to narrow fine-tuning) could predict fine-grained alignment scores after narrow fine-tuning. (3) Finally, we compared activation deltas before and after narrow fine-tuning and found moderate-to-high subspace overlap and similarity between the resulting activation shifts for training and evaluation prompts. Subspace overlaps between training and evaluation prompt activations correlate with their shifts' similarities when measuring with the last prompt-token activations. The train-evaluation data prompt overlap is controlled against overlap computed from random vectors and evaluation prompts activations.
Synthetic tabular data enables microdata sharing in regulated domains, yet deploying continuous-time generative models requires balancing analytical utility, disclosure risk, and computational cost. Latent-space flow models are flexible, but theoretical equivalences across learning targets, probability paths, and sampling dynamics can translate into different behaviour under finite-step integration and explicit compute budgets. We present an empirical study of tabular latent flow models across seven datasets, evaluating velocity, score, noise, and posterior matching objectives under optimal transport (OT) and variance-preserving (VP) paths, ODE and SDE sampling, and varying integration budgets. Our contributions are threefold: (1) we show that the learning target largely determines the utility-risk operating regime, with velocity and posterior matching tending to yield higher utility, while score and noise matching tend to achieve lower disclosure risk; (2) we demonstrate that configuration and sampling choices shift performance, with midpoint often improving distributional fidelity and OT paths often tolerating earlier stopping than VP, enabling compute savings under fixed budgets or risk thresholds; and (3) we distil these findings into actionable defaults and practical configuration guidance to support pre-release model selection under disclosure risk and resource constraints. The code implementation and supplementary materials can be accessed in this https URL.
Estimating causal effects in industrial time series requires handling temporal dynamics, time-varying treatments, and unobserved confounders. Existing causal foundation models (CausalPFN, CausalFM) operate only on static cross-sectional data; neural temporal methods (CRN, G-Net) require per-dataset training; and concurrent temporal-PFN proposals have not been demonstrated at industrial scale. None output explicit per-pair reliability signals alongside their CATE estimates. We introduce Temporal Causal Prior-Data Fitted Networks (TCPFN), a foundation model for zero-shot temporal causal discovery with learned reliability signals. TCPFN makes four contributions: (1) a Causal Judgment Head that jointly predicts null-effect probability, confounding strength, identifiability, mediation fraction, and causal regime; (2) a mixed training prior covering six causal regimes (independent, direct, confounded, mediated, time-varying confounded, feedback) plus CausalFM-style front-door and instrumental-variable priors; (3) a discrete-token panel-data architecture with cross-attention masking that prevents inter-horizon leakage; (4) zero-shot inference at industrial scale via FAISS-based context selection and one-step posterior correction. On 19 benchmark datasets across five domains, TCPFN achieves competitive zero-shot causal discovery: AUROC 0.96 on Tennessee Eastman, 0.93 on SWaT, 0.98 on Causal Rivers, 0.97 on CAUSRCA. The null detector reaches NullF1 0.94, AUROC 0.99. TCPFN scales to V=1,275 on a proprietary Kraft pulp-and-paper dataset in 6 hours on a single GPU; PCMCI, a CPU-only library, on a V=666 sub-panel of the same data took 81.5 hours, extrapolating by O(V^2) to ~12.5 days at V=1,275. TCPFN's top edges identify cross-subsystem causal relationships while PCMCI's surface within-instrument controller-measurement coupling -- a scalability case study.
Differential privacy (DP) has become the gold standard for ensuring the privacy protection of machine learning and statistical algorithms in recent decades. A plethora of algorithms and methods have been developed to enhance the utility of DP algorithms while maintaining the same level of DP. However, these are often overly complex or computationally ineffective. We propose a novel approach focusing on denoising the output of the simple additive Gaussian mechanism by adopting the idea of \textit{empirical Bayes estimation}. We highlight that the empirical Bayes approach can reduce the mean-squared error solely by taking the output of the Gaussian mechanism as input. Our numerical studies show that this simple yet powerful approach can be applied to improve upon various statistical problems, including histogram release, principal component analysis, and linear regression, often outperforming existing private algorithms.
Despite the success of deep learning, training deep networks in biologically plausible and hardware-efficient ways remains an open challenge. Feedback alignment (FA) methods address this by replacing backpropagation's symmetric backward weights with fixed random matrices, but their effectiveness depends critically on whether they can be accurately evaluated. The standard evaluation relies on two quantities: task accuracy and cosine similarity between the method's credit signal and the backpropagation gradient. We show that this reporting pair is insufficient by identifying two independent failure modes, both silent under current reporting: (1) measurement degeneracy, where the BP reference gradient collapses to the numerical floor in terminal-LayerNorm residual architectures, rendering cosine uninterpretable; and (2) aggregation collapse, where the aggregate cosine masks layerwise heterogeneity that concentrates credit at one end of the network. To address these limitations, we propose a diagnostic evaluation protocol based on three checks -- scale stability, reference validity, and depth utility -- together with per-layer rather than aggregate cosine reporting. Across multiple architectures and methods, the standard reporting pair gives no signal of failure in any audited case, while our protocol identifies all failures with wide calibration margins. The two failure modes are causally independent: a per-block scale penalty alleviates Mode 1 (residual scale explosion driving reference collapse) without affecting Mode 2 (cosine ranking that contradicts every functional metric we measured). Identifying these silent failures prevents researchers from building on non-functional credit assignment and provides actionable guidance for developing FA methods that genuinely train deep layers.
The Virginia Tech Transportation Safety Index (VTTSI) is a real-time, cloud-native framework for quantifying intersection safety using multimodal connected-vehicle telemetry and multi-year VDOT crash history. Traditional crash-based methods rely on lagged, aggregated data and cannot reflect rapidly changing operational conditions. VTTSI addresses this gap through a hybrid modeling approach that fuses Empirical Bayes (EB) crash stabilization, uplift factors derived from speed and conflict behavior, and a CRITIC-weighted multi-criteria decision-making (MCDM) module combining SAW, EDAS, and CODAS. The system produces interpretable, exposure-adjusted safety scores on a 0--100 scale every 15 minutes. A cloud-deployed architecture built on FastAPI, PostgreSQL, PostGIS, and Streamlit supports interactive visualization of traffic volumes, VRU exposure, speed variance, and real-time incident activity. Validation across intersections demonstrates coherent diurnal patterns, consistency among MCDM methods, and sensitivity to observable operational turbulence. Sensitivity analysis further shows that the RT--SI is robust to parameter perturbations, with deviations typically remaining below one point on the 0--100 scale. By integrating long-term crash risk with short-term behavioral dynamics, VTTSI provides a transparent, adaptive, and proactive safety-monitoring framework suitable for transportation agencies, traffic management centers, fleet operators, and autonomous vehicle systems.%~\cite{persaud2007, montella2020systemic, schultz2025_surrogate, Amraji2025CombinedSafetyIndex}.
Singular learning theory characterises the complexity of a deep network through the geometry of its loss singularities. The local learning coefficient (LLC), the standard estimator of Watanabe's real log canonical threshold (RLCT, $\lambda$), reads this geometry as an integrated Bayesian scalar through SGLD, which needs per-task calibration and $10^4$-$10^6$ forward-backward passes per checkpoint. We introduce Dead-Direction Signatures (DDS), a family of cheap closed-form spectral readings of singular structure: each reads a network's activation matrix or per-sample-gradient Fisher-Gram at a chosen layer, replacing the SGLD posterior chain with spectral linear algebra. The readings rest on a dead-direction framework that predicts a structural correlation between activation- and Fisher-side spectra at any singular minimum, and a rank-multiplicative volume identity that single-eigenvalue monitors cannot produce: the active-volume $\log\det^{+}(G)$ slope counts the dead directions, tracking the rank-deficit $r$ across $r \in \{1,2,3,4\}$ (slope ratios $2.0, 3.1, 4.0$ at $r{=}2,3,4$ against the predicted $2,3,4$), where the smallest eigenvalue is rank-blind. On reduced-rank regression with closed-form $\lambda$, calibrated LLC recovers $\lambda$ at $99\%$ mean and the DDS observables rank-track it at the framework-predicted sign; on a non-linear modular-addition transformer DDS separates $d_{\mathrm{model}}$ across eighteen orders of magnitude where calibrated LLC at the protocol budget is rank-flat. Complementary to LLC's integrated posterior reading, DDS gives a directional, layer-local handle on a network's dead directions, read in closed form from its activation and gradient spectra.
In this article, we analyse convergence of the averaged Adam optimizer to an attracting zero of the Adam vector field. We provide a central limit theorem that, in particular, quantifies exactly the speed of convergence. The order of convergence is $n^{-1/2}$ in the number of steps of the algorithm which coincides with the order observed for classical stochastic approximation algorithms. The covariance in the central limit theorem is given in terms of properties of the Adam algorithm in the state of the attractor.
Two recent papers by Borusyak and Hull (2023, 2026) propose using known formulas to adjust linear instrumental variable estimators for confounding covariates. Implementing this "formula instrument" approach requires making a parametric assumption on the distribution of the unobserved shocks that generated the instrument. We develop a method for systematically evaluating the sensitivity of formula instrument estimates to this parametric assumption. The method is straightforward to implement using our companion R package formulaiv. We use our method to reanalyze the applications in both Borusyak and Hull (2023) and Borusyak and Hull (2026). In both applications, we find that a variety of estimates of different signs and magnitudes can be recovered by slightly changing the shock distribution.
We equip the space of beliefs with a cost geometry (what it costs to pass from one belief to another): optimal transport in Wasserstein space, reweighted conformally by Fisher information (the price of the precision at stake), distinct from the Fisher-Rao metric. In the setting we consider, a finite machine maintains a digital twin of a system; observing the territory through finite, noisy sensors, we model its coherent output as a belief: a probability density over states, the Bayes posterior. Certainty (the perfect twin) is denied twice, by observation and by physics, both read off the Fisher information. On the conformal class, essentially location-scale, three results emerge, all invariants of one change of cost unit. A wall: a well-posed inference rejects certainty to infinite distance as soon as the cost dominates the Fisher information (necessity conjectured beyond power laws). An honesty: an honest (eikonal) cost, each nat the same length everywhere, selects the geometries proportional to the Fisher information. A rigidity: these geometries are hyperbolic, and the Stam bound crowns the Gaussian, the most hyperbolic location-scale belief. Changing the unit dilates the geometry yet preserves the wall, the curvature ordering, and the extremality of the Gaussian: an absolute cost says nothing, only relative cost carries meaning, the value -1/4 being one of its images. The cost of reaching a given precision then has a geometric floor diverging at certainty. Thermodynamics fixes the cost unit and motivates this framework; the results are geometric, in nats.
Minimum Spanning Trees have been used in unsupervised learning, particularly in clustering tasks, due to their ability to recognize clusters by removing edges that are considered inconsistent in defining those clusters. This paper aims to study the use of Minimum Spanning Trees in supervised learning. Specifically, we propose a classification algorithm based on Minimum Spanning Trees. To improve its performance, we introduce a robust version of the method that is also computationally more efficient. We evaluate the effectiveness of our proposed method through an extensive simulation study. We also apply the proposed methodology to a real-world case study involving aircraft trajectories.
Subdiffusive fractional Brownian motions produce localized aggregation when particles are stopped at exponentially distributed times. In applications where clumping and long-distance dispersal events are observed simultaneously, such as in some instances of seed dispersal, this model fails to describe the tails of the data. The resulting redistribution kernel has only an exponentially decaying tail, whereas a heavier tail is needed for modeling the long-distance dispersal observed. Here we propose a model in which subdiffusive particles stop at exponentially distributed times, but with a rate parameter that is Gamma distributed. This heterogeneity in stopping rates causes the density of final radial positions to have a heavy-tailed distribution. Our model retains the strong localized clumping characteristic of subdiffusive fractional Brownian motion while simultaneously generating the heavy tails required for realistic long-distance dispersal.
Inferring directed interactions between neural systems from EEG and MEG remains challenging due to noise, nonstationarity, and the high sample complexity of information-theoretic estimators. Transfer Entropy (TE) provides a principled and model-free measure of directed information flow; however, its practical estimation is not stable in finite data regimes (particularly as embedding dimension increases). This work introduces Embedded Polygon Symbolic Transfer Entropy (EPSTE), a framework that reframes TE estimation as a learnable problem operating on structured symbolic representations of local temporal morphology rather than raw signal amplitudes. Neural time series are decomposed into sequences of geometric primitives derived from local triplets of samples encoding complementary aspects of waveform structure such as magnitude, curvature and directional change. These primitives are discretised into symbolic tokens, yielding a compact but expressive state space over which symbolic TE is estimated. A recurrent neural network with attention-based multiple-instance learning is trained to predict surrogate-validated TE values from bags of symbolic temporal windows. The method is evaluated on source-reconstructed MEG data parcellated using the AAL90 atlas and compared against a standard symbolic baseline using identical architectures and supervision. The results demonstrate that while local window-level predictions are noisy, aggregation across trials and channel pairs yields stable directed dependencies. At the pair level, EPSTE achieves near-perfect recovery of ground-truth directed structure and significantly lower absolute error than the baseline, indicating that representational geometry plays a critical role in enabling practical learnability of information-theoretic dependencies.
We present AdaPrivate-TS, a differentially private contextual bandit algorithm that combines Thompson Sampling with batched zCDP composition. Our key insight is that differential privacy noise inflates the posterior covariance in a structured way: adding Gaussian noise $N(0,\sigma^2 I)$ to $b$ yields sampling covariance $v^2 A^{-1} + \sigma^2 A^{-2}$, which Thompson Sampling interprets as increased uncertainty rather than pure corruption. Under event-level privacy (protecting individual interactions) with stochastic contexts, we prove that the privacy cost is only $O(\sqrt{d}\,\log T/\sqrt{\rho})$, logarithmic in $T$, because parallel composition amortizes noise across batches. Additionally, we explore privacy amplification via Poisson subsampling, which can reduce effective noise at stringent privacy budgets. Experiments on synthetic and real-world datasets demonstrate: (1) AdaPrivate-TS achieves 93-99% of non-private performance at $\varepsilon \in [0.5, 5]$, outperforming UCB by 0.5-3.7% and up to 18% with tuned adaptive exploration at extreme $\varepsilon$; (2) privacy amplification provides additional 2-5% gains at low $\varepsilon$; (3) on MovieLens and Jester, AdaPrivate-TS achieves the best overall performance among event-level baselines, dominating at $\varepsilon \geq 2$; (4) under DP-SVD private features, TS's advantage over UCB grows to +11%, confirming noise-as-uncertainty is not limited to reward privacy. We provide rigorous proofs for privacy guarantees under interactive zCDP composition and comprehensive evaluation including convergence curves, 12-seed CIs, and DP-SVD feature ablation.
Plugging predictions of unknown parameters into downstream optimization problems, often referred to as the ``predict-then-optimize'' paradigm, has long been a standard approach in decision-making under uncertainty. However, improved predictive accuracy does not, in general, translate into improved decision quality. This disconnect has motivated growing interest in decision-focused learning (DFL) within the operations research community. This tutorial reviews recent developments in DFL and highlights key methodological insights, with a particular focus on stochastic linear programming as the downstream decision-making problem. We discuss why several widely used tools in traditional statistical learning are not directly suited to decision-focused settings and must be rethought, including (i) data collection strategies driven purely by predictive uncertainty and (ii) distributional distance measures such as the Wasserstein distance. We summarize properties of DFL that distinguish it from conventional predictive modeling and provide insights into the development of new decision-focused tools.
Clustering is a fundamental problem in statistics and machine learning. We propose the first one-bit clustering method for two-component sub-Gaussian mixture models. The method uses only one bit per entry of each sample obtained via a dithered quantizer. Under a mild non-spikiness condition on the cluster centers, we show that a variant of Lloyd's algorithm achieves a misclassification rate that decays exponentially with a signal-to-noise ratio comparable to that in the unquantized setting. This result further implies exact recovery under an explicit separation condition, which exceeds the optimal threshold for unquantized data by only a logarithmic factor. When the dimension $p$ is sufficiently large, the non-spikiness condition can be enforced by applying a random rotation using a Haar distributed matrix prior to quantization. In particular, it holds with high probability when $p \gtrsim 1$ for partial recovery and $p \gtrsim \log n \log\log n$ for exact recovery, where $n$ is the sample size. We also establish a minimax lower bound, showing that the misclassification rate and separation condition exhibit sharp constants in general. Numerical results are provided to corroborate the theory and demonstrate the efficacy of the proposed method.
We study fixed-precision ranking-and-selection in structured settings where the answer may be non-unique and where noisy estimates may temporarily admit no valid answer at all. This phenomenon arises naturally in problems such as multi-fidelity ranking-and-selection and identifying a Condorcet winner from pairwise comparisons. To address this, we propose a unified framework based on answer-wise acceptance sets, restricted generalized likelihood ratio stopping, and an answer-pitfall decomposition that yields a max-max-min characteristic value and a common sampling principle. We introduce ENDS, a general procedure that combines estimation, nomination, pitfall detection, and cost-aware information-directed selection. We instantiate ENDS for various problems by deriving explicit formulas. Extensive numerical experiments show that this unified recipe performs well across a broad range of pure-exploration problems and offers a practical framework and proof-of-concept algorithmic recipe.
We study high-dimensional differentially private (DP) covariance estimation in the operator norm, and principal component analysis (PCA), under $k$-row-column sparsity ($k$-RCS) of the covariance matrix. In the non-private setting, it is known that $\mathsf{poly}(k, \log d)$ samples suffice to solve both of these problems. However, the only comparable result known under DP (Wang et al. 2021) requires $\Omega(d)$ samples under standard parameterizations of the problem. We investigate when this curse of dimensionality is inherent for sparse covariance estimation tasks under DP. On the upper bound front, we show that a $\mathsf{poly}(k, \log d)$ sample complexity for PCA is possible under DP, if we also posit sparsity of the leading eigenvector. We complement this result with $\mathsf{poly}(d)$ lower bounds under DP for both sparse covariance estimation and PCA, establishing an exponential gap between the private and non-private variants of these problems when $k = \mathsf{polylog}(d)$. To our knowledge, no such separation has previously been demonstrated for any sparse estimation problems in private high-dimensional statistics. Our techniques are flexible enough that they imply stronger lower bounds even for the well-studied problem of standard DP PCA, without sparsity assumptions.
Fingerprints are the most widely deployed biometric. Verifying whether two impressions come from the same finger typically relies on minutiae, small landmarks such as skin ridge endings and bifurcations. These landmarks are extracted through a multi-stage pipeline of image enhancement, skeletonization, minutiae detection, and alignment. We investigate an alternative: using topological data analysis to represent the full pattern of skin ridges and valleys directly, bypassing minutiae detection and the downstream matching pipeline. We apply persistent homology, a topological tool that tracks how loops in the ridge pattern form and fill in across spatial scales, producing multi-scale summaries of ridge geometry. We develop and compare a range of verification methods on a standard benchmark dataset, FVC2000 DB1. Even the simplest topological summaries, with no trained parameters, substantially outperform geometry-only baselines. A trained method achieves an AUC of 0.91, while an optimal-transport method excels at the strictest false-accept thresholds, suggesting they capture different aspects of the ridge pattern. Fusing these two approaches yields the best performance at every low false-accept threshold we examine. Our results establish that these topological summaries capture substantial fingerprint identity information, far more effective for verification than raw pixel-level geometry. Because the entire pipeline is openly specified, it offers a transparent complement to minutiae-based systems, and we provide a modular framework for constructing, evaluating, and combining topological verification methods.
Public randomness is a security primitive whose deployed behavior is often observable only through a finite transcript. We study black-box auditing of $k$-subset draws from $m$ labels under the exact uniform-without-replacement null. The outcome space has size $\binom{m}{k}$, and unrestricted uniformity testing therefore requires $\Theta(\sqrt{\binom{m}{k}}/\varepsilon^2)$ samples, establishing an information-theoretic limit on transcript-only certification. For structured faults, we construct generator-agnostic audits on the hypersimplex using marginal chi-square, pair maxima, serial overlap, anchored-box discrepancy, and low-dimensional $H_0$/MST geometry, all calibrated under the exact combinatorial null. We also prove a finite-witness guarantee whose sample complexity depends logarithmically on the number of audited witnesses rather than on the full support size. Across observed and reference-source audits, no statistic remains significant after false-discovery correction (minimum BH $q=0.279$). GPU Monte Carlo experiments, using up to 300,000 null and 60,000 alternative replications per condition, show that marginal-preserving deviations can evade one-dimensional tests while remaining detectable through joint geometry. At $n=1{,}956$, a block-cluster alternative of strength 0.04 yields power 0.638 for pair maxima versus 0.051 for marginal chi-square; a band-repulsion alternative of strength 0.08 yields power 0.741 for anchored boxes versus 0.051. These results characterize which structured deviations finite public transcripts can detect and the sample sizes required for doing so.
This paper introduces statistical inference procedures for unit-specific coefficients in panel data models, where the coefficients exhibit a latent group structure. The proposed methods achieve efficiency gains by clustering units into a small number of groups, while explicitly accounting for the statistical uncertainty of group assignments. The core idea is to integrate standard inference procedures, such as the $t$-test and Wald tests, with confidence sets for group membership. Two methods are proposed: the first takes the minimum of the test statistics over the confidence set for group membership, and the second corrects for bias caused by possible group misassignment. The former can produce shorter but possibly disconnected sets, while the latter guarantees connected, interpretable intervals at some cost in length. We also develop standard errors that are adjusted for possible group misassignment and valid even with short time periods, which may be of independent interest. Monte Carlo simulations demonstrate that our approach yields narrower confidence sets for units with relatively large error variances than unit-by-unit time-series methods. In contrast, ignoring statistical uncertainty in the group membership estimation leads to distortions in size and coverage. We illustrate the method with an empirical example that estimates the effect of the minimum wage in each U.S. state.
For integers $n\ge d+1$, let $\mathsf{M}_d(n)$ denote the maximum size of a $(d+1)$-uniform family on an $n$-element ground set with VC-dimension at most $d$. For $n\ge2d+2$, the classical construction of Ahlswede and Khachatrian, later generalized by Mubayi and Zhao, gives \[ \mathsf{M}_d(n)\ge \binom{n-1}{d}+\binom{n-4}{d-2}. \] We introduce a two-cover lifting construction and prove the recursive lower bound \[ \mathsf{M}_d(n)\ge \binom{n-1}{d}+\binom{n-4}{d-2}+\mathsf{M}_{d-3}(n-5) \] for every $d\ge 3$ and $n\ge d+3$. Consequently, \[ \mathsf{M}_d(n)\ge \binom{n-1}{d}+\binom{n-4}{d-2}+\binom{n-6}{d-3}. \] Thus the Mubayi--Zhao conjecture on the exact value of $\mathsf{M}_d(n)$ for $n\ge2(d+2)$ is false for any $d\ge 3$. The proof is elementary and proceeds entirely through an explicit analysis of traces.
One of the powerful techniques in data modeling is accounting for features that are available at the training stage, but are not available when the trained model is used to classify or predict test data -- the Learning Using Privileged Information paradigm (LUPI). Sequential Minimal Optimization (SMO) methods have been developed for supervised Support Vector Machines (SVM), unsupervised one-class SVM, and SVM with privileged information (SVM+). The missing brick in this research has long been a one-class SVM with privileged information (OC-SVM+). In this paper, we propose an SMO algorithm for OC-SVM+ that significantly outperforms non-sequential algorithms for training the OC-SVM+ model. Its finite-time convergence is established. The experiments show how privileged information affects a descriptive domain in the space of original features. Comparative benchmark tests demonstrate that our algorithm is superior over interior point algorithms.
Predictive dependence in time series need not be confined to the conditional mean. Outside the Gaussian setting, causal content may arise through conditional scale, tail behavior, asymmetry, or other distributional features, implying that no single Granger-type test provides a complete characterization of predictive dependence. This paper develops a framework for distributional Granger causality based on a finite collection of channel-specific restrictions. Under suitable determinacy conditions, the channel menu is shown to be complete, yielding an identification result that links distributional Granger non-causality to a finite set of testable hypotheses. Building on this representation, we develop an adaptive sequential testing procedure that allocates inferential resources across channels while maintaining familywise error control through an alpha-investing mechanism. A policy-invariant validity theorem establishes finite-sample size control under arbitrary admissible selection rules, while an asymptotic efficiency theorem shows that a confidence-bound allocation rule achieves power equivalent to that of an infeasible oracle benchmark. The theoretical guarantees are derived from primitive mixing and moment conditions together with a circular-block permutation scheme.
The integration of multiple thematic data layers into a single composite map, known as the cartographic synthesis problem, is typically addressed through expert-driven weighting schemes. This study presents a multi-objective formulation of cartographic synthesis grounded in spatial autocorrelation structure. We develop a bi-objective evolutionary framework, GIS-moGA, that estimates layer weights by simultaneously maximizing global spatial structure, measured by Global Moran's I, and minimizing local spatial heterogeneity, measured by the variance of Local Indicators of Spatial Association (LISA). Because naive evaluation of spatial relationships requires O(N^2) operations, direct computation becomes impractical for larger datasets. We address this challenge by exploiting the 97.7% sparsity of queen contiguity matrices, reducing effective complexity to O(N k) and enabling scalable municipal-level analysis. The framework is evaluated on a high-dimensional spatial epidemiology dataset with N = 523 units from Araraquara, Brazil. A 64-scenario experimental design is used to examine evolutionary behavior across parameter settings. Results show that higher mutation rates are important for maintaining population diversity and preventing premature convergence in spatially autocorrelated fitness landscapes, where crossover operators can disrupt geographically coherent structures. Compared with expert-derived Analytic Hierarchy Process baselines, the resulting Pareto fronts show substantial hypervolume gains and significant improvements in spatial coherence (p < 0.001, Cliff's delta = 0.87). These findings provide a systematic and scalable framework for data-driven geographic multi-criteria decision analysis.
Adam is one of the most widely implemented and influential modern optimizers. Why is it effective across different optimization problems in practice? This question arguably lies at the center of the optimization community over the last decade and has motivated a substantial body of work aimed at understanding its convergence behavior. However, existing studies have mainly focused on the convergence rate of Adam in smooth nonconvex optimization, which unfortunately does not adequately capture practical settings, since many real-world problems are nonsmooth, such as those arising in training neural networks. Thus, these studies cannot fully explain the popularity and empirical success of Adam. Recently, an insightful and powerful framework called Online-to-Nonconvex Conversion has opened a new way to analyze Adam for nonsmooth nonconvex optimization. Unfortunately, prior works along this line share two common limitations. First, all of them ignore the important bias-correction term in the original Adam algorithm. Second and more importantly, many of them require extra operations that are not used in Adam, such as a clipping step. Therefore, the convergence guarantee for the original Adam method still remains unclear. In this work, we present the first finite-time analysis for the classical form of Adam, i.e., with the bias-correction step and without further algorithmic modifications, and prove that a randomly scaled learning rate ensures a convergence rate of $1/T^{\frac{2}{13}}$ for nonsmooth nonconvex optimization. Moreover, our result provably applies to the modern heavy-tailed noise regime, which is closer to practice. Interestingly, our theory is established under the parameter choice $\beta_1=\beta_2$, aligning with the recent empirical studies.
Attention mechanisms have demonstrated remarkable empirical success in identifying relevant information from large collections of tokens, yet the theoretical principles underlying this behavior remain poorly understood. We study a stylized softmax-attention model in which a query vector is learned by stochastic gradient ascent from a collection of informative and nuisance tokens. Exploiting the symmetry of the model, we derive a population objective and characterize the limiting ordinary differential equation governing the learning dynamics. Using tools from stochastic approximation and dynamical systems theory, we establish a rigorous connection between the stochastic learning algorithm and its deterministic limit. Our main result shows that, under suitable high-dimensional scaling assumptions and standard step-size conditions, the learned query converges almost surely to the one-dimensional signal subspace spanned by the latent informative direction. Equivalently, the query asymptotically recovers the latent signal up to the intrinsic sign ambiguity. These results provide a rigorous theoretical foundation for understanding attention mechanisms as signal extraction procedures in high-dimensional noisy environments and offer a dynamical-systems perspective on how attention discovers relevant information in the presence of substantial noise.
We propose a new treatment of nonlinear regression with serially correlated disturbances that incorporates autoregressive moving average structures into feedforward neural networks. The resulting model provides an alternative to modeling temporal dependence using lagged variables. In simulations, the proposed method accurately recovers regression functions of varying complexity and the underlying error dynamics across a range of time-series lengths and signal-to-noise ratios. Finite-sample properties and out-of-sample predictive performances are shown to be robust to model misspecification induced by omitted lagged variables and incorrect specification of the error dynamics. Cloud cover is an important factor in climate projections. In an empirical study of cloud cover prediction for a grid of locations within and around the Mediterranean Sea, our proposed model yields more accurate predictions than existing methods, including long short-term memory networks. Improvements are observed broadly and are particularly pronounced in mountain areas relative to linear models with serially correlated errors, consistent with the presence of stronger nonlinear effects in cloud composure in such regions.
In open-ended generation, LLMs frequently fall into the "likelihood trap", marked by repetitive degeneration and vocabulary dullness, creating a discrepancy between machine-generated and human-written text. While post-hoc tail truncation (e.g., Top-$p$, Min-$p$) avoids sampling from the unreliable tail, it can over-sample from the uncalibrated head and misalign generation with human lexical preferences; fixed scalar repetition penalties likewise ignore variation in logit scale across inference steps, potentially disrupting semantic coherence. To address both limitations, we propose Variance-Calibrated Modulation (VCM), a training-free pre-decoding intervention that reshapes the probability distribution before truncation through two dynamic mechanisms: (1) Contextual Searchlight via PMI, which suppresses global stopwords while elevating context-evoked tokens, and (2) Adaptive Self-Debiasing, which uses real-time logit standard deviation for scale-invariant penalization. Across open-ended generation, factual QA, and mathematical reasoning, VCM consistently mitigates the likelihood trap. With negligible computational overhead, VCM integrates with existing decoding strategies, improving diversity, coherence, and, particularly at higher decoding temperatures, reasoning accuracy.
We develop a framework for learning dependence structures from empirical dependence operators. Rather than treating cluster, factor, and sparse dependence as maintained assumptions, we represent them as covariance geometries in a common Hilbert space and summarize dependence through a low-dimensional dependence profile based on projection similarity scores. We establish identification under a principal-angle separation condition, prove consistency and asymptotic normality of the estimated profile, and derive finite-sample classification error bounds. We further show that when covariance-geometry tangent spaces overlap, no statistical procedure can distinguish the geometries at first order, providing a formal characterization of ambiguous dependence structures. Projection-residual diagnostics assess absolute goodness-of-fit and detect misspecified covariance dictionaries. Finally, we establish oracle adaptivity of profile-guided inference: dependence profiles can be used to select dependence-robust procedures in a data-driven manner, yielding inference that is asymptotically equivalent to an infeasible oracle that knows the dominant covariance geometry in advance.
We prove an inverse-polynomial spectral-gap bound for the lazy swap chain on binary matrices with prescribed row and column sums. This chain is a standard sampler for fixed-margin null models in ecology, statistics, and network analysis, and its rapid mixing for arbitrary feasible margins was conjectured by Kannan, Tetali, and Vempala in 1997. We show that for every feasible set of margins on an $m\times n$ binary matrix, the lazy swap chain has spectral gap at least $$ \binom{m}{2}^{-1}\binom{n}{2}^{-1}, $$ which is tight in the worst case. The proof compares the swap chain with a two-row heat-bath chain, reduces the analysis from arbitrary $m\times n$ matrices to the case of three rows, and proves the resulting three-row inequality by decomposing functions according to the column-count variable and the associated Johnson harmonic sectors. The proof itself was generated by ChatGPT 5.5 Pro. ChatGPT proposed the whole proof strategy, including the comparison with the two-row heat-bath chain, the reduction to the three-row case, and the decomposition of the three-row function space into the count sector and the Johnson harmonic sectors. It also generated all the technical lemmas and initial proofs. The author's role was to pose the problem, guide the search direction, evaluate the AI-generated arguments, rewrite the proof, and take responsibility for the final form and validity of the result.
We propose Markov Chain from Human Feedback (MCHF), an elementary approach for aligning generative models from pairwise human preferences. Unlike Reinforcement Learning from Human Feedback (RLHF), which reduces comparisons to a scalar reward, and Nash Learning from Human Feedback (NLHF), which preserves pairwise utilities through a KL-regularized minimax optimization, MCHF uses pairwise preferences directly to define a transition mechanism over model outputs. Given a pairwise utility $U(x,y)$, which quantifies human preference for $y$ over $x$, and a reference probability distribution $\mu_{\mathsf{ref}}$, we define a Markov kernel $\mathsf{P}(x, dy)\propto \exp(U(x,y))\mu_{\mathsf{ref}}(dy)$, and take the Markov chain starting from $\mu_{\mathsf{ref}}$ as an iterative alignment procedure. We show that MCHF converges geometrically fast to the stationary distribution, with a convergence rate governed by the seminorm $\|U\|_\oplus=\inf_{g,f\in L^\infty(\mu_{\mathsf{ref}})}\|U-g\oplus f\|_\infty$, which quantifies the non-transitive structure of the pairwise utility. We further show that a mirror-descent algorithm for NLHF satisfies an analogous structure-adaptive convergence guarantee. Finally, through a perturbation analysis, we prove that when $\|U\|_\oplus$ is small, MCHF and NLHF agree up to first order around an RLHF solution, which yields a unified view of reward-based, game-theoretic, and Markovian approaches to alignment. In particular, for two natural algorithms that converge to the MCHF/NLHF equilibria, we show that the first step of MCHF and NLHF recovers the RLHF solution based on the column-sum reward $\hat{f}(y)=\int \mu_{\mathsf{ref}}(dx) U(x, y)$, and starting from the second iteration, both algorithms incorporate the same linear functional of the residual $U-(-\hat f)\oplus \hat f$, which captures the non-transitive structure of the pairwise utility $U$.
Mapping the directed flow of information between brain regions -- their effective connectivity -- is central to understanding brain function, yet large-scale recordings sample only a fraction of the brain at a time: sessions, animals, and laboratories cover different, partially overlapping regions, usually without a shared temporal reference. Established directed-connectivity methods (Granger causality, dynamic causal modeling, partial directed coherence, PDC) require all regions to be recorded simultaneously and with a common clock. We introduce SPIDER, a non-parametric, frequency-domain framework that recovers directed information flow from such incomplete, asynchronous recordings: it stitches local power-spectral estimates from overlapping channel subsets into a global spectral matrix and obtains frequency-resolved directed interactions by canonical spectral factorization and PDC, without temporal alignment, while nuclear-norm completion fills in never-co-observed region pairs. With consistency guarantees, we validate SPIDER on simulations, two-photon calcium imaging, and the International Brain Laboratory Neuropixels dataset, recovering directed flow among 50 areas from 43 sessions in 12 laboratories never recorded together. Beyond validation, SPIDER reveals what no single recording can: brain-wide spontaneous flow is largely recurrent, but in the theta band it forms a significant feedforward hierarchy with the hippocampal formation at its source. Applied to resting human intracranial EEG (43 patients, non-overlapping coverage), it recovers the same theta-band hierarchy across species and modality. SPIDER makes whole-brain effective-connectivity analysis tractable for multi-session, multi-animal datasets previously incompatible with directed-flow inference.
Statistical matching combines partially overlapping datasets that share covariates $X$ but observe the target $Y$ and auxiliary variables $Z$ separately. Classical approaches typically invoke the conditional independence assumption (CIA), which makes the problem identifiable but fundamentally implies that the imported auxiliary variable provides no additional predictive power for $Y$ once $X$ is known. To capture this latent $Y$--$Z$ dependence, we propose a novel dependency-aware Schrödinger bridge for predictive statistical matching. Our approach couples the two separated databases by tilting the conservative CIA baseline with a transportation-based compatibility cost, recovering an informative joint distribution. The resulting statistical learning framework yields full probabilistic posterior rules for bidirectional imputation. Theoretically, we establish a sufficient condition under which the learned bridge strictly improves over the CIA baseline, alongside an exact joint recovery guarantee in the Gaussian setting under an appropriate cost. Across synthetic benchmarks and real-world datasets (CelebA and Adult), we demonstrate that our dependency-aware completion consistently improves downstream predictive utility, proving especially beneficial in settings like data recoding where the underlying population exhibits strong $Y$--$Z$ dependence.
Learning instability is a long-standing problem across machine learning, but it is especially acute in the overparameterized regime that defines modern deep learning: large models fine-tuned or trained on limited data traverse flat loss landscapes with many nearly-equivalent minima, and stochastic factors (initialization, data order, dropout, hardware non-determinism) can route optimization to very different solutions. The rise of large pretrained models (LPMs) makes the problem more urgent: training cost is high, downstream data is often small, and repeated runs for variance reduction are prohibitive. We introduce \textbf{GRAIN} (\textbf{G}roup \textbf{A}ggregation via m\textbf{IN}-norm objective), a lightweight training algorithm that replaces the mean aggregation used in mini-batch optimization (both across mini-batches and within a mini-batch) with a min-norm convex combination of group-wise gradients. \mName guarantees a non-negative inner product between the aggregated update and every group gradient, resolving intra- and inner-batch gradient conflict, and retains an $\mathcal{O}(1/T)$ convergence rate comparable to SGD. Under mild smoothness and absolute-continuity assumptions, the min-norm solution differs almost surely from the arithmetic mean, which yields a uniform-stability bound for \mName strictly tighter than the standard bound for SGD. Empirically across generation, classification, and regression at LPM scale, \mName delivers consistent improvements in mean performance and reductions in run-to-run variance over a broad suite of tasks, with no extra training-time or storage cost beyond a single backward pass.
Most tools for measuring political positions, manifesto coding, expert surveys, text-scaling models, were built and validated on Western party systems, and outside that setting they work poorly, and often not at all. This paper is an attempt at a method for those settings. It treats a large language model not as a measurement device but as a single, fallible rater in a panel, roughly the way an expert survey treats one expert: the value comes from pooling many judges rather than trusting any one of them. I describe the panel, an applicability rule that keeps a score of zero distinct from a blank, and a lens system that separates what an actor says from what it does. I report three results. First, holding a definition-free round fixed, adding written axis definitions moves scores by a mean of 1.8 points on a 21-point scale and tightens agreement between raters (mean absolute gap 2.81 to 2.50; r 0.81 to 0.89); they make two independent raters agree more closely, which an arbitrary steer would not. Second, across nine models from eight laboratories in two countries, Krippendorff's alpha is 0.86 on both an interval and an ordinal metric, and it stayed put as the panel grew from five raters to nine. That is reliability, the reproducibility of a reading, and not validity, its correctness. Third, where the panel does disagree, the disagreement is informative: the sharpest split, a full-scale divergence on an actor's stance toward its state's foundational order, points to a referent problem, and a blind triple-coding puts about two-thirds of it down to interpretation rather than error. I try to be plain about what the method can't do, including the human validation it still lacks, and I release the instrument and data in full. The worked example is the Middle East and North Africa, but I'd expect the method to carry to any region these standard tools leave out.
We study mirror flows generated by a convex quadratic loss and a general convex lower semicontinuous mirror potential. We show that, when initialized near the boundary of the domain of the mirror potential, their rescaled trajectories converge to a limiting mirror flow whose potential is the indicator function of the domain. In this limit, the primal variable minimizes the loss over a time-dependent hypothesis set: the subdifferential of the support function of the domain, evaluated at the dual variable. This characterization provides a general mechanism for incremental learning in mirror flows.
We study Transformer-type residual layers under cross-entropy training through a continuous-depth mean field control viewpoint. Depth is treated as time, layer parameters as controls, and the residual Transformer recursion as an explicit Euler scheme for a controlled hidden-state flow. For fixed controls, we prove an $O(\varepsilon)$ pathwise approximation of finite-depth trajectories by the continuous flow and combine this with high-probability sampling bounds for the empirical cross-entropy risk. We formulate the limiting population problem as a first-order transport control problem for the law of hidden states and derive a Pontryagin condition whose terminal adjoint contains the softmax residual. We also give finite-class and metric-entropy uniform estimates, compare optimal values, and discuss existence, stability, continuous-to-discrete recovery, initialization, and range estimates for continuous minimizers.
In this paper we prove some concentration inequalities for two types of error probabilities in the Empirical Risk Principle (ERP) in statistical learning, which provide a lower bound and an upper bound for the minimal risk (in terms of the minimal empirical risk) with non-asymptotic high confidence. The usual boundedness condition of the empirical risk function is relaxed to the Gaussian or exponential integrability condition. The confidence of the lower bound of the minimal risk is shown to be independent of the number of training parameters and the dimension of the input vectors, allowing one to detect the deficiency of a learning machine efficiently; and the confidence of the upper bound of the minimal risk is proved to be high provided that the sample size $n$ is much greater than the box dimension of the parameter set $\Theta$ in the Orlicz metric $d_{\psi_1}$ associated with the risk functions. Our work is based on Talagrand's concentration inequalities (the sharp versions by Bousquet and Klein-Rio), transport-entropy inequalities and the recent progress in the theory of empirical processes and statistical learning.
We prove that the Kim--Milman flow map enjoys favorable stability properties with respect to variations in the target measure, provided that one of the target measures is sufficiently regular. Our results include stability in relative entropy, and more notably, Lipschitz stability in the $2$-Wasserstein distance up to a logarithmic factor. We complement our results with a general existence theorem for these maps for any target measure with finite second moment.
Higher-order structures are powerful relational modeling tools, yet existing spectral operators decompose the topology into separate ranks, leaving practitioners to fuse the information back to vertices through ad hoc choices. We introduce Collapsed Effective Operators, which condense higher-order degrees of freedom into a single vertex-level operator via Schur complementation of a graded Laplacian. This yields a (generally dense) operator that encodes long-range interactions mediated by topology and is applicable to arbitrary higher-order constructs. We show it preserves positive semi-definiteness with a spectral upper bound relative to the rank-0 Hodge Laplacian, effectively lowering system energy under higher-order connectivity. Empirically, our operator improves spectral clustering, signal smoothing, and enables the inclusion of topological features in neural network architectures via positional encoding. The project page can be found this http URL
Hyperparameter tuning almost always means search: fit the model at every value on a grid, score each by cross-validation, and keep the winner. For spline regression that search is unnecessary. The optimal resolution can be solved for in closed form, to the accuracy an exhaustive search reaches, at a fraction of the compute. Three ingredients make this possible: classical approximation theory pins the squared bias to a known power of the resolution G, exactly the Kolmogorov n-width of the smoothness class; the basis dimension is an explicit polynomial in G; and leave-one-out error follows from a single fit via the PRESS identity. Balancing the two known curves gives the minimizer analytically. We extend this calculus to many coordinates by replacing ambient input dimension with interaction order, the number of active low-order components in an ANOVA decomposition, yielding a scaling law in which the optimal resolution and error are power functions of the effective density (sample size per active component), with input dimension absent from the exponent. The law becomes an algorithm. KORE (Kolmogorov-optimal Order-aware Resolution Estimation) fits two pilot resolutions, solves a leverage-calibrated 2x2 system for the bias and noise scales, and evaluates the closed-form plug-in resolution with a tiny leave-one-out certificate: about a dozen fits instead of a full grid sweep, with a consistency guarantee as the sample grows. Across additive and sparse pairwise targets up to 80 input dimensions, KORE matches exhaustive 3-fold cross-validation and the full classical ladder (GCV, Mallows' Cp, AIC, BIC) while fitting roughly 8x fewer models; on 36 real tabular datasets it ranks first among 21 methods in accuracy per unit of compute, ahead of tuned boosters and kernel machines. When complexity lives in low interaction order, solving for the resolution beats searching for it.
Recent attempts to combine large language models (LLMs) with causal discovery ask models to infer pairwise directions, propose graph structures, or inject language-model outputs as priors and constraints. These approaches promise faster analysis, but they also obscure whether a causal evidence is supported by data and assumptions or by textual associations, prompt artifacts and hallucinated mechanisms. We argue for a different role for agents in causal discovery. Agents should inspect data, retrieve context, explain method assumptions and clarify graph outputs, but they should not supply edges, orientations, priors, constraints or causal conclusions. We propose the principle that agents assist the workflow, while causal claims remain grounded in data, explicit assumptions, formal algorithms, diagnostics and user or domain-expert decisions. We instantiate this principle in causal-learn+, an online platform that coordinates data analysis, preprocessing, method recommendation, expert-knowledge incorporation, formal discovery and interpretation around the algorithmic ecosystem of causal-learn. A case study on Big Five personality data illustrates agent-assisted pipeline of causal discovery without turning language-model unreliability into causal evidence. The platform is available at this http URL.
In many modern applications of reinforcement learning (RL), the natural reward for a task of interest is inherently sparse: a reward of 0 is given everywhere except when the task is completed, when a reward of +1 is given. Training a policy to maximize such a sparse reward requires solving a challenging credit assignment problem, leading to slow or ineffective RL improvement. We propose a simple approach to transform a sparse outcome reward into a dense process reward. Our approach relies on training a discriminator to distinguish between previous successful and unsuccessful episodes, and using this discriminator to incentivize the RL-learned policy to match the state-action visitations of successful episodes, while avoiding those of unsuccessful episodes. By incentivizing the policy to match the visitations over all states, not just those that correspond to task success, this reward provides dense feedback on whether progress is being made towards task completion, and, we show, provably achieves this without changing the optimal policy. Focusing on finetuning of robotic control policies, we demonstrate that our approach leads to significantly faster RL finetuning performance on both simulated and real-world manipulation tasks, as compared to simply maximizing the sparse outcome reward.
We present an efficient protocol for learning an unknown $k$-local Lindblad generator on $n$ qubits using only product-state preparations, short-time evolution, and single-qubit Pauli measurements, without prior knowledge of the interaction structure. For fixed $k$ and bounded weighted interaction strength, the protocol estimates all Hamiltonian and dissipative Pauli--GKSL coefficients to entrywise accuracy $\varepsilon$ with probability at least $1-\delta$ using $\widetilde{\mathcal O}_k(\varepsilon^{-2}n^{2k}\log(1/\delta))$ samples and polylogarithmically many evolution times. A semidefinite projection converts these estimates into a valid $k$-local Lindblad generator with diamond-norm error at most $\varepsilon$ using $\widetilde{\mathcal O}_k(\varepsilon^{-2}n^{4k}\log(1/\delta))$ samples and polynomial-time classical postprocessing. If a suitable set of influential coefficients is supplied and satisfies a stable sparsity condition, the dependence on $n$ can improve from polynomial to logarithmic; in particular, exact supports of bounded intersection degree require only $\widetilde{\mathcal O}_k(\varepsilon^{-2}\log(n/\delta))$ samples, with analogous reductions in system-size dependence for sufficiently decaying long-range interactions. We also provide a robust structure-learning procedure, extend the guarantees to model misspecification, and prove complementary sample-complexity lower bounds. To our knowledge, these are the first efficient learning guarantees for general $k$-local dissipative quantum dynamics under such limited experimental control.
AdamW is the de facto optimizer for training large language models (LLMs), yet the theory behind it still lives mostly in finite-variance regimes. This is increasingly unsatisfying, as empirical evidence indicates that stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent work shows that sign-based optimizers such as Lion and Muon achieve sharp heavy-tailed rates, and that AdaGrad can also converge under heavy-tailed noise. However, no rigorous convergence theory for AdamW has yet been established in this regime. Can AdamW converge under the same heavy-tailed assumptions, or does its second-moment accumulator create a genuine obstruction? We formulate this as an open problem, prove a positive weighted-metric benchmark, and give a corridor lower-bound mechanism showing how denominator memory can hide large gradients.
In observational studies, achieving covariate balance in pair matching between treatment and control groups or exposed and unexposed groups is essential. This balance enables testing treatment effects or examining {associations between exposures and} multivariate response variables in pair-matched data. Paired design studies involve taking multiple measurements for the same subjects under different conditions. All these call for an effective test for multivariate paired data. However, current methods for assessing covariate balance in matched observational studies often ignore the paired structure, leading to reduced performance in some cases. The multivariate paired Hotelling's $T^2$ test can be used for paired data, but its power decreases rapidly as dimensions increase. To address these issues, we propose a new non-parametric test for paired data, significantly improving power across various scenarios. We also derive the test's asymptotic distribution, making it user-friendly for practical applications. Our proposed test's effectiveness is demonstrated through an analysis of real data on Alzheimer's disease research.
The classical multivariate normal means problem remains conceptually unresolved. While shrinkage and empirical Bayes methods improve risk by imposing external geometric or hierarchical structure, they fail to explain how information is shared across independent coordinates for a fixed, unstructured mean vector. We address this gap using the prior-free Inferential Models framework. By formulating a generalized probability integral transform (GPIT) for independent, non-i.i.d~observations combined with a reweighted Anderson-Darling predictive random set, we leverage the global shape of ordered observations for valid, efficient inference. Crucially, the auxiliary structure of this formulation provides a novel explanation for Stein's paradox, demonstrating that the maximum likelihood estimator becomes structurally implausible for $n\geq 3$. To ensure scalability, we introduce an i.i.d. sampling-with-replacement surrogate that connects our exact fixed-mean formulation to overparameterized $g$-modeling. Furthermore, we develop a maximin criterion for combining plausibility contours. Under squared error loss, our estimators are competitive with state-of-the-art auto-modeling methods and outperform classical shrinkage and empirical Bayes methods.
Group number selection is a key problem for group panel data modeling. In this work, we develop a cross-validation (CV) method to tackle this problem. Specifically, we split the panel data into two data folds on the time span with a buffer zone, with group structure preserved for individuals. We first estimate the group memberships and parameters on one data fold, then plug in the estimates and utilize the other data fold to evaluate a designed criterion. Subsequently, the group number is estimated by minimizing the average criterion across all data folds. The proposed CV method has two advantages compared to existing approaches. First, the method is totally data-driven; thus no further model-specific tuning parameters are involved. Second, the method can be flexibly applied to a wide range of panel data models. Theoretically, we establish the estimation consistency by taking advantage of the optimization process on the training data fold. Experiments are carried out with a variety of synthetic datasets and panel models to further illustrate the advantages of the proposed method. Lastly, the CV method is employed to analyze the heterogeneous patterns of stock volatilities in the Chinese stock market during the 2008 financial crisis.
Counting-process notation separates predictable risk-set information from observed event jumps through decompositions of the form dN(t)=Y(t)alpha(t)dt+dM(t). This article develops a unified event-history learning framework for censored, truncated, recurrent, multistate, and covariate-dependent data. Rather than cataloguing survival methods, the treatment translates each partially observed learning target into five recurring objects: risk process, jump process, compensator, estimating equation, and limiting argument. The framework connects right-censored survival curves, product-integral estimators, bivariate and interval-censored survival estimators, log-rank tests, Cox-Andersen-Gill regression, additive hazards, accelerated failure-time models, panel-count data, landmark prediction, semi-Markov models, Bayesian nonparametric transition models, and instrumental-variable methods. The original contribution is threefold. Computationally, the article turns risk-set sweeps, product-integral updates, interval-likelihood calculations, semi-Markov elapsed-time bookkeeping, Bayesian transition-hazard updating, and cross-fitted validation into reusable algorithms and simulation diagnostics. Theoretically, it gives proof templates for the recurring martingale, likelihood, product-integral, and empirical-process arguments, and proves a new out-of-fold compensator validation identity for cross-fitted censored learners. For applications, it maps biomedical, reliability, operational, economic, financial, literary, historical, and causal survival examples onto the same risk-set and compensator language. The resulting account provides a common mathematical language for deriving, checking, and comparing classical and machine-learning methods for censored, recurrent, and multistate event-history data.
The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flexible machine learning approaches have been developed to estimate the conditional survival function. However, many of these methods are either implicitly or explicitly targeted toward risk stratification rather than overall survival function estimation. Others apply only to discrete-time settings or require inverse probability of censoring weights, which can be as difficult to estimate as the outcome survival function itself. Here, we employ a decomposition of the conditional survival function in terms of observable regression models in which censoring and truncation play no role. This allows application of an array of flexible regression and classification methods rather than only approaches that explicitly handle the complexities inherent to survival data. We outline estimation procedures based on this decomposition, empirically assess their performance, and demonstrate their use on data from an HIV vaccine trial.
Matrix-valued time series data are frequently observed in a broad range of areas and have attracted great attention recently. In this work, we model network effects for high dimensional matrix-valued time series data in a matrix autoregression framework. To characterize the potential heterogeneity of the subjects and handle the high dimensionality simultaneously, we assume that each subject has a latent group label, which enables us to cluster the subject into the corresponding row and column groups. We propose a group matrix network autoregression (GMNAR) model, which assumes that the subjects in the same group share the same set of model parameters. To estimate the model, we develop an iterative algorithm. Theoretically, we show that the group-wise parameters and group memberships can be consistently estimated when the group numbers are correctly or possibly over-specified. An information criterion for group number estimation is also provided to consistently select the group numbers. Lastly, we implement the method on a Yelp dataset to illustrate the usefulness of the method.
This paper develops a novel statistical approach that allows for the {\em first time} the {\em cross}-oscillatory characterisation of temporally localised interactions between channels in a functional brain network. Brain signals are often nonstationary and the proposed framework uses wavelets as an effective tool for capturing (i) single-scale channel transient features, due to their adaptiveness to the dynamic signal properties, and (ii) cross-scale channel interactions, due to their multiscale nature. Our approach introduces scale-specific {\em subprocesses} and {\em cross-scale (CS) dependencies} for a new class of multivariate locally stationary (MvLSW) wavelet processes that we refer to as CS-MvLSW. Under this new model, we develop two consistent estimation procedures for the {\em localised} single- and cross-scale channel dependence. Extensive simulation studies demonstrate that the theoretically established properties hold in practice. The proposed CS-MvLSW framework remains accurate under pronounced cross-scale dependence, whereas existing MvLSW coherence estimates dramatically deteriorate even for single-scales when such complex structure is present. The proposed approach was used for electroencephalogram (EEG) data to study alterations in the functional connectivity structure in children diagnosed with attention deficit hyperactivity disorder (ADHD), and identified novel clinically pertinent cross-scale interactions in the functional brain network across the left and right hemispheres, differentiating brain connectivity between control and ADHD groups.
Directed Acyclic Graphs (DAGs) are solid structures used to describe and infer the dependencies among variables in multivariate scenarios. Having a thorough comprehension of the accurate DAG-generating model is crucial for causal discovery and estimation. Our work suggests utilizing a non-conjugate prior for Gaussian DAG structure learning to enhance the posterior probability. We employ the idea of using the Bessel function to address the computational burden, providing faster MCMC computation compared to the use of conjugate priors. In addition, our proposal exhibits a greater rate of adaptation when compared to the conjugate prior, specifically for the inclusion of nodes in the DAG-generating model. Simulation studies demonstrate the superior accuracy of DAG learning, and we obtain the same maximum a posteriori and median probability model estimate for the AML data, using the non-conjugate prior.
Time series analysis of d18O and d13C from benthic foraminifera for paleoclimatology poses significant challenges. The data span tens of millions of years, with sparse early records, dense later ones, uneven time stamps, and occasional multiples. These time series are largely non-stationary, exhibiting temporary, varying trends. We propose a continuous-time state space framework that handles these irregularities effectively. Univariate signal-plus-noise models are specified for d18O and d13C, with parameters estimated via maximum likelihood using Kalman filter recursions for signal extraction and likelihood evaluation. The framework interprets state space models as time-domain Butterworth filters. Measurement-error variances are differentiated by deep-sea drill site, including site-specific level offsets, and the record is partitioned into sub-periods reflecting the distinct climate states that drive the transition variance. Two extensions of the univariate model are explored: (i) modifying the signal specification for the Kalman filter to approximate a Butterworth filter of any order, and (ii) specifying a bivariate signal-plus-noise model for joint analysis. Results reveal substantial signal changes during the ``icehouse'' period (3.3 to 0.0006 Ma); the correlation between d18O and d13C signals is generally positive but turns negative during this period.
In many causal inference settings, the treatment of interest is not directly observed; instead, one or more error-prone proxy measurements are available, creating a fundamental identification challenge. Building on identification methods for hidden treatments with proxies, we develop a general semiparametric framework for causal effect estimation that accommodates several common estimands. Hidden treatment models differ fundamentally from classical missing-data settings, where semiparametric theory relies on positivity requiring a nonzero probability of observing a complete case for each treatment value. Here this assumption fails by design because the true treatment is never observed, creating new challenges for semiparametric characterization and efficient estimation. To overcome these challenges, we develop a new semiparametric characterization for hidden treatment models by deriving a formal mapping between the orthogonal complement to the nuisance tangent space, which contains all influence functions for a causal functional in the oracle full-data model, and its counterpart in the observed hidden treatment model. This mapping gives closed-form observed-data influence functions and yields the semiparametric efficiency bound. It also leads to semiparametric efficient estimators with a new form of multiple robustness or mixed bias property, enabling inference with nonparametric nuisance estimators. A further challenge is that some nuisance functions depend on the hidden treatment, preventing direct use of standard nonparametric regression methods. We introduce an iterative estimation algorithm and establish its large-sample properties. Simulations demonstrate the finite-sample performance of the proposed estimators, and an application estimates the causal effect of Alzheimer's disease on hippocampal volume using data from the Alzheimer's Disease Neuroimaging Initiative.
In this paper, we derive closed-form estimators for the parameters of some probability distributions belonging to the exponential family. A bootstrap bias-reduced version of these proposed closed-form estimators are also derived. A Monte Carlo simulation is performed for the assessment of the estimators. The results are seen to be quite favorable to the proposed bootstrap bias-reduce estimators.
Data fusion enables powerful and generalizable analyses across multiple sources. However, different data collection capacities across different sources lead to blockwise missingness (BM), which poses challenges in practice. Meanwhile, the high cost of obtaining gold-standard labels leaves the majority of samples unlabeled, known as the semi-supervised (SS) problem. In this paper, we propose a novel Data-adaptive Estimation approach for data FUsion in the SEmi-supervised setting (DEFUSE) that handles both BM and SS issues in the presence of distributional shifts across data sources under a missing at random (MAR) mechanism}. DEFUSE starts with a complete-data-only estimator derived from the primary data source, and uses data-adaptive and distributional-shift-adjusted procedures to successively incorporate the data with BM covariates and the large unlabeled sample to effectively reduce the estimation variance without incurring bias. To further avoid bias due to fusion of misaligned data violating of the MAR assumption, a screening method is developed to identify and exclude data sources that are not aligned with the primary source. Compared to existing approaches, DEFUSE offers two main improvements. First, it offers a new data-adaptive control variate approach to handle BM, which achieves intrinsic efficiency and robustness against distributional shifts. Second, it reveals a more essential role for the unlabeled sample in the BM regression problem, leading to improved estimation. These advantages are theoretically guaranteed and empirically supported by simulation studies and two real-world biomedical applications.
Data points in many scientific experiments originate from an ordered structure, yet this ordering is often this http URL consider noisy data points with the correct ordering to be recovered. The underlying structure naturally places the data on a 1-dimensional manifold. Because eigenfunctions of 1-dimensional manifold Laplacian are trigonometric functions, and the manifold Laplacian can be approximated by the graph data Laplacian, the data ordering can be recovered by inverting the data Laplacian this http URL propose two spectral algorithms, one for the periodic structure (closed loop) and one for the non-periodic structure (open curve). We have derived the uniform error bound for the algorithms, which is composed of two parts: the discretization error between the manifold eigenfunctions and the noiseless graph Laplacian eigenvectors, and the eigenvectors error caused by data noise. In numerical studies, our spectral seriation algorithms outperform other manifold learning methods. The superior performance of our algorithms is demonstrated further on a biomolecule data example.
Statistical methods that leverage external trial information to help accelerate drug development are becoming increasingly popular. Bayesian methods facilitate dynamic borrowing, where the similarity of the response guides how much information is used. We have proposed a semiparametric Bayesian borrowing model for time-to-event data, with smoothing priors that allows the baseline hazard to take any form via an ensemble average. By accurately modelling the baseline hazard, rather than approximating its form via fixed piecewise intervals, power is improved and bias of the estimated treatment effect reduced when the borrowing assumption of parameter exchangeability holds. A ``lump-and-smear'' borrowing prior makes the model robust to non-exchangeable historical data by increasing the sensitivity of borrowing to the presence of prior-data conflict, reducing the potential for type I error inflation. We present BayesFBHborrow, an R package that implements our semiparametric Bayesian borrowing model with a historical control. We demonstrate how to select the optimal borrowing hyperparameters. The model supports covariate-adjusted borrowing, which can reduce prior-data conflict and improve power when differences in outcomes are attributable to changes in the covariate distribution. As the treatment effect estimator is non-collapsible, the marginal hazard ratio can be estimated via Bayesian G-computation, while still permitting an adjusted analysis to account for control group drift. We illustrate the Bayesian flexible baseline hazard model on a simulated and real dataset with a marginal estimand, for both an unadjusted and adjusted analyses.
Brownian motion is a compact mathematical language for continuous-time uncertainty in biostatistics. This tutorial develops the process from construction and path properties to tools that recur in applied biomedical work: the Markov and strong Markov properties, the Karhunen-Loeve expansion, functional principal component analysis (Functional PCA), reflection principles, local time, stochastic differential equations (SDEs), Brownian bridges, and empirical-process limits. The applications emphasize longitudinal biomarkers, degradation modelling, first-passage endpoints, dynamic frailty, group-sequential monitoring, calibration diagnostics, recurrent-event processes, electronic health records, and wearable streams. A short cross-domain section uses literary and historical archives to make Brownian-bridge thinking concrete without shifting the paper away from biostatistics, and includes a reproducible chapter-level experiment on Frankenstein. The Black-Merton-Scholes model is included as a solved SDE template, not as a finance application in its own right. The aim is to connect rigorous probability with modelling decisions faced by biostatisticians when biological processes evolve between noisy observation times.
Flow Matching (FM) is a simulation-free method for learning a continuous, invertible flow that interpolates between two distributions, and in particular generates data from noise. Inspired by the variational nature of the diffusion process as a gradient flow, we introduce a stepwise FM model, Local Flow Matching (LFM), which sequentially learns a sequence of FM submodels, each matching a diffusion process up to the time-step size in the data-to-noise direction. In each step, the two distributions to be interpolated by the sub-flow model are closer than those in the full-flow matching model, which interpolates data to noise distributions, enabling smaller models with more efficient training. This variational perspective also allows us to prove a theoretical generation guarantee for the proposed flow model in terms of the $\chi^2$-divergence between the generated and true data distributions, leveraging the contraction property of the diffusion process. In practice, the stepwise structure of LFM is naturally amenable to model distillation, and various distillation techniques can be applied to accelerate generation. We empirically demonstrate that LFM achieves competitive generative performance compared to FM on unconditional generation of tabular and image datasets, and on conditional generation of robotic manipulation policies.
We consider the problem of estimating a time-varying sparse precision matrix, which is assumed to evolve in a piecewise constant manner. Building upon the Group Fused LASSO and LASSO penalty functions, we estimate both the precision matrix and the change points. We propose an alternative estimator to the commonly employed Gaussian likelihood loss, namely the D-trace loss. We provide the conditions for the consistency of the estimated change points and of the sparse estimators in each block. We show that the solutions to the corresponding estimation problem exist when some conditions relating to the tuning parameters of the penalty functions are satisfied. Unfortunately, these conditions are not verifiable in general, posing challenges for tuning the parameters in practice. To address this issue, we introduce a modified regularizer and develop a revised problem that always admits solutions: these solutions can be used for detecting possible unsolvability of the original problem or obtaining a solution of the original problem otherwise. An alternating direction method of multipliers (ADMM) is then proposed to solve the revised problem. The relevance of the method is illustrated through numerical experiments.
Detecting changepoints in a time series of length $N$ entails evaluating up to $2^{N-1}$ possible changepoint models, making exhaustive enumeration computationally infeasible. Genetic algorithms (GAs) provide a stochastic way to identify the structural changes: a population of candidate models evolves via selection, crossover, and mutation operators until it converges on one changepoint model that balances the goodness-of-fit with parsimony. The R package changepointGA encodes each candidate model as an integer chromosome vector and supports both the basic single-population model GA and the island model GA. Parallel computing is implemented on multi-core hardware to further accelerate computation. Users may supply custom fitness functions or genetic operators, while a user-friendly wrapper streamlines routine analyses. Extensive simulations demonstrate that our package runs significantly faster than binary-encoded GA alternatives. Additionally, this package can simultaneously locate changepoints and estimate their effects, as well as other model parameters and any integer-valued hyperparameters. Applications to array-based comparative genomic hybridization data and a century-long temperature series further highlight the package's value in biological and climate research.
We investigate lower bounds on the subgeometric convergence of adaptive Markov chain Monte Carlo under any adaptation strategy. In particular, we prove general lower bounds in total variation and on the weak convergence rate under general adaptation plans. If the adaptation diminishes sufficiently fast, we also develop comparable convergence rate upper bounds that are capable of approximately matching the convergence rate in the subgeometric lower bound. These results provide insight into the optimal design of adaptation strategies and also limitations on the convergence behavior of adaptive Markov chain Monte Carlo. Applications to an adaptive unadjusted Langevin algorithm as well as adaptive Metropolis-Hastings with independent proposals and random-walk proposals are explored.
We consider statistical inference for a class of mixed-effects models with system noise described by a non-Gaussian integrated Ornstein-Uhlenbeck process. Under the asymptotics where the number of individuals goes to infinity with possibly unbalanced sampling frequency across individuals, we prove some theoretical properties of the Gaussian quasi-likelihood function, followed by the asymptotic normality and the tail-probability estimate of the associated estimator. In addition to the joint inference, we propose and investigate the three-stage inference strategy, revealing that they are first-order equivalent while quantitatively different in the second-order terms. Numerical experiments are given to illustrate the theoretical results.
Obtaining accurate water level predictions are essential for water resource management and implementing flood mitigation strategies. Several data-driven models can be found in the literature. However, there has been limited research with regard to addressing the challenges posed by large spatio-temporally referenced hydrological datasets, in particular, the challenges of maintaining predictive performance and uncertainty quantification. Gaussian Processes (GPs) are commonly used to capture complex space-time interactions. However, GPs are computationally expensive and suffer from poor scaling as the number of locations increases due to required covariance matrix inversions. To overcome the computational bottleneck, the Nearest Neighbor Gaussian Process (NNGP) introduces a sparse precision matrix providing scalability without having to make inferential compromises. In this work we introduce an innovative model in the hydrology field, specifically designed to handle large datasets consisting of a large number of spatial points across multiple hydrological basins, with daily observations over an extended period. We investigate the application of a Bayesian spatiotemporal NNGP model to a rich dataset of daily water levels of rivers located in Ireland. The dataset comprises a network of 301 stations situated in various basins across Ireland, measured over a period of 90 days. The proposed approach allows for prediction of water levels at future time points, as well as the prediction of water levels at unobserved locations through spatial interpolation, while maintaining the benefits of the Bayesian approach, such as uncertainty propagation and quantification. Our findings demonstrate that the proposed model outperforms competing approaches in terms of accuracy and precision.
The instrumental variable model of Imbens and Angrist (1994) and Angrist et al. (1996) identifies the local average treatment effect, also known as the complier average causal effect (CACE). In practice, however, the treatment and outcome are often missing, and when they are missing not at random (MNAR), the CACE is generally not identifiable without further assumptions, because the underlying data distribution itself cannot be recovered. We study when the CACE remains identifiable under MNAR. Through an exhaustive search over missingness mechanisms, we characterize all those that identify the CACE without auxiliary information, in two scenarios: (1) missing data in either the treatment or the outcome alone, and (2) missing data in both the treatment and outcome under prospective data collection. Along the way, we unify existing results and establish many new ones, giving a complete picture of identifiability in each case. Our theory suggests that before any practical data analysis under the instrumental variable model, it is important to check whether the CACE is identifiable under the proposed missingness mechanism; moreover, because the true mechanism is typically unknown and untestable, it is more robust to conduct sensitivity analyses across multiple plausible missingness mechanisms.
We establish a formal connection between the decades-old surrogate outcome model in biostatistics and economics and the emerging field of prediction-powered inference (PPI). The connection treats predictions from pre-trained models, prevalent in the age of AI, as cost-effective surrogates for expensive outcomes. Building on the surrogate outcomes literature, we develop recalibrated prediction-powered inference, a more efficient approach to statistical inference than existing PPI proposals. Our method departs from the existing proposals by using flexible machine learning techniques to learn the optimal ``imputed loss'' through a step we call recalibration. Importantly, the method always improves upon the estimator that relies solely on the data with available true outcomes, even when the optimal imputed loss is estimated imperfectly, and it achieves the smallest asymptotic variance among PPI estimators if the estimate is consistent. Computationally, our optimization objective is convex whenever the loss function that defines the target parameter is convex. We further analyze the benefits of recalibration, both theoretically and numerically, in several common scenarios where machine learning predictions systematically deviate from the outcome of interest. We demonstrate significant gains in effective sample size over existing PPI proposals via three applications leveraging state-of-the-art machine learning/AI models.
When prior information is lacking, the go-to strategy for probabilistic inference is to combine a "default prior" and the likelihood via Bayes's theorem. Objective Bayes, (generalized) fiducial inference, etc. fall under this umbrella. This construction is natural, but the corresponding posterior distributions generally only offer limited, approximately valid uncertainty quantification. The present paper takes a reimagined approach that yields posterior distributions with stronger reliability properties. The proposed construction starts with an inferential model (IM), one that takes the mathematical form of a data-driven possibility measure and features exactly valid uncertainty quantification, and then returns a so-called inner probabilistic approximation thereof. This inner probabilistic approximation inherits many of the original IM's desirable properties, including credible sets with exact coverage and asymptotic efficiency. The approximation also agrees with the familiar Bayes/fiducial solution in applications where the model has a group invariance structure. A Monte Carlo method for evaluating the probabilistic approximation is presented, along with numerical illustrations.
With the rapid development of machine learning and deep learning techniques, actuaries and the broader insurance industry face a persistent trade-off between predictive accuracy and interpretability. This paper provides a comprehensive applied assessment of Explainable Boosting Machines (EBM) in a car insurance framework, focusing on claim frequency and severity modeling. EBM combines the additive structure of generalized additive models (GAM) with a cyclic gradient boosting algorithm, resulting in a glass-box model whose predictions are interpretable by design. Using real-world data, we empirically illustrate its practical relevance and compare EBM with modern benchmark models used in non-life insurance pricing. The evaluation considers (i) out-of-sample predictive accuracy, including Murphy diagrams and Bregman dominance tests, and (ii) calibration assessment using T-reliability diagrams and Murphy's score decomposition. Finally, we highlight the link between EBM predictions and Shapley values, showing how predictions can be transparently decomposed into exact main and pairwise interaction effects, providing actionable insights beyond predictive performance.
We propose a new approach for the modeling large datasets of nonstationary spatial processes that combines a latent low rank process and a sparse covariance model. The low rank component coefficients are endowed with a flexible graphical Gaussian Markov random field model. The utilization of a low rank and compactly-supported covariance structure combines the full-scale approximation and the basis graphical lasso; we term this new approach the full-scale basis graphical lasso (FSBGL). Estimation employs a graphical lasso-penalized likelihood, which is optimized using a difference-of-convex scheme. We illustrate the proposed approach on synthetic fields as well as with a challenging high-resolution simulation dataset of the thermosphere. In a comparison against state-of-the-art spatial models, the FSBGL performs better at capturing salient features of the thermospheric temperature fields, even with limited available training data.
In this article we explore the data available through the Stanford Open Policing Project. The data consist of information on millions of traffic stops across close to 100 different cities and highway patrols. Using a variety of metrics, we identify that the data is not missing completely at random. Furthermore, we develop ways of quantifying and visualizing missingness trends for different variables across the data sets. Because of the way we have identified the missingness dependence on key variables, imputation is not possible and a sensitivity analysis is required. We perform a sensitivity analysis to extend work done on the outcome test as well as to extend work done on sharp bounds on the average treatment effect. We demonstrate that bias calculations can fundamentally shift depending on the assumptions made about the observations for which the race variable has not been recorded. We suggest ways that our missingness sensitivity analysis can be extended to myriad contexts.
When the sample path of a Hawkes process is observed discretely, such that only the total event counts in disjoint time intervals are known, the likelihood function becomes intractable. To overcome the challenge of likelihood-based inference in this setting, we propose to use a likelihood-free approach that uses simulated data to train a fully connected neural network (NN) to estimate the parameters of the Hawkes process from a summary statistic of the count data. A naive imputation estimate of the parameters forms the basis for our summary statistic, which is fast to generate and requires minimal expert knowledge to design. The resulting NN estimator is comparable to the best extant approximate likelihood estimators in terms of mean-squared error but requires significantly less computational time. We implement NN quantile estimation for fast uncertainty quantification. The proposed estimation procedure is applied to weekly count data for two infectious diseases, with a time-varying background rate used to capture seasonal fluctuations in infection risk.
In two-sample Mendelian randomization (MR), Egger regression is widely used as a sensitivity analysis when directional pleiotropy is detected. However, the increasing complexity of modern MR studies, characterized by many weak instruments, renders the original Egger method less efficient. We first identify the source of weak instrument bias in Egger regression and introduce a debiased Egger (dEgger) estimator that restores consistency and asymptotic normality under substantially weaker conditions. To boost statistical power and ensure the validity of results, we then embed a random instrument selection procedure and present the rerandomized Egger (REgger) estimator along with an associated directional pleiotropy test. Recognizing the challenge of obtaining closed-form variances, we derive simple regression-residual-based variance estimators by truncating higher-order terms. The REgger estimator simultaneously removes the weak instrument bias and winner's curse while retaining robustness to directional pleiotropy, and is asymptotically normal when the effective sample size and post-selection instrument count are sufficiently large. Under balanced pleiotropy, REgger matches the rerandomized inverse-variance-weighted estimator, differing only in having marginally wider confidence intervals; under directional pleiotropy, it achieves substantially greater precision. Extensive simulations and real-data analyses confirm REgger's superior statistical properties, making it a valuable addition to two-sample MR sensitivity analyses.
The Hopfield network made associative memory (AM) the model system of neural computation, but it solves the problem only for a \emph{stationary} world: a fixed set of memories, stored once into frozen weights. Real environments are non-stationary (e.g., memories arrive over time, drift, recur, and must be told apart from noise), where the classical formulation fails by catastrophic interference (the palimpsest problem) and by a capacity fixed in advance. We argue that this is not a peripheral limitation but the crux: under non-stationarity, memory and learning cease to be separate problems, and \emph{adaptation}, rather than one-shot optimization, becomes the operative capacity. We give a fresh formulation of the AM problem for non-stationary environments and a \emph{self-sizing} continual associative memory that generalizes Hopfield's: it stores new memories without erasing old ones (no forgetting), re-binds drifting and recurring memories, allocates a genuinely new memory only for true novelty, and grows its store to the environment's \emph{intrinsic} memory demand and no further. We rigorously show that this demand is the Urysohn width of the problem and can be estimated from data via a contrastive-similarity (CS) operator. The memory's size converges to this capacity online with no preset value and no validation search, matching an oracle capacity search. We use experiments with synthetic datasets to show that the generalization buys self-sizing and retention under non-stationarity, \emph{not} higher per-item recall fidelity, on which it matches strong baselines.
Recent advances have sparked significant interest in the development of privacy-preserving Principal Component Analysis (PCA). However, many existing approaches rely on restrictive assumptions, such as assuming sub-Gaussian data or being vulnerable to data contamination. Additionally, some methods are computationally expensive or depend on unknown model parameters that must be estimated, limiting their accessibility for data analysts seeking privacy-preserving PCA. In this paper, we propose a differentially private PCA method applicable to heavy-tailed and potentially contaminated data. Our approach leverages the property that the covariance matrix of properly rescaled data preserves eigenvectors and their order under elliptical distributions, which include Gaussian and heavy-tailed distributions. By applying a bounded transformation, we enable straightforward computation of principal components in a differentially private manner. Additionally, boundedness guarantees robustness against data contamination. We conduct both theoretical analysis and empirical evaluations of the proposed method, focusing on its ability to recover the subspace spanned by the leading principal components. Extensive numerical experiments demonstrate that our method consistently outperforms existing approaches in terms of statistical utility, particularly in non-Gaussian or contaminated data settings.
This paper introduces a framework for uncertainty quantification in regression models defined on metric spaces. Using a proposed notion of homoscedasticity, we define a conformal prediction algorithm that provides finite-sample marginal coverage guarantees and fast convergence rates to the oracle prediction region. For heteroscedastic settings, we introduce a kNN procedure that yields locally adaptive prediction radii in general metric spaces. Although this procedure does not provide the same finite-sample guarantees as the conformal algorithm, it is designed to improve local coverage calibration without imposing smoothing assumptions. Both procedures are compatible with a broad range of regression algorithms and scale to large datasets, allowing practitioners to use their preferred models and incorporate domain-specific knowledge. Building on the heteroscedastic $k$NN approach, we also develop a flexible sequential extension for metric-space-valued time series based on nearest-neighbor expert aggregation. We establish the consistency of the proposed estimators under minimal conditions. Finally, we illustrate the practical utility of our framework in personalized medicine applications involving random objects such as probability distributions and graph Laplacians.
With the development of big data and machine learning, privacy concerns have become increasingly critical, especially when handling heterogeneous datasets containing sensitive personal information. Differential privacy provides a rigorous framework for safeguarding individual privacy while enabling meaningful statistical analysis. In this paper, we propose a differentially private quantile regression method for high-dimensional data in a distributed setting. Quantile regression is a powerful and robust tool for modeling the relationships between the covariates and responses in the presence of outliers or heavy-tailed distributions. To address the computational challenges due to the non-smoothness of the quantile loss function, we introduce a Newton-type transformation that reformulates the quantile regression task into an ordinary least squares problem. Building on this, we develop a differentially private estimation algorithm with iterative updates, ensuring both near-optimal statistical accuracy and formal privacy guarantees. For inference, we further propose a differentially private debiased estimator, which enables valid confidence interval construction and hypothesis testing. Additionally, we propose a communication-efficient and differentially private bootstrap for simultaneous hypothesis testing in high-dimensional quantile regression, suitable for distributed settings with both small and abundant local data. Extensive simulations demonstrate the robustness and effectiveness of our methods in practical scenarios.
Random walk (RW)-based algorithms have long been popular in distributed systems due to low overheads and scalability, with recent growing applications in decentralized learning. However, their reliance on local interactions makes them inherently vulnerable to malicious behavior. In this work, we investigate an adversarial threat that we term the ``Pac-Man'' attack, in which a malicious node probabilistically terminates any RW that visits it. This stealthy behavior gradually eliminates active RWs from the network, effectively halting the learning process without triggering failure alarms. To counter this threat, we propose the Average Crossing (AC) algorithm--a fully decentralized mechanism for duplicating RWs to prevent RW extinction in the presence of Pac-Man. Our theoretical analysis establishes that (i) the RW population remains almost surely bounded under AC and (ii) RW-based stochastic gradient descent remains convergent under AC, even in the presence of Pac-Man, with a quantifiable deviation from the true optimum. Our extensive empirical results on both synthetic and real-world datasets corroborate our theoretical findings. Furthermore, they uncover a phase transition in the extinction probability as a function of the duplication threshold. We offer theoretical insights by analyzing a simplified variant of the AC, which sheds light on the observed phase transition.
Unmeasured confounders are a major source of bias in regression-based effect estimation and causal inference. In this paper, we propose a new profiled transfer learning framework, ProTrans, to address confounding effects in the target dataset, when additional source datasets with similar confounding structures are available. We introduce the concept of profiled residuals to characterize the shared confounding patterns between source and target datasets. By incorporating these profiled residuals into the target debiasing step, we effectively mitigate the latent confounding effects. We also propose a source selection strategy to enhance the robustness of ProTrans to noninformative sources. As a byproduct, ProTrans can also be used to estimate treatment effects in the presence of potential confounders, without the use of auxiliary features such as instrumental or proxy variables, which are often challenging to select in practice. Theoretically, we prove that the resulting estimated model shift from the sources to the target is confounding-free without imposing specific assumptions on the true confounding structure, and that the target parameter estimation achieves the minimax optimal rate under mild conditions. Simulated and real-world experiments validate the effectiveness of ProTrans and support the theoretical findings.
Data assimilation (DA) estimates a dynamical system's state from noisy observations. Recent generative models like the ensemble score filter (EnSF) improve DA in high-dimensional nonlinear settings but are computationally expensive. We introduce the ensemble flow filter (EnFF), a training-free, flow matching (FM)-based framework that accelerates sampling and offers flexibility in flow design. EnFF uses Monte Carlo estimators for the marginal flow field, localized guidance for observation assimilation, and utilizes a novel flow path that exploits the Bayesian DA formulation. It generalizes classical filters such as the bootstrap particle filter and ensemble Kalman filter. Experiments on high-dimensional benchmarks demonstrate EnFF's improved cost-accuracy tradeoffs and scalability, highlighting FM's potential for efficient, scalable DA. Code is available at this https URL.
We study general M-estimators of location on Riemannian manifolds, extending classical notions such as the Frechet mean by replacing the squared loss with a broad class of loss functions. Under minimal regularity conditions on the loss function and the underlying probability distribution, we establish theoretical guarantees for the existence and uniqueness of such estimators. In particular, we provide sufficient conditions under which the population and sample M-estimators exist and are uniquely defined. Our results offer a general framework for robust location estimation in non-Euclidean geometric spaces and unify prior uniqueness results under a broad class of convex losses.
Left-truncated survival data commonly arise in prevalent cohort studies, where only individuals who have experienced disease onset and survived until enrollment in the study. When the onset process follows a stationary Poisson process, the resulting data are length-biased. This sampling mechanism induces a selection bias towards longer survival individuals, and statistical methods for traditional survival data are not directly applicable. While tree-based methods developed for left-truncated data can be applied, they may be inefficient for length-biased data, as they do not account for the distribution of truncation times. To address this, we propose new survival trees and forests for length-biased right-censored data within the conditional inference framework. Our approach uses a score function derived from the full likelihood to construct permutation test statistics for variable splitting. For survival prediction, we consider two estimators of the unbiased survival function, differing in statistical efficiency and computational complexity. These elements enhance efficiency in tree construction and improve accuracy of survival prediction in ensemble settings. Simulation studies demonstrate efficiency gains in both tree recovery and survival prediction, often exceeding the gains from ensembling alone. We further illustrate the utility of the proposed methods using lung cancer data from the Cancer Public Library Database, a nationwide cancer registry in South Korea.
Response-adaptive randomization (RAR) methods can be used to adapt randomization probabilities based on accumulating data, aiming to increase the probability of allocating patients to effective treatments. A popular RAR method is Thompson sampling, which randomizes patients proportionally to the Bayesian posterior probability that each treatment is the most effective. However, its high variability can also increase the risk of assigning patients to inferior treatments and lead to inferential problems such as confidence interval undercoverage. We propose a principled method based on Bayesian hypothesis testing to address these issues: We introduce a null hypothesis postulating equal effectiveness of treatments. Bayesian model averaging then induces shrinkage toward equal randomization probabilities, with the degree of shrinkage controlled by the prior probability of the null hypothesis. Equal randomization and Thompson sampling arise as special cases when the prior probability is one or zero, respectively. A simulation study demonstrates that the method can mitigate issues with Thompson sampling and has comparable statistical properties to Thompson sampling with common ad hoc modifications such as power transformation and probability capping. Under the null hypothesis and a normal model, the randomization probabilities are shown to converge asymptotically to equal randomization, unlike those of Thompson sampling. We implement the method in the free and open-source R package brar, enabling experimenters to easily perform null hypothesis Bayesian RAR and support more effective randomization of patients.
This work proposes a two-stage physics-informed deep learning framework that combines neural-network-based sampling with statistical inference and constrained parameter refinement. In the first stage, a dual-network physics-informed architecture is used, where a main-network approximates the PDE solution and an auxiliary coefficient sub-network provides a relaxed continuous soft approximation of the true discontinuous coefficient field. A gradient-adaptive weighting strategy is incorporated to improve residual training and enhance sampling reliability near possible discontinuity regions. The sampled coefficient values are then analyzed using Bayesian learning for Gaussian mixture models and birth-death Markov chain model selection, which identify the number of coefficient regimes and provide candidate intervals for coefficient values and transition regions. In the second stage, the inverse problem is reformulated as a constrained physics-informed estimator, in which the coefficient is replaced by a form-consistent hard approximation explicitly represented as a piecewise-constant function over the spatiotemporal domain. Comprehensive numerical experiments on PDEs with jump-discontinuous coefficients demonstrate that the proposed framework achieves adaptability and accurate parameter identification with acceptable computational costs compared to existing methods. Applications to solution reconstruction further illustrate its practical potential. This work provides a generalizable computational approach for inverse problems governed by PDEs with discontinuous parameter structures, particularly in non-stationary and heterogeneous systems.
Analyzing data from multiple sources offers valuable opportunities to improve the estimation efficiency of causal estimands. However, this analysis also poses many challenges due to population heterogeneity and data privacy constraints. While several advanced methods for causal inference in federated settings have been developed in recent years, many focus on difference-based averaged causal effects and are not designed to study effect modification. In this study, we introduce a novel targeted-federated learning framework to study the heterogeneity of treatment effects (HTEs) for a targeted population by proposing a projection-based estimand. This HTE framework integrates information from multiple data sources without sharing raw data, while accounting for covariate distribution shifts among sources. Our proposed approach is shown to be doubly robust, conveniently supporting both difference-based estimands for continuous outcomes and odds ratio-based estimands for binary outcomes. Furthermore, we develop a communication-efficient bootstrap-based selection procedure to detect non-transportable data sources, thereby enhancing robust information aggregation without introducing bias. The superior performance of the proposed estimator over existing methods is demonstrated through extensive simulation studies, and the utility of our approach has been shown in a real-world data application using nationwide Medicare-linked data.
Sequential analysis encompasses simulation theories and methods where the sample size is determined dynamically based on accumulating data. Since the conceptual inception, numerous sequential stopping rules have been introduced, and many more are currently being refined and developed. This article aims to deliver an up-to-date review of recent developments in sequential stopping rules, with a deliberate emphasis on Monte Carlo methods for estimating an unknown expectation, including binomial proportions, primarily under standard iid sampling and also under certain lightly generalized settings. These methodologies have long served and likely will continue to serve, as fundamental bases for both theoretical and practical developments in stopping rules for general statistical inference, advanced Monte Carlo techniques, and their modern applications. Building upon over a hundred references and empirical studies, we explore the essential aspects of these methods, such as core assumptions, numerical algorithms, convergence properties, and practical trade-offs to guide further developments, particularly at the intersection of sequential stopping rules and related areas of research.
In cognitive psychology and neuroscience, adjudicating between competing theoretical models is a common methodological challenge. Researchers often rely on either first-order direct mapping approaches (e.g., linear regression) or second-order abstraction methods (e.g., Representational Similarity Analysis [RSA]). However, it remains unclear whether or how the nature of the underlying data and feature characteristics affect the performance of these methods. Here, we systematically evaluated regression, RSA, and Pattern Component Modeling (PCM) across distinct data-generating schemes, including first-order linear mappings and geometry-to-first-order transformations with either linear or nonlinear sigmoid readouts, using both univariate behavioral and multivariate fMRI spatial-pattern simulations. Our results suggest that the relative performance of these methods depends on the underlying generative mechanism. Under linear generative assumptions, regression and PCM showed higher model-selection accuracy than RSA. Under nonlinear but order-preserving transformations, rank-based RSA showed an advantage over regression and PCM. We also found that feature multicollinearity affected these methods differently across generative schemes, and that orthogonalizing the predictor space via principal component analysis (PCA) reduced several collinearity-related differences. Finally, analyses of empirical datasets were consistent with the simulation results under approximately linear conditions, with regression showing clearer model discrimination than RSA. Overall, these findings suggest that the relative performance of regression, RSA, and PCM depends on the form of the mapping between features and responses, as well as on the structure of the feature space.
Inferring the infinitesimal rates of continuous-time Markov chains (CTMCs) is a central challenge in many scientific domains. This task is difficult because the number of rates grows quadratically with the state space, rates can be strongly dependent, and many transitions may be only partially observed. We introduce a Bayesian framework that models CTMC rates as flexible functions of covariates through Gaussian processes. This enables nonlinear covariate effects, improves inference by incorporating external information, and helps identify potential drivers of CTMC dynamics. For posterior inference, we use Hamiltonian Monte Carlo and develop scalable exact and approximate gradients for likelihoods involving repeated matrix exponentials. With $N$ observations and $K$ CTMC states, these gradients reduce the dominant cost of existing derivative calculations from $O(NK^3)$, with large constants, to $O(K^3+NK^2)$, with cheaper constants. We demonstrate the method in Bayesian phylogenetic and phylogeographic inference, where CTMCs are central, and show strong performance on synthetic and real datasets, including empirical quadratic scaling in $K$ even when $N<K$.
Two-phase sampling is a simple and cost-effective estimation strategy in survey sampling and is widely used in practice. Because the phase-2 sampling probability typically depends on low-cost variables collected at phase 1, naive estimation based solely on the phase-2 sample generally results in biased inference. This issue arises even when estimating causal parameters such as the average treatment effect (ATE), and there has been growing interest in recent years in the proper estimation of such parameters under complex sampling designs (e.g., Nattino et al., 2025). In this paper, we derive the semiparametric efficiency bound for a broad class of weighted average treatment effects (WATE), which includes the ATE, the average treatment effect on the treated/untreated (ATT/ATU), and the average treatment effect on the overlapped population (ATO), under two-phase sampling. In addition to straightforward weighting estimators based on the sampling probabilities, we also propose estimators that can attain strictly higher efficiency under suitable conditions. In particular, under outcome-dependent sampling, we show that substantial efficiency gains can be achieved by appropriately incorporating phase-1 information. We further conduct extensive simulation studies, varying the choice of phase-1 variables and sampling schemes, to characterize when and to what extent leveraging phase-1 information leads to efficiency gains.
Environmental exposures, such as air pollution and extreme temperatures, have complex effects on human health. These effects are often characterized by non-linear exposure-lag-response relationships and delayed impacts over time. Accurately capturing these dynamics is crucial for informing public health interventions. The Distributed Lag Non-Linear Model (DLNM) is a flexible statistical framework for estimating such effects in epidemiological research. However, standard DLNM implementations typically assume a homogeneous exposure-lag-response association across the study region, overlooking potential spatial heterogeneity, which can lead to biased risk estimates. To address this limitation, we introduce DLNM-Clust: a novel mixture of DLNMs that extends the traditional DLNM. Within a Bayesian framework, DLNM-Clust probabilistically assigns each geographic unit to one of $C$ latent spatial clusters, each of which is defined by a distinct DLNM specification. This approach allows capturing both common patterns and singular deviations in the exposure-lag-response surface. We demonstrate the method using municipality-level time-series data on the relationship between air pollution and the incidence of COVID-19 in Belgium. Our results emphasize the importance of spatially aware modeling strategies in environmental epidemiology, facilitating region-specific risk assessment and supporting the development of targeted public health initiatives.
The conditional average treatment effect (CATE) is frequently estimated in clinical studies to refute a homogeneous treatment effect hypothesis. Under this regime, all patients making up the population experience identical benefit from a given treatment relative to a comparator. Uncovering heterogeneous treatment effects through inference about the CATE, however, requires that covariates truly modifying the treatment effect be reliably collected at baseline. CATE-based techniques will necessarily fail to detect violations when effect modifiers are omitted from the data due to, for example, resource constraints. Severe measurement error has a similar impact. Clinical decision makers can be misled as a result. To address these limitations, we prove that a practical homogeneous treatment effect hypothesis can be gauged through inference about contrasts of the potential outcomes' variances even when effect modifiers are missing from the data. We derive causal machine learning estimators of these contrasts and study their asymptotic properties. We establish that these estimators are doubly robust and asymptotically linear under mild conditions, permitting formal hypothesis testing about the treatment effect heterogeneity. Numerical experiments demonstrate that these estimators' asymptotic guarantees are approximately achieved in finite-sample randomized and observational study data alike. These inference procedures are then used to detect heterogeneous treatment effects in the re-analysis of randomized controlled trials investigating targeted temperature management in cardiac arrest patients.
The Bayes factor, the data-based updating factor from prior to posterior odds, is a principled measure of relative evidence for two competing hypotheses. It is naturally suited to sequential data analysis in settings such as clinical trials and animal experiments, where early stopping for efficacy or futility is desirable. However, designing such studies is challenging because computing design characteristics, such as the probability of obtaining conclusive evidence or the expected sample size, typically requires computationally intensive Monte Carlo simulations, as no closed-form or efficient numerical methods exist. To address this issue, we extend results from classical group sequential design theory to sequential Bayes factor designs. The key idea is to derive Bayes factor stopping regions in terms of the z-statistic and use the known distribution of the cumulative z-statistics to compute stopping probabilities through multivariate normal integration. The resulting method is fast, accurate, and simulation-free. We illustrate it with examples from clinical trials, animal experiments, and psychological studies. We also provide an open-source implementation in the bfpwr R package. Our method makes exploring sequential Bayes factor designs as straightforward as classical group sequential designs, enabling experiments to rapidly design informative and efficient experiments.
Pairwise comparison data are widely used to recover latent rankings, yet the models in dominant use assume stochastic transitivity. When preferences are in fact intransitive, a single scalar strength conflates genuine hierarchy with cycle-induced structure, biasing both the recovered ranking and any covariate effects attributed to it. To address this limitation, we propose the Covariate-Assisted Bayesian Intransitive Bradley-Terry (CA-BIBT) model, which uses a combinatorial Hodge decomposition to resolve the latent match-up into identifiable and mutually orthogonal flows, attributing the component lying in the covariate-induced subspace to observed covariates and assigning the remaining components to the residuals. A global-local shrinkage prior on the residual cycle-induced flow adapts the model from transitive to intransitive regimes without prespecifying the regime, and a Gibbs sampler yields, as posterior byproducts, calibrated uncertainty for each flow, the posterior probability of the level at which the entities are rankable, and a ranking summary for decision making that remains well-defined even under intransitivity. In simulations, the CA-BIBT model recovers all flow components accurately with near nominal coverage, and in an application to animal dominance data it reveals cyclic dominance that the observed covariates cannot account for, with strong posterior evidence for stochastic intransitivity.
Conditional independence testing (CIT) is essential for reliable scientific discovery. It prevents spurious findings and enables controlled feature selection. Recent CIT methods have used machine learning (ML) models as surrogates of the underlying distribution. However, model-agnostic approaches require a train-test split, which reduces statistical power. We introduce Semi-knockoffs, a CIT method that can accommodate any pre-trained model, avoids this split, and provides valid p-values and false discovery rate (FDR) control for high-dimensional settings. Unlike methods that rely on the model-$X$ assumption (known input distribution), Semi-knockoffs only require conditional expectations for continuous variables. This makes the procedure less restrictive and more practical for machine learning integration. To ensure validity when estimating these expectations, we present two new theoretical results of independent interest: (i) stability for regularized models trained with a null feature and (ii) the double-robustness property.
State-level policy studies often conduct heterogeneity analyses that quantify how treatment effects vary across state characteristics. These analyses may be used to inform state-specific policy decisions, or to infer how the effect of a policy changes in combination with other state characteristics. However, in state-level settings with varied contexts and policy landscapes, multiple versions of similar policies, and differential policy implementation, the causal quantities targeted by these analyses may not align with the inferential goals. This paper clarifies these issues by distinguishing several causal estimands relevant to heterogeneity analyses in state-policy settings, including state-specific treatment effects (ITE), conditional average treatment effects (CATE), and controlled direct effects (CDE). We argue that the CATE is often the easiest to identify and estimate, but may not be the most policy relevant target of inference. Moreover, the widespread practice of coarsening distinct policies or implementations into a single indicator further complicates the interpretation of these analyses. Motivated by these limitations, we propose bounding ITEs as an alternative inferential goal, yielding ranges for each state's policy effect under explicit assumptions that quantify deviations from the ideal identifying conditions. These bounds target a well-defined and policy-relevant quantity, the effect for specific states. We develop this approach within a difference-in-differences framework and discuss how sensitivity parameters may be informed using pre-treatment data. Through simulations we demonstrate that bounding state-specific effects can more reliably determine the sign of the ITEs than CATE estimates. We then illustrate this method to examine the effect of the Affordable Care Act Medicaid expansion on high-volume buprenorphine prescribing.
We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of $\widetilde{O}(\frac{1}{n})$ uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This perspective also explains the phenomenon of length generalization, in which the test sequence length exceeds the training length, as the model performs Bayesian inference using a generalized posterior.
What assortments (subsets of items) should be offered, to collect data for estimating a choice model over $n$ total items? We propose a structured, non-adaptive experiment design requiring only $O(\log n)$ distinct assortments, each offered repeatedly, that consistently outperforms randomized and other heuristic designs across an extensive numerical benchmark that estimates multiple different choice models under a variety of (possibly mis-specified) ground truths. We then focus on Nested Logit choice models, which cluster items into "nests" of close substitutes. Whereas existing Nested Logit estimation procedures assume the nests to be known and fixed, we present a new algorithm to identify nests based on collected data, which when used in conjunction with our experiment design, guarantees correct identification of nests under any Nested Logit ground truth. Our experiment design was deployed to collect data from over 70 million users at Dream11, an Indian fantasy sports platform that offers different types of betting contests, with rich substitution patterns between them. We identify nests based on the collected data, which lead to better out-of-sample choice prediction than ex-ante clustering from contest features. Our identified nests are ex-post justifiable to Dream11 management.
We propose a new test of uniformity on the hypersphere based on a Stein characterization associated with the Laplace-Beltrami operator. We identify a sufficient class of test functions for this characterization, linked to the moment generating function. Exploiting the operator's eigenfunctions to obtain a harmonic decomposition in terms of Gegenbauer polynomials, we show that the proposed procedure belongs to the class of Sobolev tests. We derive closed-form series representations for the asymptotic distribution of the test statistic under the null hypothesis and under fixed alternatives. To enhance power against a range of alternatives, we introduce a tuning parameter into the characterization and study its impact on rejection probabilities. We discuss data-driven strategies for selecting this parameter to maximize rejection rates for a given alternative and compare the resulting performance with that of related parametric tests. Additional numerical experiments compare the proposed test with competing Sobolev-class procedures, highlighting settings in which it offers clear advantages.
Bayesian inference for inverse problems involves computing expectations under posterior distributions--e.g., posterior means, variances, or predictive quantities--typically via Monte Carlo (MC) estimation. When the quantity of interest varies significantly under the posterior, accurate estimates demand many samples--a cost often prohibitive for partial differential equation-constrained problems. To address this challenge, we introduce conditional neural control variates, a modular method that learns amortized control variates from joint model-data samples to reduce the variance of MC estimators. To scale to high-dimensional problems, we leverage Stein's identity to design an architecture based on an ensemble of hierarchical coupling layers with tractable Jacobian trace computation. Training requires: (i) samples from the joint distribution of unknown parameters and observed data; and (ii) the posterior score function, which can be computed from physics-based likelihood evaluations, neural operator surrogates, or learned generative models such as conditional normalizing flows. Once trained, the control variates generalize across observations without retraining. We validate our approach on stylized and partial differential equation-constrained Darcy flow inverse problems, outperforming classical Stein control variates and achieving substantial variance reduction, even when the analytical score is replaced by a learned surrogate.
Algorithmic decisions about individuals require predictions that are not only accurate but also fair with respect to sensitive attributes such as gender and race. Causal notions of fairness align with legal requirements, yet many methods assume access to detailed knowledge of the underlying causal graph, which is a demanding assumption in practice. We propose a learning framework that achieves interventional fairness by leveraging a causal graph over \textit{clusters of variables}, which is substantially easier to estimate than a variable-level graph. With possible \textit{adjustment cluster sets} identified from such a cluster causal graph, our framework trains a prediction model by reducing the worst-case discrepancy between interventional distributions across these sets. To this end, we develop a computationally efficient barycenter kernel maximum mean discrepancy (MMD) that scales favorably with the number of sensitive attribute values. Extensive experiments show that our framework strikes a better balance between fairness and accuracy than existing approaches, highlighting its effectiveness under limited causal graph knowledge.
We introduce CCMnet, an R package designed to generate network ensembles that accurately reflect the uncertainty inherent in empirical data. While traditional network modeling often results in ensembles with fixed property values or model-determined levels of variability, CCMnet enables a continuous spectrum of variability for network properties, including edge counts, degree distribution, and mixing patterns. By defining probability distributions directly over congruence classes of networks, the package allows researchers to specify the uncertainty in network properties across the generated ensemble to match a specific sampling design or empirical distribution. Furthermore, this formulation provides a principled framework that encompasses several classic models (e.g., Erdős--Rényi model, stochastic block models, and certain exponential random graph models) that implicitly share this structural basis, while offering the flexibility to specify arbitrary, even non-parametric, distributions for network properties. CCMnet implements a Markov chain Monte Carlo (MCMC) framework to sample from these models. The utility of the package is illustrated by generating posterior predictive network ensembles representing school friendship networks.
Sparse PCA is one of the most well-studied problems in high-dimensional statistics. In this problem, we are given samples from a distribution with covariance $\Sigma$, whose top eigenvector $v \in R^d$ is $s$-sparse. Existing sparse PCA algorithms can be broadly categorized into (1) combinatorial algorithms (e.g., diagonal or elementwise covariance thresholding) and (2) SDP-based algorithms. While combinatorial algorithms are much simpler, they are typically only analyzed under the spiked identity model (where $\Sigma = I_d + \gamma vv^\top$ for some $\gamma > 0$), whereas SDP-based algorithms require no additional assumptions on $\Sigma$. We demonstrate explicit counterexample covariances $\Sigma$ against the success of standard combinatorial algorithms for sparse PCA, when moving beyond the spiked identity model. In light of this discrepancy, we give the first combinatorial method for sparse PCA that provably succeeds for general $\Sigma$ using $s^2 \cdot \mathrm{polylog}(d)$ samples and $d^2 \cdot \mathrm{poly}(s, \log(d))$ time, by providing a global convergence guarantee on a variant of the truncated power method of Yuan and Zhang (2013). We provide a natural generalization of our method to recovering a vector in a sparse leading eigenspace. Finally, we evaluate our method on synthetic and real-world sparse PCA datasets.
Proper scoring rules are essential for evaluating probabilistic forecasts. We propose a simple algebraic rearrangement of the Yates covariance decomposition of the Brier score into three independently non-negative terms: a variance mismatch term, a correlation deficit term, and a calibration-in-the-large term. This rearrangement makes the optimality conditions for perfect forecasting transparent: the optimal forecast must simultaneously match the variance of outcomes, achieve perfect positive correlation with outcomes, and match the mean of outcomes. Any deviation from these conditions results in a positive contribution to the Brier score.
Deep learning models have become the dominant approach for multivariate time series anomaly detection (MTSAD), often reporting substantial performance improvements over classical statistical methods. However, these gains are frequently evaluated under heterogeneous thresholding strategies and evaluation protocols, making fair comparisons difficult. This work revisits OmniAnomaly, a widely used stochastic recurrent model for MTSAD, and systematically compares it with a simple linear baseline based on Principal Component Analysis (PCA) on the Server Machine Dataset (SMD). Both methods are evaluated under identical thresholding and evaluation procedures, with experiments repeated across 100 runs for each of the 28 machines in the dataset. Performance is evaluated using Precision, Recall and F1-score at point-level, with and without point-adjustment, and under different aggregation strategies across machines and runs, with the corresponding standard deviations also reported. The results show large variability across machines and show that PCA can achieve performance comparable to OmniAnomaly, and even outperform it when point-adjustment is not applied. These findings question the added value of more complex architectures under current benchmarking practices and highlight the critical role of evaluation methodology in MTSAD research.
In this paper, we propose an invariant quantile regression (IQR) framework specifically designed for multi-environment datasets, which captures the invariance across different environments. This framework is closely related to transfer learning, causal inference, and fair machine learning, and is motivated by scenarios in which the conditional probability of the response given covariates varies, while certain key variables remain invariant. This perspective differs notably from previous works that restrict attention to the conditional mean, which is often insufficient to capture the full causal relationships between covariates and the response in heterogeneous environments. In contrast, quantile-based invariance naturally accommodates heterogeneity, and aligns more closely with structural causal models, in which variables invariant across environments at one or multiple quantile levels directly indicate potential and stable causal variables. Moreover, we show that IQR may yield a larger set of endogenous variables compared to the conditional mean framework, which in turn promotes more effective exclusion of spurious (non-causal) variables. To achieve this, we introduce a Kernel-Smoothed Invariant Quantile Regression (KS-IQR) estimator, which leverages the underlying invariance structure and heterogeneity among environments, ensuring stable estimation across multiple environments. We establish the causal discovery properties of our method, demonstrate its ability to overcome the ``curse of endogeneity'', and derive an $\ell_2$ error bound for our estimator, all in a non-asymptotic framework. We apply our method to real data for causal discovery and obtain biologically meaningful relationships, recovering known signaling pathways and revealing additional quantile-specific effects.
Simulation-based inference (SBI) with neural networks has accelerated and transformed cognitive modeling workflows. SBI enables modelers to fit complex models that were previously difficult or impossible to estimate, while also allowing rapid estimation across large numbers of datasets. However, the utility of SBI for iterating over varying modeling assumptions remains limited: changes to parameterizations, generative functions, priors, and design variables all necessitate model retraining, thereby diminishing the benefits of amortization. To address these issues, we pilot the CogFormer, a meta-amortized framework for cognitive modeling. Our framework trains a transformer-based architecture that remains valid across a combinatorial number of structurally similar models, allowing for changing data types, parameters, design matrices, and sample sizes. We present promising quantitative results across families of decision-making models for binary, multi-alternative, and continuous responses. Our evaluation suggests that CogFormer can accurately estimate parameters across model families with minimal amortization offset, making it a potentially powerful engine that catalyzes cognitive modeling workflows.
Calibration weighting is a fundamental tool in survey sampling for incorporating auxiliary population information into design-based estimators. Classical formulations measure distance between calibrated and design weights on the multiplicative ratio scale. We develop a unified framework based on Bregman divergence defined directly on the weight vector. The framework reveals a primal--dual symmetry in which both the weight-space and multiplier-space optimization problems are themselves Bregman projections, and the calibrated weights satisfy a generalized Pythagorean decomposition with respect to the constraint manifold. The resulting estimator is asymptotically equivalent to a debiased prediction estimator whose regression coefficient depends explicitly on the Bregman generator, in contrast to the generalized regression estimator equivalent of classical calibration. Exploiting this dependence, we identify a contrast-entropy generator that achieves design-optimality under Poisson sampling. Two extensions are developed: cross-fitted estimation under non-probability sampling, yielding doubly robust inference under standard product-rate conditions; and a regularized extension whose Lagrangian dual produces a Hölder-conjugate penalty for soft balance under high-dimensional auxiliary variables. Simulations and an analysis of National Oceanic and Atmospheric Administration (NOAA)'s Large Pelagics Intercept Survey illustrate the framework.
Clustering mixed-type data remains a major challenge in biomedical research to uncover clinically meaningful subgroups within heterogeneous patient populations. Most existing clustering methods impose restrictive assumptions like local independence, fail to accommodate censored biomarkers, or unable to quantify variable importance. We propose a Bayesian finite mixture model (BFMM) clustering framework that addresses these limitations. BFMM flexibly models both continuous and categorical variables, incorporates three covariance structures to capture cluster-specific dependencies among continuous features, and handles censored observations through likelihood-based imputation. To facilitate feature prioritization, BFMM uses spike-and-slab priors to estimate variable importance on a continuous 0-1 scale. Simulation studies demonstrate that BFMM outperforms existing methods in clustering accuracy, particularly given strong within-cluster correlation or censored variables, and reliably distinguishes informative features from noise under varying conditions. We applied BFMM to two real-world datasets: (1) the SENECA cohort integrating electronic health records from patients with Sepsis; and (2) the EDEN randomized trial of patients with acute lung injury. In both settings, BFMM identified clinically interpretable phenotypes and revealed variable-specific contributions to subgroup differentiation. In the EDEN trial, it also uncovered evidence of treatment heterogeneity. These findings validate BFMM as an effective, interpretable, and practically useful clustering tool for complex biomedical datasets.
Strict minimum message length (SMML) is an information-theoretic coding principle that represents a continuous statistical model by a finite set of assertions and a partition of the sample space. We show that the SMML objective decomposes into assertion entropy and conditional cross-entropy, balancing the cost of identifying an assertion against the cost of encoding data under the assigned model. For any fixed partition, the optimal codepoint for each cell is the model distribution that minimises Kullback-Leibler divergence from the data distribution restricted to that cell. Using the local Fisher-Rao geometry of regular parametric models, we show that, under high-resolution regularity conditions, optimal SMML partitions are asymptotically the pullback, through the maximum likelihood estimator, of weighted Fisher-Rao Voronoi tessellations in parameter space, with assertion probabilities appearing as additive weights. For regular exponential families, SMML codepoints satisfy a moment-matching condition and admit an interpretation as KL/Bregman centroids, while exact SMML cells are pullbacks of convex polyhedra in sufficient-statistic space. Together, these results show that SMML induces a natural information-geometric quantisation linking entropy-based coding, KL projection, and divergence-based Voronoi geometry.
A crucial question throughout statistics is whether an observed correlation between two variables is a direct correlation or only an indirect one mediated by a confounder. We organize the existing nonlinear measures of direct correlation into two families, each with a systematic construction: (i) removing the direct correlation from the joint distribution and quantifying the resulting distributional shift, and (ii) intervening on one variable via do-calculus and quantifying the response of the other. For every Kullback-Leibler-based measure in either family we propose a Jensen-Shannon-based regularized analogue; the regularized measures take values in $[0,1]$, satisfy the metric property, and are free of the singularities of the Kullback-Leibler divergence. We analyze the achievable upper bound of each regularized measure under the observed marginals, and derive the maximal value each measure can attain when only the alphabet sizes of the variables are fixed; the maxima admit closed forms built on a single binary-entropy function. The measures are compared on a decision-making model and on three public datasets (Titanic survival, UCI Adult income, and the 1973 Berkeley graduate admissions), with bootstrap confidence intervals for every reported value.
A central challenge of meta-analysis is that the populations underlying existing studies often differ from the target population in unknown ways. We study the problem of predicting function-valued quantities, such as regression and conditional average treatment effect functions, for a new target population using only study-level covariates and study-specific function estimates. We propose MetaHunt, a new meta-analysis methodology based on a shared low-rank structure, in which the true function from each study lies within the convex hull of a small set of latent basis functions. To recover these basis functions, we extend a vertex-hunting procedure called the Successive Projection Algorithm to the functional setting, incorporating a denoised basis-hunting step. We establish consistency of the recovered basis functions under mild regularity conditions. We then model the relationship between study-level covariates and the corresponding mixing weights using flexible semi-parametric or non-parametric methods. MetaHunt is privacy-preserving and enables meta-analytic prediction based on study-level information alone, even when individual-level data are unavailable to analysts. In addition, for each study, functions of interest can be estimated using possibly different machine learning algorithms. For uncertainty quantification, we construct prediction intervals via conformal prediction. We show that, under exchangeability and mild estimation-error conditions, these intervals achieve asymptotically valid marginal coverage. We demonstrate the effectiveness of MetaHunt through both simulation studies and empirical applications. An open-source software package is available for implementing MetaHunt.
Multi-institutional electronic health record (Multi-EHR) data have emerged as a powerful resource for developing predictive models to support clinical decisions and for generating reliable real-world evidence. By aggregating information from diverse patient populations and institutions, they enhance the robustness and generalizability of models and findings. However, analyzing multi-EHR remains challenging because disparate institutions rarely map all data elements to common ontologies, and raw EHR codes are often overly granular and institution-specific, fragmenting representations of the same clinical concept. Hence, integrative analysis must overcome two key hurdles: harmonizing codes with the same clinical meaning (synonymy), and aligning institutional feature spaces. To address these challenges, we propose SMILE, a Spherical Mixture Integration for Latent Embedding alignment across multi-source feature spaces, where embeddings from heterogeneous sources serve as privacy-preserving summaries of clinical concepts and sparse relational pairs provide weak supervision. Synonymy is modeled via a mixture of von Mises-Fisher distributions, yielding unified representations of semantically equivalent raw codes. We develop a composite quasi-likelihood estimator with non-asymptotic error bounds for the latent representations and mixture mean directions and consistent synonym-cluster recovery, quantifying the gains from integrating multiple sources and knowledge-graph information. Simulations and a multi-institutional EHR application demonstrate improved alignment and synonym clustering.
Identifying covariates that modify treatment effects is a central problem in causal inference. Yet existing data-adaptive procedures do not provide finite-sample control over the expected number of false discoveries, risking spurious findings that fail to replicate. We introduce causal stability selection, an algorithm that combines cross-fitted estimation of conditional average treatment effects with integrated path stability selection. The method accommodates arbitrary treatment effect estimators and arbitrary base selectors, and produces a selection set with an explicit, non-asymptotic bound on the expected number of false positives. Under standard causal identifying assumptions and regularity conditions on the base selector, we prove that the estimated selection probabilities converge to their oracle counterparts at the rate of the underlying treatment effect estimator. This establishes a direct connection between treatment effect estimation and effect modifier discovery. We illustrate the method on a randomized trial in oncology and on observational data on maternal smoking and infant birthweight.
Causal graphs may inform covariate adjustment for estimating causal effects and improve estimation efficiency by exploiting the graphical structure. In many applications, however, the target causal parameter may not be point-identified due to the presence of unmeasured confounding. Sensitivity analysis methods address this challenge by characterizing bounds on the causal parameter under varying assumptions about the magnitude or form of unmeasured confounding. We focus on semiparametric efficient estimation of causal effects in non-identifiable settings, assuming a known (or hypothesized) causal graph. We propose an influence function projection approach that exploits the conditional independence constraints implied by the graph to improve the efficiency of semiparametric estimators of upper and lower bounds on the average causal effect under a given sensitivity analysis model. Our approach applies across multiple sensitivity analysis frameworks and causal estimands, thereby connecting knowledge of graphical structure with the sensitivity analysis literature. We illustrate our approach through simulations and real data examples thought to be affected by unmeasured confounding, including the effect of labor training program on post-intervention earnings, and the effect of low ejection fraction on heart failure death.
Kunchenko's method of polynomial maximization provides a semiparametric apparatus for parameter estimation under non-Gaussian errors, but its classical power basis relies on finite higher-order integer moments. This paper introduces the Parametrically Adaptive Transition Polynomial (PATP), a signed-parity fractional-power family controlled by a continuous parameter alpha in [0,1]. The quadratic exponent map p_i(alpha) connects the fractal regime p_i(0)=1/i, the degenerate linear point p_i(1/2)=1, and the signed-parity integer-power regime p_i(1)=i. For the degree-S=2 case we derive a closed-form variance-reduction coefficient g_2(alpha) in terms of signed and absolute fractional moments, identify the singular behavior at alpha=1/2, and state the moment and regularity conditions under which the formula is meaningful. The construction should be read as a Form-B PATP analogue within Kunchenko's generalized apparatus, not as an exact recovery of the canonical even-power PMM basis at alpha=1. Numerical illustrations on canonical distributions are used to examine the finite-sample behavior of the signed-parity estimator and to mark the boundary of applicability for extremely heavy-tailed cases such as Cauchy.
Standard generative models struggle with heavy-tailed data: Lipschitz architectures cannot produce power-law tails from Gaussian noise, and interpolating between heavy-tailed data and Gaussians is ill-posed. We propose a simple fix: apply the soft-log transform $\phi(x) = \mathrm{sign}(x) \cdot \log(1 + |x|)$ coordinate-wise to data before training, then exponentiate samples after generation. A Hill diagnostic decides per-coordinate whether to transform, leaving light-tailed margins untouched at no added complexity. This compresses heavy tails into a range where standard flow matching succeeds, without heavy-tailed base distributions or architectural modifications. We provide theoretical intuition for why this works: the log-transform maps Pareto tails to exponentials, and the induced dynamics implement a form of tail annealing via power transformations. On a 144-configuration multivariate benchmark (3 copulas, $d$ up to 100, 4 tail indices), Log-FM dominates specialized baselines on $W_1$, CVaR$_{99}$, and extreme-quantile metrics, and is the only method with zero severe divergences across 2{,}880 runs.
Privacy constraints have driven the rise of federated learning (FL), which enables multi-site analyses without sharing individual participant data. Existing FL estimators largely assume complete data, whereas multi-site studies often face missingness. We develop a framework for FL with missing data, identifying conditions under which the complete case (CC) estimator is preferred over the inverse probability weighting (IPW) estimator. For settings where the CC estimator leads to bias, we introduce a calibrated weight estimation approach that combines candidate weighting models across sites and remains consistent if at least one is correctly specified at each site; we further show that pooling many weighting candidate models with redundant information degrades the calibrated estimator, so a small set is preferable. Consistency conditions are stated at the site level, ensuring that the federated estimator inherits validity from site-level properties. We prove consistency and derive a sandwich variance estimator that accounts for uncertainty in the outcome model, and in both the estimated weighting models and the calibration step. Additionally, we show that all estimators require only one or a few communication rounds, making them practical under real-world data-governance constraints. We illustrate the framework by evaluating risk factors for 90-day mortality among patients with pleural infections treated with intrapleural enzyme therapy.
This research note investigates the impact of the experience museum Sensoria, opened in September 2024 in Holzminden, Germany, on local tourism demand and related direct and indirect effects. To this end, the study employs a novel approach by combining causal inference and demand-side economic analysis. A difference-in-differences approach is employed to quantify the number of additional guest overnight stays in the treatment city; the results are converted into industry-specific expenditures, from which the direct and indirect effects of Sensoria are determined. A positive and significant impact which corresponds to 4,691 additional overnight stays can be detected in the first year of operation of the new tourist attraction, resulting in an additional gross turnover of approximately 0.56 million EUR across the hospitality and retail industries and other services. The direct effects and indirect effects amount to approximately 0.23 and 0.21 million EUR, respectively. However, long-term effects cannot (yet) be determined. Additionally, positive effects from small and large events in the cities studied can be demonstrated. This brief study demonstrates that combining the two approaches mentioned holds promise, yet requires a more in-depth analysis, for which suggestions are also discussed regarding how it could be conducted.
While Conformal Prediction (CP) has proven to be a powerful framework for uncertainty quantification, guaranteeing conditional coverage remains a central challenge. Although finite-sample, distribution-free conditional validity is known to be impossible without structural assumptions, we show that for i.i.d. data, it is fundamentally equivalent to constructing a nonconformity score whose distribution is independent of the features. This theoretical characterization motivates PIT-CP, a new post-processing correction that maps any base nonconformity score to an approximately invariant one while preserving its geometry, interpretability, and marginal coverage. This perspective is particularly appealing in practice, since it may be neither economical nor time-effective to retrain a full generative model when a strong prediction-driven model already provides highly accurate point estimates. Our procedure reduces the problem to one-dimensional conditional density estimation on the induced score, rather than full conditional density estimation on the original outcome space. We show how to estimate this transform in practice and derive bounds on the conditional coverage gap, alongside volumetric and symmetric-difference bounds. We present known minimax-optimal conditional estimation techniques while also motivating the use of modern conditional density estimators, including Mixture Density Networks and Conditional Normalizing Flows. Finally, we empirically demonstrate on various datasets that our PIT-CP procedure matches or outperforms many state-of-the-art conformal prediction strategies with minimal effort and computational cost.
We study the mean-squared error of $k$-fold cross-validation as a risk estimator, with particular emphasis on how its accuracy depends on the number of folds $k$. Despite the widespread use of cross-validation, principled guidance for choosing $k$ is largely absent, mainly due to the complex dependence between fold-wise error estimates. To obtain sharp and interpretable results, we focus on the majority algorithm in binary classification, a minimal yet nontrivial empirical risk minimization procedure. We provide a fine-grained analysis of its cross-validation behavior, showing that even this simple algorithm exhibits subtle and delicate phenomena for which existing theory provides loose and even vacuous bounds. Leveraging this analysis, we introduce a minimax framework for cross-validation risk estimation and prove that no empirical risk minimization algorithm can achieve an $O(1/n)$ minimax mean-squared error when the number of folds grows with the number of samples $n$; instead, a lower bound of order $\Omega(\sqrt{k}/n)$ is unavoidable. Our results reveal fundamental limitations of cross-validation as a data-reuse strategy, clarify gaps and inaccuracies in prior theoretical work, and position the majority algorithm as a natural benchmark that any tight analysis of cross-validation should be able to explain.
We propose a simple two-stage order selector for finite-order ARFIMA models. First, a preliminary log-periodogram estimate of the memory parameter is used to fractionally filter the data. Second, a Hannan--Rissanen residual construction is applied to the filtered series, and the autoregressive and moving-average orders are selected by a generalized information criterion over a growing candidate rectangle. The search bounds are allowed to satisfy \(P_n,Q_n\to\infty\), whereas the true orders remain fixed and finite. The penalty is allowed to be larger than the ordinary BIC penalty so that it dominates the error introduced by preliminary long-memory estimation and by the Hannan--Rissanen residual approximation. We prove a uniform residual-variance approximation over the growing rectangle and combine it with a population separation argument between the true finite ARMA representation and underfitted alternatives. The resulting generalized information-criterion selector is consistent.
We study high-dimensional linear regression under a general symmetric convex constraint. Rather than imposing a specific sparsity-inducing penalty, we start from an arbitrary sign-symmetric and permutation-invariant convex body $K\subseteq \mathbb R^p$ and construct the sparse convexification hierarchy \[ K^{(s)} = \operatorname{conv}\{v\in K:\|v\|_0\le s\}. \] We propose a penalized least-squares estimator that searches over this hierarchy and adapts to the best sparse convex approximation of the target. Under standard sub-Gaussian assumptions on the random design and noise, we prove an oracle inequality showing that the estimator adapts to the best sparse convex approximation of the target. For an $s$-sparse target, the result yields a squared-error rate governed by the noise level $\sigma$, and the Gaussian width of the sparse convexification $K^{(s)}$. The method applies broadly to symmetric norm balls and can be implemented using oracle access to the Minkowski functional of $K$. As a special case, the framework yields a consistency result for the constrained Lasso.
Current-status data arise when an event time is observed only through an indicator of whether it occurred before an examination time. This paper studies a nonparametric neural-network sieve maximum likelihood estimator of the conditional cumulative distribution function of the event time. Under Hölder smoothness assumptions, we establish an explicit convergence rate by combining approximation theory for rectified linear unit neural networks with empirical-process arguments. This result provides theoretical support for neural-network estimation and subsequent inference under current-status observation.
Images represent objects characterized by contours and textures. From a statistical perspective these features can be defined as observations of continuous random functions. However, most existing approaches rely on pixel-based discretizations which lead to high-dimensional representations and heavy computational costs. In this note, we introduce an alternative more frugal representation. This representation assumes that the object has a star-shaped domain interior. Under this condition, we explore the analysis of images from a functional data analysis perspective. The proposed framework is illustrated on a real data supervised image classification problem.
In high-dimensional semi-supervised linear regression, prediction-powered inference (PPI) corrects an external predictor with a rectifier estimated from the labeled data. In a linear model, however, this rectifier cancels the predictor: PPI and PPI++ reduce to ordinary least squares and can inflate variance when the predictor is close to the oracle. We propose the Debiased External-model-Assisted Lasso (DEAL), which routes the external estimator and the unlabeled covariates into the variance of a debiased estimator, with a bias-aware, cross-fitted shrinkage step that adapts across target-only, near-oracle, and biased-but-informative regimes. We prove coordinate-wise asymptotic normality with an adaptive variance, extend validity to the projection parameter under misspecification and nonlinear labelers, and show that, at a common unlabeled budget, DEAL intervals are shorter than those of debiased Lasso, PPI, and PPI++; a shift-aware variant preserves coverage under covariate shift. In simulations, DEAL intervals are 0.49-0.87 of the debiased-Lasso length, and across six real-data applications spanning astronomy, chemistry, proteomics, and oncology, the last using a large-language-model oracle, they tighten in every case, with median length ratios of 0.23-0.53.
The Brody distribution, originally a phenomenological interpolation between Poisson and Wigner level-spacing statistics in quantum chaos, is calibrated here as a quantitative measure of short-range exclusion in 2D spatial point processes. Two results form the core. First, the 2D complete-spatial-randomness baseline is recalibrated to $\beta=0.96\pm0.15$, correcting the inappropriate 1D Poisson reference. Second, an empirical $\beta$--$r_{\text{excl}}$ calibration is validated against the effective hard-core radius with Spearman $\rho=0.988$. The framework is demonstrated on 58 manufactured surfaces (10 materials, 10 processes), phase-extracted interferometric profilometry of a certified roundness standard, and 2D binary embeddings of prime numbers. A sparse-integer control proves the prime $\beta=2.15$ signal is genuinely arithmetic ($\Delta\beta=+0.68$ over random-integer control), while a Cantor-embedding null result ($\beta=1.40$, TOST $p<0.01$) demonstrates that 2D exclusion is embedding-created rather than intrinsic. Density-thinning experiments establish that $\beta$ captures exclusion strength rather than point density, while absolute values are density-dependent. A distinct CSR baseline for binary fields at low fill fraction is identified, with a decision table provided. The $\beta$--$r_{\text{excl}}$ calibration, the CSR baseline correction, and the control protocols together constitute a calibrated measurement framework for reproducible characterisation of short-range exclusion in 2D spatial point processes.
Constraint-based causal discovery relies on repeated conditional independence tests, but fast nonparametric tests often sacrifice calibration, especially when variables depend on the conditioning set through nonlinear relationships. We introduce BLITZ (Broad-to-Local Independence Testing via residualiZation), a nonparametric conditional independence test designed to run well under a second while maintaining the accuracy needed for the thousands of queries performed by constraint-based causal discovery algorithms. BLITZ first removes broad smooth dependence on the conditioning set using low-order polynomial regression, then applies a small nonlinear feature map and residualizes those features with shallow tree regressions. The resulting statistic tests residual cross-covariance, with a moment-matched chi-square approximation to the null distribution. We show theoretically that the two-stage design reduces the effective complexity faced by the tree residualizers, allowing shallow trees to control residual conditional-mean bias while avoiding excessive overfitting. In simulations, BLITZ provides better null calibration than fast kernel, random-feature, and regression-based competitors while remaining among the fastest methods tested. In causal discovery experiments on synthetic graphs and flow-cytometry data, BLITZ yields more reliable endpoint orientations among retained adjacencies and competitive structural recovery. These results suggest that broad-to-local residualization is a practical route to calibrated, scalable nonparametric conditional independence testing for causal discovery.
A recent line of work has reframed individual decision trees as linear models on engineered features associated with their splits, opening routes for oracle inequalities and feature-importance reinterpretation, but leaving open the question of what unified geometric object a forest induces when one indexes its feature map by nodes rather than by splits. The present paper studies that object. KPP indexes the feature map by the nodes of the forest, weighted by a path metric that turns each coordinate into a component of a squared-Euclidean path-isometric embedding. KPP unifies four pillars under a single non-diagonal Gram that carries a metric: prediction, exact additive attribution, deterministic Lipschitz robust radius in the KPP metric, and uniform Rademacher risk bounds for regression and classification under fixed, honest, or cross-fit conditioning. All probabilistic guarantees are conditional on the representation and are stated under three explicit conditioning regimes; the robust-radius guarantee is deterministic in the KPP metric rather than in a norm on the raw input. Conjectured fast-rate refinements for both regression and classification are stated as open problems and are not claimed as theorems.
Accurately estimating treatment effects in time series is essential for evaluating interventions in real-world applications, especially when treatment assignment is biased by unobserved factors. In many practical settings, interventions are adopted at different times across individuals, leading to staggered treatment exposure and heterogeneous pre-treatment histories. In such cases, aggregating outcome trajectories across treated units is ill-defined, making individual treatment effect (ITE) estimation a prerequisite for reliable causal inference. We therefore study the problem of estimating the average treatment effect for the treated (ATT) by first recovering individual-level counterfactuals. We introduce a neural framework that learns simultaneously low-dimensional latent representations of individual time series and propensity scores. These estimates are then used to approximate the individual treatment effects through a flexible matching procedure that avoids classical convexity constraints commonly used in synthetic control methods. By operating at the individual level, our approach naturally accommodates staggered interventions and improves counterfactual estimation under latent bias, without relying on explicit temporal modeling assumptions. We illustrate our approach on both real-world energy consumption data and clinical time series, including high-frequency electricity demand-response programs and semi-synthetic data for individuals in intensive care unit (ICU), where hidden confounding, staggered treatment adoption, and non-stationary dynamics are prevalent.
We consider contraction of Bayesian posterior distributions in nonparametric settings where coefficients of a function over a basis or dictionary are given priors with $p$--exponential tails, including Laplace tails $(p=1)$ and heavier tails $(p<1)$. It is shown that contraction rates improve as $p$ decreases and that full adaptation to smoothness, up to logarithmic factors, is obtained in an appropriate $p\to 0$ regime. As applications, we consider both series priors in white noise regression and shallow ReLU neural networks in random design regression. In particular, we show that overparametrised shallow ReLU networks can adapt to any regularity $0\le \beta\le 2$. Through a simulation study, we show strong empirical agreement with the behavior predicted by our theory.
This paper integrates deep neural networks (DNNs) into structural models to increase flexibility and capture rich heterogeneity while preserving interpretability. Economic (or scientific or domain-restricted) structure and machine learning are complements in empirical modeling, not substitutes: DNNs provide the capacity to learn complex, nonlinear heterogeneity, while the structure ensures the estimates remain interpretable and suitable for decision-making and policy analysis. We start with a standard parametric structural model and then enrich its parameters into fully flexible functions, which are estimated using a DNN with the model structure built in. We illustrate our framework with an application to demand estimation in consumer choice. We show that by enriching a demand model we can capture rich heterogeneity exploit it to create personalized pricing. Optimization is not possible without structure, but cannot be heterogeneous without machine learning. The same lessons apply to precision dosing, adaptive treatment, educational testing, and other targeting settings. We provide theoretical justification for our proposed methodology: nonasymptotic bounds and a novel and general influence function for feasible inference via double machine learning, so that the latter can be easily applied in numerous new contexts. These results may be of interest in other contexts as they generalize prior work.
Vertex-level clustering for directed graphs (digraphs) remains challenging as edge directionality breaks the key assumptions underlying popular spectral methods, which also incur the overhead of eigen-decomposition. This paper proposes Parametrized Power-Iteration Clustering (ParPIC), a random-walk-based clustering method for weakly connected digraphs. This builds over the Power-Iteration Clustering paradigm, which uses the rows of the iterated diffusion operator as a data embedding. ParPIC has three important features: the use of parametrized reversible random walk operators, the automatic tuning of the diffusion time, and the efficient truncation of the final embedding, which produces low-dimensional data representations and reduces complexity. Empirical results on synthetic and real-world graphs demonstrate that ParPIC achieves competitive clustering accuracy with improved scalability relative to spectral and teleportation-based methods.
Sequential data is ubiquitous -- it is routinely gathered to gain insights into complex processes such as behavioral, biological, or physical processes. Challengingly, such data not only has dependencies within the observed sequences, but the observations are also often high-dimensional, sparse, and noisy. These are all difficulties that obscure the inner workings of the complex process under study. One solution is to calculate a low-dimensional representation that describes (characteristics of) the complex process. This representation can then serve as a proxy to gain insight into the original process. However, uncovering such low-dimensional representation within sequential data is nontrivial due to the dependencies, and an algorithm specifically made for sequences is needed to guarantee estimator consistency. Fortunately, recent theoretical advancements on Block Markov Chains have resulted in new clustering algorithms that can provably do just this in synthetic sequential data. This paper presents a first field study of these new algorithms in real-world sequential data; a wide empirical study of clustering within a range of data sequences. We investigate broadly whether, when given sparse high-dimensional sequential data of real-life complex processes, useful low-dimensional representations can in fact be extracted using these algorithms. Concretely, we examine data sequences containing GPS coordinates describing animal movement, strands of human DNA, texts from English writing, and daily yields in a financial market. The low-dimensional representations we uncover are shown to not only successfully encode the sequential structure of the data, but also to enable gaining new insights into the underlying complex processes.
We consider nonlinear networks as perturbations of linear ones. Based on this approach, we present novel generalization bounds that become non-vacuous for networks that are close to being linear. The main advantage over the previous works which propose non-vacuous generalization bounds is that our bounds are a-priori: performing the actual training is not required for evaluating the bounds. To the best of our knowledge, they are the first non-vacuous generalization bounds for neural nets possessing this property.
We consider the effect of Gaussian perturbations on least-squares residuals, orthogonal projections, and QR-type algorithms. The problem that motivated our investigations is as follows: suppose that a full column-rank matrix \(B\in\mathbb{R}^{m\times n}\) has already been computed, and suppose that a new normalized column \(q=(x+y)/\|x+y\|_2\) is to be appended to \(B\), where \(x\perp\operatorname{span}(B)\) is the ideal orthogonal component and \(y\) represents the orthogonalization error. How large can the condition number \(\kappa([B,q])\) of the resulting matrix \([B,q]\) become? While we provide a Weyl-type bound on the singular values of \([B,q]\), in terms of the extremal singular values of \(B\) and the quantity \(\|B^T y\|_2/\|x+y\|_2\), we also derive exact probability laws for norms and projection residuals under Gaussian perturbations. Finally, we use these probability laws to derive probabilistic condition-number bounds for QR-type processes with imperfect orthogonalization and exact normalization.
The conditional Gaussian nonlinear system (CGNS) is a broad class of nonlinear stochastic dynamical systems. Given the trajectories for a subset of state variables, the remaining follow a Gaussian distribution. Despite the conditionally linear structure, the CGNS exhibits strong nonlinearity, thus capturing many non-Gaussian characteristics observed in nature through its joint and marginal distributions. Desirably, it enjoys closed analytic formulae for the time evolution of its conditional Gaussian statistics, which facilitate the study of data assimilation and other related topics. In this paper, we develop a martingale-free approach to improve the understanding of CGNSs. This methodology provides a tractable approach to proving the time evolution of the conditional statistics by deriving results through time discretization schemes, with the continuous-time regime obtained via a formal limiting process as the discretization time-step vanishes. This discretized approach further allows for developing analytic formulae for optimal posterior sampling of unobserved state variables with correlated noise. These tools are particularly valuable for studying extreme events and intermittency and apply to high-dimensional systems. Moreover, the approach improves the understanding of different sampling methods in characterizing uncertainty. The effectiveness of the framework is demonstrated through a physics-constrained, triad-interaction climate model with cubic nonlinearity and state-dependent cross-interacting noise.
Data assimilation (DA) combines partial observations with dynamical models to improve state estimation. Filter-based DA uses only past and present data and is the prerequisite for real-time forecasts. Smoother-based DA exploits both past and future observations. It aims to fill in missing data, provide more accurate estimations, and develop high-quality datasets. However, the standard smoothing procedure requires using all historical state estimations, which is storage-demanding, especially for high-dimensional systems. This paper develops an adaptive-lag online smoother for a large class of complex dynamical systems with strong nonlinear and non-Gaussian features, which has important applications to many real-world problems. The adaptive lag allows the utilization of observations only within a nearby window, thus reducing computational complexity and storage needs. Online lag adjustment is essential for tackling turbulent systems, where temporal autocorrelation varies significantly over time due to intermittency, extreme events, and nonlinearity. Based on the uncertainty reduction in the estimated state, an information criterion is developed to systematically determine the adaptive lag. Notably, the mathematical structure of these systems facilitates the use of closed analytic formulae to calculate the online smoother and adaptive lag, avoiding empirical tunings as in ensemble-based DA methods. The adaptive online smoother is applied to studying three important scientific problems. First, it helps detect online causal relationships between state variables. Second, the advantage of reduced computational storage expenditure is illustrated via Lagrangian DA, a high-dimensional nonlinear problem. Finally, the adaptive smoother advances online parameter estimation with partial observations, emphasizing the role of the observed extreme events in accelerating convergence.
We derive universal approximation results for the class of (countably) $m$-rectifiable measures. Specifically, we prove that $m$-rectifiable measures can be approximated as push-forwards of the one-dimensional Lebesgue measure on $[0,1]$ using ReLU neural networks with arbitrarily small approximation error in terms of Wasserstein distance. What is more, the weights in the networks under consideration are quantized and bounded and the number of ReLU neural networks required to achieve an approximation error of $\varepsilon$ is no larger than $2^{b(\varepsilon)}$ with $b(\varepsilon)=\mathcal{O}(\varepsilon^{-m}\log^2(\varepsilon))$. This result improves Lemma IX.4 in Perekrestenko et al. as it shows that the rate at which $b(\varepsilon)$ tends to infinity as $\varepsilon$ tends to zero equals the rectifiability parameter $m$, which can be much smaller than the ambient dimension. We extend this result to countably $m$-rectifiable measures and show that this rate still equals the rectifiability parameter $m$ provided that, among other technical assumptions, the measure decays exponentially on the individual components of the countably $m$-rectifiable support set.
When evaluating policy interventions, researchers often pursue two related goals: identifying which individuals or contexts benefit most, and determining whether patterns of treatment effect heterogeneity can be used to aggregate evidence across environments. We develop a framework that aggregates treatment effect heterogeneity, defined over individual and environmental characteristics, into interpretable summaries while setting aside contexts in which extrapolation is unreliable and further evidence is needed. The procedure therefore learns both how to summarize heterogeneous effects and when researchers should admit ignorance. We derive finite-sample regret guarantees, provide data-driven guarantees for selecting the complexity of the summary class, and inference procedures that quantify the value of follow-up data collection. We illustrate the approach by reanalyzing a multifaceted anti-poverty program implemented in six countries.
Accurate and rapid prediction of wildfire trends is crucial for effective management and mitigation. However, the stochastic nature of fire propagation poses significant challenges in developing reliable simulators. In this paper, we introduce PyTorchFire, an open-access, PyTorch-based software that leverages GPU acceleration. With our redesigned differentiable wildfire Cellular Automata (CA) model, we achieve millisecond-level computational efficiency, significantly outperforming traditional CPU-based wildfire simulators on real-world-scale fires at high resolution. Real-time parameter calibration is made possible through gradient descent on our model, aligning simulations closely with observed wildfire behavior both temporally and spatially, thereby enhancing the realism of the simulations. Our PyTorchFire simulator, combined with real-world environmental data, demonstrates superior generalizability compared to supervised learning surrogate models. Its ability to predict and calibrate wildfire behavior in real-time ensures accuracy, stability, and efficiency. PyTorchFire has the potential to revolutionize wildfire simulation, serving as a powerful tool for wildfire prediction and management.
Transformers are the state-of-the-art architecture for large language models, and a key to their scalability is the strategic usage of low-precision arithmetic. We develop a mixed-precision analysis of transformer inference, deriving bounds for the condition numbers and forward error of the architecture's constituent parts. Notably, we compare the numerical stability of LayerNorm and RMSNorm in the massive-outlier regime, tighten the error bound of softmax in the presence of attention sinks, and quantify the impact of its shifted evaluation on the sensitivity to perturbations. Furthermore, we derive novel sequence-length-independent bounds on the local Lipschitz constant of self-attention. Our worst-case error bound for transformer inference suggests that its numerical stability is determined by the interplay between weight magnitude and the growth of the residual stream. Crucially, and as validated by experiments with GPT-2, our analysis establishes that the scaling of residual-projection weights preserves the propagation of the relative rounding error unless it forces a qualitative transition in the dynamics of the residual stream.
With the growing interest in quantum machine learning, the perceptron, a fundamental building block in traditional machine learning, has emerged as a valuable model for exploring the potential of quantum algorithms. In this work, we make two principal contributions. First, we revisit the \emph{quantum version space perceptron} algorithm proposed by Kapoor et al. (2016), by identifying and correcting a flawed complexity assumption. We show that the query complexity of the algorithm is dimension-dependent, which has significant implications for its behaviour in high-dimensional regimes under worst-case scenarios. Second, we propose and analyse two \emph{quantum-enhanced} cutting-plane algorithms for perceptron learning. Specifically, we leverage established quantum subroutines such as \emph{Grover's search} and \emph{quantum walk search}, and provide detailed algorithmic constructions together with query and arithmetic complexity analyses. Our results establish improved complexity bounds under an idealised implementation framework and noise-free quantum computational models, offering insights into the trade-offs between margin dependence, dimensional dependence, and quantum resources. These findings provide a refined understanding of quantum perceptron models and their theoretical computational complexity properties.
Robust reinforcement learning (RL) under the average-reward criterion is essential for long-term decision-making, particularly when the environment may differ from its training dynamics. However, most existing studies focus on model-based settings and provide only asymptotic guarantees, hindering their principled understanding and practical deployment, especially in data-limited scenarios. We aim to close this gap by proposing a model-free algorithm, \textbf{Robust Halpern Iteration (RHI)}. We first design our algorithm based on a black-box sampling oracle, which can estimate the worst-case performance accurately. We then derive the finite sample complexity of RHI under the generative model setting, assuming the sampling oracle. To concretely design such an oracle, we propose a $K$-order multi-level Monte-Carlo estimator, which is shown to have a lower bias compared to prior methods. We further instantiate our design for multiple uncertainty models, including KL and $\chi^2$ divergence sets, and show that our RHI algorithm achieves an $\varepsilon$-optimal robust policy with a sample complexity of $\tilde{\mathcal{O}}\left( \frac{SA\mathcal{H}^2}{\varepsilon^{(2+o(1))}}\right)$, where $S,A$ are the number of states and actions, and $\mathcal{H}$ is the robust optimal span. Our result asymptotically matches the best complexity in robust average reward RL.
Motivated by Online Configuration Optimization in large, dynamic parameter spaces, this work studies the nonstochastic multi-armed bandit (MAB) problem in metric action spaces with oblivious Lipschitz adversaries. We propose ABoB (Adversarial Bandit over Bandits), a hierarchical framework that decomposes the configuration space into clusters to accelerate learning and adapt to changing environments. We evaluate ABoB using standard algorithms such as EXP3 and Tsallis-INF on a real-world production storage system, demonstrating significant performance gains of up to $50\%$ compared to state-of-the-art "flat" bandit algorithms. Extensive simulations further confirm that ABoB effectively exploits metric structures, achieving up to $91\%$ improvement in adversarial metric scenarios while significantly reducing computational running time. Theoretical analysis grounds this empirical success: we prove that ABoB maintains a worst-case "safety net" bound of $O(\sqrt{kT})$, matching traditional methods, where $T$ is the number of rounds and $k$ is the number of arms, while capable of accelerating learning to $O(k^{1/4}\sqrt{T})$ under favorable Lipschitz conditions. This combination of operational efficiency and theoretical soundness makes ABoB a practical solution for automated system tuning.
Identifying locations that offer maximum visual exposure to passing vehicular traffic is a core problem in urban analytics, with applications spanning urban design, navigation, location-based services, and the placement of street-level assets. Traditional site selection methods often rely on static traffic counts or subjective assessments. This research introduces a data-driven methodology to objectively quantify location visibility by analyzing large-scale connected vehicle trajectory data within urban environments. We model the dynamic driver field-of-view using a forward-projected visibility area for each vehicle position derived from interpolated trajectories. By integrating this with building vertex locations extracted from OpenStreetMap, we quantify the cumulative visual exposure, or ``visibility count'', for thousands of potential points of interest along roadways. The core technical contribution involves the construction of a BallTree spatial index over building vertices. This enables highly efficient (O(logN) complexity) radius queries to determine which vertices fall within the viewing circles of millions of trajectory points across numerous trips, significantly outperforming brute-force geometric checks. Analysis reveals two key findings: 1) Visibility is highly concentrated, identifying distinct 'visual hotspots' receiving disproportionately high exposure compared to average locations. 2) The aggregated visibility counts across vertices conform to a Log-Normal distribution.
The public often attributes human-like qualities to large language models (LLMs), assuming that they "know" certain things. In reality, LLMs encode information retained during training as internal probabilistic knowledge. This study examines existing methods for probing the veracity of that knowledge and identifies three flawed underlying assumptions. To address these flaws, we introduce sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework that combines multiple-instance learning with conformal prediction. sAwMIL leverages LLMs' internal representations to classify statements as true, false, or neither. We evaluate sAwMIL across 16 open-source LLMs, including default and chat-based variants, using three new curated datasets. Our results show that (1) common probing methods fail to provide a reliable and transferable veracity direction and, in some settings, perform worse than zero-shot prompting; (2) truth and falsehood are not encoded symmetrically; and (3) LLMs encode a third type of signal that is distinct from both true and false.
This paper proposes a frequency-domain estimator for low-order systems from repeated noisy measurements. The estimator minimizes a quadratic data-fitting term regularized by the nuclear norm of a Loewner matrix, subject to a convex stability constraint enforced via a semidefinite program. We prove a finite-sample error bound at the sampled frequencies and extend it to all frequencies through rational interpolation. The bound characterizes the dependence on the number of repeated experiments, number of frequency points, system order, and noise level. Numerical experiments on SISO and MIMO systems demonstrate the low-order-promoting effect of the method and validate the predicted scaling laws.
We prove the Kolmogorov-Feller weak law of large numbers for sample Fréchet means on non-compact symmetric spaces. The result covers independent, non-identically distributed data, extending beyond the i.i.d. setting. Examples of symmetric positive-definite matrices and product symmetric spaces are provided.
Interfacing quantum and classical processors is an important subroutine in full-stack quantum algorithms. The so-called ``classical shadow'' method efficiently extracts essential classical information from quantum states, enabling the prediction of many properties of a quantum system from only a few measurements. However, for a small number of highly non-local observables, or when classical post-processing power is limited, the classical shadow method is not always the most efficient choice. Here, we address this issue quantitatively by performing a full-stack resource analysis that compares classical shadows with direct quantum measurement. Under certain assumptions, our analysis illustrates an efficiency frontier between classical shadows and direct quantum measurement in the information-extraction stage. For observables expressed as linear combinations of Pauli matrices, the classical shadow method outperforms direct measurement when the number of observables is large and the Pauli weight is small. For observables in the form of large Hermitian sparse matrices, the classical shadow method shows an advantage when the number of observables, the sparsity of the matrix, and the number of qubits fall within a certain range. The key parameters influencing this behavior include the number of qubits $n$, observables $M$, sparsity $k$, Pauli weight $w$, accuracy requirement $\epsilon$, and failure tolerance $\delta$. We also compare the resource consumption of the two methods on different types of quantum computers and identify break-even points where the classical shadow method becomes more efficient, which vary depending on the hardware. This paper opens a new avenue for quantitatively designing optimal strategies for hybrid quantum-classical tomography and provides practical insights for selecting the most suitable quantum measurement approach in real-world applications.
Many LLMs plan before they act, yet planning and execution are often still entangled in one long generation trace, enforced only through prompts, or split across separate components. We argue that these two stages call for different computation: planning benefits from diversity and breadth, whereas execution demands precision and faithful adherence to a chosen strategy. Treating them as a single undifferentiated chain wastes tokens on routine derivation and makes it costly to explore alternative strategies at test time. We present the \textbf{Explore-Execute Chain (E\textsuperscript{2}C)}, which keeps both stages in one model but separates them structurally: a stochastic \textit{Exploration} phase drafts a concise high-level plan, and a deterministic \textit{Execution} phase carries it out. Causal SFT and RL train this split so that exploration stays informative and execution remains plan-faithful. Once plans are short yet decisive, extra inference compute can be directed to exploration rather than to repeatedly decoding full solutions. On AIME'2024 at $K{=}32$, \textbf{E\textsuperscript{2}C-ReAct Loop} reaches 53.3\% accuracy with only 12.4k tokens, outperforming Tree-of-Thoughts ($N{=}32$: 50.0\%, 71.3k). The same structure also supports lightweight domain adaptation: \textbf{Exploration-Focused SFT (EF-SFT)} updates only the planning phase, uses 3.5\% of the tokens required by standard SFT, and improves medical benchmark accuracy by up to 14.5\%.
We introduce vector diffusion wavelets (VDWs), a novel family of wavelets inspired by the vector diffusion maps algorithm that was introduced to analyze data lying in the tangent bundle of a Riemannian manifold. We show that these wavelets may be effectively incorporated into a family of geometric graph neural networks, which we refer to as VDW-GNNs. We demonstrate that such networks are effective on synthetic point cloud data, as well as on real-world data derived from wind field and neural activity measurements. Theoretically, we prove that these new wavelets have desirable frame theoretic properties, similar to traditional diffusion wavelets. Additionally, we prove that these wavelets have useful symmetries with respect to rotations and translations.
Nuclear fusion plays a pivotal role in the quest for reliable and sustainable energy production. A major roadblock to viable fusion power is understanding plasma turbulence, which significantly impairs plasma confinement, and is vital for next-generation reactor design. Plasma turbulence is governed by the nonlinear gyrokinetic equation, which evolves a 5D distribution function over time. Due to its high computational cost, reduced-order models are often employed in practice to approximate turbulent transport of energy. However, they omit nonlinear effects unique to the full 5D dynamics. To tackle this, we introduce GyroSwin, the first scalable 5D neural surrogate that can model 5D nonlinear gyrokinetic simulations, thereby capturing the physical phenomena neglected by reduced models, while providing accurate estimates of turbulent heat transport. GyroSwin (i) extends hierarchical Vision Transformers to 5D, (ii) introduces cross-attention and integration modules for latent 3D$\leftrightarrow$5D interactions between electrostatic potential fields and the distribution function, and (iii) performs channelwise mode separation inspired by nonlinear physics. We demonstrate that GyroSwin outperforms widely used reduced numerics on heat flux prediction, captures the turbulent energy cascade, and reduces the cost of fully resolved nonlinear gyrokinetics by three orders of magnitude while remaining physically verifiable. GyroSwin shows promising scaling laws, tested up to one billion parameters, paving the way for scalable neural surrogates for gyrokinetic simulations of plasma turbulence.
Diffusion models have demonstrated remarkable performance in generating high-dimensional samples across domains such as vision, language, and the sciences. Although continuous-state diffusion models have been extensively studied both empirically and theoretically, discrete-state diffusion models, essential for applications involving text, sequences, and combinatorial structures, remain significantly less understood from a theoretical standpoint. In particular, all existing analyses of discrete-state models assume score estimation error bounds without studying sample complexity results. In this work, we present a principled theoretical framework for discrete-state diffusion, providing the first sample complexity bound of $\widetilde{\mathcal{O}}(\epsilon^{-2})$. Our structured decomposition of the score estimation error into statistical, approximation, optimization, and clipping components offers critical insights into how discrete-state models can be trained efficiently. This analysis addresses a fundamental gap in the literature and establishes the theoretical tractability and practical relevance of discrete-state diffusion models.
Toeplitz covariance estimation is a classical problem in statistical signal processing, yet the geometry of the Gaussian maximum-likelihood objective remains only partially understood. Recent algorithms, including Newton-type, majorization-minimization, and gradient-based methods, indicate that the nonconvex problem can often be globally solved when the number of samples is sufficiently large, but they also reveal a difficult computational landscape. In this work, we study this phenomenon through an overparameterized Caratheodory representation of positive definite Toeplitz covariance matrices. The Caratheodory decomposition parameterizes the covariance using a combination of steering vectors with different frequencies and amplitudes. Our first result shows that fixed-grid amplitude optimization is fundamentally insufficient. Even in the population setting, and even with arbitrarily many fixed frequency grid points, amplitude-only optimization can have a strictly positive error floor under grid mismatch. This motivates optimizing both amplitudes and frequencies. In this case, our main theoretical result proves that the joint optimization has a benign population landscape: every stationary point that produces a positive definite covariance matrix recovers the true Toeplitz covariance. These findings suggest a simple interpretation of the Toeplitz covariance problem: the population landscape is globally benign, but may be highly ill-conditioned. In our numerical experiments, overparameterization improves convergence speed and finite-sample accuracy. In particular, it allows simple gradient descent to approach the Cramer Rao bound while keeping the implementation simple.
Overparameterized fully-connected neural networks have been shown to behave like kernel models when trained with gradient descent, assuming standard scaling conditions on the width, the learning rate, and the parameter initialization. In the limit of infinitely large widths and infinitesimal learning rate, the obtained kernel provides a description of the learned model's output via a closed-form solution dependent on the architecture and the activation function. The Neural Tangent Kernel, central to this description, remains constant throughout training, a phenomenon that is referred to as ``lazy training'' or within the ``lazy regime''. Prior works show that the ``lazy regime'' leads to non-varying hidden neuron activations in infinitely-wide networks. Moreover, as infinitely-wide networks increase in depth, the Neural Tangent Kernel induces a closed-form solution that is data-independent, hence trivial. The Neural Tangent Kernel seemingly fails to describe the complexity of overparameterized neural networks on two distinct axes: large widths and large depths. In this work, we challenge these two conclusions and open the door to re-evaluating the Neural Tangent Kernel's role in describing the output of overparameterized neural networks. Specifically, we show experimentally that while deviations in the activations of individual hidden neurons vanish, the aggregate norm of these deviations does not. We support this finding with a theoretical result showing that the activations of the last hidden layer do not remain constant. Furthermore, we demonstrate that properly scaling the depth and stopping time in infinitely-wide ReLU networks yields a well-behaved, non-trivial output at large dataset sizes. We empirically evaluate the stability of this behavior on large datasets, and we describe the essential properties that enable the generalization of our results to other kernels.
Distributed Fiber Optic Sensing (DFOS) has emerged as a promising technology for long-range and real-time perimeter security in critical infrastructure monitoring. However, DFOS signals collected from different field deployments often exhibit substantial distribution shifts caused by variations in fiber installation, structural coupling, and environmental noise. These deployment-dependent changes make reliable event recognition difficult in practical perimeter security systems, especially when labeled samples from new target sites are scarce or unavailable. To address these challenges, this paper proposes DUPLE, an intelligent cross-deployment recognition framework for fiber-optic perimeter security under label-scarce target deployments. DUPLE employs statistically guided meta-learning to enhance recognition robustness across unseen deployments. Specifically, a dual-domain multi-prototype learner jointly models temporal and frequency-domain evidence to capture intra-class variability under deployment shifts. A statistical guidance network estimates sample-specific domain reliability from raw signal statistics, while a query-aware aggregation mechanism adaptively selects relevant prototypes for each test sample. Extensive experiments on two real-world cross-deployment DFOS benchmarks demonstrate that DUPLE consistently outperforms representative traditional machine learning, deep learning, domain generalization, and meta-learning baselines. Ablation, few-shot, per-deployment, and efficiency analyses further verify the effectiveness and practicality of DUPLE for reliable DFOS-based perimeter security monitoring.
We inspect the deductive connection between the neural scaling law and Zipf's law -- two statements discussed in machine learning and quantitative linguistics. The neural scaling law describes how the cross entropy rate of a foundation model -- such as a large language model -- changes with respect to the amount of training tokens, parameters, and compute. By contrast, Zipf's law posits that the distribution of tokens exhibits a power law tail. Whereas similar claims have been made in more specific settings, we show that the neural scaling law is a consequence of Zipf's law under certain broad assumptions that we reveal systematically. The derivation steps are as follows: We derive Heaps' law on the vocabulary growth from Zipf's law, Hilberg's hypothesis on the entropy scaling from Heaps' law, and the neural scaling from Hilberg's hypothesis. We illustrate these inference steps by a toy example of the Santa Fe process that satisfies all four statistical laws.
Empirical models of multi-product demand rely on low-dimensional product representations to capture substitution patterns, increasingly using proxies built from unstructured data. When proxies are imperfect, standard workflows yield biased counterfactuals and invalid inference. We develop a practical toolkit to address these issues. Our methods apply to market-level and/or individual data, require minimal additional computation, provide simple standard-error formulas, and accommodate proxies from fine-tuned models. Further, we propose diagnostics to assess proxy quality. Our methods yield meaningful improvements in predicting substitution in empirically calibrated simulations and in an application where we assess counterfactual prediction performance against a ground truth.
The empirical success of deep learning is often attributed to deep networks' ability to exploit hierarchical structure in data, constructing increasingly complex features across layers. Yet despite substantial progress in deep learning theory, most optimization results sill focus on networks with only two or three layers, leaving the theoretical understanding of hierarchical learning in genuinely deep models limited. This leads to a natural question: can we prove that deep networks, trained with gradient-based methods and standard input-label pairs, can efficiently exploit hierarchical structure? In this work, we consider Random Hierarchy Models -- a hierarchical context-free grammar introduced by arXiv:2307.02129 and conjectured to separate deep and shallow networks. We prove that, under mild conditions, a deep convolutional network can be efficiently trained to learn this function class. Our proof builds on a general observation: if intermediate layers can receive clean signal from the labels and the relevant features are weakly identifiable, then layerwise training each individual layer suffices to hierarchically learn the target function.
Ultrafast online learning is essential for high-frequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.
We develop a general mathematical framework to analyze scaling regimes and derive explicit analytic solutions for gradient flow (GF) in large learning problems. Our key innovation is a formal power series expansion of the loss evolution, with coefficients encoded by diagrams akin to Feynman diagrams. We show that this expansion has a well-defined large-size limit that can be used to reveal different learning phases and, in some cases, to obtain explicit solutions of the nonlinear GF. We focus on learning Canonical Polyadic (CP) decompositions of high-order tensors, and show that this model has several distinct extreme lazy and rich GF regimes such as free evolution, NTK and under- and over-parameterized mean-field. We show that these regimes depend on the parameter scaling, tensor order, and symmetry of the model in a specific and subtle way. Moreover, we propose a general approach to summing the formal loss expansion by reducing it to a PDE; in a wide range of scenarios, it turns out to be first-order and solvable by the method of characteristics. We observe a very good agreement of our theoretical predictions with experimental results.
We consider the problem of heteroskedastic generalized linear bandits (GLBs) with adversarial corruptions, which subsumes heteroskedastic linear bandits and logistic/Poisson bandits, in the presence of adversarial corruptions. We propose HCW-GLB-OMD, which consists of two components: an online mirror descent (OMD)-based estimator and Hessian-based confidence weights to achieve corruption robustness. This is computationally efficient in that it only requires ${O}(1)$ space and time complexity per iteration. Under the self-concordance assumption on the link function, we show a regret bound of $\tilde{O}\left( d \sqrt{\sum_t g(\tau_t) \dot{\mu}_{t,\star}} + d^2 g_{\max} \kappa + d (g_{\max} + \kappa) C \right)$, where $\dot{\mu}_{t,\star}$ is the slope of $\mu$ around the optimal arm at time $t$, $g(\tau_t)$'s are potentially exogenously time-varying dispersions (e.g., $g(\tau_t) = \sigma_t^2$ for heteroskedastic linear bandits, $g(\tau_t) = 1$ for Bernoulli and Poisson), $g_{\max} = \max_{t \in [T]} g(\tau_t)$ is the maximum dispersion, and $C \geq 0$ is the total corruption budget of the adversary. We complement this with a lower bound of $\tilde{\Omega}(d \sqrt{\sum_t g(\tau_t) \dot{\mu}_{t,\star}} + d C)$, unifying previous problem-specific lower bounds. Thus, our algorithm achieves, up to a $\kappa$-factor in the corruption term, instance-wise minimax optimality simultaneously across various instances of heteroskedastic GLBs with adversarial corruptions.
Random-restart heuristics are widely used in nonconvex optimization and equilibrium computation: practitioners run a local algorithm from many initial conditions and interpret repeated convergence to the same output as evidence that the result is robust, dominant, or even unique. Despite its widespread use, this reasoning is usually informal. We provide a probabilistic framework for interpreting restart evidence. We give broad, easy-to-verify sufficient conditions under which repeated runs of a solver can be treated as independent draws from a categorical distribution induced by random initial conditions. Within this framework, we develop Bayesian inference from repeated identical outputs. We derive posterior concentration rates for basin size and uniqueness. These rates demonstrate that uniqueness is inherently harder than learning basin size: posterior concentration for uniqueness is polynomial, whereas basin size concentrates exponentially fast. We also provide a verification protocol for checking whether a given problem fits our framework. We demonstrate the protocol on a widely used equilibrium solver for mixed-logit demand with multi-product firms, and complement the verification exercise with posterior tables that apply to any restart experiment satisfying the protocol. We conclude by delineating limits of restart-based inference, including failures induced by solver--problem mismatch and limited visibility of alternative outcomes.
A matching platform is a system that matches participants of different types, such as companies and job-seekers. In such a platform, maximizing matches may concentrate assignments on popular participants, increasing dissatisfaction among others, and eventually causing churn, which reduces the platform's profit opportunities. To address this issue, we propose a novel online learning problem, Combinatorial Allocation Bandits (CAB), which incorporates the notion of *arm satisfaction*. In CAB, at each round, the learner observes feature vectors for $K$ arms and $N$ users, assigns users to arms, and observes feedback following a generalized linear model (GLM). Unlike prior work, the objective is to maximize arm satisfaction rather than the number of positive feedback. For CAB, we develop an upper confidence bound algorithm that uses an approximate optimization oracle and achieves an approximate regret upper bound, whose dependence on $d$, $T$, and $N$ matches the known lower bound for contextual combinatorial linear bandits up to logarithmic factors. We also analyze a Thompson sampling algorithm with a standard regret bound under an exact optimization oracle, and propose a cheaper one-pass variant retaining sublinear approximate regret under a self-concordance assumption. Experiments on synthetic data support the objective and show that CAB-UCB achieves higher cumulative satisfaction than baselines.
Reliable uncertainty quantification is paramount for forecasting multivariate time series and spatiotemporal data. While Transformer architectures excel at sequence modeling, current probabilistic approaches typically rely on restrictive parametric likelihoods or quantile-based objectives, thereby limiting their ability to capture complex joint distributions in correlated time series. To overcome these limitations, we propose \textit{Enformer} and its spatiotemporal extension, \textit{GEnformer}. These models synthesize the expressive power of Transformers with engression, a stochastic learning paradigm for modeling conditional distributions. By injecting stochastic noise and optimizing a strictly proper scoring objective, our frameworks directly learn conditional predictive distributions without imposing parametric assumptions. This design ensures the generation of coherent multivariate trajectories while maintaining the Transformer's efficacy in modeling long-range dependencies and cross-series interactions. The probabilistic capability of Enformer is achieved with an asymptotic overhead of only a constant factor over a deterministic Transformer with an identical configuration. We extensively evaluate our frameworks on prominent multivariate benchmarks for temporal dynamics and real-world epidemic datasets for spatiotemporal dynamics. Empirical results demonstrate that both frameworks yield calibrated probabilistic forecasts and consistently outperform state-of-the-art baselines.
This paper studies finite-sample set-membership identification for discrete-time bilinear systems under bounded symmetric log-concave disturbances. Our analysis considers trajectory-dependent regressors and allows marginally stable dynamics with polynomial mean-square state growth. We prove that the diameter of the feasible parameter set shrinks with sample complexity $\widetilde{\mathcal O}(1/\epsilon)$ where $\epsilon$ is the estimation error. Simulation supports the theory and illustrates the advantage of the proposed estimator for uncertainty quantification.
We develop a transport-geometric theory of cosmological peak statistics based on optimal transport and entropy geometry. The density field is treated as a probability measure in Wasserstein space, and its local structure is characterized by a logarithmic curvature tensor obtained as the localized response of an entropy functional. Peaks are thereby defined as positive-curvature stationary points, and their abundance is formulated as a curvatureconditioned measure on local tensor space. In the Gaussian-linear limit, this measure admits a finite-dimensional closure in terms of the spectral moments of the density field. The resulting peak abundance reduces exactly to the classical BBKS formula, identifying BBKS as a solvable Gaussian closure of a more general geometric structure. This formulation separates peak statistics into three fundamental ingredients: the probability distribution of local variables, the positive-curvature constraint, and the induced geometric measure. The theory extends naturally beyond the Gaussian approximation. Nonlinear evolution appears as a deformation of the logarithmic curvature geometry, while primordial non-Gaussianity is interpreted as a deformation of the curvature-conditioned measure itself. We further formulate two- and three-point peak statistics as higher-order curvature-conditioned measures and show that the resulting hierarchy can be organized as response functions to long-wavelength background modes, with conventional peak bias emerging as the lowest-order response coefficient. These results provide a unified description linking optimal transport, curvature geometry, peak statistics, and cosmological observables, and establish a systematic framework for studying nonlinearity, scale dependence, primordial non-Gaussianity, and higher-order peak correlations.
Genome engineering has achieved sequence-level precision, yet predicting the transcriptomic state a cell will occupy after perturbation remains open. Single-cell CRISPR screens measure how far cells move, but effect magnitude ignores whether the cells move together. We introduce Shesha perturbation stability ($S_p$), which quantifies directional coherence as the mean cosine similarity between individual cell shift vectors and the mean perturbation direction. Across five CRISPR datasets (2,200+ perturbations), stability correlates with magnitude (Spearman $\rho = 0.75$--$0.97$), but discordant cases expose regulatory architecture: pleiotropic regulators such as CEBPA pay a ``geometric tax,'' producing large but incoherent shifts, while lineage-specific factors such as KLF1 produce coordinated responses. $S_p$ and Song et al.'s perturbation-response score (PS) share partial overlap ($\rho_{\text{partial}} = +0.51$ after controlling for magnitude), but $S_p$ provides significant incremental prediction of UPR pathway activation beyond both PS and magnitude ($p < 10^{-18}$). In a split-half reproducibility assay, $S_p$ predicts directional reproducibility beyond magnitude ($\rho_{\text{partial}} = +0.384$) while PS does not ($\rho_{\text{partial}} = -0.193$), with the advantage consistent across all magnitude strata and both datasets. Geometric instability is independently associated with UPR activation across four datasets. $S_p$ is implemented in the open-source shesha-geometry Python package.
Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit $\lim_{N,P,K \to \infty} \mathcal{E}(N,P,K)$, with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.
Online conformal prediction must balance fast adaptation to distribution shift against stable coverage: feedback-driven methods react quickly but become volatile, while strongly discounted Bayesian methods lag and inflate intervals at tight coverage. We introduce \textbf{State-Adaptive Bayesian Conformal Prediction (SA-BCP)}, which forms the predictive quantile as a gated convex combination of long-term temporal inertia and local spatial evidence from a kernel density estimate, controlled by a single interpretable evidence threshold $K$. We establish three results: (i) asymptotic marginal validity of the resulting intervals up to a gate-controlled bias that vanishes as spatial evidence accumulates (exact under recurrent states); (ii) a closed-form expression for the MSE-optimal threshold, $K^*_{\mathrm{MSE}}=\alpha(1-\alpha)/M^{\mathcal{T}}$, trading the coverage-indicator (Bernoulli) variance against the temporal structural bias $M^{\mathcal{T}}$; and (iii) a rolling-origin procedure for selecting $K$ online -- consistent under stationarity, with $O(\sqrt{T\log N})$ regret against the best fixed $K$ and, for a segmented variant, a sublinear dynamic-regret bound under sublinearly many ($B_T=o(T)$) threshold shifts. Across four financial-volatility and weather datasets, three target coverage levels, and eight baselines, SA-BCP attains at-or-above-nominal coverage in most settings while producing substantially sharper intervals -- up to roughly $3\times$ lower Winkler score than discounted Bayesian CP at the tightest coverage -- and a coverage-matched audit confirms these efficiency gains are not an artifact of under-coverage. We disclose our principal limitation: a volatility-specialized CF-GARCH competitor remains more efficient on its home volatility-base series, though it does not transfer across domains.
Urban air quality forecasting is challenging because pollutant concentrations are nonlinear, nonstationary, spatiotemporally dependent, and often affected by anomalous observations caused by traffic congestion, industrial emissions, and seasonal meteorological variability. This study proposes a Graph Convolutional Support Vector Regression (GraphSVR) framework for robust spatiotemporal forecasting of urban air pollution. The model combines graph convolutional learning to capture inter-station spatial dependence with support vector regression to model nonlinear temporal dynamics while reducing sensitivity to outlier observations. The proposed framework is evaluated using air quality records from 37 monitoring stations in Delhi and 18 stations in Mumbai, representing inland and coastal metropolitan environments in India. Forecasting performance is assessed across multiple horizons and compared with established temporal and spatiotemporal benchmarks. The results show that GraphSVR consistently improves predictive accuracy and maintains stable performance across seasons and outlier-prone pollution episodes. Statistical test further confirms the reliability of the proposed approach across the datasets. Furthermore, the conformal prediction approach is integrated with GraphSVR to generate calibrated prediction intervals, enhancing its practical value for uncertainty-aware air quality monitoring and public health decision-making.
We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $\delta \in (0,1]$, thereby characterizing the regret distribution across the full range of $\delta$. We present a simple UCBVI-style algorithm with exploration bonus $\min\{c_{1,k}/N, c_{2,k}/\sqrt{N}\}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}(\sqrt{AT}\log(1/\delta))$, confirming the conjecture of Lattimore & Szepesvári (2020, Section 17.1) for the first time.
Discrete diffusion has become a leading framework for generative modeling in various applications including language, vision, and biology. Existing convergence theory, however, exhibits fundamental limitations. KL-based analyses diverge under singular priors such as the masked distribution, while bounds in total variation (TV) depend on the state space size $S$ and become vacuous for modern language tasks, where vocabularies contain hundreds of thousands of tokens. We develop a unified adjoint-equation-based framework that establishes dimension-free convergence guarantees in any integral probability metric (IPM). To the best of our knowledge, our bounds are the first to be entirely free of $S$ and applicable to both masked and uniform priors. Importantly, our theory relies only on a single standard rate-matrix regularity assumption and applies to general priors. Five novel techniques drive our improvements: working in the space of observables via adjoint equations rather than directly with probability measures, a regularity analysis that yields bounds on any IPM, a coupling argument that removes $S$-dependence under uniform transitions, and score-marginal cancellation and exit-routing techniques that remove $S$-dependence under masked transitions. Our framework thus sharply departs from prior analyses and avoids the shortcomings of pathspace-KL and existing TV-based approaches. Beyond convergence bounds, our framework provides a versatile toolkit for further theoretical study of discrete diffusion models, including principled choices of loss functions and dimension-free step complexity.
A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible update rules consistently improve final validation loss, reduce expert load imbalance in sparse MoE models, and in several cases control final vocabulary-logit growth, improve router stability, and overall training stability over the corresponding AdamW updates.
Diffusion models have emerged as powerful generative approaches for missing-data imputation, yet most existing methods operate directly in data space and degrade when training data are heavily incomplete. We investigate whether shifting diffusion to a learned latent representation improves robustness under missing-completely-at-random (MCAR) corruption. To this end, we propose a two-stage framework: a robust VAE-based imputer first learns compact semantic features from incomplete observations, and a diffusion model is then trained in the resulting latent space. Across training missing rates, we perform a controlled comparison against pixel-space diffusion models under the same incomplete-data setting. The latent diffusion model maintains high sample quality and remains stable up to 50% missingness, while pixel-space diffusion degrades progressively as missingness increases. For downstream imputation, latent diffusion also achieves consistently better performance than pixel-space diffusion. These findings indicate that latent-space modeling mitigates artifact amplification from zero-imputed inputs and provides a more robust generative prior for incomplete-data learning. Overall, our results support latent diffusion as a strong and practically useful alternative to pixel-space diffusion for missing-data problems.
Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory. The source code for our approach is accessible at this https URL.
Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at this https URL.
We develop Bellman-sufficient information complexity, a representation-level framework for studying information-theoretic complexity in sequential decision making. The primitive object is an environment space $\Omega$ and an admissible algorithm class. The intrinsic object is a Bellman-sufficient state representation together with an information index $Y=\chi(\Omega)$, often the optimal decision or value object rather than the full environment. This replaces syntactic model realizability with representation-level sufficiency for decision making. On the upper-bound side, learning is organized as a dynamic program on the sufficient state with a logarithmic information potential for the index. In fixed-truth analysis this potential is represented by the coordinate log loss $\gamma\log(1/q_t(\chi(\omega^\star)))$; in the indexed Algorithmic Information Ratio (AIR) regret identities it gives rise to the log-posterior telescope, and after Bayesian posterior averaging it corresponds to an entropy term. On the lower side, a Bellman-Fano certificate uses the same state and index to compare the indexed information telescope with the ghost-good mass of low-regret reference trajectories. The central matching statement is therefore a conditional Bellman information-risk sandwich when the log-penalized Bellman upper value and the ghost-quantile lower certificate close on the same representation and at the same radius. UCB, E2D/DEC, and AMS/EBO then appear as tractable certificates or relaxations of this same log-potential Bellman program, rather than as separate notions of information complexity.
We propose a novel Bayesian framework for joint image reconstruction and uncertainty quantification from compressed sensing magnetic resonance imaging data. The problem is formulated as a linear inverse problem, where prior distributions are assigned to the unknown image parameters. Specifically, the image is assumed to be sparse in a given transform domain. We develop a general framework applicable to any sparsifying transform and demonstrate its performance using (1) a total variation transform based on image spatial gradients and (2) a wavelet-domain transform. Bayesian inference is performed using a split-and-augmented Gibbs sampler, while the resulting non-differentiable conditional distributions are efficiently sampled using a proximal Markov chain Monte Carlo method. The proposed algorithms are validated on both single-coil and multi-coil datasets using various k-space sampling patterns and acceleration factors. The results demonstrate that the proposed Bayesian methods consistently outperform their optimisation-based counterparts in image reconstruction while providing uncertainty estimates for the reconstructed images. Furthermore, the estimated uncertainty maps show a strong correlation with the true reconstruction errors and substantially outperformed deep learning-based uncertainty estimation methods.