Independent component analysis is a core framework within blind source separation for recovering latent source signals from observed mixtures under statistical independence assumptions. In this work, we propose PDGMM-VAE, a source-oriented variational autoencoder in which each latent dimension, interpreted explicitly as an individual source signal, is assigned its own Gaussian mixture model prior. Unlike conventional VAE formulations with a shared simple prior, the proposed framework imposes per-dimension heterogeneous prior constraints, enabling the model to capture diverse non-Gaussian source statistics and thereby promote source separation under a probabilistic encoder-decoder architecture. Importantly, the parameters of these per-dimension GMM priors are not fixed in advance, but are adaptively learned and automatically refined toward convergence together with the encoder and decoder parameters under the overall training objective. Within this formulation, the encoder serves as a demixing mapping from observations to latent sources, while the decoder reconstructs the observed mixtures from the inferred components. The proposed model provides a systematic study of an idea that had previously only been noted in our preliminary form, namely, equipping different latent sources with different GMM priors for ICA, and formulates it as a full VAE framework with end-to-end training and per-dimension prior learning. Experimental results on both linear and nonlinear mixing problems demonstrate that PDGMM-VAE can recover latent source signals and achieve satisfactory separation performance.
Understanding wafer-level spatial variations from in-situ process signals is essential for advanced plasma etching process monitoring. While most data-driven approaches focus on scalar indicators such as average etch rate, actual process quality is determined by complex two-dimensional spatial distributions across the wafer. This paper presents a spatial regression model that predicts wafer-level etch depth distributions directly from multichannel in-situ process time series. We propose a Time-LLM-based spatial regression model that extends LLM reprogramming from conventional time-series forecasting to wafer-level spatial estimation by redesigning the input embedding and output projection. Using the BOSCH plasma-etching dataset, we demonstrate stable performance under data-limited conditions, supporting the feasibility of LLM-based reprogramming for wafer-level spatial monitoring.
In clustering, strong dominance in the size of a particular cluster is often undesirable, motivating a measure of cluster size uniformity that can be used to filter such partitions. A basic requirement of such a measure is stability: partitions that differ only slightly in their point assignments should receive similar uniformity scores. A difficulty arises because cluster labels are not fixed objects; algorithms may produce different numbers of labels even when the underlying point distribution changes very little. Measures defined directly over labels can therefore become unstable under label-count perturbations. I introduce the Mass Agreement Score (MAS), a point-centric metric bounded in [0, 1] that evaluates the consistency of expected cluster size as measured from the perspective of points in each cluster. Its construction yields fragment robustness by design, assigning similar scores to partitions with similar bulk structure while remaining sensitive to genuine redistribution of cluster mass.
In this paper, we study semiparametric inference for linear multivariate Hawkes processes, a class of point processes widely used to describe self and mutually exciting phenomena. We establish a convolution theorem giving the best limiting distribution for a regular estimator of smooth functional. Then, in the Bayesian setting, we prove a semiparametric Bernstein-von Mises (BvM) theorem for nonparametric random series priors. We apply this result to histogram and wavelet based priors. Taken together, the convolution and BvM theorems show that, from a frequentist point of view, semiparametric Bayesian procedures have asymptotically the optimal behavior. Deriving the BvM property for random series priors led us to prove L2 posterior contraction, complementing for these priors the results of Donnet, Rivoirard and Rousseau (2020).
The main purpose of this paper is to study the Dynamical behaviors of a stochastic SIS epidemic model using mean-reverting inhomogeneous geometric brownian motion process. First we demonstrate the existence of a global-in-time solution and establish that is unique and remains positive. Then we derive a sufficient condition for exponential extinction of infectious diseases and we show that our extinction threshold in the stochastic case coincides with that of the deterministic case. Finaly, we define an appropriate theoretical framework to guarantee the existence of an ergodic stationary distribution.
Background: Determining an adequate sample size is essential for developing reliable and generalisable clinical prediction models, yet practical guidance on selecting appropriate methods remains limited. Existing analytical and simulation-based approaches often rely on restrictive assumptions and focus on mean-based criteria. We present and validate pmsims, an R package that uses Gaussian process surrogate modelling to provide a flexible and computationally efficient simulation-based framework for sample size determination across diverse prediction settings. Methods: We conducted a comprehensive simulation study with two aims. First, we compared three search engines implemented in pmsims: a Gaussian process-based adaptive method, a deterministic bisection method, and a hybrid approach, across binary, continuous, and survival outcomes. Second, we benchmarked the best-performing pmsims engine against existing analytical (pmsampsize) and simulation-based (samplesizedev) methods, evaluating recommended sample sizes, computational time, and achieved performance on large independent validation datasets. Results: The Gaussian process-based method consistently produced the most stable sample size estimates, particularly in low-signal, high-dimensional settings. In benchmarking, pmsims achieved performance close to prespecified targets across all outcome types, matching simulation-based approaches and outperforming analytical methods in more challenging scenarios. Conclusions: pmsims provides an efficient and flexible framework for principled sample size planning in clinical prediction modelling, requiring fewer model evaluations than non-adaptive simulation approaches.
In the aftermath of the COVID-19 pandemic, empirical data have revealed that large-scale health crises not only cause immediate disruptions in mortality dynamics but also have persistent effects that may last for several years. Existing mortality models largely assume that mortality shocks are transitory and overlook how their effects can be long-lasting and heterogeneous across age groups and causes of death. In response to this limitation, we propose a novel stochastic mortality model that captures age- and cause-specific long-lasting effects of mortality jumps through a gamma-density-like decay function, estimated via a customized conditional maximum likelihood algorithm. Applying the model to recent U.S. mortality data, we reveal divergent persistence patterns across demographic groups and provide key insights into the tail risk profiles of life insurance and annuity products. Our scenario-based analyses further show that neglecting persistent shock effects can lead to suboptimal hedging, while the proposed model enables what-if testing to analyze such effects under potential future health crises.
Inverse probability of treatment weighting (IPTW) is widely used to estimate causal effects, but guidance is limited for count exposures. It is also unclear how IPTW performs when combined with multiple imputation in this context. In this study, we evaluated five IPTW methods applied to count exposures: multinomial binning, parametric and non-parametric covariate balancing propensity scores (CBPS, npCBPS), generalised boosted models (GBM), and energy balancing. Our simulations were informed by an example using data from the 1970 British Cohort Study, aiming to estimate the effect of psychological distress, measured as a count of symptoms at age 34, on self-reported longstanding illness at age 42. We compared these approaches on bias, coverage, effective sample size, and other metrics under truncated negative binomial and Poisson exposure distributions. We also assessed the performance of Rubin's rules under different missingness mechanisms. Under complete data, multinomial, CBPS, GBM, and energy weights produced low bias and near-nominal coverage, whereas npCBPS resulted in bias and poor coverage due to extreme weights. When data were missing completely at random, similar performance patterns were observed for IPTW with multiple imputation. Under missing at random, bias increased with higher missingness, but this was present for both IPTW and covariate-adjusted regression, possibly reflecting a limitation of the imputation model rather than a failure of IPTW. Overall, these findings support the use of multinomial, CBPS, GBMs, and energy weights for count exposures in similar settings while highlighting trade-offs between these methods and the need for imputation models accommodating right-truncated overdispersed counts.
Many scientific systems, such as cellular populations or economic cohorts, are naturally described by probability distributions that evolve over time. Predicting how such a system would have evolved under different forces or initial conditions is fundamental to causal inference, domain adaptation, and counterfactual prediction. However, the space of distributions often lacks the vector space structure on which classical methods rely. To address this, we introduce a general notion of parallel dynamics at a distributional level. We base this principle on parallel transport of tangent dynamics along optimal transport geodesics and call it ``Wasserstein Parallel Trends''. By replacing the vector subtraction of classic methods with geodesic parallel transport, we can provide counterfactual comparisons of distributional dynamics in applications such as causal inference, domain adaptation, and batch-effect correction in experimental settings. The main mathematical contribution is a novel notion of fanning scheme on the Wasserstein manifold that allows us to efficiently approximate parallel transport along geodesics while also providing the first theoretical guarantees for parallel transport in the Wasserstein space. We also show that Wasserstein Parallel Trends recovers the classic parallel trends assumption for averages as a special case and derive closed-form parallel transport for Gaussian measures. We deploy the method on synthetic data and two single-cell RNA sequencing datasets to impute gene-expression dynamics across biological systems.
In attempt to advance the current practice for assessing and predicting the primary ovarian insufficiency (POI) risk in female childhood cancer survivors, we propose two estimating function based approaches for age-specific logistic regression. Both approaches adapt the inverse probability of censoring weighting (IPCW) strategy and yield consistent estimators with asymptotic normality. The first approach modifies the IPCW weights used by Im et al. (2023) to account for doubly censoring. The second approach extends the outcome weighted IPCW approach to use the information of the subjects censored before the analysis time. We consider variance estimation for the estimators and explore by simulation the two approaches implemented in the situations where the conditional right-censoring time distribution required in the IPCW weighs is unknown and approximated using the survival random forest approaches, stratified empirical distribution functions, or the estimator under the Cox proportional hazards model. The numerical studies indicate that the second approach is more efficient when right-censoring is relatively heavy, whereas the first approach is preferable when the right-censoring is light. We also observe that the performance of the two approaches heavily relies on the estimation of censoring distribution in our simulation settings. The POI data from a childhood cancer survivor study are employed throughout the paper for motivation and illustration. Our data analysis provides new insight into understanding the POI risk among cancer survivors.
We concern computer model calibration problem where the goal is to find the parameters that minimize the discrepancy between the multivariate real-world and computer model outputs. We propose to solve an approximation using signed residuals that enables a root finding approach and an accelerated search. We characterize the distance of the solutions to the approximation from the solutions of the original problem for the strongly-convex objective functions, showing that it depends on variability of the signed residuals across output dimensions, as wells as their variance and covariance. We develop a metamodel-based root finding framework under kriging and stochastic kriging that is augmented with a sequential search space reduction. We derive three new acquisition functions for finding roots of the approximate problem along with their derivatives usable by first-order solvers. Compared to kriging, stochastic kriging accounts for observational noise, promoting more robust solutions. We also analyze the case where a root may not exist. Our analysis of the asymptotic behavior in this context show that, since existence of roots in the approximation problem may not be known a priori, using new acquisition functions will not compromise the outcome. Numerical experiments on data-driven and physics-based examples demonstrate significant computational gains over standard calibration approaches.
There remain theoretical gaps in deep neural network estimators for the nonparametric Cox proportional hazards model. In particular, it is unclear how gradient-based optimization error propagates to population risk under partial likelihood, how pointwise bias can be controlled to permit valid inference, and how ensemble-based uncertainty quantification behaves under realistic variance decay regimes. We develop an asymptotic distribution theory for deep Cox estimators that addresses these issues. First, we establish nonasymptotic oracle inequalities for general trained networks that link in-sample optimization error to population risk without requiring the exact empirical risk optimizer. We then construct a structured neural parameterization that achieves infinity-norm approximation rates compatible with the oracle bound, yielding control of the pointwise bias. Under these conditions and using the Hajek--Hoeffding projection, we prove pointwise and multivariate asymptotic normality for subsampled ensemble estimators. We derive a range of subsample sizes that balances bias correction with the requirement that the Hajek--Hoeffding projection remain dominant. This range accommodates decay conditions on the single-overlap covariance, which measures how strongly a single shared observation influences the estimator, and is weaker than those imposed in the subsampling literature. An infinitesimal jackknife representation provides analytic covariance estimation and valid Wald-type inference for relative risk contrasts such as log-hazard ratios. Finally, we illustrate the finite-sample implications of the theory through simulations and a real data application.
Predictive inference is a fundamental task in statistics, traditionally addressed using parametric assumptions about the data distribution and detailed analyses of how models learn from data. In recent years, conformal prediction has emerged as a rapidly growing alternative framework that is particularly well suited to modern applications involving high-dimensional data and complex machine learning models. Its appeal stems from being both distribution-free -- relying mainly on symmetry assumptions such as exchangeability -- and model-agnostic, treating the learning algorithm as a black box. Even under such limited assumptions, conformal prediction provides exact finite-sample guarantees, though these are typically of a marginal nature that requires careful interpretation. This paper explains the core ideas of conformal prediction and reviews selected methods. Rather than offering an exhaustive survey, it aims to provide a clear conceptual entry point and a pedagogical overview of the field.
Matérn random fields are one of the most widely used classes of models in spatial statistics. The fixed-domain identifiability of covariance parameters for stationary Matérn Gaussian random fields exhibits a dimension-dependent phase transition. For known smoothness $\nu$, Zhang \cite{Zhang2004} showed that when $d\le3$, two Matérn models with the same microergodic parameter $m=\sigma^2\alpha^{2\nu}$ induce equivalent Gaussian measures on bounded domains, while Anderes \cite{Anderes2010} proved that when $d>4$, the corresponding measures are mutually singular whenever the parameters differ. The critical case $d=4$ for stationary Matérn models has remained open. We resolve this case. Let $d=4$ and consider two stationary Matérn models on $\mathbb R^4$ with parameters $(\sigma_1,\alpha_1)$ and $(\sigma_2,\alpha_2)$ satisfying \[ \sigma_1^2\alpha_1^{2\nu}=\sigma_2^2\alpha_2^{2\nu}, \qquad \alpha_1\neq \alpha_2. \] We prove that the corresponding Gaussian measures on any bounded observation domain are mutually singular on every countable dense observation set, and on the associated path space of continuous functions. Our approach can be viewed as a spectral analogue of the higher-order increment method of Anderes \cite{Anderes2010}. Whereas Anderes isolates the second irregular covariance coefficient through renormalized quadratic variations in physical space, we detect the first nonvanishing high-frequency spectral mismatch via localized Fourier coefficients and use a normalized Whittle score to identify parameters. More broadly, the localized spectral probing framework used here for detecting subtle covariance differences in Gaussian random fields may be useful for studying identifiability and estimation in other spatial models.
Model selection is a cornerstone of statistical inference, where information criteria are widely employed to balance model fit and complexity. However, classical likelihood-based criteria are often highly sensitive to contamination, outliers, and model misspecification. In this paper, we develop a robust alternative based on the Exponential-Polynomial Divergence, a flexible extension of existing divergence measures that enhances adaptability to diverse data irregularities. The proposed Exponential-Polynomial Divergence Information Criterion preserves the objective of approximating the discrepancy between the true model and candidate models while incorporating robustness against anomalous observations. Its theoretical properties are established, and robustness is examined through influence function analysis, demonstrating controlled sensitivity to extreme data points. For practical implementation, a data-driven tuning parameter selection strategy based on generalized score matching is employed, ensuring improved computational stability and efficiency. The effectiveness of the proposed method is demonstrated through extensive simulation studies under varying contamination levels, as well as real data applications involving linear mixed-effects panel data models and neural network-based prediction tasks. The results consistently show improved stability and reliability compared to classical likelihood and density power divergence-based information criteria. The proposed framework thus provides a practical and unified approach for model selection in complex and contaminated data settings.
Understanding how animals move through heterogeneous landscapes is central to ecology and conservation. In this context, step selection functions (SSFs) have emerged as the main statistical framework to analyze how biotic and abiotic predictors influence movement paths observed by radio tracking, GPS tags, or similar sensors. A traditional SSF consists of a generalized linear model (GLM) that infers the animal's habitat preferences (selection coefficients) by comparing each observed movement step to random steps. Such GLM-SSFs, however, cannot flexibly consider non-linear or interacting effects, unless those have been specified a priori. To address this problem, generalized additive models have been integrated in the SSF framework, but those GAM-SSFs are still limited in their ability to represent complex habitat preferences and inter-individual variability. Here we explore the utility of deep neural networks (DNNs) to overcome these limitations. We find that DNN-SSFs, coupled with explainable AI to extract selection coefficients, offer many advantages for analyzing movement data. In the case of linear effects, they effectively retrieve the same effect sizes and p-values as conventional GLMs. At the same time, however, they can automatically detect complex interaction effects, nonlinear responses, and inter-individual variability if those are present in the data. We conclude that DNN-SSFs are a promising extension of traditional SSF. Our analysis extends previous research on DNN-SSF by exploring differences and similarities of GLM, GAM and DNN-based SSF models in more depth, in particular regarding the validity of statistical indicators that are derived from the DNN. We also propose new DNN structures to capture inter-individual effects that can be viewed as a nonlinear random effect. All methods used in this paper are available via the 'citoMove' R package.
Shooting location is a core indicator of offensive style in invasion sports. Existing basketball shot-chart analyses often use spatial information for descriptive visualization, location-based efficiency modeling, or clustering players into shooting archetypes, yet few studies provide a unified framework for fair comparison of shot-type-specific tendencies. We propose the shot-type-aware areal multilevel Poisson (STAMP) model, which jointly models team-level field-goal attempts across predefined court regions, seasons, and shot types using a Poisson likelihood with a possession-based exposure offset. The hierarchical random-effects structure combines team, area, team-area, and team-side random effects with shot-type-specific random slopes for key shot categories. We fit the model using approximate Bayesian inference via the Integrated Nested Laplace Approximation (INLA), enabling efficient analysis of more than $3\times 10^{5}$ shots from two seasons of this http URL (the men's professional basketball league in Japan). The STAMP model achieves better out-of-sample predictive performance than simpler baselines, yielding interpretable relative-rate maps and left-right bias summaries. Case studies illustrate how the model reveals team-specific spatial tendencies for comparative analysis, and we discuss its limitations and potential extensions.
Deep neural networks (DNNs) achieve remarkable predictive performance but remain difficult to interpret, largely due to overparameterization that obscures the minimal structure required for interpretation. Here we introduce DeepIn, a self-interpretable neural network framework that adaptively identifies and learns the minimal representation necessary for preserving the full expressive capacity of standard DNNs. We show that DeepIn can correctly identify the minimal representation dimension, select relevant variables, and recover the minimal sufficient network architecture for prediction. The resulting estimator achieves optimal non-asymptotic error rates that adapt to the learned minimal dimension, demonstrating that recovering minimal sufficient structure fundamentally improves generalization error. Building on these guarantees, we further develop hypothesis testing procedures for both selected variables and learned representations, bridging deep representation learning with formal statistical inference. Across biomedical and vision benchmarks, DeepIn improves both predictive accuracy and interpretability, reducing error by up to 30% on real-world datasets while automatically uncovering human-interpretable discriminative patterns. Our results suggest that interpretability and statistical rigor can be embedded directly into deep architectures without sacrificing performance.
Most algorithms for hyperspectral image unmixing produce point estimates of fractional abundances of the materials to be separated. However, in the absence of reliable ground truth, the ability to perform abundance uncertainty quantification (UQ) should be an important feature of algorithms, e.g. to evaluate how hard the unmixing problem is and how much the results should be trusted. The usual modeling assumptions in Bayesian models for unmixing rely heavily on the Euclidean geometry of the simplex and typically disregard spatial information. In addition, to our knowledge, abundance UQ is close to nonexistent. In this paper, we propose to leverage Aitchinson geometry from the compositional data analysis literature to provide practitioners with alternative tools for modeling prior abundance distributions. In particular we show how to design simplex-valued Gaussian Process priors using this geometry. Then we link Aitchinson geometry to constrained sampling algorithms in the literature, and propose UQ diagnostics that comply with the constraints on abundance vectors. We illustrate these concepts on real and simulated data.
This paper proposes a scoring-rule-based method for ranking predictive distributions in the Fréchet domain that is able to distinguish between different tail indices. The approach is built on normalized order statistics and exploits proper scoring rules to compare tail limit distributions in a distributional framework, with direct relevance for insurance claim-severity tails. On the theoretical side, consistency and asymptotic normality for empirical tail scores based on normalized upper order statistics are obtained through residual estimation theory. Simulation results demonstrate that the scoring-rule-based approach is capable of discriminating between different tail behaviors in finite samples and that trends in the scaling have only a minor impact on stability. We further show that optimizing scoring rules (equivalently, minimizing the associated loss form) yields consistent tail-index estimators and that the classical Hill estimator arises as a special case. The performance of the proposed method is investigated and compared with the Hill estimator across a range of tail indices. Lastly, we analyze an automobile claim-severity data set to demonstrate how scoring rules can be used to rank predictive models based on tail predictions in actuarial settings.
Quantile regression (QR) relies on the estimation of conditional quantiles and explores the relationships between independent and dependent variables. At high probability levels, classical QR methods face extrapolation difficulties due to the scarcity of data in the tail of the distribution. Another challenge arises when the number of predictors is large and the quantile function exhibits a complex structure. In this work, we propose an estimation method designed to overcome these challenges. To enhance extrapolation in the tail of the conditional response distribution, we model block maxima using the generalized extreme value (GEV) distribution, where the parameters depend on covariates. To address the second challenge, we adopt an approach based on generalized random forests (grf) to estimate these parameters. Specifically, we maximize a penalized likelihood, weighted by the weights obtained through the grf method. This penalization helps overcome the limitations of the maximum likelihood estimator (MLE) in small samples, while preserving its optimality in large samples. The effectiveness of our method is validated through comparisons with other approaches in simulation studies and an application to U.S. wage data.
At the end, the house always wins! This simple truth holds for all public games of chance. Nevertheless, since lotteries have existed, people have tried everything to give luck a helping hand. This article compares objective scientific approaches to tackle the 6/49 lottery: probabilistic methods and combinatorial designs. The mathematical models developed herein can be modified and applied to other lotteries. Additionally, this work introduces the newly constructed (49, 6, 5) covering design, which meets the Schönheim bound. For lottery designs and for covering designs, a benchmark based on probabilistic methods is presented. It is demonstrated that common attempts to outwit the odds correspond to limitations of numbers to subsets, which disproportionately reduce the chances of winning.
A new dynamic latent space eigenmodel (LSM) is proposed for weighted temporal networks. The model accommodates integer-valued weights, excess of zeros, time-varying node positions (features), and time-varying network sparsity. The latent positions evolve according to a vector autoregressive process that accounts for lagged and contemporaneous dependence across nodes and features, a characteristic neglected in the LSM literature. A Bayesian approach is used to address two of the primary sources of inference intractability in dynamic LSMs: latent feature estimation and the choice of latent space dimension. We employ an efficient auxiliary-mixture sampler that performs data augmentation and supports conditionally conjugate prior distributions. A point-process representation of the network weights and the finite-dimensional distribution of the latent processes are used to derive a multi-move sampler in which each feature trajectory is drawn in a single block, without recursions. This sampling strategy is new to the network literature and can significantly reduce computational time while improving chain mixing. To avoid trans-dimensional samplers, a Laplace approximation of the partial marginal likelihood is used to design a partially collapsed Gibbs sampler. Overall, our procedure is general, as it can be easily adapted to static and dynamic settings, as well as to other discrete or continuous weight distributions.
To predict smooth physical phenomena from observations, spline interpolation provides an interpretable framework by minimizing an energy functional associated with the Laplacian operator. This work proposes a methodology to construct a spline predictor on a compact Riemannian manifold, while quantifying the uncertainty inherent in the classical deterministic solution. Our approach leverages the equivalence between spline interpolation and universal kriging with a specific covariance kernel. By adopting a Gaussian random field framework, we generate stochastic simulations that reflect prediction uncertainty. However, on compact manifolds, the covariance kernel depends on the generally unknown spectrum of the Laplace-Beltrami operator. To address this, we introduce a finite element approximation based on a triangulation of the manifold. This leads to the use of intrinsic Gaussian Markov Random Fields (GMRF) and allows for the incorporation of anisotropies through local modifications of the Riemannian metric. The method is validated using a temperature study on a sphere, where the operator's spectrum is known, and is further extended to a test case on a cylindrical surface.
In this paper, we introduce a novel model for the meta-analysis of proportions that integrates the standard random-effects model (REM) with an extreme value theory (EVT)-based component. The proposed model, named XT-REM (Extreme-Tail Random Effects Model), extends the classical REM framework by explicitly accounting for extreme proportions through a partial segmentation of the study set based on a predefined threshold. While the majority of proportions are modeled using REM, proportions exceeding the threshold are analyzed using the Generalized Pareto Distribution (GPD). This formulation enables a dual interpretation of meta-analytic results, providing both an aggregate estimate for the central bulk of studies and a separate characterization of tail behavior. The XT-REM framework accommodates heteroskedastic variance structures inherent to proportion data, while preserving identifiability and consistency. Using real-world data on immunotherapy-related adverse events, together with simulation studies calibrated to empirical settings, we demonstrate that XT-REM yields a comparable central estimate while enabling a more explicit assessment of tail behavior, including high-percentile extreme proportions. Compared with the classical REM, XT-REM achieves higher log-likelihood values and lower AIC, in the considered scenarios, indicating a better fit within this modeling framework. In summary, XT-REM offers a theoretically grounded and practically useful extension of random-effects meta-analysis, with potential relevance to clinical contexts in which extreme event rates carry important implications for risk assessment.
Survival analysis provides a well-established framework for modeling time-to-event data, with hazard and survival functions formally defined as population-level quantities. In applied work, however, these quantities are often interpreted as representing individual-level risk, despite the absence of a clear generative account linking individual risk mechanisms to observed survival data. This paper develops a latent hazard framework that makes this relationship explicit by modeling event times as arising from unobserved, individual-specific hazard mechanisms and viewing population-level survival quantities as aggregates over heterogeneous mechanisms. Within this framework, we show that individual hazard trajectories are not identifiable from survival data under partial information. More generally, the conditional distribution of latent hazard mechanisms given covariates is structurally non-identifiable, even when population-level survival functions are fully known. This non-identifiability arises from the aggregation inherent in survival data and persists independently of model flexibility or estimation strategy. Finally, we show that classical survival models can be systematically reinterpreted according to how they handle this unresolved conditional mechanism distribution. This paper provides a unified framework for understanding heterogeneity, identifiability, and interpretation in survival analysis, and clarifies how population-level survival models should be interpreted when individual risk mechanisms are only partially observed, thereby establishing explicit information constraints for principled modeling and inference.
Mortality forecasting methods in the Lee-Carter tradition extrapolate temporal components via time-series models, producing forecasts that can systematically underpredict life expectancy at long horizons and require ad hoc adjustments for sex coherence. We reframe forecasting as integrating a flow field through the low-dimensional score space of a Tucker tensor decomposition of multi-population mortality data from the Human Mortality Database. PCA reduction of the effective core matrices reveals that the mortality transition is essentially a one-dimensional flow: a scalar speed function advances the level, trajectory functions supply the structural scores, and the Tucker reconstruction produces complete sex-specific mortality schedules at each horizon. An era-weighted speed function adapts to contemporary dynamics at each forecast origin, and empirically calibrated convergence rates control relaxation from country-specific to canonical mortality structure. The system is evaluated by leave-country-out cross-validation with a 50-year horizon against Lee-Carter and Hyndman-Ullah benchmarks.
Graph Neural Networks (GNNs) have achieved impressive performance in graph-related tasks. However, they suffer from poor generalization on out-of-distribution (OOD) data, as they tend to learn spurious correlations. Such correlations present a phenomenon that GNNs fail to stably learn the mutual information between prediction representations and ground-truth labels under OOD settings. To address these challenges, we formulate a causal graph starting from the essence of node classification, adopt backdoor adjustment to block non-causal paths, and theoretically derive a lower bound for improving OOD generalization of GNNs. To materialize these insights, we further propose a novel approach integrating causal representation learning and a loss replacement strategy. The former captures node-level causal invariance and reconstructs graph posterior distribution. The latter introduces asymptotic losses of the same order to replace the original losses. Extensive experiments demonstrate the superiority of our method in OOD generalization and effectively alleviating the phenomenon of unstable mutual information learning.
Recently, Forré (arXiv:2104.11547, 2021) introduced transitional conditional independence, a notion of conditional independence that provides a unified framework for both random and non-stochastic variables. The original paper establishes a strong global Markov property connecting transitional conditional independencies with suitable graphical separation criteria for directed mixed graphs with input nodes (iDMGs), together with a version of causal calculus for iDMGs in a general measure-theoretic setting. These notes aim to further illustrate the motivations behind this framework and its connections to the literature, highlight certain subtlies in the general measure-theoretic causal calculus, and extend the "one-line" formulation of the ID algorithm of Richardson et al. (Ann. Statist. 51(1):334--361, 2023) to the general measure-theoretic setting.
Privacy and algorithmic fairness have become two central issues in modern machine learning. Although each has separately emerged as a rapidly growing research area, their joint effect remains comparatively under-explored. In this paper, we systematically study the joint impact of differential privacy and fairness on classification in a federated setting, where data are distributed across multiple servers. Targeting demographic disparity constrained classification under federated differential privacy, we propose a two-step algorithm, namely FDP-Fair. In the special case where there is only one server, we further propose a simple yet powerful algorithm, namely CDP-Fair, serving as a computationally-lightweight alternative. Under mild structural assumptions, theoretical guarantees on privacy, fairness and excess risk control are established. In particular, we disentangle the source of the private fairness-aware excess risk into a) intrinsic cost of classification, b) cost of private classification, c) non-private cost of fairness and d) private cost of fairness. Our theoretical findings are complemented by extensive numerical experiments on both synthetic and real datasets, highlighting the practicality of our designed algorithms.
We propose a neural network model for contextual regression in which the regression model depends on contextual features that determine the active submodel and an algorithm to fit the model. The proposed simple contextual neural network (SCtxtNN) separates context identification from context-specific regression, resulting in a structured and interpretable architecture with fewer parameters than a fully connected feed-forward network. We show mathematically that the proposed architecture is sufficient to represent contextual linear regression models using only standard neural network components. Numerical experiments are provided to support the theoretical result, showing that the proposed model achieves lower excess mean squared error and more stable performance than feed-forward neural networks with comparable numbers of parameters, while larger networks improve accuracy only at the cost of increased complexity. The results suggest that incorporating contextual structure can improve model efficiency while preserving interpretability.
A recurring debate in the philosophy of statistics concerns what, exactly, should count as a measure of evidence for or against a given hypothesis. P-values, likelihood ratios, and Bayes factors all have their defenders. In this paper we add two additional candidates to this list: the e-value and its sequential analogue, the e-process. E-values enjoy several desirable properties as measures of evidence: they combine naturally across studies, handle composite hypotheses, provide long-run error rates, and admit a useful interpretation as the wealth accrued by a bettor in a game against the null distribution. E-processes additionally handle optional stopping and optional continuation. This work examines the extent to which e-values and e-processes satisfy the evidential desiderata of different statistical traditions, concluding that they combine attractive features of p-values, likelihood ratios, and Bayes factors, and merit serious consideration as interpretable and intuitive measures of statistical evidence.
Understanding how biomarker distributions evolve over time is a central challenge in digital health and chronic disease monitoring. In diabetes, changes in the distribution of glucose measurements can reveal patterns of disease progression and treatment response that conventional summary measures miss. Motivated by a 26-week clinical trial comparing the closed-loop insulin delivery system t:slim X2 with standard therapy in children with type 1 diabetes, we propose a probabilistic framework to model the continuous-time evolution of time-indexed distributions using continuous glucose monitoring data (CGM) collected every five minutes. We represent the glucose distribution as a Gaussian mixture, with time-varying mixture weights governed by a neural ODE. We estimate the model parameter using a distribution-matching criterion based on the maximum mean discrepancy. The resulting framework is interpretable, computationally efficient, and sensitive to subtle temporal distributional changes. Applied to CGM trial data, the method detects treatment-related improvements in glucose dynamics that are difficult to capture with traditional analytical approaches.
Distributionally balanced sampling designs are low-discrepancy probability designs obtained by minimizing the expected discrepancy between the auxiliary-variable distribution of a random sample and the target population distribution. Existing constructions rely on circular population sequences, which restrict the design space by forcing samples to be contiguous blocks of a sequence. We propose a new construction based on minimum tactical configurations that removes this topological constraint. The resulting designs are fixed-size, have equal inclusion probabilities, and belong to the class with minimum feasible configuration size. We develop both a simple initialization valid for arbitrary population and sample sizes and a spatial initialization that yields a lower initial expected discrepancy, together with a simulated annealing algorithm for optimization within this class. In simulations and empirical examples, the proposed method outperforms state-of-the-art alternatives in terms of distributional fit, balance, and spatial spread.
Forecasting infectious disease outbreaks is hard. Forecasting emerging infectious diseases with limited historical data is even harder. In this paper, we investigate ways to improve emerging infectious disease forecasting under operational constraints. Specifically, we explore two options likely to be available near the start of an emerging disease outbreak: synthetic data and genetic information. For this investigation, we conducted an experiment where we trained deep learning models on different combinations of real and synthetic data, both with and without genetic information, to explore how these models compare when forecasting COVID-19 cases for US states. All models are developed with an eye towards forecasting the next pandemic. We find that models trained with synthetic data have better forecast accuracy than models trained on real data alone, and models that use genetic variants have better forecast accuracy compared to those that do not. All models outperformed a baseline persistence model (a feat only accomplished by 7 out of 22 real-time COVID-19 cases forecasting models as reported in [38]) and multiple models outperformed the COVIDHub-4_week_ensemble. This paper demonstrates the value of these underutilized sources of information and provides a blueprint for forecasting future pandemics.
While the mathematical foundations of score-based generative models are increasingly well understood for unconstrained Euclidean spaces, many practical applications involve data restricted to bounded domains. This paper provides a statistical analysis of reflected diffusion models on the hypercube $[0,1]^D$ for target distributions supported on $d$-dimensional linear subspaces. A primary challenge in this setting is the absence of Gaussian transition kernels, which play a central role in standard theory in $\mathbb{R}^D$. By employing an easily implementable infinite series expansion of the transition densities, we develop analytic tools to bound the score function and its approximation by sparse ReLU networks. For target densities with Sobolev smoothness $\alpha$, we establish a convergence rate in the $1$-Wasserstein distance of order $n^{-\frac{\alpha+1-\delta}{2\alpha+d}}$ for arbitrarily small $\delta > 0$, demonstrating that the generative algorithm fully adapts to the intrinsic dimension $d$. These results confirm that the presence of reflecting boundaries does not degrade the fundamental statistical efficiency of the diffusion paradigm, matching the almost optimal rates known for unconstrained settings.
Gaussian process (GP) emulators have become essential tools for approximating complex simulators, significantly reducing computational demands in optimization, sensitivity analysis, and model calibration. While traditional GP emulators effectively model continuous and Gaussian-distributed simulator outputs with homogeneous variability, they typically struggle with discrete, heteroskedastic Gaussian, or non-Gaussian data, limiting their applicability to increasingly common stochastic simulators. In this work, we introduce a scalable Generalized Deep Gaussian Process (GDGP) emulation framework designed to accommodate simulators with heteroskedastic Gaussian outputs and a wide range of non-Gaussian response distributions, including Poisson, negative binomial, and categorical distributions. The GDGP framework leverages the expressiveness of DGPs and extends them to latent GP structures, enabling it to capture the complex, non-stationary behavior inherent in many simulators while also modeling non-Gaussian simulator outputs. We make GDGP scalable by incorporating the Vecchia approximation for settings with a large number of input locations, while also developing efficient inference procedures for handling large numbers of replicates. In particular, we present methodological developments that further enhance the computation of the approach for heteroskedastic Gaussian responses. We demonstrate through a series of synthetic and empirical examples that these extensions deliver the practical application of GDGP emulators and a unified methodology capable of addressing diverse modeling challenges. The proposed GDGP framework is implemented in the open-source R package dgpsi.
We study the problem of detecting local geometry in random graphs. We introduce a model $\mathcal{G}(n, p, d, k)$, where a hidden community of average size $k$ has edges drawn as a random geometric graph on $\mathbb{S}^{d-1}$, while all remaining edges follow the Erdős--Rényi model $\mathcal{G}(n, p)$. The random geometric graph is generated by thresholding inner products of latent vectors on $\mathbb{S}^{d-1}$, with each edge having marginal probability equal to $p$. This implies that $\mathcal{G}(n, p, d, k)$ and $\mathcal{G}(n, p)$ are indistinguishable at the level of the marginals, and the signal lies entirely in the edge dependencies induced by the local geometry. We investigate both the information-theoretic and computational limits of detection. On the information-theoretic side, our upper bounds follow from three tests based on signed triangle counts: a global test, a scan test, and a constrained scan test; our lower bounds follow from two complementary methods: truncated second moment via Wishart--GOE comparison, and tensorization of KL divergence. These results together settle the detection threshold at $d = \widetilde{\Theta}(k^2 \vee k^6/n^3)$ for fixed $p$, and extend the state-of-the-art bounds from the full model (i.e., $k = n$) for vanishing $p$. On the computational side, we identify a computational--statistical gap and provide evidence via the low-degree polynomial framework, as well as the suboptimality of signed cycle counts of length $\ell \geq 4$.
Constrained optimization in high-dimensional black-box settings is difficult due to expensive evaluations, the lack of gradient information, and complex feasibility regions. In this work, we propose a Bayesian optimization method that combines a penalty formulation, a surrogate model, and a trust region strategy. The constrained problem is converted to an unconstrained form by penalizing constraint violations, which provides a unified modeling framework. A trust region restricts the search to a local region around the current best solution, which improves stability and efficiency in high dimensions. Within this region, we use the Expected Improvement acquisition function to select evaluation points by balancing improvement and uncertainty. The proposed Trust Region method integrates penalty-based constraint handling with local surrogate modeling. This combination enables efficient exploration of feasible regions while maintaining sample efficiency. We compare the proposed method with state-of-the-art methods on synthetic and real-world high-dimensional constrained optimization problems. The results show that the method identifies high-quality feasible solutions with fewer evaluations and maintains stable performance across different settings.
Sentiment signals derived from sparse news are commonly used in financial analysis and technology monitoring, yet transforming raw article-level observations into reliable temporal series remains a largely unsolved engineering problem. Rather than treating this as a classification challenge, we propose to frame it as a causal signal reconstruction problem: given probabilistic sentiment outputs from a fixed classifier, recover a stable latent sentiment series that is robust to the structural pathologies of news data such as sparsity, redundancy, and classifier uncertainty. We present a modular three-stage pipeline that (i) aggregates article-level scores onto a regular temporal grid with uncertainty-aware and redundancy-aware weights, (ii) fills coverage gaps through strictly causal projection rules, and (iii) applies causal smoothing to reduce residual noise. Because ground-truth longitudinal sentiment labels are typically unavailable, we introduce a label-free evaluation framework based on signal stability diagnostics, information preservation lag proxies, and counterfactual tests for causality compliance and redundancy robustness. As a secondary external check, we evaluate the consistency of reconstructed signals against stock-price data for a multi-firm dataset of AI-related news titles (November 2024 to February 2026). The key empirical finding is a three-week lead lag pattern between reconstructed sentiment and price that persists across all tested pipeline configurations and aggregation regimes, a structural regularity more informative than any single correlation coefficient. Overall, the results support the view that stable, deployable sentiment indicators require careful reconstruction, not only better classifiers.
Adapting large-scale foundation models to new domains with limited supervision remains a fundamental challenge due to latent distribution mismatch, unstable optimization dynamics, and miscalibrated uncertainty propagation. This paper introduces an uncertainty-aware probabilistic latent transport framework that formulates domain adaptation as a stochastic geometric alignment problem in representation space. A Bayesian transport operator is proposed to redistribute latent probability mass along Wasserstein-type geodesic trajectories, while a PAC-Bayesian regularization mechanism constrains posterior model complexity to mitigate catastrophic overfitting. The proposed formulation yields theoretical guarantees on convergence stability, loss landscape smoothness, and sample efficiency under distributional shift. Empirical analyses demonstrate substantial reduction in latent manifold discrepancy, accelerated transport energy decay, and improved covariance calibration compared with deterministic fine-tuning and adversarial domain adaptation baselines. Furthermore, bounded posterior uncertainty evolution indicates enhanced probabilistic reliability during cross-domain transfer. By establishing a principled connection between stochastic optimal transport geometry and statistical generalization theory, the proposed framework provides new insights into robust adaptation of modern foundation architectures operating in heterogeneous environments. These findings suggest that uncertainty-aware probabilistic alignment constitutes a promising paradigm for reliable transfer learning in next-generation deep representation systems.
Diffusion models often generate novel samples even when the learned score is only \emph{coarse} -- a phenomenon not accounted for by the standard view of diffusion training as density estimation. In this paper, we show that, under the \emph{manifold hypothesis}, this behavior can instead be explained by coarse scores capturing the \emph{geometry} of the data while discarding the fine-scale distributional structure of the population measure~$\mu_{\scriptscriptstyle\mathrm{data}}$. Concretely, whereas estimating the full data distribution $\mu_{\scriptscriptstyle\mathrm{data}}$ supported on a $k$-dimensional manifold is known to require the classical minimax rate $\tilde{\mathcal{O}}(N^{-1/k})$, we prove that diffusion models trained with coarse scores can exploit the \emph{regularity of the manifold support} and attain a near-parametric rate toward a \emph{different} target distribution. This target distribution has density uniformly comparable to that of~$\mu_{\scriptscriptstyle\mathrm{data}}$ throughout any $\tilde{\mathcal{O}}\bigl(N^{-\beta/(4k)}\bigr)$-neighborhood of the manifold, where $\beta$ denotes the manifold regularity. Our guarantees therefore depend only on the smoothness of the underlying support, and are especially favorable when the data density itself is irregular, for instance non-differentiable. In particular, when the manifold is sufficiently smooth, we obtain that \emph{generalization} -- formalized as the ability to generate novel, high-fidelity samples -- occurs at a statistical rate strictly faster than that required to estimate the full population distribution~$\mu_{\scriptscriptstyle\mathrm{data}}$.
Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended the definition of neural collapse to regression problems, albeit only measuring the phenomenon at the last layer. In this paper, we establish that Neural Regression Collapse (NRC) also occurs below the last layer across different types of models. We show that in the collapsed layers of neural regression models, features lie in a subspace that corresponds to the target dimension, the feature covariance aligns with the target covariance, the input subspace of the layer weights aligns with the feature subspace, and the linear prediction error of the features is close to the overall prediction error of the model. In addition to establishing Deep NRC, we also show that models that exhibit Deep NRC learn the intrinsic dimension of low rank targets and explore the necessity of weight decay in inducing Deep NRC. This paper provides a more complete picture of the simple structure learned by deep networks in the context of regression.
Deep neural networks (DNNs), particularly those using Rectified Linear Unit (ReLU) activation functions, have achieved remarkable success across diverse machine learning tasks, including image recognition, audio processing, and language modeling. Despite this success, the non-convex nature of DNN loss functions complicates optimization and limits theoretical understanding. In this paper, we highlight how recently developed convex equivalences of ReLU NNs and their connections to sparse signal processing models can address the challenges of training and understanding NNs. Recent research has uncovered several hidden convexities in the loss landscapes of certain NN architectures, notably two-layer ReLU networks and other deeper or varied architectures. This paper seeks to provide an accessible and educational overview that bridges recent advances in the mathematics of deep learning with traditional signal processing, encouraging broader signal processing applications.
Galaxy surveys provide finite catalogs of objects observed within bounded volumes, yet clustering statistics are often interpreted using theoretical frameworks developed for infinite point processes. In this work, we formulate key statistical quantities directly for finite point processes and examine the structural consequences of finite-number and finite-window constraints. We show that several well-known features of galaxy survey analysis arise naturally from finiteness alone. In particular, non-vanishing higher-order connected correlations can occur even in statistically independent samples when the total number of points is fixed, and the integral constraint in two-point statistics appears as an exact identity implied by the finite-number condition rather than as an estimator artifact. We further demonstrate that counts-in-cells and point-centered environmental measures correspond to distinct statistical ensembles. Using Palm conditioning, we derive an exact relation between random-cell and point-centered statistics, showing that the latter probe a tilted version of the underlying distribution. These results provide a probabilistic framework for separating structural effects imposed by finite sampling from correlations reflecting genuine astrophysical processes. The formulation presented here remains valid for realistic survey geometries and finite data sets and clarifies the interpretation of commonly used clustering statistics in galaxy surveys.
Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $\gamma$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.
Unsupervised learning of high-dimensional data is challenging due to irrelevant or noisy features obscuring underlying structures. It's common that only a few features, called the influential features, meaningfully define the clusters. Recovering these influential features is helpful in data interpretation and clustering. We propose i-IF-Learn, an iterative unsupervised framework that jointly performs feature selection and clustering. Our core innovation is an adaptive feature selection statistic that effectively combines pseudo-label supervision with unsupervised signals, dynamically adjusting based on intermediate label reliability to mitigate error propagation common in iterative frameworks. Leveraging low-dimensional embeddings (PCA or Laplacian eigenmaps) followed by $k$-means, i-IF-Learn simultaneously outputs influential feature subset and clustering labels. Numerical experiments on gene microarray and single-cell RNA-seq datasets show that i-IF-Learn significantly surpasses classical and deep clustering baselines. Furthermore, using our selected influential features as preprocessing substantially enhances downstream deep models such as DeepCluster, UMAP, and VAE, highlighting the importance and effectiveness of targeted feature selection.
Using standard financial ratios as variables in statistical analyses has been related to several serious problems, such as extreme outliers, asymmetry, non-normality, and non-linearity. The compositional-data methodology has been successfully applied to solve these problems and has always yielded substantially different results when compared to standard financial ratios. An under-researched area is the use of financial log-ratios computed with the compositional-data methodology to predict bankruptcy or the related terms of business default, insolvency or failure. Another under-researched area is the use of machine learning methods in combination with compositional log-ratios. The present article adapts the classical Altman bankruptcy prediction model and some of its extensions to the compositional methodology with pairwise log-ratios and three common statistical and machine learning tools: logistic regression models, k-nearest neighbours, and random forests, and compares the results with standard financial ratios. Data from the sector in the Spanish economy with the largest number of bankrupt firms according to the first two digits of the NACE code (46XX "wholesale trade, except of motor vehicles and motorcycles") were obtained from the Iberian Balance sheet Analysis System. The sample size (31,131 firms, of which 97 were bankrupt) was divided into a training and a validation dataset. The training data set was downsampled to one healthy firm to each bankrupt firm. No outliers were removed. Focusing on predictive performance, the results show that compositional methods are better than standard ratios in terms of sensitivity, with mixed results regarding specificity, compositional random forests and compositional logistic regression behaving the best.
In nonnegative matrix factorization (NMF), minimum-volume-constrained NMF is a widely used framework for identifying the solution of NMF by making basis vectors as similar as possible. This typically induces sparsity in the coefficient matrix, with each row containing zero entries. Consequently, minimum-volume-constrained NMF may fail for highly mixed data, where such sparsity does not hold. Moreover, the estimated basis vectors in minimum-volume-constrained NMF may be difficult to interpret as they may be mixtures of the ground truth basis vectors. To address these limitations, in this paper we propose a new NMF framework, called maximum-volume-constrained NMF, which makes the basis vectors as distinct as possible. We further establish an identifiability theorem for maximum-volume-constrained NMF and provide an algorithm to estimate it. Experimental results demonstrate the effectiveness of the proposed method.
The theory of Local Intrinsic Dimensionality (LID) has become a valuable tool for characterizing local complexity within and across data manifolds, supporting a range of data mining and machine learning tasks. Accurate LID estimation requires samples drawn from small neighborhoods around each query to avoid biases from nonlocal effects and potential manifold mixing, yet limited data within such neighborhoods tends to cause high estimation variance. As a variance reduction strategy, we propose an ensemble approach that uses subbagging to preserve the local distribution of nearest neighbor (NN) distances. The main challenge is that the uniform reduction in total sample size within each subsample increases the proximity threshold for finding a fixed number k of NNs around the query. As a result, in the specific context of LID estimation, the sampling rate has an additional, complex interplay with the neighborhood size, where both combined determine the sample size as well as the locality and resolution considered for estimation. We analyze both theoretically and experimentally how the choice of the sampling rate and the k-NN size used for LID estimation, alongside the ensemble size, affects performance, enabling informed prior selection of these hyper-parameters depending on application-based preferences. Our results indicate that within broad and well-characterized regions of the hyper-parameters space, using a bagged estimator will most often significantly reduce variance as well as the mean squared error when compared to the corresponding non-bagged baseline, with controllable impact on bias. We additionally propose and evaluate different ways of combining bagging with neighborhood smoothing for substantial further improvements on LID estimation performance.
Uniform laws of large numbers form a cornerstone of Vapnik--Chervonenkis theory, where they are characterized by the finiteness of the VC dimension. In this work, we study uniform convergence phenomena in cartesian product spaces, under assumptions on the underlying distribution that are compatible with the product structure. Specifically, we assume that the distribution is absolutely continuous with respect to the product of its marginals, a condition that captures many natural settings, including product distributions, sparse mixtures of product distributions, distributions with low mutual information, and more. We show that, under this assumption, a uniform law of large numbers holds for a family of events if and only if the linear VC dimension of the family is finite. The linear VC dimension is defined as the maximum size of a shattered set that lies on an axis-parallel line, namely, a set of vectors that agree on all but at most one coordinate. This dimension is always at most the classical VC dimension, yet it can be arbitrarily smaller. For instance, the family of convex sets in $\mathbb{R}^d$ has linear VC dimension $2$, while its VC dimension is infinite already for $d\ge 2$. Our proofs rely on estimator that departs substantially from the standard empirical mean estimator and exhibits more intricate structure. We show that such deviations from the standard empirical mean estimator are unavoidable in this setting. Throughout the paper, we propose several open questions, with a particular focus on quantitative sample complexity bounds.
We introduce the Multilevel Euler-Maruyama (ML-EM) method compute solutions of SDEs and ODEs using a range of approximators $f^1,\dots,f^k$ to the drift $f$ with increasing accuracy and computational cost, only requiring a few evaluations of the most accurate $f^k$ and many evaluations of the less costly $f^1,\dots,f^{k-1}$. If the drift lies in the so-called Harder than Monte Carlo (HTMC) regime, i.e. it requires $\epsilon^{-\gamma}$ compute to be $\epsilon$-approximated for some $\gamma>2$, then ML-EM $\epsilon$-approximates the solution of the SDE with $\epsilon^{-\gamma}$ compute, improving over the traditional EM rate of $\epsilon^{-\gamma-1}$. In other terms it allows us to solve the SDE at the same cost as a single evaluation of the drift. In the context of diffusion models, the different levels $f^{1},\dots,f^{k}$ are obtained by training UNets of increasing sizes, and ML-EM allows us to perform sampling with the equivalent of a single evaluation of the largest UNet. Our numerical experiments confirm our theory: we obtain up to fourfold speedups for image generation on the CelebA dataset downscaled to 64x64, where we measure a $\gamma\approx2.5$. Given that this is a polynomial speedup, we expect even stronger speedups in practical applications which involve orders of magnitude larger networks.
In this paper, we study limiting laws and consistent estimation criteria for the extreme eigenvalues in a spiked covariance model of dimension $p$. Firstly, for fixed $p$, we propose a generalized estimation criterion that can consistently estimate, $k$, the number of spiked eigenvalues. Compared with the existing literature, we show that consistency can be achieved under weaker conditions on the penalty term. Next, allowing both $p$ and $k$ to diverge, we derive limiting distributions of the spiked sample eigenvalues using random matrix theory techniques. Notably, our results do not require the spiked eigenvalues to be uniformly bounded from above or tending to infinity, as have been assumed in the existing literature. Based on the above derived results, we formulate a generalized estimation criterion and show that it can consistently estimate $k$, while $k$ can be fixed or grow at an order of $k=o(n^{1/3})$. We further show that the results in our work continue to hold under a general population distribution without assuming normality. The efficacy of the proposed estimation criteria is illustrated through comparative simulation studies.
There has been a misconception that only one type of error rate control is necessary in clinical trials, leading to debates over whether to prioritize Familywise Error Rate (FWER) or False Discovery Rate (FDR). This misconception has led to misleading statements about FWER control and proposals to shift towards FDR control, which could be manipulated by the industry. In reality, since the early 2000s, biopharmaceutical statistics have implicitly applied two layers of Type I error rate control. This aligns with Tukey's 1953 invention of Error Rate per Family (ERpF) for controlling error across studies, while FWER applies within each study. Our paper clarifies this layering, using Platform trials to demonstrate the verifiable conditions needed across studies for the FDA to fulfill its regulatory mission. We show that controlling FWER within a study at $5\%$ inherently controls ERpF across studies at 5-per-100, regardless of study correlations. This supports current regulatory practices that protect public health while fostering innovation. We also address concerns about ERpF stability in Platform trials, where shared controls introduce dependencies. By applying the Conditionality Principle and utilizing an innovative Shiny app, we explore how correlations impact ERpF variability, providing deeper insights for informed decision-making. Our findings, supported by principles like Layering of Error Rate Controls and the Conditionality Principle, are particularly relevant as Platform trials gain popularity for their efficiency in testing multiple treatments simultaneously.
Importance sampling (IS) is an efficient stand-in for model refitting in performing (LOO) cross-validation (CV) on a Bayesian model. IS inverts the Bayesian update for a single observation by reweighting posterior samples. The so-called importance weights have high variance -- we resolve this issue through adaptation by transformation. We observe that removing a single observation perturbs the posterior by $\mathcal{O}(1/n)$, motivating bijective transformations of the form $T(\theta)=\theta + h Q(\theta)$ for $0<h\ll 1.$ We introduce several such transformations: partial moment matching, which generalizes prior work on affine moment-matching with a tunable step size; log-likelihood descent, which partially invert the Bayesian update for an observation; and gradient flow steps that minimize the KL divergence or IS variance. The gradient flow and likelihood descent transformations require Jacobian determinants, which are available via auto-differentiation; we additionally derive closed-form expressions for logistic regression and shallow ReLU networks. We tested the methodology on classification ($n\ll p$), count regression (Poisson and zero-inflated negative binomial), and survival analysis problems, finding that no single transformation dominates but their combination nearly eliminates the need to refit.
We study Leaky ResNets, which interpolate between ResNets and Fully-Connected nets depending on an 'effective depth' hyper-parameter $\tilde{L}$. In the infinite depth limit, we study 'representation geodesics' $A_{p}$: continuous paths in representation space (similar to NeuralODEs) from input $p=0$ to output $p=1$ that minimize the parameter norm of the network. We give a Lagrangian and Hamiltonian reformulation, which highlight the importance of two terms: a kinetic energy which favors small layer derivatives $\partial_{p}A_{p}$ and a potential energy that favors low-dimensional representations, as measured by the 'Cost of Identity'. The balance between these two forces offers an intuitive understanding of feature learning in ResNets. We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work: for large $\tilde{L}$ the potential energy dominates and leads to a separation of timescales, where the representation jumps rapidly from the high dimensional inputs to a low-dimensional representation, move slowly inside the space of low-dimensional representations, before jumping back to the potentially high-dimensional outputs. Inspired by this phenomenon, we train with an adaptive layer step-size to adapt to the separation of timescales.
Principal stratification provides a causal inference framework for investigating treatment effects in the presence of a post-treatment variable. Principal strata play a key role in characterizing the treatment effect by identifying groups of units with the same or similar values for the potential post-treatment variable at all treatment levels. The literature has focused mainly on binary post-treatment variables. Few papers considered continuous post-treatment variables. In the presence of a continuous post-treatment, a challenge is how to identify and characterize meaningful coarsening of the latent principal strata that lead to interpretable principal causal effects. This paper introduces the Confounders-Aware SHared atoms BAyesian mixture (CASBAH), a novel approach for principal stratification with binary treatment and continuous post-treatment variables. CASBAH leverages Bayesian nonparametric priors with an innovative hierarchical structure for the potential post-treatment outcomes that overcomes some of the limitations of previous works. Specifically, the novel features of our method allow for (i) identifying coarsened principal strata through a data-adaptive approach and (ii) providing a comprehensive quantification of the uncertainty surrounding stratum membership. Through Monte Carlo simulations, we show that the proposed methodology performs better than existing methods in characterizing the principal strata and estimating principal effects of the treatment. Finally, CASBAH is applied to a case study in which we estimate the causal effects of US national air quality regulations on pollution levels and health outcomes.
In high-dimensional principal component analysis, important inferential targets include both leading spikes and the associated principal eigenspaces. Such problems arise naturally in high-dimensional factor models, where leading principal directions are interpreted as dominant loading directions and spike magnitudes reflect the strength of the corresponding common factors. We study inference based on the sample covariance matrix $\bS$ and the sample correlation matrix $\widehat{\bR}$ under generalized spiked models with arbitrary bulk spectrum. We establish almost sure limits and central limit theorems for spiked sample eigenvalues, and derive asymptotic distributions for functionals of sample spiked eigenspaces. Building on this theory, we develop procedures for one-sample inference for benchmark principal directions and for two-sample comparison of leading spike strengths across populations. Even in the covariance setting, our results substantially extend the existing literature by allowing a non-identity bulk structure. A real-data analysis on stock returns further illustrates the practical relevance of the proposed procedures, showing that covariance-based and correlation-based PCA can lead to markedly different conclusions.
Multivariate Gaussian distributions enjoy Gaussian conditional distributions that makes conditioning easy: conditioning boils down to implementing analytical formulae for conditional means and covariances. For more general distributions, however, conditional distributions may not be available in analytical form and require demanding and approximate numerical approaches. Primarily motivatedby probabilistic imputation problems, we review and discuss families of multivariate distributions that do enjoy analytical conditioning, also providing a few counter-examples. Proving that trans-dimensional stability under conditioning extends to mixtures and transformations, we demonstrate that a broader class of multivariate distributions inherit easy conditioning properties. Building on this insight, we developed a generative method to estimate conditional distributions from data by first fitting a flexible joint distribution using copulas and then performing analytical conditioning in a latent space. In our applications, we specifically opt for Gaussian Mixture Copula Models (GMCM), comparing in turn various fitting strategies. Through simulations and real-world data experiments, we showcase the efficacy of our method in tasks involving conditional density estimation and data imputation. We also touch upon links to Gaussian process modelling and how stability by mixtures and transformations and mixtures carries over towards easy conditioning of non-Gaussian processes.
High-throughput pheno-, geno-, and envirotyping allows characterization of plant genotypes and the trials they are evaluated in, producing different types of data. These different data modalities can be integrated into statistical or machine learning models for genomic prediction in several ways. One commonly used approach within the analysis of multi-environment trial data in plant breeding is to create linear or nonlinear kernels which are subsequently used in linear mixed models (LMMs) to model genotype by environment (G$\times$E) interactions. Current implementations of these kernel-based LMMs present a number of opportunities in terms of methodological extensions. Here we show how these models can be implemented in standard software, allowing direct restricted maximum likelihood (REML) estimation of all parameters. We also further extend the models by combining the kernels with unstructured covariance matrices for three-way interactions in genotype by environment by management (G$\times$E$\times$M) datasets, while simultaneously allowing for environment-specific genetic variances. We show how the models incorporating nonlinear kernels and heterogeneous variances maximize the amount of genetic variance captured by environmental covariables and perform best in prediction settings. We discuss the opportunities regarding models with multiple kernels or kernels obtained after environmental feature selection, as well as the similarities to models regressing phenotypes on latent and observed environmental covariables. Finally, we discuss the flexibility provided by our implementation in terms of modeling complex plant breeding datasets, allowing for straightforward integration of phenomics, enviromics, and genomics.
The field of algorithms with predictions incorporates machine learning advice in the design of online algorithms to improve real-world performance. A central consideration is the extent to which predictions can be trusted -- while existing approaches often require users to specify an aggregate trust level, modern machine learning models can provide estimates of prediction-level uncertainty. In this paper, we propose calibration as a principled and practical tool to bridge this gap, demonstrating the benefits of calibrated advice through two case studies: the ski rental and online job scheduling problems. For ski rental, we design an algorithm that achieves near-optimal prediction-dependent performance and prove that, in high-variance settings, calibrated advice offers more effective guidance than alternative methods for uncertainty quantification. For job scheduling, we demonstrate that using a calibrated predictor leads to significant performance improvements over existing methods. Evaluations on real-world data validate our theoretical findings, highlighting the practical impact of calibration for algorithms with predictions.
Markov Chain Monte Carlo (MCMC) algorithms are essential tools in computational statistics for sampling from unnormalised probability distributions, but can be fragile when targeting high-dimensional, multimodal, or complex target distributions. Parallel Tempering (PT) enhances MCMC's sample efficiency through annealing and parallel computation, propagating samples from tractable reference distributions to intractable targets via state swapping across interpolating distributions. The effectiveness of PT is limited by the often minimal overlap between adjacent distributions in challenging problems, which requires increasing the computational resources to compensate. We introduce a framework that accelerates PT by leveraging neural samplers -- including normalising flows, diffusion models, and controlled diffusions -- to reduce the required overlap. Our approach utilises neural samplers in parallel, circumventing the computational burden of neural samplers while preserving the asymptotic consistency of classical PT. We demonstrate theoretically and empirically on a variety of multimodal sampling problems that our method improves sample quality, reduces the computational cost compared to classical PT, and enables efficient free energy/normalising constant estimation.
Providing theoretical guarantees for parameter estimation in exponential random graph models is a largely open problem. While maximum likelihood estimation has theoretical guarantees in principle, verifying the assumptions for these guarantees to hold can be very difficult. Moreover, in complex networks, numerical maximum likelihood estimation is computer-intensive and may not converge in reasonable time. To ameliorate this issue, local dependency exponential random graph models have been introduced, which assume that the network consists of many independent exponential random graphs. In this setting, progress towards maximum likelihood estimation has been made. However the estimation is still computer-intensive. Instead, we propose to use so-called Stein estimators: we use the Stein characterizations to obtain new estimators for local dependency exponential random graph models.
Reducing methane emissions from the oil and gas sector is a key component of short-term climate action. Emission reduction efforts are often conducted at the individual site-level, where being able to apportion emissions between a finite number of potentially emitting equipment is necessary for leak detection and repair as well as regulatory reporting of annualized emissions. We present a hierarchical Bayesian model, referred to as the multisource detection, localization, and quantification (MDLQ) model, for performing source apportionment on oil and gas sites using methane measurements from point sensor networks. The MDLQ model accounts for autocorrelation in the sensor data and enforces sparsity in the emission rate estimates via a spike-and-slab prior, as oil and gas equipment often emit intermittently. We use the MDLQ model to apportion methane emissions on an experimental oil and gas site designed to release methane in known quantities, providing a means of model evaluation. Data from this experiment are unique in their size (i.e., the number of controlled releases) and in their close approximation of emission characteristics on real oil and gas sites. As such, this study provides a baseline level of apportionment accuracy that can be expected when using point sensor networks on operational sites.
Among inferential problems in functional data analysis, domain selection is one of the practical interests aiming to identify sub-interval(s) of the domain where desired functional features are displayed. Motivated by applications in quantitative ultrasound signal analysis, we propose the robust domain selection method, particularly aiming to discover a subset of the domain presenting distinct behaviors on location parameters among different groups. By extending the interval testing approach, we propose to take into account multiple aspects of functional features simultaneously to detect the practically interpretable domain. To further handle potential outliers and missing segments on collected functional trajectories, we perform interval testing with a test statistic based on functional M-estimators for the inference. In addition, we introduce the effect size heatmap by calculating robustified effect sizes from the lowest to the largest scales over the domain to reflect dynamic functional behaviors among groups so that clinicians get a comprehensive understanding and select practically meaningful sub-interval(s). The performance of the proposed method is demonstrated through simulation studies and an application to motivating quantitative ultrasound measurements.
High-dimensional data analysis using traditional models suffers from overparameterization. Two types of techniques are commonly used to reduce the number of parameters - regularization and dimension reduction. In this project, we combine them by imposing a sparse factor structure and propose a regularized estimator to further reduce the number of parameters in factor models. A challenge limiting the widespread application of factor models is that factors are hard to interpret, as both factors and the loading matrix are unobserved. To address this, we introduce a penalty term when estimating the loading matrix for a sparse estimate. As a result, each factor only drives a smaller subset of time series that exhibit the strongest correlation, improving the factor interpretability. The theoretical properties of the proposed estimator are investigated. The simulation results are presented to confirm that our algorithm performs well. We apply our method to Hawaii tourism data.
The rank correlation \xi(X,Y), recently established by Sourav Chatterjee and already popular in the statistics literature, takes values in [0,1], where 0 characterizes independence of X and Y, and 1 characterizes perfect dependence of Y on X. Unlike concordance measures such as Spearman's \rho, which capture the degree of positive or negative dependence, \xi quantifies the strength of functional dependence. In this paper, we study the attainable set of pairs (\xi(X,Y),\rho(X,Y)). The resulting {\xi}-\r{ho}-region is a convex set whose boundary is characterized by a novel family of absolutely continuous, asymmetric copulas having a diagonal band structure. Moreover, we prove that \xi(X,Y)\leq|\rho}(X,Y)| whenever Y is stochastically increasing or decreasing in X, and we identify the maximal difference \rho(X,Y)-\xi(X,Y) as exactly 0.4. Our proofs rely on a convex optimization problem under various equality and inequality constraints, as well as on ordering properties for \xi and \rho. Our results contribute to a better understanding of Chatterjee's rank correlation, which typically yields substantially smaller values than Spearman's \rho when quantifying positive dependencies. In particular, when interpreting the values of Chatterjee's rank correlation on the scale of \rho, the quantity \sqrt{\xi} appears to be more appropriate.
To conduct causal inference in observational settings, researchers must rely on certain identifying assumptions. In practice, these assumptions are unlikely to hold exactly. This paper considers the bias of selection-on-observables, instrumental variables, and proximal inference estimates under violations of their identifying assumptions. We develop bias expressions for IV and proximal inference that show how violations of their respective assumptions are amplified by any unmeasured confounding in the outcome variable. We propose a set of sensitivity tools that quantify the sensitivity of different identification strategies, and an augmented bias contour plot visualizes the relationship between these strategies. We argue that the act of choosing an identification strategy implicitly expresses a belief about the degree of violations that must be present in alternative identification strategies. Even when researchers intend to conduct an IV or proximal analysis, a sensitivity analysis comparing different identification strategies can help to better understand the implications of each set of assumptions. Throughout, we compare the different approaches on a re-analysis of the impact of state surveillance on the incidence of protest in Communist Poland.
We develop a central limit theorem (CLT) for a non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build on it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.
Causal discovery is the subfield of causal inference concerned with estimating the structure of cause-and-effect relationships in a system of interrelated variables, as opposed to quantifying the strength or describing the form of causal effects. As interest in causal discovery builds in fields such as ecology, public health, and environmental sciences where data are regularly collected with spatial and temporal structures, approaches must evolve to manage autocorrelation and complex confounding. As it stands, the few proposed causal discovery algorithms for spatiotemporal data require summarizing across locations, ignore spatial autocorrelation, and/or scale poorly to high dimensions. Here, we introduce our developing framework that extends time-series causal discovery to systems with spatial structure, building upon work on causal discovery across contexts and methods for handling spatial confounding in causal effect estimation. We close by outlining remaining gaps in the literature and directions for future research.
We study the problem of denoising when only the noise level is known, not the noise distribution. Independent noise $Z$ corrupts a signal $X$, yielding the observation $Y = X + \sigma Z$ with known $\sigma \in (0,1)$. We propose \emph{universal} denoisers, agnostic to both signal and noise distributions, that recover the signal distribution $P_X$ from $P_Y$. When the focus is on distributional recovery of $P_X$ rather than on individual realizations of $X$, our denoisers achieve order-of-magnitude improvements over the Bayes-optimal denoiser derived from Tweedie's formula, which achieves $O(\sigma^2)$ accuracy. They shrink $P_Y$ toward $P_X$ with $O(\sigma^4)$ and $O(\sigma^6)$ accuracy in matching generalized moments and densities. Drawing on optimal transport theory, our denoisers approximate the Monge--Ampère equation with higher-order accuracy and can be implemented efficiently via score matching. Let $q$ denote the density of $P_Y$. For distributional denoising, we propose replacing the Bayes-optimal denoiser, $$\mathbf{T}^*(y) = y + \sigma^2 \nabla \log q(y),$$ with denoisers exhibiting less-aggressive distributional shrinkage, $$\mathbf{T}_1(y) = y + \frac{\sigma^2}{2} \nabla \log q(y),$$ $$\mathbf{T}_2(y) = y + \frac{\sigma^2}{2} \nabla \log q(y) - \frac{\sigma^4}{8} \nabla \!\left( \frac{1}{2} \| \nabla \log q(y) \|^2 + \nabla \cdot \nabla \log q(y) \right)\!.$$
This paper argues that DNNs implement a computational Occam's razor -- finding the `simplest' algorithm that fits the data -- and that this could explain their incredible and wide-ranging success over more traditional statistical methods. We start with the discovery that the set of real-valued function $f$ that can be $\epsilon$-approximated with a binary circuit of size at most $c\epsilon^{-\gamma}$ becomes convex in the `Harder than Monte Carlo' (HTMC) regime, when $\gamma>2$, allowing for the definition of a HTMC norm on functions. In parallel one can define a complexity measure on the parameters of a ResNets (a weighted $\ell_1$ norm of the parameters), which induce a `ResNet norm' on functions. The HTMC and ResNet norms can then be related by an almost matching sandwich bound. Thus minimizing this ResNet norm is equivalent to finding a circuit that fits the data with an almost minimal number of nodes (within a power of 2 of being optimal). ResNets thus appear as an alternative model for computation of real functions, better adapted to the HTMC regime and its convexity.
Consider the additive Gaussian model $Y = X + \sigma Z$, where $X \sim P$ is an unknown signal, $Z \sim N(0,1)$ is independent of $X$, and $\sigma > 0$ is known. Let $Q$ denote the law of $Y$. We construct a hierarchy of denoisers $T_0, T_1, \ldots, T_\infty \colon \mathbb{R} \to \mathbb{R}$ that depend only on higher-order score functions $q^{(m)}/q$, $m \geq 1$, of $Q$ and require no knowledge of the law $P$. The $K$-th order denoiser $T_K$ involves scores up to order $2K{-}1$ and satisfies $W_r(T_K \sharp Q, P) = O(\sigma^{2(K+1)})$ for every $r \geq 1$; in the limit, $T_\infty$ recovers the monotone optimal transport map (Brenier map) pushing $Q$ onto $P$. We provide a complete characterization of the combinatorial structure governing this hierarchy through partial Bell polynomial recursions, making precise how higher-order score functions encode the Brenier map. We further establish rates of convergence for estimating these scores from $n$ i.i.d.\ draws from $Q$ under two complementary strategies: (i) plug-in kernel density estimation, and (ii) higher-order score matching. The construction reveals a precise interplay among higher-order Fisher-type information, optimal transport, and the combinatorics of integer partitions.
Dynamic Treatment Regimes (DTRs) provide a systematic framework for optimizing sequential decision-making in chronic disease management, where therapies must adapt to patients' evolving clinical profiles. Inverse probability weighting (IPW) is a cornerstone methodology for estimating regime values from observational data due to its intuitive formulation and established theoretical properties, yet standard IPW estimators face significant limitations, including variance instability and data inefficiency. A fundamental but underexplored source of inefficiency lies in the strict alignment requirement between observed and target treatment trajectories, which fails to account for partial compatibility and discards substantial information from individuals with only minimal deviations from the regime. We propose two novel methodologies that relax the strict inclusion rule through flexible compatibility mechanisms. Both methods provide computationally tractable alternatives that can be easily integrated into existing IPW workflows, offering more efficient approaches to DTR estimation. Theoretical analysis demonstrates that both estimators preserve consistency while achieving superior finite-sample efficiency compared to standard IPW, and comprehensive simulation studies confirm improved stability. We illustrate the practical utility of our methods through an application to HIV treatment data from the AIDS Clinical Trials Group Study 175 (ACTG175).
In many systems, the true data-generating process is unknown, requiring forecasters to rely on observed time series. This study proposes a pre-modeling diagnostic framework for horizon-specific forecastability assessment that evaluates forecastability before model selection begins. Forecastability is operationalized using auto-mutual information at lag h, which quantifies how much past observations reduce uncertainty about future values, estimated via a k-nearest-neighbor estimator computed strictly on training data to preserve out-of-sample validity. The diagnostic signal is validated against realized out-of-sample symmetric mean absolute percentage error across 42,355 time series spanning six temporal frequencies, using benchmark and higher-capacity probe models under a rolling-origin protocol. The results reveal a strong frequency-dependent relationship between measurable dependence and realized forecast error: for five of six frequencies, auto-mutual information exhibits a consistent negative rank association with realized error, supporting its use as a forecast triage signal for modeling investment decisions, whereas daily series show weaker discrimination despite measurable dependence. Across all frequencies, median forecast error declines monotonically from low to high forecastability terciles, demonstrating clear decision-relevant separation. Overall, the findings establish measurable past-future dependence as a practical screening tool for analytics-driven forecasting strategy, identifying when advanced models are likely to add value, when simple baselines suffice, and when attention should shift from accuracy improvement to robust decision design, thereby supporting a diagnostic-first approach to modeling effort and resource allocation in organizational forecasting contexts.
\cite{tsagris2025a} proposed the generalized circular projected Cauchy (GCPC) distribution, whose special case is the wrapped Cauchy distribution. In this paper we first derive the relationship with the wrapped Cauchy distribution, and then we attempt to characterize the distribution. We establish the conditions under which the distribution exhibits unimodality. We provide non-analytical formulas for the mean resultant length and the Kullback-Leibler divergence, and analytical form for the cumulative probability function and the entropy of the GCPC distribution. We propose log-likelihood ratio tests for one, or two location parameters without assuming equality of the concentration parameters. We revisit maximum likelihood estimation with and without predictors. In the regression setting we briefly mention the addition of circular and simplicial predictors. Simulation studies illustrate a) the performance of the log-likelihood ratio test when one falsely assumes that the true distribution is the wrapped Cauchy distribution, and b) the empirical rate of convergence of the regression coefficients. Using a real data analysis example we show how to avoid the log-likelihood being trapped in a local maximum and we correct a mistake in the regression setting.
Four distinct admissibility geometries govern sequential and distribution-free inference: Blackwell risk dominance over convex risk sets, anytime-valid admissibility within the nonnegative supermartingale cone, marginal coverage validity over exchangeable prediction sets, and Cesàro approachability (CAA) admissibility, which reaches the risk-set boundary via approachability-style arguments rather than explicit priors. We prove a criterion separation theorem: the four classes of admissible procedures are pairwise non-nested. Each geometry carries a different certificate of optimality: a supporting-hyperplane prior (Blackwell), a nonnegative supermartingale (anytime-valid), an exchangeability rank (coverage), or a Cesàro steering argument (CAA). Martingale coherence is necessary for Blackwell admissibility and necessary and sufficient for anytime-valid admissibility within e-processes, but is not sufficient for Blackwell admissibility and is not necessary for coverage validity or CAA-admissibility. All four criteria can be viewed through a common schematic template (minimize Bayesian risk subject to a feasibility constraint), but the decision spaces, partial orders, and performance metrics differ by criterion, making them geometrically incompatible. Admissibility is irreducibly criterion-relative.
This paper investigates the evolving causal mechanisms of flight delays in the U.S. domestic aviation network from 2010-2024. Utilizing a three-level hierarchical Bayesian model on Bureau of Transportation Statistics (BTS) on-time performance data, we decouple the marginal contribution factors of weather, national aviation system (NAS), security delays, and late-arriving aircraft, using carrier delays as the baseline reference. Our findings suggest a structural shift: during the pre-pandemic decade (2010-2019), security delays functioned as an operational stabilizer with negative causal leverage (beta approx -1.307). However, in the post-pandemic period, they shift to a statistically marginal effect (beta approx -0.130). While the total volume of security delays remains a marginal fraction of the overall system latency, this structural shift points toward a potential change in the operational sensitivity of the system to security-related frictions. We show that while causal neutralization is characteristic of high-volume hubs (n >= 100), a discernible directional shift into a positive delay driver (beta approx 0.118) is observed as the analysis scales down to include the broader network (n >= 30). Our model identifies a significant change in how security delays propagate through high-volume nodes, evolving from an internalized operational buffer into a statistically discernible contributor to delay probability in the post-pandemic era.
Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.
Demographers rely on a variety of tools and methods to work with mortality schedules - model life tables, fitting methods, summary-indicator prediction, and forecasting - largely developed independently and not providing structurally coherent sex-specific outputs. The multi-dimensional mortality model (MDMx) unifies all four within one Tucker tensor decomposition demonstrated using the Human Mortality Database (HMD). Period life tables from the HMD are organized as a four-way tensor of logit(1qx) indexed by sex, age, country, and year. Shared factor matrices for sex and age make every output schedule structurally coherent by construction. From this decomposition four capabilities emerge: model life tables via clustering and smooth within-regime trajectories; life table fitting via a three-stage algorithm with Bayes-factor disruption detection; summary-indicator prediction mapping child or adult mortality to complete schedules, reformulating SVD-Comp in tensor coordinates; and forecasting via a damped local linear trend Kalman filter on PCA-reduced core matrices with hierarchical drift.
Post-click conversion rate (CVR) estimation is a fundamental task in developing effective recommender systems, yet it faces challenges from data sparsity and sample selection bias. To handle both challenges, the entire space multitask models are employed to decompose the user behavior track into a sequence of exposure $\rightarrow$ click $\rightarrow$ conversion, constructing surrogate learning tasks for CVR estimation. However, these methods suffer from two significant defects: (1) intrinsic estimation bias (IEB), where the CVR estimates are higher than the actual values; (2) false independence prior (FIP), where the causal relationship between clicks and subsequent conversions is potentially overlooked. To overcome these limitations, we develop a model-agnostic framework, namely Entire Space Counterfactual Multitask Model (ESCM$^2$), which incorporates a counterfactual risk minimizer within the ESMM framework to regularize CVR estimation. Experiments conducted on large-scale industrial recommendation datasets and an online industrial recommendation service demonstrate that ESCM$^2$ effectively mitigates IEB and FIP defects and substantially enhances recommendation performance.
Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-enhanced CounterFactual Regression (CFR-Pro) to exploit proximity for enhancing representation balancing within the HTE estimation context. Specifically, we introduce a pair-wise proximity regularizer based on optimal transport to incorporate the local proximity in discrepancy calculation. However, the curse of dimensionality renders the proximity measure and discrepancy estimation ineffective -- exacerbated by limited data availability for HTE estimation. To handle this problem, we further develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that CFR-Pro accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at this https URL.
Lyman Alpha Emitters (LAEs) are valuable high-redshift cosmological probes traditionally identified using specialized narrow-band photometric surveys. In ground-based spectroscopy, it can be difficult to distinguish the sharp LAE peak from residual sky emission lines using automated methods, leading to misclassified redshifts. We present a Bayesian spectral component separation technique to automatically determine spectroscopic redshifts for LAEs while marginalizing over sky residuals. We use visually inspected spectra of LAEs obtained using the Dark Energy Spectroscopic Instrument (DESI) to create a data-driven prior and can determine redshift by jointly inferring sky residual, LAE, and residual components for each individual spectrum. We demonstrate this method on 881 spectroscopically observed $z = 2-4$ DESI LAE candidate spectra and determine their redshifts with $>$90% accuracy when validated against visually inspected redshifts. Using the $\Delta \chi^2$ value from our pipeline as a proxy for detection confidence, we then explore potential survey design choices and implications for targeting LAEs with medium-band photometry. This method allows for scalability and accuracy in determining redshifts from DESI spectra, and the results provide recommendations for LAE targeting in anticipation of future high-redshift spectroscopic surveys.
Understanding the generalization properties of neural networks on simple input-output distributions is key to explaining their performance on real datasets. The classical teacher-student setting, where a network is trained on data generated by a teacher model, provides a canonical theoretical test bed. In this context, a complete theoretical characterization of fully connected one-hidden-layer networks with generic activation functions remains missing. In this work, we develop a general framework for such networks with large width, yet much smaller than the input dimension. Using methods from statistical physics, we derive closed-form expressions for the typical performance of both finite-temperature (Bayesian) and empirical risk minimization estimators in terms of a small number of order parameters. We uncover a transition to a specialization phase, where hidden neurons align with teacher features once the number of samples becomes sufficiently large and proportional to the number of network parameters. Our theory accurately predicts the generalization error of networks trained on regression and classification tasks using either noisy full-batch gradient descent (Langevin dynamics) or deterministic full-batch gradient descent.
The problem of classification in machine learning has often been approached in terms of function approximation. In this paper, we propose an alternative approach for classification in arbitrary compact metric spaces which, in theory, yields both the number of classes, and a perfect classification using a minimal number of queried labels. Our approach uses localized trigonometric polynomial kernels initially developed for the point source signal separation problem in signal processing. Rather than point sources, we argue that the various classes come from different probability measures. The localized kernel technique developed for separating point sources is then shown to separate the supports of these distributions. This is done in a hierarchical manner in our MASC algorithm to accommodate touching/overlapping class boundaries. We illustrate our theory on several simulated and real life datasets, including the Salinas and Indian Pines hyperspectral datasets and a document dataset.
We numerically investigate the feasibility and limits of jointly estimating flow fields and unknown particle properties (e.g., position, size, and density) from Lagrangian particle tracking (LPT) data. LPT offers time-resolved, volumetric measurements of particle trajectories, which are markers of the carrier fluid motion. However, experimental tracks are spatially sparse and potentially noisy, and the problem of reconstructing flow fields may be further complicated by inertial particle transport, such that particle slip velocities must be determined to access the velocity field of the carrier fluid. To address this problem, we develop a data assimilation framework that couples an Eulerian representation of the flow with Lagrangian particle models, enabling the simultaneous inference of carrier fields and particle properties under the governing equations of disperse multiphase flow. We show that flow fields and particle properties can be jointly estimated in three representative regimes: (1) In a turbulent boundary layer with noisy tracer tracks (St to 0), flow fields and true particle positions are jointly estimated, which amounts to a physics-informed particle tracking problem; (2) in homogeneous isotropic turbulence seeded with inertial particles (St ~ 1-5), we demonstrate simultaneous recovery of flow states and particle diameters, showing the feasibility of implicit particle characterization; and (3) in a compressible, shock-dominated flow, we report the first joint reconstructions of velocity, pressure, density, and inertial particle properties (diameter and density), highlighting both the potential and certain limits of joint estimation in supersonic regimes. A systematic sensitivity study reveals how the seeding density, noise level, and Stokes number govern reconstruction accuracy for our method.
Flexible and accurate noise characterization is crucial for the precise estimation of gravitational-wave parameters. We introduce a Bayesian method for estimating the power spectral density (PSD) of long, stationary time series, explicitly tailored for LISA data analysis. Our approach models the PSD as the geometric mean of a parametric and a nonparametric component, combining the knowledge from parametric models with the flexibility to capture deviations from theoretical expectations. The nonparametric component is expressed by a mixture of penalized B-splines. Adaptive, data-driven knot placement, performed once at initialization, removes the need for reversible-jump Markov chain Monte Carlo, while hierarchical roughness-penalty priors prevent overfitting. Validation on simulated autoregressive AR(4) data demonstrates estimator consistency and shows that well-matched parametric components reduce the integrated absolute error compared to an uninformative baseline, requiring fewer spline knots to achieve comparable accuracy. Applied to one year of simulated LISA X-channel (univariate) noise, our method achieves relative integrated absolute errors of $\mathcal{O}(10^{-2})$, making it suitable for iterative analysis pipelines and multi-year mission data sets.
We study the problem of excess risk evaluation for empirical risk minimization (ERM) under convex losses. We show that by leveraging the idea of wild refitting, one can upper bound the excess risk through the so-called "wild optimism," without relying on the global structure of the underlying function class but only assuming black box access to the training algorithm and a single dataset. We begin by generating two sets of artificially modified pseudo-outcomes created by stochastically perturbing the derivatives with carefully chosen scaling. Using these pseudo-labeled datasets, we refit the black-box procedure twice to obtain two wild predictors and derive an efficient excess risk upper bound under the fixed design setting. Requiring no prior knowledge of the complexity of the underlying function class, our method is essentially model-free and holds significant promise for theoretically evaluating modern opaque deep neural networks and generative models, where traditional learning theory could be infeasible due to the extreme complexity of the hypothesis class.
We study the amplitude-constrained additive white Gaussian noise channel. It is well known that the capacity-achieving input distribution for this channel is discrete and supported on finitely many points. The best known bounds show that the support size of the capacity-achieving distribution is lower-bounded by a term of order $A$ and upper-bounded by a term of order $A^2$, where $A$ denotes the amplitude constraint. It was conjectured in [1] that the linear scaling is optimal. In this work, we establish a new lower bound of order $A\sqrt{\log A}$, improving the known bound and ruling out the conjectured linear scaling. To obtain this result, we quantify the fact that the capacity-achieving output distribution is close to the uniform distribution in the interior of the amplitude constraint. Next, we introduce a wrapping operation that maps the problem to a compact domain and develop a theory of best approximation of the uniform distribution by finite Gaussian mixtures. These approximation bounds are then combined with stability properties of capacity-achieving distributions to yield the final support-size lower bound.
Attenuation bias -- the systematic underestimation of regression coefficients due to measurement errors in input variables -- affects astronomical data-driven models. For linear regression, this problem was solved by treating the true input values as latent variables to be estimated alongside model parameters. In this paper, we show that neural networks suffer from the same attenuation bias and that the latent variable solution generalizes directly to neural networks. We introduce LatentNN, a method that jointly optimizes network parameters and latent input values by maximizing the joint likelihood of observing both inputs and outputs. We demonstrate the correction on one-dimensional regression, multivariate inputs with correlated features, and stellar spectroscopy applications. LatentNN reduces attenuation bias across a range of signal-to-noise ratios where standard neural networks show large bias. This provides a framework for improved neural network inference in the low signal-to-noise regime characteristic of astronomical data. This bias correction is most effective when measurement errors are less than roughly half the intrinsic data range; in the regime of very low signal-to-noise and few informative features. Code is available at this https URL.
Applied work under interference typically models outcomes as functions of own treatment and a low-dimensional exposure mapping of others' treatments, even when that mapping may be misspecified. We ask what policy object such exposure-based procedures target. Taking the marginal policy effect as primitive, we show that any researcher-chosen exposure mapping induces a unique pseudo-true outcome model: the best approximation to the underlying potential outcomes within the class of functions that depend only on that mapping. This yields a decomposition of the marginal policy effect into exposure-based direct and spillover effects, and each component optimally approximates its oracle counterpart, with a sign-preserving interpretation under monotonicity. We then study a structured misspecification setting in which outcomes depend on both network spillovers and a global equilibrium channel, while the analyst may model only one. In this setting, we obtain a sharper asymptotic decomposition into direct, local, and global components, implying that existing estimators recover their respective oracle channel-specific effects even when the other channel is present but omitted from the maintained this http URL analysis also yields phase transitions in convergence rates and higher-order expansions for Z-estimators. A semi-synthetic experiment calibrated to a large cash-transfer study illustrates the empirical relevance of the framework.
A thermodynamic framework for asymptotic inference is developed in which sample size and parameter variance define a state space. Within this description, Shannon information plays the role of entropy, and an integrating factor organizes its variation into a first-law-type balance equation. The framework supports a cyclic inequality analogous to a reversed second law, derived for the estimation of the mean. A non-trivial third-law-type result emerges as a lower bound on entropy set by representation noise. Optimal inference paths, global bounds on information gain, and a natural Carnot-like information efficiency follow from this structure, with efficiency fundamentally limited by a noise floor. Finally, de Bruijn's identity and the I-MMSE relation in the Gaussian-limit case appear as coordinate projections of the same underlying thermodynamic structure. This framework suggests that ensemble physics and inferential physics constitute shadow processes evolving in opposite directions within a unified thermodynamic description.
Classical deep networks are effective because depth enables adaptive geometric deformation of data representations. In quantum neural networks (QNNs), however, depth or state reachability alone does not guarantee this feature-learning capability. We study this question in the pure-state setting by viewing encoded data as an embedded manifold in $\mathbb{C}P^{2^n-1}$ and analysing infinitesimal unitary actions through Lie-algebra directions. We introduce Classical-to-Lie-algebra (CLA) maps and the criterion of almost Complete Local Selectivity (aCLS), which combines directional completeness with data-dependent local selectivity. Within this framework, we show that data-independent trainable unitaries are complete but non-selective, i.e. learnable rigid reorientations, whereas pure data encodings are selective but non-tunable, i.e. fixed deformations. Hence, geometric flexibility requires a non-trivial joint dependence on data and trainable weights. We further show that accessing high-dimensional deformations of many-qubit state manifolds requires parametrised entangling directions; fixed entanglers such as CNOT alone do not provide adaptive geometric control. Numerical examples validate that aCLS-satisfying data re-uploading models outperform non-tunable schemes while requiring only a quarter of the gate operations. Thus, the resulting picture reframes QNN design from state reachability to controllable geometry of hidden quantum representations.
Standard masked discrete diffusion models face limitations in reasoning tasks due to their inability to correct their own mistakes on the masking path. Since they rely on a fixed number of denoising steps, they are unable to adjust their computation to the complexity of a given problem. To address these limitations, we introduce a method based on learning a Markov transition kernel that is trained on its own outputs. This design enables tokens to be remasked, allowing the model to correct its previous mistakes. Furthermore, we do not need a fixed time schedule but use a trained stopping criterion. This allows for adaptation of the number of function evaluations to the difficulty of the reasoning problem. Our adaptation adds two lightweight prediction heads, enabling reuse and fine-tuning of existing pretrained models. On the Sudoku-Extreme dataset we clearly outperform other flow based methods with a validity of 95%. For the Countdown-4 we only need in average of 10 steps to solve almost 96% of them correctly, while many problems can be solved already in 2 steps.
Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast within-model optimisation with slower population-level adaptation. Despite their empirical success, a general mathematical description of the resulting collective training dynamics remains incomplete. We introduce a theoretical framework for neural network training based on two-time-scale population dynamics. We model a population of neural networks as an interacting agent system in which network parameters evolve through fast noisy gradient updates of SGD/Langevin type, while hyperparameters evolve through slower selection--mutation dynamics. We prove the large-population limit for the joint distribution of parameters and hyperparameters and, under strong time-scale separation, derive a selection--mutation equation for the hyperparameter density. For each fixed hyperparameter, the fast parameter dynamics relaxes to a Boltzmann--Gibbs measure, inducing an effective fitness for the slow evolution. The averaged dynamics connects population-based learning with bilevel optimisation and classical replicator--mutator models, yields conditions under which the population mean moves toward the fittest hyperparameter, and clarifies the role of noise and diversity in balancing optimisation and exploration. Numerical experiments illustrate both the large-population regime and the reduced two-time-scale dynamics, and indicate that access to the effective fitness, either in closed form or through population-level estimation, can improve population-level updates.