New articles on Statistics


[1] 2508.02759

Hedging with memory: shallow and deep learning with signatures

We investigate the use of path signatures in a machine learning context for hedging exotic derivatives under non-Markovian stochastic volatility models. In a deep learning setting, we use signatures as features in feedforward neural networks and show that they outperform LSTMs in most cases, with orders of magnitude less training compute. In a shallow learning setting, we compare two regression approaches: the first directly learns the hedging strategy from the expected signature of the price process; the second models the dynamics of volatility using a signature volatility model, calibrated on the expected signature of the volatility. Solving the hedging problem in the calibrated signature volatility model yields more accurate and stable results across different payoffs and volatility dynamics.


[2] 2508.02763

Polynomial complexity sampling from multimodal distributions using Sequential Monte Carlo

We study a sequential Monte Carlo algorithm to sample from the Gibbs measure with a non-convex energy function at a low temperature. We use the practical and popular geometric annealing schedule, and use a Langevin diffusion at each temperature level. The Langevin diffusion only needs to run for a time that is long enough to ensure local mixing within energy valleys, which is much shorter than the time required for global mixing. Our main result shows convergence of Monte Carlo estimators with time complexity that, approximately, scales like the forth power of the inverse temperature, and the square of the inverse allowed error. We also study this algorithm in an illustrative model scenario where more explicit estimates can be given.


[3] 2508.02888

Precision Profile Weighted Deming Regression for Methods Comparison

Errors in variables (Deming) regression of measurements spanning a wide range of values requires appropriate weighting to reflect nonconstant variance. Precision profile models, mathematical relationships between measurement variance and mean, are a route to these weights. The paper describes a methodology combining general precision profile models with Deming regression and described R routines for the resulting calculations.


[4] 2508.02922

A multi-stage Bayesian approach to fit spatial point process models

Spatial point process (SPP) models are commonly used to analyze point pattern data, including presence-only data in ecology. Current methods for fitting these models are computationally expensive because they require numerical quadrature and algorithm supervision (i.e., tuning) in the Bayesian setting. We propose a flexible and efficient multi-stage recursive Bayesian approach to fitting SPP models that leverages parallel computing resources to estimate point process model coefficients and derived quantities. We show how this method can be extended to study designs with compact observation windows and allows for posterior prediction of total abundance and points in unobserved areas, which can be used for downstream analyses. We demonstrate this approach using a simulation study and analyze data from aerial imagery surveys to improve our understanding of spatially explicit abundance of harbor seals (Phoca vitulina) in Johns Hopkins Inlet, a protected tidewater glacial fjord in Glacier Bay National Park, Alaska.


[5] 2508.02954

Sensitivity of weighted least squares estimators to omitted variables

This paper introduces tools for assessing the sensitivity, to unobserved confounding, of a common estimator of the causal effect of a treatment on an outcome that employs weights: the weighted linear regression of the outcome on the treatment and observed covariates. We demonstrate through the omitted variable bias framework that the bias of this estimator is a function of two intuitive sensitivity parameters: (i) the proportion of weighted variance in the treatment that unobserved confounding explains given the covariates and (ii) the proportion of weighted variance in the outcome that unobserved confounding explains given the covariates and the treatment, i.e., two weighted partial $R^2$ values. Following previous work, we define sensitivity statistics that lend themselves well to routine reporting, and derive formal bounds on the strength of the unobserved confounding with (a multiple of) the strength of select dimensions of the covariates, which help the user determine if unobserved confounding that would alter one's conclusions is plausible. We also propose tools for adjusted inference. A key choice we make is to examine only how the (weighted) outcome model is influenced by unobserved confounding, rather than examining how the weights have been biased by omitted confounding. One benefit of this choice is that the resulting tool applies with any weights (e.g., inverse-propensity score, matching, or covariate balancing weights). Another benefit is that we can rely on simple omitted variable bias approaches that, for example, impose no distributional assumptions on the data or unobserved confounding, and can address bias from misspecification in the observed data. We make these tools available in the weightsense package for the R computing language.


[6] 2508.02965

Novel measures and estimators of income inequality

In this paper, we propose new income inequality measures that approximate the Gini coefficient and analyze the asymptotic properties of their estimators, including strong consistency and limiting distribution. Generalizations to the measures and estimators are developed. Simulation studies assess finite-sample performance, and an empirical example demonstrates practical relevance.


[7] 2508.02970

Bayesian Sensitivity Analyses for Policy Evaluation with Difference-in-Differences under Violations of Parallel Trends

Violations of the parallel trends assumption pose significant challenges for causal inference in difference-in-differences (DiD) studies, especially in policy evaluations where pre-treatment dynamics and external shocks may bias estimates. In this work, we propose a Bayesian DiD framework to allow us to estimate the effect of policies when parallel trends is violated. To address potential deviations from the parallel trends assumption, we introduce a formal sensitivity parameter representing the extent of the violation, specify an autoregressive AR(1) prior on this term to robustly model temporal correlation, and explore a range of prior specifications - including fixed, fully Bayesian, and empirical Bayes (EB) approaches calibrated from pre-treatment data. By systematically comparing posterior treatment effect estimates across prior configurations when evaluating Philadelphia's sweetened beverage tax using Baltimore as a control, we show how Bayesian sensitivity analyses support robust and interpretable policy conclusions under violations of parallel trends.


[8] 2508.03059

Two-sample comparison through additive tree models for density ratios

The ratio of two densities characterizes their differences. We consider learning the density ratio given i.i.d. observations from each of the two distributions. We propose additive tree models for the density ratio along with efficient algorithms for training these models using a new loss function called the balancing loss. With this loss, additive tree models for the density ratio can be trained using algorithms original designed for supervised learning. Specifically, they can be trained from both an optimization perspective that parallels tree boosting and from a (generalized) Bayesian perspective that parallels Bayesian additive regression trees (BART). For the former, we present two boosting algorithms -- one based on forward-stagewise fitting and the other based on gradient boosting, both of which produce a point estimate for the density ratio function. For the latter, we show that due to the loss function's resemblance to an exponential family kernel, the new loss can serve as a pseudo-likelihood for which conjugate priors exist, thereby enabling effective generalized Bayesian inference on the density ratio using backfitting samplers designed for BART. The resulting uncertainty quantification on the inferred density ratio is critical for applications involving high-dimensional and complex distributions in which uncertainty given limited data can often be substantial. We provide insights on the balancing loss through its close connection to the exponential loss in binary classification and to the variational form of f-divergence, in particular that of the squared Hellinger distance. Our numerical experiments demonstrate the accuracy of the proposed approach while providing unique capabilities in uncertainty quantification. We demonstrate the application of our method in a case study involving assessing the quality of generative models for microbiome compositional data.


[9] 2508.03074

Poisson Inventory Models with Many Items: An Empirical Bayes Approach

We consider inventory decisions with many items, each of which has Poisson demand. The rate of demand for individual items is estimated on the basis of observations of past demand. The problem is to determine the items to hold in stock and the amount of each one. Our setting provides a natural framework for the application of the empirical Bayes methodology. We show how to do this in practice and demonstrate the importance of making posterior estimates of different demand levels, rather than just estimating the Poisson rate. We also address the question of when it is beneficial to separately analyse a group of items which are distinguished in some way. An example occurs when looking at inventory for a book retailer, who may find it advantageous to look separately at certain types of book (e.g. biographies). The empirical Bayes methodology is valuable when dealing with items having Poisson demand, and can be effective even with relatively small numbers of distinct items (e.g. 100). We discuss the best way to apply an empirical Bayes methodology in this context, and also show that doing this in the wrong way will reduce or eliminate the potential benefits.


[10] 2508.03282

Adaptive Data-Borrowing for Improving Treatment Effect Estimation using External Controls

Randomized controlled trials (RCTs) often exhibit limited inferential efficiency in estimating treatment effects due to small sample sizes. In recent years, the combination of external controls has gained increasing attention as a means of improving the efficiency of RCTs. However, external controls are not always comparable to RCTs, and direct borrowing without careful evaluation can introduce substantial bias and reduce the efficiency of treatment effect estimation. In this paper, we propose a novel influence-based adaptive sample borrowing approach that effectively quantifies the "comparability'' of each sample in the external controls using influence function theory. Given a selected set of borrowed external controls, we further derive a semiparametric efficient estimator under an exchangeability assumption. Recognizing that the exchangeability assumption may not hold for all possible borrowing sets, we conduct a detailed analysis of the asymptotic bias and variance of the proposed estimator under violations of exchangeability. Building on this bias-variance trade-off, we further develop a data-driven approach to select the optimal subset of external controls for borrowing. Extensive simulations and real-world applications demonstrate that the proposed approach significantly enhances treatment effect estimation efficiency in RCTs, outperforming existing approaches.


[11] 2508.03310

Robust fuzzy clustering with cellwise outliers

Fuzzy clustering is a technique for identifying subgroups in heterogeneous populations by quantifying unit membership degrees. The magnitude of the latter depends on the desired level of fuzzification, based on the purpose of the analysis. We combine the advantages of fuzzy clustering with a robust approach able to detecting cellwise outliers, i.e., anomalous cells in a data matrix. The proposed methodology is formulated within a probabilistic framework and estimated via an Expectation-Maximization algorithm for missing data. It includes an additional step for flagging contaminated cells, which are then treated as missing information. The strengths of the model are illustrated through two real-world applications: the first one identifies individuals at potential risk of obesity based on their physiological measurements, while the second one analyzes well-being across regions of the OECD countries. We also explore the effects of the model's tuning parameters and provide guidance for users on how to set them suitably.


[12] 2508.03314

A Dual Optimization View to Empirical Risk Minimization with f-Divergence Regularization

The dual formulation of empirical risk minimization with f-divergence regularization (ERM-fDR) is introduced. The solution of the dual optimization problem to the ERM-fDR is connected to the notion of normalization function introduced as an implicit function. This dual approach leverages the Legendre-Fenchel transform and the implicit function theorem to provide a nonlinear ODE expression to the normalization function. Furthermore, the nonlinear ODE expression and its properties provide a computationally efficient method to calculate the normalization function of the ERM-fDR solution under a mild condition.


[13] 2508.03431

A Note on The Rationale Behind Using Parental Longevity as a Proxy in Mendelian Randomization Studies

In many cohorts (such as the UK Biobank) on which Mendelian Randomization studies are routinely performed, data on participants' longevity is inadequate as the majority of participants are still living. To nevertheless estimate effects on longevity, it is increasingly common for researchers to substitute participants' `parental attained age', i.e. parental lifespan or current age (which is routinely collected in UK Biobank), as a proxy outcome. The common approach to performing this clever trick appears to be based on a solid understanding of its underlying assumptions. However, we have not seen these assumptions (or the causal effects whose identification they enable) clearly stated anywhere in the literature. In this note, we fill that gap.


[14] 2508.03504

A New Perspective on High Dimensional Confidence Intervals

Classically, confidence intervals are required to have consistent coverage across all values of the parameter. However, this will inevitably break down if the underlying estimation procedure is biased. For this reason, many efforts have focused on debiased versions of the lasso for interval construction. In the process of debiasing, however, the connection to the original estimates are often obscured. In this work, we offer a different perspective focused on average coverage in contrast to individual coverage. This perspective results in confidence intervals that better reflect the original assumptions, as opposed to debiased intervals, which often do not even contain the original lasso estimates. To this end we propose a method based on the Relaxed Lasso that gives approximately correct average coverage and compare this to debiased methods which attempt to produce correct individual coverage. With this new definition of coverage we also briefly revisit the bootstrap, which Chatterjee and Lahiri (2010) showed was inconsistent for lasso, but find that it fails even under this alternative coverage definition.


[15] 2508.03508

UnMuted: Defining SARS-CoV-2 Lineages According to Temporally Consistent Mutation Clusters in Wastewater Samples

SARS-CoV-2 lineages are defined according to placement in a phylogenetic tree, but approximated by a list of mutations based on sequences collected from clinical sampling. Wastewater lineage abundance is generally found under the assumption that the mutation frequency is approximately equal to the sum of the abundances of the lineages to which it belongs. By leveraging numerous samples collected over time, I am able to estimate the temporal trends of the abundance of lineages as well as the definitions of those lineages. This is accomplished by assuming that collections of mutations that appear together over time constitute lineages, then estimating the proportions as before. Three main models are considered: Two that incorporate an explicit temporal trend with different constraints on the abundances, and one that does not estimate a temporal component. It is found that estimated lineage definitions correspond to known lineage definitions with matching temporal trends for the lineage abundances, despite having no information from clinical samples. I refer to this set of methods as "UnMuted" since the mutations are allowed to speak for themselves.


[16] 2508.03546

Supervised Dynamic Dimension Reduction with Deep Neural Network

This paper studies the problem of dimension reduction, tailored to improving time series forecasting with high-dimensional predictors. We propose a novel Supervised Deep Dynamic Principal component analysis (SDDP) framework that incorporates the target variable and lagged observations into the factor extraction process. Assisted by a temporal neural network, we construct target-aware predictors by scaling the original predictors in a supervised manner, with larger weights assigned to predictors with stronger forecasting power. A principal component analysis is then performed on the target-aware predictors to extract the estimated SDDP factors. This supervised factor extraction not only improves predictive accuracy in the downstream forecasting task but also yields more interpretable and target-specific latent factors. Building upon SDDP, we propose a factor-augmented nonlinear dynamic forecasting model that unifies a broad family of factor-model-based forecasting approaches. To further demonstrate the broader applicability of SDDP, we extend our studies to a more challenging scenario when the predictors are only partially observable. We validate the empirical performance of the proposed method on several real-world public datasets. The results show that our algorithm achieves notable improvements in forecasting accuracy compared to state-of-the-art methods.


[17] 2508.03610

Density Estimation from Aggregated Data with Integrated Auxiliary Information: Estimating Population Densities with Geospatial Data

Density estimation for geospatial data ideally relies on precise geocoordinates, typically defined by longitude and latitude. However, such detailed information is often unavailable due to confidentiality constraints. As a result, analysts frequently work with spatially aggregated data, commonly visualized through choropleth maps. Approaches that reverse the aggregation process using measurement error models in the context of kernel density estimation have been proposed in the literature. From a methodological perspective, we extend this line of work by incorporating auxiliary information to improve the precision of density estimates derived from aggregated data. Our approach employs a correlation-based weighting scheme to combine the auxiliary density with the estimate obtained from aggregated data. We evaluate the method through a series of model-based simulation scenarios reflecting varying conditions of auxiliary data quality. From an applied perspective, we demonstrate the utility of our method in two real-world case studies: (1) estimating population densities from the 2022 German Census in Bavaria, using satellite imagery of nighttime light emissions as auxiliary data; and (2) analyzing brown hare hunting bag data in the German state of Lower Saxony. Overall, our results show that integrating auxiliary information into the estimation process leads to more precise density estimates.


[18] 2508.03617

Expanding the Standard Diffusion Process to Specified Non-Gaussian Marginal Distributions

We develop a class of non-Gaussian translation processes that extend classical stochastic differential equations (SDEs) by prescribing arbitrary absolutely continuous marginal distributions. Our approach uses a copula-based transformation to flexibly model skewness, heavy tails, and other non-Gaussian features often observed in real data. We rigorously define the process, establish key probabilistic properties, and construct a corresponding diffusion model via stochastic calculus, including proofs of existence and uniqueness. A simplified approximation is introduced and analyzed, with error bounds derived from asymptotic expansions. Simulations demonstrate that both the full and simplified models recover target marginals with high accuracy. Examples using the Student's t, asymmetric Laplace, and Exponentialized Generalized Beta of the Second Kind (EGB2) distributions illustrate the flexibility and tractability of the framework.


[19] 2508.03636

Likelihood Matching for Diffusion Models

We propose a Likelihood Matching approach for training diffusion models by first establishing an equivalence between the likelihood of the target data distribution and a likelihood along the sample path of the reverse diffusion. To efficiently compute the reverse sample likelihood, a quasi-likelihood is considered to approximate each reverse transition density by a Gaussian distribution with matched conditional mean and covariance, respectively. The score and Hessian functions for the diffusion generation are estimated by maximizing the quasi-likelihood, ensuring a consistent matching of both the first two transitional moments between every two time points. A stochastic sampler is introduced to facilitate computation that leverages on both the estimated score and Hessian information. We establish consistency of the quasi-maximum likelihood estimation, and provide non-asymptotic convergence guarantees for the proposed sampler, quantifying the rates of the approximation errors due to the score and Hessian estimation, dimensionality, and the number of diffusion steps. Empirical and simulation evaluations demonstrate the effectiveness of the proposed Likelihood Matching and validate the theoretical results.


[20] 2508.03653

Optimized imaging prefiltering for enhanced image segmentation

The Box-Cox transformation, introduced in 1964, is a widely used statistical tool for stabilizing variance and improving normality in data analysis. Its application in image processing, particularly for image enhancement, has gained increasing attention in recent years. This paper investigates the use of the Box-Cox transformation as a preprocessing step for image segmentation, with a focus on the estimation of the transformation parameter. We evaluate the effectiveness of the transformation by comparing various segmentation methods, highlighting its advantages for traditional machine learning techniques-especially in situations where no training data is available. The results demonstrate that the transformation enhances feature separability and computational efficiency, making it particularly beneficial for models like discriminant analysis. In contrast, deep learning models did not show consistent improvements, underscoring the differing impacts of the transformation across model types and image characteristics.


[21] 2508.03675

A New Approach to Partial Conjunction Analysis in Neuroimaging

The problem of identifying the brain regions activated through a particular cognitive task is pivotal in neuroimaging. This problem becomes even more complex if we have several cognitive tasks or several subjects. In this paper, we view this problem as a partial conjunction (PC) hypotheses testing problem, i.e., we are testing whether a specific brain region is activated in at least $\gamma$ (for some pre-fixed $\gamma$) subjects. We propose the application of a recent advance in the simultaneous statistical inference literature to activation localization in neuroimaging. We apply the recently proposed CoFilter method to neuroimaging data to discover brain regions activated in at least $\gamma$ subjects. Our proposal has two distinct advantages. First, it alleviates the conservativeness displayed by the traditional multiple testing procedures in testing PC hypotheses by eliminating many of the conservative PC $p$-values. Second, it is especially suitable for several high-dimensional studies, each of which examines a large number of null hypotheses. We also compare the performance of our proposal with existing methods for testing PC hypotheses through extensive simulation studies on neuroimaging data and a real dataset.


[22] 2508.03688

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws

We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $y \propto \sum_{j=1}^{r}\lambda_j \sigma\left(\langle \boldsymbol{\theta_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim N(0,\boldsymbol{I}_d)$, $\sigma$ is the 2nd Hermite polynomial, and $\lbrace\boldsymbol{\theta}_j \rbrace_{j=1}^{r} \subset \mathbb{R}^d$ are orthonormal signal directions. We consider the extensive-width regime $r \asymp d^\beta$ for $\beta \in [0, 1)$, and assume a power-law decay on the (non-negative) second-layer coefficients $\lambda_j\asymp j^{-\alpha}$ for $\alpha \geq 0$. We present a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.


[23] 2508.02692

Overcoming the Loss Conditioning Bottleneck in Optimization-Based PDE Solvers: A Novel Well-Conditioned Loss Function

Optimization-based PDE solvers that minimize scalar loss functions have gained increasing attention in recent years. These methods either define the loss directly over discrete variables, as in Optimizing a Discrete Loss (ODIL), or indirectly through a neural network surrogate, as in Physics-Informed Neural Networks (PINNs). However, despite their promise, such methods often converge much more slowly than classical iterative solvers and are commonly regarded as inefficient. This work provides a theoretical insight, attributing the inefficiency to the use of the mean squared error (MSE) loss, which implicitly forms the normal equations, squares the condition number, and severely impairs optimization. To address this, we propose a novel Stabilized Gradient Residual (SGR) loss. By tuning a weight parameter, it flexibly modulates the condition number between the original system and its normal equations, while reducing to the MSE loss in the limiting case. We systematically benchmark the convergence behavior and optimization stability of the SGR loss within both the ODIL framework and PINNs-employing either numerical or automatic differentiation-and compare its performance against classical iterative solvers. Numerical experiments on a range of benchmark problems demonstrate that, within the ODIL framework, the proposed SGR loss achieves orders-of-magnitude faster convergence than the MSE loss. Further validation within the PINNs framework shows that, despite the high nonlinearity of neural networks, SGR consistently outperforms the MSE loss. These theoretical and empirical findings help bridge the performance gap between classical iterative solvers and optimization-based solvers, highlighting the central role of loss conditioning, and provide key insights for the design of more efficient PDE solvers.


[24] 2508.02709

Advancing Computational Tools for Analyzing Commutative Hypercomplex Algebras

Commutative hypercomplex algebras offer significant advantages over traditional quaternions due to their compatibility with linear algebra techniques and efficient computational implementation, which is crucial for broad applicability. This paper explores a novel family of commutative hypercomplex algebras, referred to as (alpha,beta)-tessarines, which extend the system of generalized Segre's quaternions and, consequently, elliptic quaternions. The main contribution of this work is the development of theoretical and computational tools for matrices within this algebraic system, including inversion, square root computation, LU factorization with partial pivoting, and determinant calculation. Additionally, a spectral theory for (alpha,beta)-tessarines is established, covering eigenvalue and eigenvector analysis, the power method, singular value decomposition, rank-k approximation, and the pseudoinverse. Solutions to the classical least squares problem are also presented. These results not only enhance the fundamental understanding of hypercomplex algebras but also provide researchers with novel matrix operations that have not been extensively explored in previous studies. The theoretical findings are supported by real-world examples, including image reconstruction and color face recognition, which demonstrate the potential of the proposed techniques.


[25] 2508.02715

Cholesky decomposition for symmetric matrices, Riemannian geometry, and random matrices

For each $n \geq 1$ and sign pattern $\epsilon \in \{ \pm 1 \}^n$, we introduce a cone of real symmetric matrices $LPM_n(\epsilon)$: those with leading principal $k \times k$ minors of signs $\epsilon_k$. These cones are pairwise disjoint and their union $LPM_n$ is a dense cone in all symmetric matrices; they subsume positive and negative definite matrices, and symmetric (P-,) N-, PN-, almost P-, and almost N- matrices. We show that each $LPM_n$ matrix $A$ admits an uncountable family of Cholesky-type factorizations - yielding a unique lower triangular matrix $L$ with positive diagonals - with additional attractive properties: (i) each such factorization is algorithmic; and (ii) each such Cholesky map $A \mapsto L$ is a smooth diffeomorphism from $LPM_n(\epsilon)$ onto an open Euclidean ball. We then show that (iii) the (diffeomorphic) balls $LPM_n(\epsilon)$ are isometric Riemannian manifolds as well as isomorphic abelian Lie groups, each equipped with a translation-invariant Riemannian metric (and hence Riemannian means/barycentres). Moreover, (iv) this abelian metric group structure on each $LPM_n(\epsilon)$ - and hence the log-Cholesky metric on Cholesky space - yields an isometric isomorphism onto a finite-dimensional Euclidean space. The complex version of this also holds. In the latter part, we show that the abelian group $PD_n$ of positive definite matrices, with its bi-invariant log-Cholesky metric, is precisely the identity-component of a larger group with an alternate metric: the dense cone $LPM_n$. This also holds for Hermitian matrices over several subfields $\mathbb{F} \subseteq \mathbb{C}$. As a result, (v) the groups $LPM_n^{\mathbb{F}}$ and $LPM_\infty^{\mathbb{F}}$ admit a rich probability theory, and the cones $LPM_n(\epsilon), TPM_n(\epsilon)$ admit Wishart densities with signed Bartlett decompositions.


[26] 2508.02874

Beyond Least Squares: Robust Regression Transformer (R2T)

Robust regression techniques rely on least-squares optimization, which works well for Gaussian noise but fails in the presence of asymmetric structured noise. We propose a hybrid neural-symbolic architecture where a transformer encoder processes numerical sequences, a compression NN predicts symbolic parameters, and a fixed symbolic equation reconstructs the original sequence. Using synthetic data, the training objective is to recover the original sequence after adding asymmetric structured noise, effectively learning a symbolic fit guided by neural parameter estimation. Our model achieves a median regression MSE of 6e-6 to 3.5e-5 on synthetic wearable data, which is a 10-300 times improvement when compared with ordinary least squares fit and robust regression techniques such as Huber loss or SoftL1.


[27] 2508.02908

Random Effects Models for Understanding Variability and Association between Brain Functional and Structural Connectivity

The human brain is organized as a complex network, where connections between regions are characterized by both functional connectivity (FC) and structural connectivity (SC). While previous studies have primarily focused on network-level FC-SC correlations (i.e., the correlation between FC and SC across all edges within a predefined network), edge-level correlations (i.e., the correlation between FC and SC across subjects at each edge) has received comparatively little attention. In this study, we systematically analyze both network-level and edge-level FC-SC correlations, demonstrating that they lead to divergent conclusions about the strength of brain function-structure association. To explain these discrepancies, we introduce new random effects models that decompose FC and SC variability into different sources: subject effects, edge effects, and their interactions. Our results reveal that network-level and edge-level FC-SC correlations are influenced by different effects, each contributing differently to the total variability in FC and SC. This modeling framework provides the first statistical approach for disentangling and quantitatively assessing different sources of FC and SC variability and yields new insights into the relationship between functional and structural brain networks.


[28] 2508.02924

BoostTransformer: Enhancing Transformer Models with Subgrid Selection and Importance Sampling

Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.


[29] 2508.02945

LLM-based IR-system for Bank Supervisors

Bank supervisors face the complex task of ensuring that new measures are consistently aligned with historical precedents. To address this challenge, we introduce a novel Information Retrieval (IR) System tailored to assist supervisors in drafting both consistent and effective measures. This system ingests findings from on-site investigations. It then retrieves the most relevant historical findings and their associated measures from a comprehensive database, providing a solid basis for supervisors to write well-informed measures for new findings. Utilizing a blend of lexical, semantic, and Capital Requirements Regulation (CRR) fuzzy set matching techniques, the IR system ensures the retrieval of findings that closely align with current cases. The performance of this system, particularly in scenarios with partially labeled data, is validated through a Monte Carlo methodology, showcasing its robustness and accuracy. Enhanced by a Transformer-based Denoising AutoEncoder for fine-tuning, the final model achieves a Mean Average Precision (MAP@100) of 0.83 and a Mean Reciprocal Rank (MRR@100) of 0.92. These scores surpass those of both standalone lexical models such as BM25 and semantic BERT-like models.


[30] 2508.02964

Injecting Measurement Information Yields a Fast and Noise-Robust Diffusion-Based Inverse Problem Solver

Diffusion models have been firmly established as principled zero-shot solvers for linear and nonlinear inverse problems, owing to their powerful image prior and iterative sampling algorithm. These approaches often rely on Tweedie's formula, which relates the diffusion variate $\mathbf{x}_t$ to the posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t]$, in order to guide the diffusion trajectory with an estimate of the final denoised sample $\mathbf{x}_0$. However, this does not consider information from the measurement $\mathbf{y}$, which must then be integrated downstream. In this work, we propose to estimate the conditional posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t, \mathbf{y}]$, which can be formulated as the solution to a lightweight, single-parameter maximum likelihood estimation problem. The resulting prediction can be integrated into any standard sampler, resulting in a fast and memory-efficient inverse solver. Our optimizer is amenable to a noise-aware likelihood-based stopping criteria that is robust to measurement noise in $\mathbf{y}$. We demonstrate comparable or improved performance against a wide selection of contemporary inverse solvers across multiple datasets and tasks.


[31] 2508.02966

Measuring Human Leadership Skills with Artificially Intelligent Agents

We show that the ability to lead groups of humans is predicted by leadership skill with Artificially Intelligent agents. In a large pre-registered lab experiment, human leaders worked with AI agents to solve problems. Their performance on this 'AI leadership test' was strongly correlated with their causal impact on human teams, which we estimate by repeatedly randomly assigning leaders to groups of human followers and measuring team performance. Successful leaders of both humans and AI agents ask more questions and engage in more conversational turn-taking; they score higher on measures of social intelligence, fluid intelligence, and decision-making skill, but do not differ in gender, age, ethnicity or education. Our findings indicate that AI agents can be effective proxies for human participants in social experiments, which greatly simplifies the measurement of leadership and teamwork skills.


[32] 2508.03072

Achieving Limited Adaptivity for Multinomial Logistic Bandits

Multinomial Logistic Bandits have recently attracted much attention due to their ability to model problems with multiple outcomes. In this setting, each decision is associated with many possible outcomes, modeled using a multinomial logit function. Several recent works on multinomial logistic bandits have simultaneously achieved optimal regret and computational efficiency. However, motivated by real-world challenges and practicality, there is a need to develop algorithms with limited adaptivity, wherein we are allowed only $M$ policy updates. To address these challenges, we present two algorithms, B-MNL-CB and RS-MNL, that operate in the batched and rarely-switching paradigms, respectively. The batched setting involves choosing the $M$ policy update rounds at the start of the algorithm, while the rarely-switching setting can choose these $M$ policy update rounds in an adaptive fashion. Our first algorithm, B-MNL-CB extends the notion of distributional optimal designs to the multinomial setting and achieves $\tilde{O}(\sqrt{T})$ regret assuming the contexts are generated stochastically when presented with $\Omega(\log \log T)$ update rounds. Our second algorithm, RS-MNL works with adversarially generated contexts and can achieve $\tilde{O}(\sqrt{T})$ regret with $\tilde{O}(\log T)$ policy updates. Further, we conducted experiments that demonstrate that our algorithms (with a fixed number of policy updates) are extremely competitive (and often better) than several state-of-the-art baselines (which update their policy every round), showcasing the applicability of our algorithms in various practical scenarios.


[33] 2508.03210

Convergence of Deterministic and Stochastic Diffusion-Model Samplers: A Simple Analysis in Wasserstein Distance

We provide new convergence guarantees in Wasserstein distance for diffusion-based generative models, covering both stochastic (DDPM-like) and deterministic (DDIM-like) sampling methods. We introduce a simple framework to analyze discretization, initialization, and score estimation errors. Notably, we derive the first Wasserstein convergence bound for the Heun sampler and improve existing results for the Euler sampler of the probability flow ODE. Our analysis emphasizes the importance of spatial regularity of the learned score function and argues for controlling the score error with respect to the true reverse process, in line with denoising score matching. We also incorporate recent results on smoothed Wasserstein distances to sharpen initialization error bounds.


[34] 2508.03245

On Conformal Machine Unlearning

The increasing demand for data privacy, driven by regulations such as GDPR and CCPA, has made Machine Unlearning (MU) essential for removing the influence of specific training samples from machine learning models while preserving performance on retained data. However, most existing MU methods lack rigorous statistical guarantees, rely on heuristic metrics, and often require computationally expensive retraining baselines. To overcome these limitations, we introduce a new definition for MU based on Conformal Prediction (CP), providing statistically sound, uncertainty-aware guarantees without the need for the concept of naive retraining. We formalize conformal criteria that quantify how often forgotten samples are excluded from CP sets, and propose empirical metrics,the Efficiently Covered Frequency (ECF at c) and its complement, the Efficiently Uncovered Frequency (EuCF at d), to measure the effectiveness of unlearning. We further present a practical unlearning method designed to optimize these conformal metrics. Extensive experiments across diverse forgetting scenarios, datasets and models demonstrate the efficacy of our approach in removing targeted data.


[35] 2508.03272

The alpha-beta divergence for real and complex data

Divergences are fundamental to the information criteria that underpin most signal processing algorithms. The alpha-beta family of divergences, designed for non-negative data, offers a versatile framework that parameterizes and continuously interpolates several separable divergences found in existing literature. This work extends the definition of alpha-beta divergences to accommodate complex data, specifically when the arguments of the divergence are complex vectors. This novel formulation is designed in such a way that, by setting the divergence hyperparameters to unity, it particularizes to the well-known Euclidean and Mahalanobis squared distances. Other choices of hyperparameters yield practical separable and non-separable extensions of several classical divergences. In the context of the problem of approximating a complex random vector, the centroid obtained by optimizing the alpha-beta mean distortion has a closed-form expression, which interpretation sheds light on the distinct roles of the divergence hyperparameters. These contributions may have wide potential applicability, as there are many signal processing domains in which the underlying data are inherently complex.


[36] 2508.03452

Reconstructing the Probability Measure of a Curie-Weiss Model Observing the Realisations of a Subset of Spins

We study the problem of reconstructing the probability measure of the Curie-Weiss model from a sample of the voting behaviour of a subset of the population. While originally used to study phase transitions in statistical mechanics, the Curie-Weiss or mean-field model has been applied to study phenomena, where many agents interact with each other. It is useful to measure the degree of social cohesion in social groups, which manifests in the way the members of the group influence each others' decisions. In practice, statisticians often only have access to survey data from a representative subset of a population. As such, it is useful to provide methods to estimate social cohesion from such data. The estimators we study have some positive properties, such as consistency, asymptotic normality, and large deviation principles. The main advantages are that they require only a sample of votes belonging to a (possibly very small) subset of the population and have a low computational cost. Due to the wide application of models such as Curie-Weiss, these estimators are potentially useful in disciplines such as political science, sociology, automated voting, and preference aggregation.


[37] 2508.03593

On the (In)Significance of Feature Selection in High-Dimensional Datasets

Extensive research has been done on feature selection (FS) algorithms for high-dimensional datasets aiming to improve model performance, reduce computational cost and identify features of interest. We test the null hypothesis of using randomly selected features to compare against features selected by FS algorithms to validate the performance of the latter. Our results show that FS on high-dimensional datasets (in particular gene expression) in classification tasks is not useful. We find that (1) models trained on small subsets (0.02%-1% of all features) of randomly selected features almost always perform comparably to those trained on all features, and (2) a "typical"- sized random subset provides comparable or superior performance to that of top-k features selected in various published studies. Thus, our work challenges many feature selection results on high dimensional datasets, particularly in computational genomics. It raises serious concerns about studies that propose drug design or targeted interventions based on computationally selected genes, without further validation in a wet lab.


[38] 2508.03633

Pair Correlation Factor and the Sample Complexity of Gaussian Mixtures

We study the problem of learning Gaussian Mixture Models (GMMs) and ask: which structural properties govern their sample complexity? Prior work has largely tied this complexity to the minimum pairwise separation between components, but we demonstrate this view is incomplete. We introduce the \emph{Pair Correlation Factor} (PCF), a geometric quantity capturing the clustering of component means. Unlike the minimum gap, the PCF more accurately dictates the difficulty of parameter recovery. In the uniform spherical case, we give an algorithm with improved sample complexity bounds, showing when more than the usual $\epsilon^{-2}$ samples are necessary.


[39] 2508.03677

FairLangProc: A Python package for fairness in NLP

The rise in usage of Large Language Models to near ubiquitousness in recent years has risen societal concern about their applications in decision-making contexts, such as organizational justice or healthcare. This, in turn, poses questions about the fairness of these models in critical settings, which leads to the developement of different procedures to address bias in Natural Language Processing. Although many datasets, metrics and algorithms have been proposed to measure and mitigate harmful prejudice in Natural Language Processing, their implementation is diverse and far from centralized. As a response, this paper presents FairLangProc, a comprehensive Python package providing a common implementation of some of the more recent advances in fairness in Natural Language Processing providing an interface compatible with the famous Hugging Face transformers library, aiming to encourage the widespread use and democratization of bias mitigation techniques. The implementation can be found on this https URL.


[40] 2508.03679

Streaming Generated Gaussian Process Experts for Online Learning and Control

Gaussian Processes (GPs), as a nonparametric learning method, offer flexible modeling capabilities and calibrated uncertainty quantification for function approximations. Additionally, GPs support online learning by efficiently incorporating new data with polynomial-time computation, making them well-suited for safety-critical dynamical systems that require rapid adaptation. However, the inference and online updates of exact GPs, when processing streaming data, incur cubic computation time and quadratic storage memory complexity, limiting their scalability to large datasets in real-time settings. In this paper, we propose a \underline{s}treaming \underline{k}ernel-induced progressivel\underline{y} generated expert framework of \underline{G}aussian \underline{p}rocesses (SkyGP) that addresses both computational and memory constraints by maintaining a bounded set of experts, while inheriting the learning performance guarantees from exact Gaussian processes. Furthermore, two SkyGP variants are introduced, each tailored to a specific objective, either maximizing prediction accuracy (SkyGP-Dense) or improving computational efficiency (SkyGP-Fast). The effectiveness of SkyGP is validated through extensive benchmarks and real-time control experiments demonstrating its superior performance compared to state-of-the-art approaches.


[41] 2109.09711

Quantifying Grid Resilience Against Extreme Weather Using Large-Scale Customer Power Outage Data

In recent decades, the weather around the world has become more irregular and extreme, often causing large-scale extended power outages. Resilience -- the capability of withstanding, adapting to, and recovering from a large-scale disruption -- has become a top priority for the power sector. However, the understanding of power grid resilience still stays on the conceptual level mostly or focuses on particular components, yielding no actionable results or revealing few insights on the system level. This study provides a quantitatively measurable definition of power grid resilience, using a statistical model inspired by patterns observed from data and domain knowledge. We analyze a large-scale quarter-hourly historical electricity customer outage data and the corresponding weather records, and draw connections between the model and industry resilience practice. We showcase the resilience analysis using three major service territories on the east coast of the United States. Our analysis suggests that cumulative weather effects play a key role in causing immediate, sustained outages, and these outages can propagate and cause secondary outages in neighboring areas. The proposed model also provides some interesting insights into grid resilience enhancement planning. For example, our simulation results indicate that enhancing the power infrastructure in a small number of critical locations can reduce nearly half of the number of customer power outages in Massachusetts. In addition, we have shown that our model achieves promising accuracy in predicting the progress of customer power outages throughout extreme weather events, which can be very valuable for system operators and federal agencies to prepare disaster response.


[42] 2305.07581

Nonparametric data segmentation in multivariate time series via joint characteristic functions

Modern time series data often exhibit complex dependence and structural changes which are not easily characterised by shifts in the mean or model parameters. We propose a nonparametric data segmentation methodology for multivariate time series termed NP-MOJO. By considering joint characteristic functions between the time series and its lagged values, NP-MOJO is able to detect change points in the marginal distribution, but also those in possibly non-linear serial dependence, all without the need to pre-specify the type of changes. We show the theoretical consistency of NP-MOJO in estimating the total number and the locations of the change points, and demonstrate the good performance of NP-MOJO against a variety of change point scenarios. We further demonstrate its usefulness in applications to seismology and economic time series.


[43] 2306.16765

Estimation and variable selection in high dimension in a causal joint model of survival times and longitudinal outcomes with random effects

We consider a joint survival and mixed-effects model to explain the survival time from longitudinal data and high-dimensional covariates in a population. The longitudinal data is modeled using a non linear mixed-effects model to account for the inter-individual variability in the population. The corresponding regression function serves as a link function incorporated into the survival model. In that way, the longitudinal data is related to the survival time. We consider a Cox model that takes into account both high-dimensional covariates and the link function. There are two main objectives: first, identify the relevant covariates that contribute to explaining survival time, and second, estimate all unknown parameters of the joint model. For the first objective, we consider the estimate defined by maximizing the marginal log-likelihood regularized with a l1-penalty term. To tackle the optimization problem, we implement an adaptive stochastic gradient to handle the latent variables of the non linear mixed-effects model associated with a proximal operator to manage the non-differentiability of the penalty. We rely on an eBIC model choice criterion to select an optimal value for the regularization parameter. Once the relevant covariates are selected, we re-estimate the parameters in the reduced model by maximizing the likelihood using an adaptive stochastic gradient descent. We provide relevant simulations that showcase the performance of the proposed variable selection and parameter estimation method in the joint model. We investigate the effect of censoring and of the presence of correlation between the individual parameters in the mixed model.


[44] 2401.14973

Discovering group dynamics in coordinated time series via hierarchical recurrent switching-state models

We seek a computationally efficient model for a collection of time series arising from multiple interacting entities (a.k.a. "agents"). Recent models of temporal patterns across individuals fail to incorporate explicit system-level collective behavior that can influence the trajectories of individual entities. To address this gap in the literature, we present a new hierarchical switching-state model that can be trained in an unsupervised fashion to simultaneously learn both system-level and individual-level dynamics. We employ a latent system-level discrete state Markov chain that provides top-down influence on latent entity-level chains which in turn govern the emission of each observed time series. Recurrent feedback from the observations to the latent chains at both entity and system levels allows recent situational context to inform how dynamics unfold at all levels in bottom-up fashion. We hypothesize that including both top-down and bottom-up influences on group dynamics will improve interpretability of the learned dynamics and reduce error when forecasting. Our hierarchical switching recurrent dynamical model can be learned via closed-form variational coordinate ascent updates to all latent chains that scale linearly in the number of entities. This is asymptotically no more costly than fitting a separate model for each entity. Analysis of both synthetic data and real basketball team movements suggests our lean parametric model can achieve competitive forecasts compared to larger neural network models that require far more computational resources. Further experiments on soldier data as well as a synthetic task with 64 cooperating entities show how our approach can yield interpretable insights about team dynamics over time.


[45] 2404.04455

Tomographic reconstruction of a disease transmission landscape via GPS recorded random paths

Identifying areas in a landscape where individuals have a higher likelihood of disease infection is key to managing diseases. Unlike conventional methods relying on ecological assumptions, we perform a novel epidemiological tomography for the estimation of landscape propensity to disease infection, using GPS animal tracks in a manner analogous to tomographic techniques in positron emission tomography (PET). Treating tracking data as random Radon transforms, we analyze Cervid movements in a game preserve, paired with antibody levels for epizootic hemorrhagic disease virus (EHDV) -- a vector-borne disease transmitted by biting midges. After discretizing the field and building the regression matrix of the time spent by each deer (row) at each point of the lattice (column), we model the binary response (infected or not) as a binomial linear inverse problem where spatial coherence is enforced with a total variation regularization. The smoothness of the reconstructed propensity map is selected by the quantile universal threshold. To address limitations of small sample sizes and evaluate significance of our estimates, we quantify uncertainty using a bootstrap-based data augmentation procedure. Our method outperforms alternative ones when using simulated and real data. This tomographic framework is novel, with no established statistical methods tailored for such data.


[46] 2407.12114

Bounds on causal effects in $2^{K}$ factorial experiments with non-compliance

Factorial experiments are ubiquitous in the social and biomedical sciences, but when units fail to comply with each assigned factors, identification and estimation of the average treatment effects become impossible without strong assumptions. Leveraging an instrumental variables approach, previous studies have shown how to identify and estimate the causal effect of treatment uptake among respondents who comply with treatment. A major caveat is that these identification results rely on strong assumptions on the effect of randomization on treatment uptake. This paper shows how to bound these complier average treatment effects for bounded outcomes under more mild assumptions on non-compliance.


[47] 2407.18166

Identification and multiply robust estimation of causal effects via instrumental variables from an auxiliary population

Estimating causal effects in a target population with unmeasured confounders is challenging, especially when instrumental variables (IVs) are unavailable. However, IVs from auxiliary populations with similar problems can help infer causal effects in the target population. While the homogeneous conditional average treatment effect assumption has been widely used for effect transportability, it has not been explored in IV-based data fusion. We include it as a basic approach, though it may be biased when treatment effect heterogeneity exists. As an alternative approach, we introduce the equi-confounding assumption that the unmeasured confounding bias remains the same after adjusting for observed covariates, while allowing conditional average treatment effects to differ across populations. This allows us to identify the confounding bias in the auxiliary population and remove it from the treatment-outcome association in the target population to recover the causal effect. We develop multiply robust estimators under both approaches and demonstrate them through simulation studies and a real data application.


[48] 2408.06760

Stratification in Randomised Clinical Trials and Analysis of Covariance: Some Simple Theory and Recommendations

A simple device for balancing for a continuous covariate in clinical trials is to stratify by whether the covariate is above or below some target value, typically the predicted median. This raises an issue as to which model should be used for modelling the effect of treatment on the outcome variable, $Y$. Should one fit, the stratum indicator, $S$, the continuous covariate, $X$, both or neither? This question has sometimes been investigated using simulations targeting the overall effect on inferences about treatment, in terms, for example, of power for a given alternative hypothesis. However, when a covariate is added to a linear model there are three consequences for inference: 1) the mean square error effect, 2) the variance inflation factor and 3) second order precision. We consider that it is valuable to consider these three factors separately, even if, ultimately, it is their joint effect that matters. We present some simple theory, concentrating in particular on the variance inflation factor, that may be used to guide trialists in their choice of model. We also consider the case where the precise form of the relationship between the outcome and the covariate is not known. We conclude by recommending that the continuous covariate should always be in the model but that, depending on circumstances, there may be some justification in fitting the stratum indicator also.


[49] 2501.02298

Beyond Log-Concavity and Score Regularity: Improved Convergence Bounds for Score-Based Generative Models in W2-distance

Score-based Generative Models (SGMs) aim to sample from a target distribution by learning score functions using samples perturbed by Gaussian noise. Existing convergence bounds for SGMs in the $\mathcal{W}_2$-distance rely on stringent assumptions about the data distribution. In this work, we present a novel framework for analyzing $\mathcal{W}_2$-convergence in SGMs, significantly relaxing traditional assumptions such as log-concavity and score regularity. Leveraging the regularization properties of the Ornstein--Uhlenbeck (OU) process, we show that weak log-concavity of the data distribution evolves into log-concavity over time. This transition is rigorously quantified through a PDE-based analysis of the Hamilton--Jacobi--Bellman equation governing the log-density of the forward process. Moreover, we establish that the drift of the time-reversed OU process alternates between contractive and non-contractive regimes, reflecting the dynamics of concavity. Our approach circumvents the need for stringent regularity conditions on the score function and its estimators, relying instead on milder, more practical assumptions. We demonstrate the wide applicability of this framework through explicit computations on Gaussian mixture models, illustrating its versatility and potential for broader classes of data distributions.


[50] 2501.05584

The Impact of Question Framing on the Performance of Automatic Occupation Coding

Occupational data play a vital role in research, official statistics, and policymaking, yet their collection and accurate classification remain a challenge. This study investigates the effects of occupational question wording on data variability and the performance of automatic coding tools. We conducted and replicated a split-ballot survey experiment in Germany using two common occupational question formats: one focusing on 'job title' (Berufsbezeichnung) and another on 'occupational tasks' (berufliche Tätigkeit). Our analysis reveals that automatic coding tools, such as CASCOT and OccuCoDe, exhibit sensitivity to the form and origin of the data. Specifically, these tools were more efficient when coding responses to the job title question format compared with the occupational task format, suggesting a potential way to improve the respective questions for many German surveys. In a subsequent 'detailed tasks and duties' question, providing a guiding example prompted respondents to give longer answers without broadening the range of unique words they used. These findings highlight the importance of harmonising survey questions and of ensuring that automatic coding tools are robust to differences in question wording. We emphasise the need for further research to optimise question design and coding tools for greater accuracy and applicability in occupational data collection.


[51] 2502.08539

Anytime-valid FDR control with the stopped e-BH procedure

The recent e-Benjamini-Hochberg (e-BH) procedure for multiple hypothesis testing is known to control the false discovery rate (FDR) under arbitrary dependence between the input e-values. This paper points out an important subtlety when applying the e-BH procedure with e-processes, which are sequential generalizations of e-values (where the data are observed sequentially). Since adaptively stopped e-processes are e-values, the e-BH procedure can be repeatedly applied at every time step, and one can continuously monitor the e-processes and the rejection sets obtained. One would hope that the "stopped e-BH procedure" (se-BH) has an FDR guarantee for the rejection set obtained at any stopping time. However, while this is true if the data in different streams are independent, it is not true in full generality, because each stopped e-process is an e-value only for stopping times in its own local filtration, but the se-BH procedure employs a stopping time with respect to a global filtration. This can cause information to leak across time, allowing one stream to know its future by knowing past data of another stream. This paper formulates a simple causal condition under which local e-processes are also global e-processes and thus the se-BH procedure does indeed control the FDR. The condition excludes unobserved confounding from the past and is met under most reasonable scenarios including genomics.


[52] 2502.19851

Can a calibration metric be both testable and actionable?

Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration$\unicode{x2014}$ensuring forecasted probabilities match empirical frequencies$\unicode{x2014}$is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance from Calibration (dCE) is testable, but it is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. To resolve this question, we consider Cutoff Calibration Error, a calibration measure that bridges this gap by assessing calibration over intervals of forecasted probabilities. We show that Cutoff Calibration Error is both testable and actionable, and we examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.


[53] 2503.20401

Estimation and variable selection in high dimension in nonlinear mixed-effects models

We consider nonlinear mixed effects models including high-dimensional covariates to model individual parameters variability. The objective is to identify relevant covariates among a large set under sparsity assumption and to estimate model parameters. To face the high dimensional setting we consider a regularized estimator namely the maximum likelihood estimator penalized with the l1-penalty. We rely on the use of the eBIC model choice criteria to select an optimal reduced model. Then we estimate the parameters by maximizing the likelihood of the reduced model. We calculate in practice the maximum likelihood estimator penalized with the l1-penalty though a weighted proximal stochastic gradient descent algorithm with an adaptive learning rate. This choice allows us to consider very general models, in particular models that do not belong to the curved exponential family. We demonstrate first in a simple linear toy model through a simulation study the good convergence properties of this optimization algorithm. We compare then the performance of the proposed methodology with those of the \glmmLasso procedure in a linear mixed effects model in a simulation study. We illustrate also its performance in a nonlinear mixed-effects logistic growth model through simulation. We finally highlight the beneficit of the proposed procedure relying on an integrated single step approach regarding two others two steps approaches for variable selection objective.


[54] 2504.21120

A Hybrid Mixture of $t$-Factor Analyzers for Clustering High-dimensional Data

This paper develops a novel hybrid approach for estimating the mixture model of $t$-factor analyzers (MtFA) that employs multivariate $t$-distribution and factor model to cluster and characterize grouped data. The traditional estimation method for MtFA faces computational challenges, particularly in high-dimensional settings, where the eigendecomposition of large covariance matrices and the iterative nature of Expectation-Maximization (EM) algorithms lead to scalability issues. We propose a computational scheme that integrates a profile likelihood method into the EM framework to efficiently obtain the model parameter estimates. The effectiveness of our approach is demonstrated through simulations showcasing its superior computational efficiency compared to the existing method, while preserving clustering accuracy and resilience against outliers. Our method is applied to cluster the Gamma-ray bursts, reinforcing several claims in the literature that Gamma-ray bursts have heterogeneous subpopulations and providing characterizations of the estimated groups.


[55] 2505.09596

Design of Experiments for Emulations: A Selective Review from a Modeling Perspective

Space-filling designs are crucial for efficient computer experiments, enabling accurate surrogate modeling and uncertainty quantification in many scientific and engineering applications, such as digital twin systems and cyber-physical systems. In this work, we will provide a comprehensive review on key design methodologies, including Maximin/miniMax designs, Latin hypercubes, and projection-based designs. Moreover, we will connect the space-filling design criteria like the fill distance to Gaussian process performance. Numerical studies are conducted to investigate the practical trade-offs among various design types, with the discussion on emerging challenges in high-dimensional and constrained settings. The paper concludes with future directions in adaptive sampling and machine learning integration, providing guidance for improving computational experiments.


[56] 2505.12487

Stereographic Multi-Try Metropolis Algorithms for Heavy-tailed Sampling

Markov chain Monte Carlo (MCMC) methods for sampling from heavy-tailed distributions present unique challenges, particularly in high dimensions. Multi-proposal MCMC algorithms have recently gained attention for their potential to improve performance, especially through parallel implementation on modern hardware. This paper introduces a novel family of gradient-free MCMC algorithms that combine the multi-try Metropolis (MTM) with stereographic MCMC framework, specifically designed for efficient sampling from heavy-tailed targets. The proposed stereographic multi-try Metropolis (SMTM) algorithm not only outperforms traditional Euclidean MTM and existing stereographic random-walk Metropolis methods, but also avoids the pathological convergence behavior often observed in MTM and demonstrates strong robustness to tuning. These properties are supported by scaling analysis and extensive simulation studies.


[57] 2505.12796

Adaptive Inference through Bayesian and Inverse Bayesian Inference with Symmetry-Bias in Nonstationary Environments

This study proposes a novel inference framework known as Bayesian and inverse Bayesian (BIB) inference, which incorporates symmetry bias into the Bayesian updating process to perform both conventional and inverse Bayesian updates concurrently. The model was evaluated in a sequential estimation task involving observations drawn from a Gaussian distribution with a stochastically time-varying mean. Conventional Bayesian inference is constrained by a fundamental trade-off between adaptability to abrupt environmental changes and accuracy during stable this http URL BIB framework addresses this limitation by dynamically modulating the learning rate via inverse Bayesian updates, thereby enhancing adaptive flexibility. Notably, the BIB model exhibited spontaneous bursts in the learning rate during environmental transitions, transiently entering high-sensitivity states that facilitated rapid adaptation. This burst-relaxation dynamic serves as a mechanism for balancing adaptability and accuracy. Furthermore, avalanche analysis and detrended fluctuation analysis revealed that the BIB system likely operates near a critical state-a property not observed in standard Bayesian inference. This suggests that the BIB model uniquely achieves a coexistence of computational efficiency and critical dynamics, resolving the adaptability-accuracy trade-off while maintaining a scale-free behavior. These findings offer a new computational perspective on scale-free dynamics in natural systems and provide valuable insights for the design of adaptive inference systems in nonstationary environments.


[58] 2506.09165

Identifiability and Estimation in High-Dimensional Nonparametric Latent Structure Models

This paper studies the problems of identifiability and estimation in high-dimensional nonparametric latent structure models. We introduce an identifiability theorem that generalizes existing conditions, establishing a unified framework applicable to diverse statistical settings. Our results rigorously demonstrate how increased dimensionality, coupled with diversity in variables, inherently facilitates identifiability. For the estimation problem, we establish near-optimal minimax rate bounds for the high-dimensional nonparametric density estimation under latent structures with smooth marginals. Contrary to the conventional curse of dimensionality, our sample complexity scales only polynomially with the dimension. Additionally, we develop a perturbation theory for component recovery and propose a recovery procedure based on simultaneous diagonalization.


[59] 2506.11369

Filtrated Kinematic Connectivity Analysis for Lower-limb Joint Effective Age Evaluation

To understand and communicate the risk of chronic lower-limb joint diseases associated with aging, it is crucial to investigate the relationship between age and gait dynamics, particularly through angular kinematics. One key challenge is that angular kinematic trajectories are highly interconnected, and the structures of the interconnections vary across different components. Neglecting the interconnections and the variability in the connectivity structures impairs the understanding of age-associated gait coordination. To this end, we develop a novel kinematic connectivity analysis framework, grounded in multiple functional regression, to evaluate lower-limb joint effective age and uncover age-related kinematic features. The proposed approach is built upon the concept of filtration, a widely used tool in network analysis and topological data analysis for multi-resolution exploration. Specifically, we develop a forest-structured covariate grouping framework in which different kinematic trajectories are aggregated hierarchically to capture both (partially) shared and idiosyncratic motion signatures which are strongly associated with aging. We also develop a novel filtrated functional partial least squares approach for model estimation and feature extraction. Compared to existing approaches, our proposed approach demonstrates superior predictive power while providing novel insights into the coordinated evolution of angular kinematics during aging. In addition, the proposed framework is broadly applicable and can be readily extended in other scientific domains.


[60] 2507.10643

TaylorPODA: A Taylor Expansion-Based Method to Improve Post-Hoc Attributions for Opaque Models

Existing post-hoc model-agnostic methods generate external explanations for opaque models, primarily by locally attributing the model output to its input features. However, they often lack an explicit and systematic framework for quantifying the contribution of individual features. Building on the Taylor expansion framework introduced by Deng et al. (2024) to unify existing local attribution methods, we propose a rigorous set of postulates -- "precision", "federation", and "zero-discrepancy" -- to govern Taylor term-specific attribution. Guided by these postulates, we introduce TaylorPODA (Taylor expansion-derived imPortance-Order aDapted Attribution), which incorporates an additional "adaptation" property. This property enables alignment with task-specific goals, especially in post-hoc settings lacking ground-truth explanations. Empirical evaluations demonstrate that TaylorPODA achieves competitive results against baseline methods, providing principled and visualization-friendly explanations. This work enhances the trustworthy deployment of opaque models by offering explanations with stronger theoretical grounding.


[61] 2508.00411

Predictive information criterion for jump diffusion processes

In this paper, we address a model selection problem for ergodic jump diffusion processes based on high-frequency samples. We evaluate the expected genuine log-likelihood function and derive an Akaike-type information criterion based on the threshold-based quasi-likelihood function. In the derivation process, we also give new estimates of the transition density of jump diffusion processes. We also provide the relative selection probability of the proposed information criterion.


[62] 2202.05063

PCENet: High Dimensional Surrogate Modeling for Learning Uncertainty

Learning data representations under uncertainty is an important task that emerges in numerous scientific computing and data analysis applications. However, uncertainty quantification techniques are computationally intensive and become prohibitively expensive for high-dimensional data. In this study, we introduce a dimensionality reduction surrogate modeling (DRSM) approach for representation learning and uncertainty quantification that aims to deal with data of moderate to high dimensions. The approach involves a two-stage learning process: 1) employing a variational autoencoder to learn a low-dimensional representation of the input data distribution; and 2) harnessing polynomial chaos expansion (PCE) formulation to map the low dimensional distribution to the output target. The model enables us to (a) capture the system dynamics efficiently in the low-dimensional latent space, (b) learn under uncertainty, a representation of the data and a mapping between input and output distributions, (c) estimate this uncertainty in the high-dimensional data system, and (d) match high-order moments of the output distribution; without any prior statistical assumptions on the data. Numerical results are presented to illustrate the performance of the proposed method.


[63] 2412.13527

Lyapunov Analysis For Monotonically Forward-Backward Accelerated Algorithms

Nesterov's accelerated gradient method (NAG) achieves faster convergence than gradient descent for convex optimization but lacks monotonicity in function values. To address this, Beck and Teboulle [2009b] proposed a monotonic variant, M-NAG, and extended it to the proximal setting as M-FISTA for composite problems such as Lasso. However, establishing the linear convergence of M-NAG and M-FISTA under strong convexity remains an open problem. In this paper, we analyze M-NAG via the implicit-velocity phase representation and show that an additional assumption, either the position update or the phase-coupling relation, is necessary to fully recover the NAG iterates. The essence of M-NAG lies in controlling an auxiliary sequence to enforce non-increase. We further demonstrate that the M-NAG update alone is sufficient to construct a Lyapunov function guaranteeing linear convergence, without relying on full NAG iterates. By modifying the mixed sequence to incorporate forward-indexed gradients, we develop a new Lyapunov function that removes the kinetic energy term, enabling a direct extension to M-NAG. The required starting index depends only on the momentum parameter and not on problem constants. Finally, leveraging newly developed proximal inequalities, we extend our results to M-FISTA, establishing its linear convergence and deepening the theoretical understanding of monotonic accelerated methods.


[64] 2501.11638

Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model

Class imbalance (CI) is a longstanding problem in machine learning, slowing down training and reducing performances. Although empirical remedies exist, it is often unclear which ones work best and when, due to the lack of an overarching theory. We address a common case of imbalance, that of anomaly (or outlier) detection. We provide a theoretical framework to analyze, interpret and address CI. It is based on an exact solution of the teacher-student perceptron model, through replica theory. Within this framework, one can distinguish several sources of CI: either intrinsic, train or test imbalance. Our analysis reveals that the optimal train imbalance is generally different from 50%, with a non trivial dependence on the intrinsic imbalance, the abundance of data and on the noise in the learning. Moreover, there is a crossover between a small noise training regime where results are independent of the noise level to a high noise regime where performances quickly degrade with noise. Our results challenge some of the conventional wisdom on CI and offer practical guidelines to address it.


[65] 2502.08808

A First-order Generative Bilevel Optimization Framework for Diffusion Models

Diffusion models, which iteratively denoise data samples to synthesize high-quality outputs, have achieved empirical success across domains. However, optimizing these models for downstream tasks often involves nested bilevel structures, such as tuning hyperparameters for fine-tuning tasks or noise schedules in training dynamics, where traditional bilevel methods fail due to the infinite-dimensional probability space and prohibitive sampling costs. We formalize this challenge as a generative bilevel optimization problem and address two key scenarios: (1) fine-tuning pre-trained models via an inference-only lower-level solver paired with a sample-efficient gradient estimator for the upper level, and (2) training diffusion model from scratch with noise schedule optimization by reparameterizing the lower-level problem and designing a computationally tractable gradient estimator. Our first-order bilevel framework overcomes the incompatibility of conventional bilevel methods with diffusion processes, offering theoretical grounding and computational practicality. Experiments demonstrate that our method outperforms existing fine-tuning and hyperparameter search baselines.


[66] 2503.08746

In silico clinical trials in drug development: a systematic review

In the context of clinical research, computational models have received increasing attention over the past decades. In this systematic review, we aimed to provide an overview of the role of so-called in silico clinical trials (ISCTs) in medical applications. Exemplary for the broad field of clinical medicine, we focused on in silico (IS) methods applied in drug development, sometimes also referred to as model informed drug development (MIDD). We searched PubMed and this http URL for published articles and registered clinical trials related to ISCTs. We identified 202 articles and 48 trials, and of these, 76 articles and 19 trials were directly linked to drug development. We extracted information from all 202 articles and 48 clinical trials and conducted a more detailed review of the methods used in the 76 articles that are connected to drug development. Regarding application, most articles and trials focused on cancer and imaging-related research while rare and pediatric diseases were only addressed in 14 articles and 5 trials, respectively. While some models were informed combining mechanistic knowledge with clinical or preclinical (in-vivo or in-vitro) data, the majority of models were fully data-driven, illustrating that clinical data is a crucial part in the process of generating synthetic data in ISCTs. Regarding reproducibility, a more detailed analysis revealed that only 24% (18 out of 76) of the articles provided an open-source implementation of the applied models, and in only 20% of the articles the generated synthetic data were publicly available. Despite the widely raised interest, we also found that it is still uncommon for ISCTs to be part of a registered clinical trial and their application is restricted to specific diseases leaving potential benefits of ISCTs not fully exploited.


[67] 2504.13134

Energy-Based Reward Models for Robust Language Model Alignment

Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.


[68] 2507.03885

Unraveling the Black-box Magic: An Analysis of Neural Networks' Dynamic Extrema

We point out that neural networks are not black boxes, and their generalization stems from the ability to dynamically map a dataset to the extrema of the model function. We further prove that the number of extrema in a neural network is positively correlated with the number of its parameters. We then propose a new algorithm that is significantly different from back-propagation algorithm, which mainly obtains the values of parameters by solving a system of linear equations. Some difficult situations, such as gradient vanishing and overfitting, can be reasonably explained and dealt with in this framework.


[69] 2507.17748

Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility

Robustness and resource-efficiency are two highly desirable properties for modern machine learning models. However, achieving them jointly remains a challenge. In this paper, we identify high learning rates as a facilitator for simultaneously achieving robustness to spurious correlations and network compressibility. We demonstrate that large learning rates also produce desirable representation properties such as invariant feature utilization, class separation, and activation sparsity. Our findings indicate that large learning rates compare favorably to other hyperparameters and regularization methods, in consistently satisfying these properties in tandem. In addition to demonstrating the positive effect of large learning rates across diverse spurious correlation datasets, models, and optimizers, we also present strong evidence that the previously documented success of large learning rates in standard classification tasks is related to addressing hidden/rare spurious correlations in the training dataset. Our investigation of the mechanisms underlying this phenomenon reveals the importance of confident mispredictions of bias-conflicting samples under large learning rates.