New articles on Statistics


[1] 2509.08102

Recursive Adaptive Importance Sampling with Optimal Replenishment

Increased access to computing resources has led to the development of algorithms that can run efficiently on multi-core processing units or in distributed computing environments. In the context of Bayesian inference, many parallel computing approaches to fit statistical models have been proposed in the context of Markov Chain Monte Carlo methods, but they either have limited gains due to high latency cost or rely on model-specific decompositions. Alternatively, adaptive importance sampling, sequential Monte Carlo, and recursive Bayesian methods provide a parallel-friendly and asymptotically exact framework with well-developed theory for error estimation. We propose a recursive adaptive importance sampling approach that alternates between fast recursive weight updates and sample replenishment steps to balance computational efficiency while ensuring sample quality. We derive theoretical results to determine the optimal allocation of replenishing steps, and demonstrate the efficacy of our method in simulated experiments and an application of sea surface temperature prediction in the Gulf of Mexico using Gaussian processes.


[2] 2509.08155

Contributions to Robust and Efficient Methods for Analysis of High Dimensional Data

A ubiquitous feature of data of our era is their extra-large sizes and dimensions. Analyzing such high-dimensional data poses significant challenges, since the feature dimension is often much larger than the sample size. This thesis introduces robust and computationally efficient methods to address several common challenges associated with high-dimensional data. In my first manuscript, I propose a coherent approach to variable screening that accommodates nonlinear associations. I develop a novel variable screening method that transcends traditional linear assumptions by leveraging mutual information, with an intended application in neuroimaging data. This approach allows for accurate identification of important variables by capturing nonlinear as well as linear relationships between the outcome and covariates. Building on this foundation, I develop new optimization methods for sparse estimation using nonconvex penalties in my second manuscript. These methods address notable challenges in current statistical computing practices, facilitating computationally efficient and robust analyses of complex datasets. The proposed method can be applied to a general class of optimization problems. In my third manuscript, I contribute to robust modeling of high-dimensional correlated observations by developing a mixed-effects model based on Tsallis power-law entropy maximization and discussed the theoretical properties of such distribution. This model surpasses the constraints of conventional Gaussian models by accommodating a broader class of distributions with enhanced robustness to outliers. Additionally, I develop a proximal nonlinear conjugate gradient algorithm that accelerates convergence while maintaining numerical stability, along with rigorous statistical properties for the proposed framework.


[3] 2509.08162

Survival Analysis with Discrete Biomarkers Under a Semiparametric Bayesian Conditional Poisson Model

Discrete biomarkers derived as cell densities or counts from tissue microarrays and immunostaining are widely used to study immune signatures in relation to survival outcomes in cancer. Although routinely collected, these signatures are not measured with exact precision because the sampling mechanism involves examination of small tissue cores from a larger section of interest. We model these error-prone biomarkers as Poisson processes with latent rates, inducing heteroscedasticity in their conditional variance. While critical for tumor histology, such measurement error frameworks remain understudied for conditionally Poisson-distributed covariates. To address this, we propose a Bayesian joint model that incorporates a Dirichlet process (DP) mixture to flexibly characterize the latent covariate distribution. The proposed approach is evaluated using simulation studies which demonstrate a superior bias reduction and robustness to the underlying model in realistic settings when compared to existing methods. We further incorporate Bayes factors for hypothesis testing in the Bayesian semiparametric joint model. The methodology is applied to a survival study of high-grade serous carcinoma where comparisons are made between the proposed and existing approaches. Accompanying R software is available at the GitHub repository listed in the Web Appendices.


[4] 2509.08186

Drinking water contamination as a population-wide determinant of mortality in California

Drinking water contamination, a known determinant of adverse health outcomes, remains widespread and inequitably distributed amidst aging infrastructure. Regulatory oversight is the primary tool to protect drinking water-related public health risks, with maximum contaminant levels established to regulate concentrations for contaminants of concern. However, the extent to which existing concentrations of contaminants at the public water system level directly affect population health remains poorly understood. Here we present a large-scale data analysis of 20 million water samples from 2012-2022 in California to assess the impact of drinking water quality on all-cause mortality. Surfactants were associated with increases in mortality, potentially serving as proxies for wastewater contaminants. We further find evidence of mixture effects that were unidentifiable through single-contaminant analysis, suggesting mixtures of toxic metals as well as salinity constituents are associated with mortality. These results could inform public health efforts to mitigate mortality associated with consumption of contaminated drinking water.


[5] 2509.08187

A Comparative Analysis of Multi-Criteria Decision-Making (MCDM) Methods

Multi-Criteria Decision-Making (MCDM) techniques have found widespread application across diverse fields. The rapid evolution of MCDM has led to the development of hundreds of methods, each employing distinct approaches. However, due to inherent algorithmic differences, various MCDM methods often yield divergent results when applied to the same specific problem. This study undertakes a comparative analysis of four particular methods: RAM, MOORA, FUCA, and CURLI, within a defined case study. The evaluation context involves ranking 30 Vietnamese banks based on six criteria: capital adequacy, asset quality, management capability, earnings ability, liquidity, and sensitivity to market risk. Prior to this analysis, these banks had also been ranked by the CAMELS rating system. The CAMELS rankings serve as a benchmark to assess the performance of the RAM, MOORA, FUCA, and CURLI methods. Our findings indicate that FUCA and CURLI are highly suitable methods for this application, demonstrating Spearman's rank correlation coefficients with CAMELS of 0.9996 and 0.9984, respectively. In contrast, both RAM and MOORA proved unsuitable, exhibiting very low Spearman's correlation coefficients of -1.0296 against the CAMELS ranking.


[6] 2509.08214

Comparative Analysis of Global and Local Probabilistic Time Series Forecasting for Contiguous Spatial Demand Regions

This study evaluates three probabilistic forecasting strategies using LightGBM: global pooling, cluster-level pooling, and station-level modeling across a range of scenarios, from fully homogeneous simulated data to highly heterogeneous real-world Divvy bike-share demand observed during 2023 to 2024. Clustering was performed using the K-means algorithm applied to principal component analysis transformed covariates, which included time series features, counts of nearby transportation infrastructure, and local demographic characteristics. Forecasting performance was assessed using prediction interval coverage probability (PICP), normalized interval width (PINAW), and the mean squared error (MSE) of the median forecast. The results show that global LightGBM models incorporating station identifiers consistently outperform both cluster-level and station-level models across most scenarios. These global models effectively leverage the full cross-sectional dataset while enabling local adjustments through the station identifier, resulting in superior prediction interval coverage, sharper intervals, and lower forecast errors. In contrast, cluster-based models often suffer from residual within group heterogeneity, leading to degraded accuracy. Station-level models capture fine-grained local dynamics in heterogeneous settings. These findings underscore that global LightGBM models with embedded station identifiers provide a robust, scalable, and computationally efficient framework for transportation demand forecasting. By balancing global structure with local specificity, this approach offers a practical and effective solution for real-world mobility applications.


[7] 2509.08231

Deploying Robust Decision Support Systems for Transit Headway Control: Rider Impacts, Human Factors and Recommendations for Scalability

Service reliability is critical to transit service delivery. This paper describes headway control pilots conducted in two high-ridership Chicago bus routes between 2022 and 2023. A decision support system was developed for a bus holding strategy based on a reinforcement learning approach. For the pilots, a user interface enabled supervisors to monitor service and record applied actions. The first pilot tested terminal-based holding on a route affected by missed trips from absenteeism. The analysis found improvements in reliability, and the application of control was shown to outperform days with more service. The second pilot applied en-route holding in a high-ridership bus route in Chicago. The evaluation showed wait time improvements with rippled benefits to stops downstream, and a reduction in transfer times from connecting bus and rail lines. Compliance analysis based on the supervisor logs on the app revealed mixed compliance levels from drivers, which were related to the mentality of schedule adherence and seniority. Recommendations are provided for practitioners to scale similar efforts.


[8] 2509.08237

Convergence and Optimality of the EM Algorithm Under Multi-Component Gaussian Mixture Models

Gaussian mixture models (GMMs) are fundamental statistical tools for modeling heterogeneous data. Due to the nonconcavity of the likelihood function, the Expectation-Maximization (EM) algorithm is widely used for parameter estimation of each Gaussian component. Existing analyses of the EM algorithm's convergence to the true parameter focus primarily on either the two-component case or multi-component settings with both known mixing probabilities and known, isotropic covariance matrices. In this work, we establish the minimax optimal rate of convergence of the EM algorithm for multi-component GMMs in full generality. The required separation condition between Gaussian components for EM to converge is the weakest known to date. We develop two distinct analytical approaches, each tailored to a different regime of separation, reflecting two complementary perspectives on the use of EM: parameter estimation and clustering. As a byproduct of our analysis, we show that the EM algorithm, when used for community detection, also achieves the minimax optimal rate of misclustering error under milder separation conditions than spectral clustering and Lloyd's algorithm, an interesting result in its own right. Our analysis allows the number of components, the minimal mixing probabilities, the separation between Gaussian components as well as the dimension to grow with the sample size. Simulation studies corroborate the theoretical findings.


[9] 2509.08366

kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

We study a missing-value imputation method, termed kNNSampler, that imputes a given unit's missing response by randomly sampling from the observed responses of the $k$ most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation. Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate. Experiments demonstrate its effectiveness in recovering the distribution of missing values. The code for kNNSampler is made publicly available (this https URL).


[10] 2509.08451

Comparing Methodologies for Ranking Alternatives: A case study in assessing bank financial performance

Bank financial performance encapsulates an institution's capacity to effectively manage its assets, capital, and operational activities to generate profits and ensure stability. Evaluating this performance necessitates the integration of diverse metrics, including profitability indicators, loan growth rates, capital utilization efficiency, and more. Nevertheless, directly comparing the financial performance across different banks presents a complex challenge due to inherent disparities in their specific performance parameters. Multi-criteria decision-making (MCDM) techniques are frequently employed to navigate this intricate assessment. This study undertakes a comparative analysis of various MCDM approaches in evaluating bank financial performance. Our investigation encompasses both a comparison of methods for assigning weights to criteria and a comparison of methodologies for ranking the alternatives (banks). We examine five distinct weighting methods: Equal, Entropy, MEREC, LOPCOW, and SPC. Concurrently, three alternative ranking methods Probability, TOPSIS, and RAM are compared. These comparisons are conducted within the context of a case study involving the performance assessment of 19 banks. The findings indicate that the highest degree of stability in ranking bank financial performance is achieved when the Entropy method is utilized for criteria weighting in conjunction with the Probability method for ranking alternatives.


[11] 2509.08457

Gaussian Process Regression -- Neural Network Hybrid with Optimized Redundant Coordinates

Recently, a Gaussian Process Regression - neural network (GPRNN) hybrid machine learning method was proposed, which is based on additive-kernel GPR in redundant coordinates constructed by rules [J. Phys. Chem. A 127 (2023) 7823]. The method combined the expressive power of an NN with the robustness of linear regression, in particular, with respect to overfitting when the number of neurons is increased beyond optimal. We introduce opt-GPRNN, in which the redundant coordinates of GPRNN are optimized with a Monte Carlo algorithm and show that when combined with optimization of redundant coordinates, GPRNN attains the lowest test set error with much fewer terms / neurons and retains the advantage of avoiding overfitting when the number of neurons is increased beyond optimal value. The method, opt-GPRNN possesses an expressive power closer to that of a multilayer NN and could obviate the need for deep NNs in some applications. With optimized redundant coordinates, a dimensionality reduction regime is also possible. Examples of application to machine learning an interatomic potential and materials informatics are given.


[12] 2509.08553

PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research

Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major challenges due to data heterogeneity, semantic differences, and privacy concerns. To address these challenges, we introduce $\textit{PEHRT}$, a standardized pipeline for efficient EHR data harmonization consisting of two core modules: (1) data pre-processing and (2) representation learning. PEHRT maps EHR data to standard coding systems and uses advanced machine learning to generate research-ready datasets without requiring individual-level data sharing. Our pipeline is also data model agnostic and designed for streamlined execution across institutions based on our extensive real-world experience. We provide a complete suite of open source software, accompanied by a user-friendly tutorial, and demonstrate the utility of PEHRT in a variety of tasks using data from diverse healthcare systems.


[13] 2509.08591

Functional Regression with Nonstationarity and Error Contamination: Application to the Economic Impact of Climate Change

This paper studies a functional regression model with nonstationary dependent and explanatory functional observations, in which the nonstationary stochastic trends of the dependent variable are explained by those of the explanatory variable, and the functional observations may be error-contaminated. We develop novel autocovariance-based estimation and inference methods for this model. The methodology is broadly applicable to economic and statistical functional time series with nonstationary dynamics. To illustrate our methodology and its usefulness, we apply it to the evaluation of the global economic impact of climate change, an issue of intrinsic importance.


[14] 2509.08619

A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo

The unadjusted Langevin algorithm is widely used for sampling from complex high-dimensional distributions. It is well known to be biased, with the bias typically scaling linearly with the dimension when measured in squared Wasserstein distance. However, the recent paper of Chen et al. (2024) identifies an intriguing new delocalization effect: For a class of distributions with sparse interactions, the bias between low-dimensional marginals scales only with the lower dimension, not the full dimension. In this work, we strengthen the results of Chen et al. (2024) in the sparse interaction regime by removing a logarithmic factor, measuring distance in relative entropy (a.k.a. KL-divergence), and relaxing the strong log-concavity assumption. In addition, we expand the scope of the delocalization phenomenon by showing that it holds for a class of distributions with weak interactions. Our proofs are based on a hierarchical analysis of the marginal relative entropies, inspired by the authors' recent work on propagation of chaos.


[15] 2509.08647

Assessing Bias in the Variable Bandpass Periodic Block Bootstrap Method

The Variable Bandpass Periodic Block Bootstrap(VBPBB) is an innovative method for time series with periodically correlated(PC) components. This method applies bandpass filters to extract specific PC components from datasets, effectively eliminating unwanted interference such as noise. It then bootstraps the PC components, maintaining their correlation structure while resampling and enabling a clearer analysis of the estimation of the statistical properties of periodic patterns in time series data. While its efficiency has been demonstrated in environmental and epidemiological research, the theoretical properties of VBPBB, particularly regarding its bias of the estimated sampling distributions, remain unexamined. This study investigates issues regarding biases in VBPBB, including overall mean bias and pointwise mean bias, across a range of time series models of varying complexity, all of which exhibit periodic components. Using the R programming language, we simulate various PC time series and apply VBPBB to assess its bias under different conditions. Our findings provide key insights into the validity of VBPBB for periodic time series analysis and offer practical recommendations for its implementation, as well as directions for future theoretical advancements.


[16] 2509.08702

Generative AI as a Safety Net for Survey Question Refinement

Writing survey questions that easily and accurately convey their intent to a variety of respondents is a demanding and high-stakes task. Despite the extensive literature on best practices, the number of considerations to keep in mind is vast and even small errors can render collected data unusable for its intended purpose. The process of drafting initial questions, checking for known sources of error, and developing solutions to those problems requires considerable time, expertise, and financial resources. Given the rising costs of survey implementation and the critical role that polls play in media, policymaking, and research, it is vital that we utilize all available tools to protect the integrity of survey data and the financial investments made to obtain it. Since its launch in 2022, ChatGPT and other generative AI model platforms have been integrated into everyday life processes and workflows, particularly pertaining to text revision. While many researchers have begun exploring how generative AI may assist with questionnaire design, we have implemented a prompt experiment to systematically test what kind of feedback on survey questions an average ChatGPT user can expect. Results from our zero--shot prompt experiment, which randomized the version of ChatGPT and the persona given to the model, shows that generative AI is a valuable tool today, even for an average AI user, and suggests that AI will play an increasingly prominent role in the evolution of survey development best practices as precise tools are developed.


[17] 2509.08744

Who has the best probabilities? Luck versus skill in prediction tournaments

An informal and elementary introduction to probability scoring and forecast verification and improvement, slightly extended from Significance 22:3(2025)16, which might be useful for less mathematical readers as a prologue to the classic review by Gneiting and Raftery [Strictly proper scoring rules, prediction, and estimation, Journal of the American Statistical Association 102 (2007): 359].


[18] 2509.08788

Doubly robust average treatment effect estimation for survival data

Considering censored outcomes in survival analysis can lead to quite complex results in the model setting of causal inference. Causal inference has attracted a lot of attention over the past few years, but little research has been done on survival analysis. Even for the only research conducted, the machine learning method was considered assuming a large sample, which is not suitable in that the actual data are high dimensional low sample size (HDLSS) method. Therefore, penalty is considered for numerous covariates, and the relationship between these covariates and treatment variables is reflected as a covariate balancing property score (CBPS). It also considers censored results. To this end, we will try to solve the above-mentioned problems by using penalized empirical likelihood, which considers both estimating equation and penalty. The proposed average treatment effect (ATE) estimator possesses the oracle property, exhibiting key characteristics such as double robustness for unbiasedness, sparsity in model selection, and asymptotic normality.


[19] 2509.08795

On the inclusion of non-concurrent controls in platform trials with a futility interim analysis

The analysis of platform trials can be enhanced by utilizing non-concurrent controls. Since including this data might also introduce bias in the treatment effect estimators if time trends are present, methods for incorporating non-concurrent controls adjusting for time have been proposed. However, so far their behavior has not been systematically investigated in platform trials that include interim analyses. To evaluate the impact of a futility interim analysis in trials utilizing non-concurrent controls, we consider a platform trial featuring two experimental arms and a shared control, with the second experimental arm entering later. We focus on a frequentist regression model that uses non-concurrent controls to estimate the treatment effect of the second arm and adjusts for time using a step function to account for temporal changes. We show that performing a futility interim analysis in Arm 1 may introduce bias in the point estimation of the effect in Arm 2, if the regression model is used without adjustment, and investigate how the marginal bias and bias conditional on the first arm continuing after the interim depend on different trial design parameters. Moreover, we propose a new estimator of the treatment effect in Arm 2, aiming to eliminate the bias introduced by both the interim analysis in Arm 1 and the time trends, and evaluate its performance in a simulation study. The newly proposed estimator is shown to substantially reduce the bias and type I error rate inflation while leading to power gains compared to an analysis using only concurrent controls.


[20] 2509.08001

Network Contagion in Financial Labor Markets: Predicting Turnover in Hong Kong

Employee turnover is a critical challenge in financial markets, yet little is known about the role of professional networks in shaping career moves. Using the Hong Kong Securities and Futures Commission (SFC) public register (2007-2024), we construct temporal networks of 121,883 professionals and 4,979 firms to analyze and predict employee departures. We introduce a graph-based feature propagation framework that captures peer influence and organizational stability. Our analysis shows a contagion effect: professionals are 23% more likely to leave when over 30% of their peers depart within six months. Embedding these network signals into machine learning models improves turnover prediction by 30% over baselines. These results highlight the predictive power of temporal network effects in workforce dynamics, and demonstrate how network-based analytics can inform regulatory monitoring, talent management, and systemic risk assessment.


[21] 2509.08020

Towards Scalable Proteomics: Opportunistic SMC Samplers on HTCondor

Quantitative proteomics plays a central role in uncovering regulatory mechanisms, identifying disease biomarkers, and guiding the development of precision therapies. These insights are often obtained through complex Bayesian models, whose inference procedures are computationally intensive, especially when applied at scale to biological datasets. This limits the accessibility of advanced modelling techniques needed to fully exploit proteomics data. Although Sequential Monte Carlo (SMC) methods offer a parallelisable alternative to traditional Markov Chain Monte Carlo, their high-performance implementations often rely on specialised hardware, increasing both financial and energy costs. We address these challenges by introducing an opportunistic computing framework for SMC samplers, tailored to the demands of large-scale proteomics inference. Our approach leverages idle compute resources at the University of Liverpool via HTCondor, enabling scalable Bayesian inference without dedicated high-performance computing infrastructure. Central to this framework is a novel Coordinator-Manager-Follower architecture that reduces synchronisation overhead and supports robust operation in heterogeneous, unreliable environments. We evaluate the framework on a realistic proteomics model and show that opportunistic SMC delivers accurate inference with weak scaling, increasing samples generated under a fixed time budget as more resources join. To support adoption, we release CondorSMC, an open-source package for deploying SMC samplers in opportunistic computing environments.


[22] 2509.08122

In-Context Learning Enhanced Credibility Transformer

The starting point of our network architecture is the Credibility Transformer which extends the classical Transformer architecture by a credibility mechanism to improve model learning and predictive performance. This Credibility Transformer learns credibilitized CLS tokens that serve as learned representations of the original input features. In this paper we present a new paradigm that augments this architecture by an in-context learning mechanism, i.e., we increase the information set by a context batch consisting of similar instances. This allows the model to enhance the CLS token representations of the instances by additional in-context information and fine-tuning. We empirically verify that this in-context learning enhances predictive accuracy by adapting to similar risk patterns. Moreover, this in-context learning also allows the model to generalize to new instances which, e.g., have feature levels in the categorical covariates that have not been present when the model was trained -- for a relevant example, think of a new vehicle model which has just been developed by a car manufacturer.


[23] 2509.08163

Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation

Ensuring equitable treatment (fairness) across protected attributes (such as gender or ethnicity) is a critical issue in machine learning. Most existing literature focuses on binary classification, but achieving fairness in regression tasks-such as insurance pricing or hiring score assessments-is equally important. Moreover, anti-discrimination laws also apply to continuous attributes, such as age, for which many existing methods are not applicable. In practice, multiple protected attributes can exist simultaneously; however, methods targeting fairness across several attributes often overlook so-called "fairness gerrymandering", thereby ignoring disparities among intersectional subgroups (e.g., African-American women or Hispanic men). In this paper, we propose a distance covariance regularisation framework that mitigates the association between model predictions and protected attributes, in line with the fairness definition of demographic parity, and that captures both linear and nonlinear dependencies. To enhance applicability in the presence of multiple protected attributes, we extend our framework by incorporating two multivariate dependence measures based on distance covariance: the previously proposed joint distance covariance (JdCov) and our novel concatenated distance covariance (CCdCov), which effectively address fairness gerrymandering in both regression and classification tasks involving protected attributes of various types. We discuss and illustrate how to calibrate regularisation strength, including a method based on Jensen-Shannon divergence, which quantifies dissimilarities in prediction distributions across groups. We apply our framework to the COMPAS recidivism dataset and a large motor insurance claims dataset.


[24] 2509.08183

Chaotic Bayesian Inference: Strange Attractors as Risk Models for Black Swan Events

We introduce a new risk modeling framework where chaotic attractors shape the geometry of Bayesian inference. By combining heavy-tailed priors with Lorenz and Rossler dynamics, the models naturally generate volatility clustering, fat tails, and extreme events. We compare two complementary approaches: Model A, which emphasizes geometric stability, and Model B, which highlights rare bursts using Fibonacci diagnostics. Together, they provide a dual perspective for systemic risk analysis, linking Black Swan theory to practical tools for stress testing and volatility monitoring.


[25] 2509.08184

Selective Induction Heads: How Transformers Select Causal Structures In Context

Transformers have exhibited exceptional capabilities in sequence modeling tasks, leveraging self-attention and in-context learning. Critical to this success are induction heads, attention circuits that enable copying tokens based on their previous occurrences. In this work, we introduce a novel framework that showcases transformers' ability to dynamically handle causal structures. Existing works rely on Markov Chains to study the formation of induction heads, revealing how transformers capture causal dependencies and learn transition probabilities in-context. However, they rely on a fixed causal structure that fails to capture the complexity of natural languages, where the relationship between tokens dynamically changes with context. To this end, our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed. This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context. We empirically demonstrate that transformers learn this mechanism to predict the next token by identifying the correct lag and copying the corresponding token from the past. We provide a detailed construction of a 3-layer transformer to implement the selective induction head, and a theoretical analysis proving that this mechanism asymptotically converges to the maximum likelihood solution. Our findings advance the understanding of how transformers select causal structures, providing new insights into their functioning and interpretability.


[26] 2509.08194

Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization

We address the problem of policy selection in contextual stochastic optimization (CSO), where covariates are available as contextual information and decisions must satisfy hard feasibility constraints. In many CSO settings, multiple candidate policies--arising from different modeling paradigms--exhibit heterogeneous performance across the covariate space, with no single policy uniformly dominating. We propose Prescribe-then-Select (PS), a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy for the observed covariates. We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven. Across two benchmark CSO problems--single-stage newsvendor and two-stage shipment planning--PS consistently outperforms the best single policy in heterogeneous regimes of the covariate space and converges to the dominant policy when such heterogeneity is absent. All the code to reproduce the results can be found at this https URL.


[27] 2509.08199

Algorithmic Tradeoffs, Applied NLP, and the State-of-the-Art Fallacy

Computational sociology is growing in popularity, yet the analytic tools employed differ widely in power, transparency, and interpretability. In computer science, methods gain popularity after surpassing benchmarks of predictive accuracy, becoming the "state of the art." Computer scientists favor novelty and innovation for different reasons, but prioritizing technical prestige over methodological fit could unintentionally limit the scope of sociological inquiry. To illustrate, we focus on computational text analysis and revisit a prior study of college admissions essays, comparing analyses with both older and newer methods. These methods vary in flexibility and opacity, allowing us to compare performance across distinct methodological regimes. We find that newer techniques did not outperform prior results in meaningful ways. We also find that using the current state of the art, generative AI and large language models, could introduce bias and confounding that is difficult to extricate. We therefore argue that sociological inquiry benefits from methodological pluralism that aligns analytic choices with theoretical and empirical questions. While we frame this sociologically, scholars in other disciplines may confront what we call the "state-of-the-art fallacy", the belief that the tool computer scientists deem to be the best will work across topics, domains, and questions.


[28] 2509.08483

Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis

We analyze gradient descent with Polyak heavy-ball momentum (HB) whose fixed momentum parameter $\beta \in (0, 1)$ provides exponential decay of memory. Building on Kovachki and Stuart (2021), we prove that on an exponentially attractive invariant manifold the algorithm is exactly plain gradient descent with a modified loss, provided that the step size $h$ is small enough. Although the modified loss does not admit a closed-form expression, we describe it with arbitrary precision and prove global (finite "time" horizon) approximation bounds $O(h^{R})$ for any finite order $R \geq 2$. We then conduct a fine-grained analysis of the combinatorics underlying the memoryless approximations of HB, in particular, finding a rich family of polynomials in $\beta$ hidden inside which contains Eulerian and Narayana polynomials. We derive continuous modified equations of arbitrary approximation order (with rigorous bounds) and the principal flow that approximates the HB dynamics, generalizing Rosca et al. (2023). Approximation theorems cover both full-batch and mini-batch HB. Our theoretical results shed new light on the main features of gradient descent with heavy-ball momentum, and outline a road-map for similar analysis of other optimization algorithms.


[29] 2509.08514

Bias in the Loop: How Humans Evaluate AI-Generated Suggestions

Human-AI collaboration increasingly drives decision-making across industries, from medical diagnosis to content moderation. While AI systems promise efficiency gains by providing automated suggestions for human review, these workflows can trigger cognitive biases that degrade performance. We know little about the psychological factors that determine when these collaborations succeed or fail. We conducted a randomized experiment with 2,784 participants to examine how task design and individual characteristics shape human responses to AI-generated suggestions. Using a controlled annotation task, we manipulated three factors: AI suggestion quality in the first three instances, task burden through required corrections, and performance-based financial incentives. We collected demographics, attitudes toward AI, and behavioral data to assess four performance metrics: accuracy, correction activity, overcorrection, and undercorrection. Two patterns emerged that challenge conventional assumptions about human-AI collaboration. First, requiring corrections for flagged AI errors reduced engagement and increased the tendency to accept incorrect suggestions, demonstrating how cognitive shortcuts influence collaborative outcomes. Second, individual attitudes toward AI emerged as the strongest predictor of performance, surpassing demographic factors. Participants skeptical of AI detected errors more reliably and achieved higher accuracy, while those favorable toward automation exhibited dangerous overreliance on algorithmic suggestions. The findings reveal that successful human-AI collaboration depends not only on algorithmic performance but also on who reviews AI outputs and how review processes are structured. Effective human-AI collaborations require consideration of human psychology: selecting diverse evaluator samples, measuring attitudes, and designing workflows that counteract cognitive biases.


[30] 2509.08560

A transport approach to the cutoff phenomenon

Substantial progress has recently been made in the understanding of the cutoff phenomenon for Markov processes, using an information-theoretic statistics known as varentropy [Sal23; Sal24; Sal25a; PS25]. In the present paper, we propose an alternative approach which bypasses the use of varentropy and exploits instead a new W-TV transport inequality, combined with a classical parabolic regularization estimate [BGL01; OV01]. While currently restricted to non-negatively curved processes on smooth spaces, our argument no longer requires the chain rule, nor any approximate version thereof. As applications, we recover the main result of [Sal25a] establishing cutoff for the log-concave Langevin dynamics, and extend the conclusion to a widely-used discrete-time sampling algorithm known as the Proximal Sampler.


[31] 2509.08593

No-Knowledge Alarms for Misaligned LLMs-as-Judges

If we use LLMs as judges to evaluate the complex decisions of other LLMs, who or what monitors the judges? Infinite monitoring chains are inevitable whenever we do not know the ground truth of the decisions by experts and we do not want to trust them. One way to ameliorate our evaluation uncertainty is to exploit the use of logical consistency between disagreeing experts. By observing how LLM judges agree and disagree while grading other LLMs, we can compute the only possible evaluations of their grading ability. For example, if two LLM judges disagree on which tasks a third one completed correctly, they cannot both be 100\% correct in their judgments. This logic can be formalized as a Linear Programming problem in the space of integer response counts for any finite test. We use it here to develop no-knowledge alarms for misaligned LLM judges. The alarms can detect, with no false positives, that at least one member or more of an ensemble of judges are violating a user specified grading ability requirement.


[32] 2509.08708

Quantifying model prediction sensitivity to model-form uncertainty

Model-form uncertainty (MFU) in assumptions made during physics-based model development is widely considered a significant source of uncertainty; however, there are limited approaches that can quantify MFU in predictions extrapolating beyond available data. As a result, it is challenging to know how important MFU is in practice, especially relative to other sources of uncertainty in a model, making it difficult to prioritize resources and efforts to drive down error in model predictions. To address these challenges, we present a novel method to quantify the importance of uncertainties associated with model assumptions. We combine parameterized modifications to assumptions (called MFU representations) with grouped variance-based sensitivity analysis to measure the importance of assumptions. We demonstrate how, in contrast to existing methods addressing MFU, our approach can be applied without access to calibration data. However, if calibration data is available, we demonstrate how it can be used to inform the MFU representation, and how variance-based sensitivity analysis can be meaningfully applied even in the presence of dependence between parameters (a common byproduct of calibration).


[33] 2509.08731

Data-driven generative simulation of SDEs using diffusion models

This paper introduces a new approach to generating sample paths of unknown stochastic differential equations (SDEs) using diffusion models, a class of generative AI models commonly employed in image and video applications. Unlike the traditional Monte Carlo methods for simulating SDEs, which require explicit specifications of the drift and diffusion coefficients, our method takes a model-free, data-driven approach. Given a finite set of sample paths from an SDE, we utilize conditional diffusion models to generate new, synthetic paths of the same SDE. To demonstrate the effectiveness of our approach, we conduct a simulation experiment to compare our method with alternative benchmark ones including neural SDEs. Furthermore, in an empirical study we leverage these synthetically generated sample paths to enhance the performance of reinforcement learning algorithms for continuous-time mean-variance portfolio selection, hinting promising applications of diffusion models in financial analysis and decision-making.


[34] 2509.08739

Bregman Douglas-Rachford Splitting Method

In this paper, we propose the Bregman Douglas-Rachford splitting (BDRS) method and its variant Bregman Peaceman-Rachford splitting method for solving maximal monotone inclusion problem. We show that BDRS is equivalent to a Bregman alternating direction method of multipliers (ADMM) when applied to the dual of the problem. A special case of the Bregman ADMM is an alternating direction version of the exponential multiplier method. To the best of our knowledge, algorithms proposed in this paper are new to the literature. We also discuss how to use our algorithms to solve the discrete optimal transport (OT) problem. We prove the convergence of the algorithms under certain assumptions, though we point out that one assumption does not apply to the OT problem.


[35] 2509.08765

PCGBandit: One-shot acceleration of transient PDE solvers via online-learned preconditioners

Data-driven acceleration of scientific computing workflows has been a high-profile aim of machine learning (ML) for science, with numerical simulation of transient partial differential equations (PDEs) being one of the main applications. The focus thus far has been on methods that require classical simulations to train, which when combined with the data-hungriness and optimization challenges of neural networks has caused difficulties in demonstrating a convincing advantage against strong classical baselines. We consider an alternative paradigm in which the learner uses a classical solver's own data to accelerate it, enabling a one-shot speedup of the simulation. Concretely, since transient PDEs often require solving a sequence of related linear systems, the feedback from repeated calls to a linear solver such as preconditioned conjugate gradient (PCG) can be used by a bandit algorithm to online-learn an adaptive sequence of solver configurations (e.g. preconditioners). The method we develop, PCGBandit, is implemented directly on top of the popular open source software OpenFOAM, which we use to show its effectiveness on a set of fluid and magnetohydrodynamics (MHD) problems.


[36] 2509.08785

Narrative-Guided Reinforcement Learning: A Platform for Studying Language Model Influence on Decision Making

We present a preliminary experimental platform that explores how narrative elements might shape AI decision-making by combining reinforcement learning (RL) with language model reasoning. While AI systems can now both make decisions and engage in narrative reasoning, these capabilities have mostly been studied separately. Our platform attempts to bridge this gap using a dual-system architecture to examine how narrative frameworks could influence reward-based learning. The system comprises a reinforcement learning policy that suggests actions based on past experience, and a language model that processes these suggestions through different narrative frameworks to guide decisions. This setup enables initial experimentation with narrative elements while maintaining consistent environment and reward structures. We implement this architecture in a configurable gridworld environment, where agents receive both policy suggestions and information about their surroundings. The platform's modular design facilitates controlled testing of environmental complexity, narrative parameters, and the interaction between reinforcement learning and narrative-based decisions. Our logging system captures basic decision metrics, from RL policy values to language model reasoning to action selection patterns. While preliminary, this implementation provides a foundation for studying how different narrative frameworks might affect reward-based decisions and exploring potential interactions between optimization-based learning and symbolic reasoning in AI systems.


[37] 2402.04355

PQMass: Probabilistic Assessment of the Quality of Generative Models using Probability Mass Estimation

We propose a likelihood-free method for comparing two distributions given samples from each, with the goal of assessing the quality of generative models. The proposed approach, PQMass, provides a statistically rigorous method for assessing the performance of a single generative model or the comparison of multiple competing models. PQMass divides the sample space into non-overlapping regions and applies chi-squared tests to the number of data samples that fall within each region, giving a p-value that measures the probability that the bin counts derived from two sets of samples are drawn from the same multinomial distribution. PQMass does not depend on assumptions regarding the density of the true distribution, nor does it rely on training or fitting any auxiliary models. We evaluate PQMass on data of various modalities and dimensions, demonstrating its effectiveness in assessing the quality, novelty, and diversity of generated samples. We further show that PQMass scales well to moderately high-dimensional data and thus obviates the need for feature extraction in practical applications.


[38] 2405.07109

Bridging Binarization: Causal Inference with Dichotomized Continuous Exposures

The average treatment effect (ATE) is a common parameter estimated in causal inference literature, but it is only defined for binary exposures. Thus, despite concerns raised by some researchers, many studies seeking to estimate the causal effect of a continuous exposure create a new binary exposure variable by dichotomizing the continuous values into two categories. In this paper, we affirm binarization as a statistically valid method for answering causal questions about continuous exposures by showing the equivalence between the binarized ATE and the difference in the average outcomes of two specific modified treatment policies. These policies impose cut-offs corresponding to the binarized exposure variable and assume preservation of relative self-selection. Relative self-selection is the ratio of the probability density of an individual having an exposure equal to one value of the continuous exposure variable versus another. The policies assume that, for any two values of the exposure variable with non-zero probability density after the cut-off, this ratio will remain unchanged. Through this equivalence, we clarify the assumptions underlying binarization and discuss how to properly interpret the resulting estimator. Additionally, we introduce a new target parameter that can be computed after binarization that considers the observed world as a benchmark. We argue that this parameter addresses more relevant causal questions than the traditional binarized ATE parameter. We present a simulation study to illustrate the implications of these assumptions when analyzing data and to demonstrate how to correctly implement estimators of the parameters discussed. Finally, we present an application of this method to evaluate the effect of a law in the state of California which seeks to limit exposures to oil and gas wells on birth outcomes to further illustrate the underlying assumptions.


[39] 2408.01626

Weighted Brier Score -- an Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration

As advancements in novel biomarker-based algorithms and models accelerate disease risk prediction and stratification in medicine, it is crucial to evaluate these models within the context of their intended clinical application. Prediction models output the absolute risk of disease; subsequently, patient counseling and shared decision-making are based on the estimated individual risk and cost-benefit assessment. The overall impact of the application is often referred to as clinical utility, which received significant attention in terms of model assessment lately. The classic Brier score is a popular measure of prediction accuracy; however, it is insufficient for effectively assessing clinical utility. To address this limitation, we propose a class of weighted Brier scores that aligns with the decision-theoretic framework of clinical utility. Additionally, we decompose the weighted Brier score into discrimination and calibration components, examining how weighting influences the overall score and its individual components. Through this decomposition, we link the weighted Brier score to the $H$ measure, which has been proposed as a coherent alternative to the area under the receiver operating characteristic curve. This theoretical link to the $H$ measure further supports our weighting method and underscores the essential elements of discrimination and calibration in risk prediction evaluation. The practical use of the weighted Brier score as an overall summary is demonstrated using data from the Prostate Cancer Active Surveillance Study (PASS).


[40] 2411.04394

Statistical-Computational Trade-offs for Recursive Adaptive Partitioning Estimators

Models based on recursive adaptive partitioning such as decision trees and their ensembles are popular for high-dimensional regression as they can potentially avoid the curse of dimensionality. Because empirical risk minimization (ERM) is computationally infeasible, these models are typically trained using greedy algorithms. Although effective in many cases, these algorithms have been empirically observed to get stuck at local optima. We explore this phenomenon in the context of learning sparse regression functions over $d$ binary features, showing that when the true regression function $f^*$ does not satisfy Abbe et al. (2022)'s Merged Staircase Property (MSP), greedy training requires $\exp(\Omega(d))$ to achieve low estimation error. Conversely, when $f^*$ does satisfy MSP, greedy training can attain small estimation error with only $O(\log d)$ samples. This dichotomy mirrors that of two-layer neural networks trained with stochastic gradient descent (SGD) in the mean-field regime, thereby establishing a head-to-head comparison between SGD-trained neural networks and greedy recursive partitioning estimators. Furthermore, ERM-trained recursive partitioning estimators achieve low estimation error with $O(\log d)$ samples irrespective of whether $f^*$ satisfies MSP, thereby demonstrating a statistical-computational trade-off for greedy training. Our proofs are based on a novel interpretation of greedy recursive partitioning using stochastic process theory and a coupling technique that may be of independent interest.


[41] 2412.13034

Unified calibration and spatial mapping of fine particulate matter data from multiple low-cost air pollution sensor networks in Baltimore, Maryland

Low-cost air pollution sensor networks are increasingly being deployed globally, supplementing sparse regulatory monitoring with localized air quality data. In some areas, like Baltimore, Maryland, there are only few regulatory (reference) devices but multiple low-cost networks. While there are many available methods to calibrate data from each network individually, separate calibration of each network leads to conflicting air quality predictions. We develop a general Bayesian spatial filtering model combining data from multiple networks and reference devices, providing dynamic calibrations (informed by the latest reference data) and unified predictions (combining information from all available sensors) for the entire region. This method accounts for network-specific bias and noise (observation models), as different networks can use different types of sensors, and uses a Gaussian process (state-space model) to capture spatial correlations. We apply the method to calibrate PM$_{2.5}$ data from Baltimore in June and July 2023 -- a period including days of hazardous concentrations due to wildfire smoke. Our method helps mitigate the effects of preferential sampling of one network in Baltimore, results in better predictions and narrower confidence intervals. Our approach can be used to calibrate low-cost air pollution sensor data in Baltimore and any other areas with multiple low-cost networks.


[42] 2501.10845

A Multi-fidelity Estimator of the Expected Information Gain for Bayesian Optimal Experimental Design

Optimal experimental design (OED) is a framework that leverages a mathematical model of the experiment to identify optimal conditions for conducting the experiment. Under a Bayesian approach, the design objective function is typically chosen to be the expected information gain (EIG). However, EIG is intractable for nonlinear models and must be estimated numerically. Estimating the EIG generally entails some variant of Monte Carlo sampling, requiring repeated data model and likelihood evaluations $\unicode{x2013}$ each involving solving the governing equations of the experimental physics $\unicode{x2013}$ under different sample realizations. This computation becomes impractical for high-fidelity models. We introduce a novel multi-fidelity EIG (MF-EIG) estimator under the approximate control variate (ACV) framework. This estimator is unbiased with respect to the high-fidelity mean, and minimizes variance under a given computational budget. We achieve this by first reparameterizing the EIG so that its expectations are independent of the data models, a requirement for compatibility with ACV. We then provide specific examples under different data model forms, as well as practical enhancements of sample size optimization and sample reuse techniques. We demonstrate the MF-EIG estimator in two numerical examples: a nonlinear benchmark and a turbulent flow problem involving the calibration of shear-stress transport turbulence closure model parameters within the Reynolds-averaged Navier-Stokes model. We validate the estimator's unbiasedness and observe one- to two-orders-of-magnitude variance reduction compared to existing single-fidelity EIG estimators.


[43] 2504.02547

Outlier-Robust Multi-Group Gaussian Mixture Modeling with Flexible Group Reassignment

Do expert-defined or diagnostically-labeled data groups align with clusters inferred through statistical modeling? If not, where do discrepancies between predefined labels and model-based groupings occur and why? In this work, we show how to address these questions using the multi-group Gaussian mixture model (MG-GMM). This novel model incorporates prior group information while allowing flexibility to reassign observations to alternative groups based on data-driven evidence. We achieve this by modeling the observations of each group as arising not from a single distribution, but from a Gaussian mixture comprising all group-specific distributions. Moreover, our model offers robustness against cellwise outliers that may obscure or distort the underlying group structure. We propose a new penalized likelihood approach, called cellMG-GMM, to jointly estimate mixture probabilities, location and scale parameters of the MG-GMM, and detect outliers through a penalty term on the number of flagged cellwise outliers in the objective function. We show that our estimator has good breakdown properties in presence of cellwise outliers. We develop a computationally-efficient EM-based algorithm for cellMG-GMM, and demonstrate its strong performance in identifying and diagnosing observations at the intersection of multiple groups through simulations and diverse applications in meteorology, medicine and oenology.


[44] 2506.19958

Introducing RobustiPy: An efficient next generation multiversal library with model selection, averaging, resampling, and explainable artificial intelligence

Scientific inference is often undermined by the vast but rarely explored "multiverse" of defensible modelling choices, which can generate results as variable as the phenomena under study. We introduce RobustiPy, an open-source Python library that systematizes multiverse analysis and model-uncertainty quantification at scale. RobustiPy unifies bootstrap-based inference, combinatorial specification search, model selection and averaging, joint-inference routines, and explainable AI methods within a modular, reproducible framework. Beyond exhaustive specification curves, it supports rigorous out-of-sample validation and quantifies the marginal contribution of each covariate. We demonstrate its utility across five simulation designs and ten empirical case studies spanning economics, sociology, psychology, and medicine, including a re-analysis of widely cited findings with documented discrepancies. Benchmarking on ~672 million simulated regressions shows that RobustiPy delivers state-of-the-art computational efficiency while expanding transparency in empirical research. By standardizing and accelerating robustness analysis, RobustiPy transforms how researchers interrogate sensitivity across the analytical multiverse, offering a practical foundation for more reproducible and interpretable computational science.


[45] 2507.14652

Accelerating Hamiltonian Monte Carlo for Bayesian Inference in Neural Networks and Neural Operators

Hamiltonian Monte Carlo (HMC) is a powerful and accurate method to sample from the posterior distribution in Bayesian inference. However, HMC techniques are computationally demanding for Bayesian neural networks due to the high dimensionality of the network's parameter space and the non-convexity of their posterior distributions. Therefore, various approximation techniques, such as variational inference (VI) or stochastic gradient MCMC, are often employed to infer the posterior distribution of the network parameters. Such approximations introduce inaccuracies in the inferred distributions, resulting in unreliable uncertainty estimates. In this work, we propose a hybrid approach that combines inexpensive VI and accurate HMC methods to efficiently and accurately quantify uncertainties in neural networks and neural operators. The proposed approach leverages an initial VI training on the full network. We examine the influence of individual parameters on the prediction uncertainty, which shows that a large proportion of the parameters do not contribute substantially to uncertainty in the network predictions. This information is then used to significantly reduce the dimension of the parameter space, and HMC is performed only for the subset of network parameters that strongly influence prediction uncertainties. This yields a framework for accelerating the full batch HMC for posterior inference in neural networks. We demonstrate the efficiency and accuracy of the proposed framework on deep neural networks and operator networks, showing that inference can be performed for large networks with tens to hundreds of thousands of parameters. We show that this method can effectively learn surrogates for complex physical systems by modeling the operator that maps from upstream conditions to wall-pressure data on a cone in hypersonic flow.


[46] 2508.07915

Adding structure to generalized additive models, with applications in ecology

Generalized additive models (GAMs) connecting a set of scalar covariates that map 1-1 to a response are commonly employed in ecology and beyond. However, covariates are often inherently non-scalar, taking multiple values for each observation of the response. They can sometimes have a temporal structure, e.g., a time series of temperatures, or a spatial structure, e.g., multiple soil pH measurements made at nearby locations. While aggregating or selectively summarizing such covariates to yield a scalar covariate allows the use of standard GAM fitting procedures, exactly how to do so can be problematic and information is necessarily lost. Naively including all $p$ components of a vector-valued covariate as $p$ separate covariates, say, without recognizing the structure, can lead to problems of multicollinearity, data sets that are excessively wide given the sample size, and difficulty extracting the primary signal provided by the covariate. Here we introduce three useful extensions to GAMs that efficiently and effectively handle vector-valued covariates without requiring one to choose aggregations or selective summarizations. These extensions are varying-coefficient, scalar-on-function and distributed lag models. While these models have existed for some time they remain relatively underused in ecology. This article aims to show when these models can be useful and how to fit them with the popular R package \mgcv{}.


[47] 2508.08167

Variance Estimation for Weighted Average Treatment Effects

Common variance estimation methods for weighted average treatment effects (WATEs) in observational studies include nonparametric bootstrap and model-based, closed-form sandwich variance estimation. However, the computational cost of bootstrap increases with the size of the data at hand. Besides, some replicates may exhibit random violations of the positivity assumption even when the original data do not. Sandwich variance estimation relies on regularity conditions that may be structurally violated. Moreover, the sandwich variance estimation is model-dependent on the propensity score model, the outcome model, or both; thus it does not have a unified closed-form expression. Recent studies have explored the use of wild bootstrap to estimate the variance of the average treatment effect on the treated (ATT). This technique adopts a one-dimensional, nonparametric, and computationally efficient resampling strategy. In this article, we propose a "post-weighting" bootstrap approach as an alternative to the conventional bootstrap, which helps avoid random positivity violations in replicates and improves computational efficiency. We also generalize the wild bootstrap algorithm from ATT to the broader class of WATEs by providing new justification for correctly accounting for sampling variability from multiple sources under different weighting functions. We evaluate the performance of all four methods through extensive simulation studies and demonstrate their application using data from the National Health and Nutrition Examination Survey (NHANES). Our findings offer several practical recommendations for the variance estimation of WATE estimators.


[48] 2508.11042

Sample efficient likelihood-free inference for virus dynamics with different types of experiments

This study applied Bayesian optimization likelihood-free inference(BOLFI) to virus dynamics experimental data and efficiently inferred the model parameters with uncertainty measure. The computational benefit is remarkable compared to existing methodology on the same problem. No likelihood knowledge is needed in the inference. Improvement of the BOLFI algorithm with Gaussian process based classifier for treatment of extreme values are provided. Discrepancy design for combining different forms of data from completely different experiment processes are suggested and tested with synthetic data, then applied to real data. Reasonable parameter values are estimated for influenza A virus data.


[49] 2508.14321

piCurve: an R package for modeling photosynthesis-irradiance curves

Photosynthesis-irradiance (PI) curves are foundational for quantifying primary production, parameterizing ecosystem and biogeochemical models, and interpreting physiological acclimation to light. Despite their broad use, researchers lack a unified, reproducible toolkit to fit, compare, and diagnose the many PI formulations that have accumulated over the last century. We introduce piCurve, an R package that standardizes the modeling of PI relationships, with a library of widely used light-limited, light-saturated, and photoinhibited formulations and a consistent statistical framework for estimation and comparison. With the total of 24 PI models, piCurve supports mean squared error (MSE) and maximum likelihood estimation (MLE), provides uncertainty quantification via information matrix (Hessian), and includes automated, data-informed initialization to improve convergence. Utilities classify PI data into light-limited, light-saturated, and photoinhibited regions, while plotting and 'tidy' helpers streamline workflow and reporting. Together, these features enable reproducible analyses and fair model comparisons, including for curves exhibiting a plateau followed by photoinhibition.


[50] 2508.20278

Interpretable Scalar-on-Image Linear Regression Models via the Generalized Dantzig Selector

The scalar-on-image regression model examines the association between a scalar response and a bivariate function (e.g., images) through the estimation of a bivariate coefficient function. Existing approaches often impose smoothness constraints to control the bias-variance trade-off, and thus prevent overfitting. However, such assumptions can hinder interpretability, especially when only certain regions of an image influence changes in the response. In such a scenario, interpretability can be better captured by imposing sparsity assumptions on the coefficient function. To address this challenge, we propose the Generalized Dantzig Selector, a novel method that jointly enforces sparsity and smoothness on the coefficient function. The proposed approach enhances interpretability by accurately identifying regions with no contribution to the changes of response, while preserving stability in estimation. Extensive simulation studies and real data applications demonstrate that the new method is highly interpretable and achieves notable improvements over existing approaches. Moreover, we rigorously establish non-asymptotic bounds for the estimation error, providing strong theoretical guarantees for the proposed framework.


[51] 2509.05877

Uncertainty Quantification in Probabilistic Machine Learning Models: Theory, Methods, and Insights

Uncertainty Quantification (UQ) is essential in probabilistic machine learning models, particularly for assessing the reliability of predictions. In this paper, we present a systematic framework for estimating both epistemic and aleatoric uncertainty in probabilistic models. We focus on Gaussian Process Latent Variable Models and employ scalable Random Fourier Features-based Gaussian Processes to approximate predictive distributions efficiently. We derive a theoretical formulation for UQ, propose a Monte Carlo sampling-based estimation method, and conduct experiments to evaluate the impact of uncertainty estimation. Our results provide insights into the sources of predictive uncertainty and illustrate the effectiveness of our approach in quantifying the confidence in the predictions.


[52] 2303.02444

Calibrating Transformers via Sparse Gaussian Processes

Transformer models have achieved profound success in prediction tasks in a wide range of applications in natural language processing, speech recognition and computer vision. Extending Transformer's success to safety-critical domains requires calibrated uncertainty estimation which remains under-explored. To address this, we propose Sparse Gaussian Process attention (SGPA), which performs Bayesian inference directly in the output space of multi-head attention blocks (MHAs) in transformer to calibrate its uncertainty. It replaces the scaled dot-product operation with a valid symmetric kernel and uses sparse Gaussian processes (SGP) techniques to approximate the posterior processes of MHA outputs. Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both in-distribution calibration and out-of-distribution robustness and detection.


[53] 2406.00574

On the Sample Complexity of Set Membership Estimation for Linear Systems with Disturbances Bounded by Convex Sets

This paper revisits the set membership identification for linear control systems and establishes its convergence rates under relaxed assumptions on (i) the persistent excitation requirement and (ii) the system disturbances. In particular, instead of assuming persistent excitation exactly, this paper adopts the block-martingale small-ball condition enabled by randomly perturbed control policies to establish the convergence rates of SME with high probability. Further, we relax the assumptions on the shape of the bounded disturbance set and the boundary-visiting condition. Our convergence rates hold for disturbances bounded by general convex sets, which bridges the gap between the previous convergence analysis for general convex sets and the existing convergence rate analysis for $\ell_\infty$ balls. Further, we validate our convergence rates by several numerical experiments. This manuscript contains supplementary content in the Appendix.


[54] 2410.20890

Generative Example-Based Explanations: Bridging the Gap between Generative Modeling and Explainability

Recently, several methods have leveraged deep generative modeling to produce example-based explanations of image classifiers. Despite producing visually stunning results, these methods are largely disconnected from classical explainability literature. This conceptual and communication gap leads to misunderstandings and misalignments in goals and expectations. In this paper, we bridge this gap by proposing a probabilistic framework for example-based explanations, formally defining the example-based explanations in a probabilistic manner amenable for modeling via deep generative models while coherent with the critical characteristics and desiderata widely accepted in the explainability community. Our aim is on one hand to provide a constructive framework for the development of well-grounded generative algorithms for example-based explanations and, on the other, to facilitate communication between the generative and explainability research communities, foster rigor and transparency, and improve the quality of peer discussion and research progress in this promising direction.


[55] 2411.17109

On the maximal correlation of some stochastic processes

We study the maximal correlation coefficient $R(X,Y)$ between two stochastic processes $X$ and $Y$. In the case when $(X,Y)$ is a random walk, we find $R(X,Y)$ using the Csáki-Fischer identity and the lower semicontinuity of the map $\text{Law}(X,Y) \to R(X,Y)$. When $(X,Y)$ is a two-dimensional Lévy process, we express $R(X,Y)$ in terms of the Lévy measure of the process and the covariance matrix of the diffusion part of the process. Consequently, for a two-dimensional $\alpha$-stable random vector $(X,Y)$ with $0<\alpha<2$, we express $R(X,Y)$ in terms of $\alpha$ and the spectral measure $\tau$ of the $\alpha$-stable distribution. We also establish analogs and extensions of the Dembo-Kagan-Shepp-Yu inequality and the Madiman-Barron inequality.


[56] 2501.06777

Identification and Estimation of Simultaneous Equation Models Using Higher-Order Cumulant Restrictions

Identifying structural parameters in linear simultaneous-equation models is a longstanding challenge. Recent work exploits information in higher-order moments of non-Gaussian data. In this literature, the structural errors are typically assumed to be uncorrelated so that, after standardizing the covariance matrix of the observables (whitening), the structural parameter matrix becomes orthogonal -- a device that underpins many identification proofs but can be restrictive in econometric applications. We show that neither zero covariance nor whitening is necessary. For any order $h>2$, a simple diagonality condition on the $h$th-order cumulants alone identifies the structural parameter matrix -- up to unknown scaling and permutation -- as the solution to an eigenvector problem; no restrictions on cumulants of other orders are required. This general, single-order result enlarges the class of models covered by our framework and yields a sample-analogue estimator that is $\sqrt{n}$-consistent, asymptotically normal, and easy to compute. Furthermore, when uncorrelatedness is intrinsic -- as in vector autoregressive (VAR) models -- our framework provides a transparent overidentification test. Monte Carlo experiments show favorable finite-sample performance, and two applications -- "Returns to Schooling" and "Uncertainty and the Business Cycle" -- demonstrate its practical value.


[57] 2503.00300

Cauchy Random Features for Operator Learning in Sobolev Space

Operator learning is the approximation of operators between infinite dimensional Banach spaces using machine learning approaches. While most progress in this area has been driven by variants of deep neural networks such as the Deep Operator Network and Fourier Neural Operator, the theoretical guarantees are often in the form of a universal approximation property. However, the existence theorems do not guarantee that an accurate operator network is obtainable in practice. Motivated by the recent kernel-based operator learning framework, we propose a random feature operator learning method with theoretical guarantees and error bounds. The random feature method can be viewed as a randomized approximation of a kernel method, which significantly reduces the computation requirements for training. We provide a generalization error analysis for our proposed random feature operator learning method along with comprehensive numerical results. Compared to kernel-based method and neural network methods, the proposed method can obtain similar or better test errors across benchmarks examples with significantly reduced training times. An additional advantages it that our implementation is simple and does require costly computational resources, such as GPU.


[58] 2506.06486

A Certified Unlearning Approach without Access to Source Data

With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.


[59] 2506.06656

Rescaled Influence Functions: Accurate Data Attribution in High Dimension

How does the training data affect a model's behavior? This is the question we seek to answer with data attribution. The leading practical approaches to data attribution are based on influence functions (IF). IFs utilize a first-order Taylor approximation to efficiently predict the effect of removing a set of samples from the training set without retraining the model, and are used in a wide variety of machine learning applications. However, especially in the high-dimensional regime (# params $\geq \Omega($# samples$)$), they are often imprecise and tend to underestimate the effect of sample removals, even for simple models such as logistic regression. We present rescaled influence functions (RIF), a new tool for data attribution which can be used as a drop-in replacement for influence functions, with little computational overhead but significant improvement in accuracy. We compare IF and RIF on a range of real-world datasets, showing that RIFs offer significantly better predictions in practice, and present a theoretical analysis explaining this improvement. Finally, we present a simple class of data poisoning attacks that would fool IF-based detections but would be detected by RIF.


[60] 2507.07751

Manifolds with kinks and the asymptotic behavior of the graph Laplacian operator with Gaussian kernel

We introduce manifolds with kinks, a class of manifolds with possibly singular boundary that notably contains manifolds with smooth boundary and corners. We derive the asymptotic behavior of the Graph Laplace operator with Gaussian kernel and its deterministic limit on these spaces as bandwidth goes to zero. We show that this asymptotic behavior is determined by the inward sector of the tangent space and, as special cases, we derive its behavior near interior and singular points. Lastly, we show the validity of our theoretical results using numerical simulation.


[61] 2509.06218

The Efficiency Frontier: Classical Shadows versus Quantum Footage

Interfacing quantum and classical processors is an important subroutine in full-stack quantum algorithms. The so-called "classical shadow" method efficiently extracts essential classical information from quantum states, enabling the prediction of many properties of a quantum system from only a few measurements. However, for a small number of highly non-local observables, or when classical post-processing power is limited, the classical shadow method is not always the most efficient choice. Here, we address this issue quantitatively by performing a full-stack resource analysis that compares classical shadows with "quantum footage," which refers to direct quantum measurement. Under certain assumptions, our analysis illustrates a boundary of download efficiency between classical shadows and quantum footage. For observables expressed as linear combinations of Pauli matrices, the classical shadow method outperforms direct measurement when the number of observables is large and the Pauli weight is small. For observables in the form of large Hermitian sparse matrices, the classical shadow method shows an advantage when the number of observables, the sparsity of the matrix, and the number of qubits fall within a certain range. The key parameters influencing this behavior include the number of qubits $n$, observables $M$, sparsity $k$, Pauli weight $w$, accuracy requirement $\epsilon$, and failure tolerance $\delta$. We also compare the resource consumption of the two methods on different types of quantum computers and identify break-even points where the classical shadow method becomes more efficient, which vary depending on the hardware. This paper opens a new avenue for quantitatively designing optimal strategies for hybrid quantum-classical tomography and provides practical insights for selecting the most suitable quantum measurement approach in real-world applications.