New articles on Statistics


[1] 2412.14196

Investigating Central England Temperature Variability: Statistical Analysis of Associations with North Atlantic Oscillation (NAO) and Pacific Decadal Oscillation (PDO)

This study investigates the variability of the Central England Temperature (CET) series in relation to the North Atlantic Oscillation (NAO) and the Pacific Decadal Oscillation (PDO) using advanced time series modeling techniques. Leveraging the world's longest continuous instrumental temperature dataset (1723-2023), this research applies ARIMA and ARIMAX models to quantify the impact of climatic oscillations on regional temperature variability, while also accounting for long-term warming trends. Spectral and coherence analyses further explore the periodic interactions between CET and the oscillations. Results reveal that NAO exerts a stronger influence on CET variability compared to PDO, with significant coherence observed at cycles of 5 to 7.5 years and 2 to 2.5 years for NAO, while PDO shows no statistically significant coherence. The ARIMAX model effectively captures both the upward warming trend and the influence of climatic oscillations, with robust diagnostics confirming its reliability. This study contributes to understanding the interplay between regional temperature variability and large-scale climatic drivers, providing a framework for future research on climatic oscillations and their role in shaping regional climate dynamics. Limitations and potential future directions, including the integration of additional climatic indices and comparative regional analyses, are also discussed.


[2] 2412.14263

Evaluation of the linear mixing model in fluorescence spectroscopy

Analyses of spectral data often assume a linear mixing hypothesis, which states that the spectrum of a mixed substance is approximately the mixture of the individual spectra of its constituent parts. We evaluate this hypothesis in the context of dissolved organic matter (DOM) fluorescence spectroscopy for endmember abundance recovery from mixtures of three different DOM endmembers. We quantify two key sources of experimental variation, and statistically evaluate the linear mixing hypotheses in the context of this variation. We find that there is not strong statistical evidence against this hypothesis for high-fluorescence readings, and that true abundances of high-fluorescence endmembers are accurately recovered from the excitation-emission fluorescence spectra of mixed samples using linear methods. However, abundances of a low-fluorescence endmember are less well-estimated, in that the abundance coefficient estimates exhibit a high degree of variability across replicate experiments.


[3] 2412.14284

Optimal design of experiments for functional linear models with dynamic factors

In this work we build optimal experimental designs for precise estimation of the functional coefficient of a function-on-function linear regression model where both the response and the factors are continuous functions of time. After obtaining the variance-covariance matrix of the estimator of the functional coefficient which minimizes the integrated sum of square of errors, we extend the classical definition of optimal design to this estimator, and we provide the expression of the A-optimal and of the D-optimal designs. Examples of optimal designs for dynamic experimental factors are then computed through a suitable algorithm, and we discuss different scenarios in terms of the set of basis functions used for their representation. Finally, we present an example with simulated data to illustrate the feasibility of our methodology.


[4] 2412.14315

On the Robustness of Spectral Algorithms for Semirandom Stochastic Block Models

In a graph bisection problem, we are given a graph $G$ with two equally-sized unlabeled communities, and the goal is to recover the vertices in these communities. A popular heuristic, known as spectral clustering, is to output an estimated community assignment based on the eigenvector corresponding to the second smallest eigenvalue of the Laplacian of $G$. Spectral algorithms can be shown to provably recover the cluster structure for graphs generated from certain probabilistic models, such as the Stochastic Block Model (SBM). However, spectral clustering is known to be non-robust to model mis-specification. Techniques based on semidefinite programming have been shown to be more robust, but they incur significant computational overheads. In this work, we study the robustness of spectral algorithms against semirandom adversaries. Informally, a semirandom adversary is allowed to ``helpfully'' change the specification of the model in a way that is consistent with the ground-truth solution. Our semirandom adversaries in particular are allowed to add edges inside clusters or increase the probability that an edge appears inside a cluster. Semirandom adversaries are a useful tool to determine the extent to which an algorithm has overfit to statistical assumptions on the input. On the positive side, we identify classes of semirandom adversaries under which spectral bisection using the _unnormalized_ Laplacian is strongly consistent, i.e., it exactly recovers the planted partitioning. On the negative side, we show that in these classes spectral bisection with the _normalized_ Laplacian outputs a partitioning that makes a classification mistake on a constant fraction of the vertices. Finally, we demonstrate numerical experiments that complement our theoretical findings.


[5] 2412.14339

Forecasting Influenza Hospitalizations Using a Bayesian Hierarchical Nonlinear Model with Discrepancy

The annual influenza outbreak leads to significant public health and economic burdens making it desirable to have prompt and accurate probabilistic forecasts of the disease spread. The United States Centers for Disease Control and Prevention (CDC) hosts annually a national flu forecasting competition which has led to the development of a variety of flu forecast modeling methods. Beginning in 2013, the target to be forecast was weekly percentage of patients with an influenza-like illness (ILI), but in 2021 the target was changed to weekly hospitalizations. Reliable hospitalization data has only been available since 2021, but ILI data has been available since 2010 and has been successfully forecast for several seasons. In this manuscript, we introduce a two component modeling framework for forecasting hospitalizations utilizing both hospitalization and ILI data. The first component is for modeling ILI data using a nonlinear Bayesian model. The second component is for modeling hospitalizations as a function of ILI. For hospitalization forecasts, ILI is first forecast then hospitalizations are forecast with ILI forecasts used as a predictor. In a simulation study, the hospitalization forecast model is assessed and two previously successful ILI forecast models are compared. Also assessed is the usefulness of including a systematic model discrepancy term in the ILI model. Forecasts of state and national hospitalizations for the 2023-24 flu season are made, and different modeling decisions are compared. We found that including a discrepancy component in the ILI model tends to improve forecasts during certain weeks of the year. We also found that other modeling decisions such as the exact nonlinear function to be used in the ILI model or the error distribution for hospitalization models may or may not be better than other decisions, depending on the season, location, or week of the forecast.


[6] 2412.14343

Revisiting the Nowosiółka skull with RMaCzek

One of the first fully quantitative distance matrix visualization methods was proposed by Jan Czekanowski at the beginning of the previous century. Recently, a software package, RMaCzek, was made available that allows for producing such diagrams in R. Here we reanalyze the original data that Czekanowski used for introducing his method, and in the accompanying code show how the user can specify their own custom distance functions in the package.


[7] 2412.14346

Strong Gaussian approximations with random multipliers

One reason why standard formulations of the central limit theorems are not applicable in high-dimensional and non-stationary regimes is the lack of a suitable limit object. Instead, suitable distributional approximations can be used, where the approximating object is not constant, but a sequence as well. We extend Gaussian approximation results for the partial sum process by allowing each summand to be multiplied by a data-dependent matrix. The results allow for serial dependence of the data, and for high-dimensionality of both the data and the multipliers. In the finite-dimensional and locally-stationary setting, we obtain a functional central limit theorem as a direct consequence. An application to sequential testing in non-stationary environments is described.


[8] 2412.14357

Nonparametric Regression in Dirichlet Spaces: A Random Obstacle Approach

In this paper, we consider nonparametric estimation over general Dirichlet metric measure spaces. Unlike the more commonly studied reproducing kernel Hilbert space, whose elements may be defined pointwise, a Dirichlet space typically only contain equivalence classes, i.e. its elements are only unique almost everywhere. This lack of pointwise definition presents significant challenges in the context of nonparametric estimation, for example the classical ridge regression problem is ill-posed. In this paper, we develop a new technique for renormalizing the ridge loss by replacing pointwise evaluations with certain \textit{local means} around the boundaries of obstacles centered at each data point. The resulting renormalized empirical risk functional is well-posed and even admits a representer theorem in terms of certain equilibrium potentials, which are truncated versions of the associated Green function, cut-off at a data-driven threshold. We study the global, out-of-sample consistency of the sample minimizer, and derive an adaptive upper bound on its convergence rate that highlights the interplay of the analytic, geometric, and probabilistic properties of the Dirichlet form. We also construct a simple regressogram type estimator that achieves the minimax optimal estimation rate over certain $L^p$ subsets of a Dirichlet ball with some knowledge of the geometry of the metric measure space. Our framework notably does not require the smoothness of the underlying space, and is applicable to both manifold and fractal settings. To the best of our knowledge, this is the first paper to obtain out-of-sample convergence guarantees in the framework of general metric measure Dirichlet spaces.


[9] 2412.14391

Randomization Tests for Conditional Group Symmetry

Symmetry plays a central role in the sciences, machine learning, and statistics. While statistical tests for the presence of distributional invariance with respect to groups have a long history, tests for conditional symmetry in the form of equivariance or conditional invariance are absent from the literature. This work initiates the study of nonparametric randomization tests for symmetry (invariance or equivariance) of a conditional distribution under the action of a specified locally compact group. We develop a general framework for randomization tests with finite-sample Type I error control and, using kernel methods, implement tests with finite-sample power lower bounds. We also describe and implement approximate versions of the tests, which are asymptotically consistent. We study their properties empirically on synthetic examples, and on applications to testing for symmetry in two problems from high-energy particle physics.


[10] 2412.14423

Cross-Validation with Antithetic Gaussian Randomization

We introduce a method for performing cross-validation without sample splitting. The method is well-suited for problems where traditional sample splitting is infeasible, such as when data are not assumed to be independently and identically distributed. Even in scenarios where sample splitting is possible, our method offers a computationally efficient alternative for estimating prediction error, achieving comparable or even lower error than standard cross-validation at a significantly reduced computational cost. Our approach constructs train-test data pairs using externally generated Gaussian randomization variables, drawing inspiration from recent randomization techniques such as data-fission and data-thinning. The key innovation lies in a carefully designed correlation structure among these randomization variables, referred to as antithetic Gaussian randomization. This correlation is crucial in maintaining a bounded variance while allowing the bias to vanish, offering an additional advantage over standard cross-validation, whose performance depends heavily on the bias-variance tradeoff dictated by the number of folds. We provide a theoretical analysis of the mean squared error of the proposed estimator, proving that as the level of randomization decreases to zero, the bias converges to zero, while the variance remains bounded and decays linearly with the number of repetitions. This analysis highlights the benefits of the antithetic Gaussian randomization over independent randomization. Simulation studies corroborate our theoretical findings, illustrating the robust performance of our cross-validated estimator across various data types and loss functions.


[11] 2412.14478

Time-Varying Functional Cox Model

We propose two novel approaches for estimating time-varying effects of functional predictors within a linear functional Cox model framework. This model allows for time-varying associations of a functional predictor observed at baseline, estimated using penalized regression splines for smoothness across the functional domain and event time. The first approach, suitable for small-to-medium datasets, uses the Cox-Poisson likelihood connection for valid estimation and inference. The second, a landmark approach, significantly reduces computational burden for large datasets and high-dimensional functional predictors. Both methods address proportional hazards violations for functional predictors and model associations as a bivariate smooth coefficient. Motivated by analyzing diurnal motor activity patterns and all-cause mortality in NHANES (N=4445, functional predictor dimension=1440), we demonstrate the first method's computational limitations and the landmark approach's efficiency. These methods are implemented in stable, high-quality software using the mgcv package for penalized spline regression with automated smoothing parameter selection. Simulations show both methods achieve high accuracy in estimating functional coefficients, with the landmark approach being computationally faster but slightly less accurate. The Cox-Poisson method provides nominal coverage probabilities, while landmark inference was not assessed due to inherent bias. Sensitivity to landmark modeling choices was evaluated. Application to NHANES reveals an attenuation of diurnal effects on mortality over an 8-year follow-up.


[12] 2412.14503

dapper: Data Augmentation for Private Posterior Estimation in R

This paper serves as a reference and introduction to using the R package dapper. dapper encodes a sampling framework which allows exact Markov chain Monte Carlo simulation of parameters and latent variables in a statistical model given privatized data. The goal of this package is to fill an urgent need by providing applied researchers with a flexible tool to perform valid Bayesian inference on data protected by differential privacy, allowing them to properly account for the noise introduced for privacy protection in their statistical analysis. dapper offers a significant step forward in providing general-purpose statistical inference tools for privatized data.


[13] 2412.14527

Statistical Undersampling with Mutual Information and Support Points

Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.


[14] 2412.14563

Transfer Learning Meets Functional Linear Regression: No Negative Transfer under Posterior Drift

Posterior drift refers to changes in the relationship between responses and covariates while the distributions of the covariates remain unchanged. In this work, we explore functional linear regression under posterior drift with transfer learning. Specifically, we investigate when and how auxiliary data can be leveraged to improve the estimation accuracy of the slope function in the target model when posterior drift occurs. We employ the approximated least square method together with a lasso penalty to construct an estimator that transfers beneficial knowledge from source data. Theoretical analysis indicates that our method avoids negative transfer under posterior drift, even when the contrast between slope functions is quite large. Specifically, the estimator is shown to perform at least as well as the classical estimator using only target data, and it enhances the learning of the target model when the source and target models are sufficiently similar. Furthermore, to address scenarios where covariate distributions may change, we propose an adaptive algorithm using aggregation techniques. This algorithm is robust against non-informative source samples and effectively prevents negative transfer. Simulation and real data examples are provided to demonstrate the effectiveness of the proposed algorithm.


[15] 2412.14720

MICG-AI: A multidimensional index of child growth based on digital phenotyping with Bayesian artificial intelligence

This document proposes an algorithm for a mobile application designed to monitor multidimensional child growth through digital phenotyping. Digital phenotyping offers a unique opportunity to collect and analyze high-frequency data in real time, capturing behavioral, psychological, and physiological states of children in naturalistic settings. Traditional models of child growth primarily focus on physical metrics, often overlooking multidimensional aspects such as emotional, social, and cognitive development. In this paper, we introduce a Bayesian artificial intelligence (AI) algorithm that leverages digital phenotyping to create a Multidimensional Index of Child Growth (MICG). This index integrates data from various dimensions of child development, including physical, emotional, cognitive, and environmental factors. By incorporating probabilistic modeling, the proposed algorithm dynamically updates its learning based on data collected by the mobile app used by mothers and children. The app also infers uncertainty from response times, adjusting the importance of each dimension of child growth accordingly. Our contribution applies state-of-the-art technology to track multidimensional child development, enabling families and healthcare providers to make more informed decisions in real time.


[16] 2412.14745

Union-Free Generic Depth for Non-Standard Data

Non-standard data, which fall outside classical statistical data formats, challenge state-of-the-art analysis. Examples of non-standard data include partial orders and mixed categorical-numeric-spatial data. Most statistical methods required to represent them by classical statistical spaces. However, this representation can distort their inherent structure and thus the results and interpretation. For applicants, this creates a dilemma: using standard statistical methods can risk misrepresenting the data, while preserving their true structure often lead these methods to be inapplicable. To address this dilemma, we introduce the union-free generic depth (ufg-depth) which is a novel framework that respects the true structure of non-standard data while enabling robust statistical analysis. The ufg-depth extends the concept of simplicial depth from normed vector spaces to a much broader range of data types, by combining formal concept analysis and data depth. We provide a systematic analysis of the theoretical properties of the ufg-depth and demonstrate its application to mixed categorical-numerical-spatial data and hierarchical-nominal data. The ufg-depth is a unified approach that bridges the gap between preserving the data structure and applying statistical methods. With this, we provide a new perspective for non-standard data analysis.


[17] 2412.14800

Asymptotic Equivalence for Nonparametric Regression

We consider a nonparametric model $\mathcal{E}^{n},$ generated by independent observations $X_{i},$ $i=1,...,n,$ with densities $p(x,\theta_{i}),$ $i=1,...,n,$ the parameters of which $\theta _{i}=f(i/n)\in \Theta $ are driven by the values of an unknown function $f:[0,1]\rightarrow \Theta $ in a smoothness class. The main result of the paper is that, under regularity assumptions, this model can be approximated, in the sense of the Le Cam deficiency pseudodistance, by a nonparametric Gaussian shift model $Y_{i}=\Gamma (f(i/n))+\varepsilon _{i},$ where $\varepsilon_{1},...,\varepsilon _{n}$ are i.i.d. standard normal r.v.'s, the function $\Gamma (\theta ):\Theta \rightarrow \mathrm{R}$ satisfies $\Gamma ^{\prime}(\theta )=\sqrt{I(\theta )}$ and $I(\theta )$ is the Fisher information corresponding to the density $p(x,\theta ).$


[18] 2412.14916

From Point to probabilistic gradient boosting for claim frequency and severity prediction

Gradient boosting for decision tree algorithms are increasingly used in actuarial applications as they show superior predictive performance over traditional generalized linear models. Many improvements and sophistications to the first gradient boosting machine algorithm exist. We present in a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms: GBM, XGBoost, DART, LightGBM, CatBoost, EGBM, PGBM, XGBoostLSS, cyclic GBM, and NGBoost. In this comprehensive numerical study, we compare their performance on five publicly available datasets for claim frequency and severity, of various size and comprising different number of (high cardinality) categorical variables. We explain how varying exposure-to-risk can be handled with boosting in frequency models. We compare the algorithms on the basis of computational efficiency, predictive performance, and model adequacy. LightGBM and XGBoostLSS win in terms of computational efficiency. The fully interpretable EGBM achieves competitive predictive performance compared to the black box algorithms considered. We find that there is no trade-off between model adequacy and predictive accuracy: both are achievable simultaneously.


[19] 2412.14942

Robust modestly weighted log-rank tests

The introduction of checkpoint inhibitors in immuno-oncology has raised questions about the suitability of the log-rank test as the default primary analysis method in confirmatory studies, particularly when survival curves exhibit non-proportional hazards. The log-rank test, while effective in controlling false positive rates, may lose power in scenarios where survival curves remain similar for extended periods before diverging. To address this, various weighted versions of the log-rank test have been proposed, including the MaxCombo test, which combines multiple weighted log-rank statistics to enhance power across a range of alternative hypotheses. Despite its potential, the MaxCombo test has seen limited adoption, possibly owing to its proneness to produce counterintuitive results in situations where the hazard functions on the two arms cross. In response, the modestly weighted log-rank test was developed to provide a balanced approach, giving greater weight to later event times while avoiding undue influence from early detrimental effects. However, this test also faces limitations, particularly if the possibility of early separation of survival curves cannot be ruled out a priori. We propose a novel test statistic that integrates the strengths of the standard log-rank test, the modestly weighted log-rank test, and the MaxCombo test. By considering the maximum of the standard log-rank statistic and a modestly weighted log-rank statistic, the new test aims to maintain power under delayed effect scenarios while minimizing power loss, relative to the log-rank test, in worst-case scenarios. Simulation studies and a case study demonstrate the efficiency and robustness of this approach, highlighting its potential as a robust alternative for primary analysis in immuno-oncology trials.


[20] 2412.14946

Joint Models for Handling Non-Ignorable Missing Data using Bayesian Additive Regression Trees: Application to Leaf Photosynthetic Traits Data

Dealing with missing data poses significant challenges in predictive analysis, often leading to biased conclusions when oversimplified assumptions about the missing data process are made. In cases where the data are missing not at random (MNAR), jointly modeling the data and missing data indicators is essential. Motivated by a real data application with partially missing multivariate outcomes related to leaf photosynthetic traits and several environmental covariates, we propose two methods under a selection model framework for handling data with missingness in the response variables suitable for recovering various missingness mechanisms. Both approaches use a multivariate extension of Bayesian additive regression trees (BART) to flexibly model the outcomes. The first approach simultaneously uses a probit regression model to jointly model the missingness. In scenarios where the relationship between the missingness and the data is more complex or non-linear, we propose a second approach using a probit BART model to characterize the missing data process, thereby employing two BART models simultaneously. Both models also effectively handle ignorable covariate missingness. The efficacy of both models compared to existing missing data approaches is demonstrated through extensive simulations, in both univariate and multivariate settings, and through the aforementioned application to the leaf photosynthetic trait data.


[21] 2412.15012

Assessing treatment effects in observational data with missing confounders: A comparative study of practical doubly-robust and traditional missing data methods

In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing data and are dominant in the biomedical literature. Doubly-robust methods, which are consistent under fewer assumptions, can be more efficient with respect to mean-squared error. We discuss two practical-to-implement doubly-robust estimators, generalized raking and inverse probability-weighted targeted maximum likelihood estimation (TMLE), which are both currently under-utilized in biomedical studies. We compare their performance to IPW and MI in a detailed numerical study for a variety of synthetic data-generating and missingness scenarios, including scenarios with rare outcomes and a high missingness proportion. Further, we consider plasmode simulation studies that emulate the complex data structure of a large electronic health records cohort in order to compare anti-depressant therapies in a rare-outcome setting where a key confounder is prone to more than 50\% missingness. We provide guidance on selecting a missing data analysis approach, based on which methods excelled with respect to the bias-variance trade-off across the different scenarios studied.


[22] 2412.15041

Boosting Distributional Copula Regression for Bivariate Right-Censored Time-to-Event Data

We propose a highly flexible distributional copula regression model for bivariate time-to-event data in the presence of right-censoring. The joint survival function of the response is constructed using parametric copulas, allowing for a separate specification of the dependence structure between the time-to-event outcome variables and their respective marginal survival distributions. The latter are specified using well-known parametric distributions such as the log-Normal, log-Logistic (proportional odds model), or Weibull (proportional hazards model) distributions. Hence, the marginal univariate event times can be specified as parametric (also known as Accelerated Failure Time, AFT) models. Embedding our model into the class of generalized additive models for location, scale and shape, possibly all distribution parameters of the joint survival function can depend on covariates. We develop a component-wise gradient-based boosting algorithm for estimation. This way, our approach is able to conduct data-driven variable selection. To the best of our knowledge, this is the first implementation of multivariate AFT models via distributional copula regression with automatic variable selection via statistical boosting. A special merit of our approach is that it works for high-dimensional (p>>n) settings. We illustrate the practical potential of our method on a high-dimensional application related to semi-competing risks responses in ovarian cancer. All of our methods are implemented in the open source statistical software R as add-on functions of the package gamboostLSS.


[23] 2412.15049

A linear regression model for quantile function data applied to paired pulmonary 3d CT scans

This paper introduces a new objective measure for assessing treatment response in asthmatic patients using computed tomography (CT) imaging data. For each patient, CT scans were obtained before and after one year of monoclonal antibody treatment. Following image segmentation, the Hounsfield unit (HU) values of the voxels were encoded through quantile functions. It is hypothesized that patients with improved conditions after treatment will exhibit better expiration, reflected in higher HU values and an upward shift in the quantile curve. To objectively measure treatment response, a novel linear regression model on quantile functions is developed, drawing inspiration from Verde and Irpino (2010). Unlike their framework, the proposed model is parametric and incorporates distributional assumptions on the errors, enabling statistical inference. The model allows for the explicit calculation of regression coefficient estimators and confidence intervals, similar to conventional linear regression. The corresponding data and R code are available on GitHub to facilitate the reproducibility of the analyses presented.


[24] 2412.15057

Asymptotic Equivalence for Nonparametric Generalized Linear Models

We establish that a non-Gaussian nonparametric regression model is asymptotically equivalent to a regression model with Gaussian noise. The approximation is in the sense of Le Cam's deficiency distance $\Delta $; the models are then asymptotically equivalent for all purposes of statistical decision with bounded loss. Our result concerns a sequence of independent but not identically distributed observations with each distribution in the same real-indexed exponential family. The canonical parameter is a value $f(t_i)$ of a regression function $f$ at a grid point $t_i$ (nonparametric GLM). When $f$ is in a H\"{o}lder ball with exponent $\beta >\frac 12 ,$ we establish global asymptotic equivalence to observations of a signal $\Gamma (f(t))$ in Gaussian white noise, where $\Gamma $ is related to a variance stabilizing transformation in the exponential family. The result is a regression analog of the recently established Gaussian approximation for the i.i.d. model. The proof is based on a functional version of the Hungarian construction for the partial sum process.


[25] 2412.15076

Digital N-of-1 Trials and their Application in Experimental Physiology

Traditionally, studies in experimental physiology have been conducted in small groups of human participants, animal models or cell lines. Important challenges include achieving sufficient statistical power in statistical hypothesis tests of small sample sizes and identifying optimal study designs. Here, we introduce N-of-1 trials as an innovative study design which can have high relevance to innovate and improve studies in experimental physiology. N-of-1 trials are multi-crossover trials in single participants that allow valid statistical inference on the individual level. Also, series of N-of-1 trials conducted on multiple study participants can be aggregated for population-level inference and provide a more efficient study design compared to standard randomized controlled trials. In this manuscript, we first introduce key components and design features of N-of-1 trials. Then we lay out how N-of-1 trials can be analyzed statistically and give different examples of their applicability in experimental physiological studies. In summary, we provide here an overview of all main components for designing N-of-1 trials, give direct examples in experimental physiology and practical recommendations on their proper use.


[26] 2412.15128

Estimating Heterogeneous Treatment Effects for Spatio-Temporal Causal Inference: How Economic Assistance Moderates the Effects of Airstrikes on Insurgent Violence

Scholars from diverse fields now increasingly rely on high-frequency spatio-temporal data. Yet, causal inference with these data remains challenging due to the twin threats of spatial spillover and temporal carryover effects. We develop methods to estimate heterogeneous treatment effects by allowing for arbitrary spatial and temporal causal dependencies. We focus on common settings where the treatment and outcomes are time-varying spatial point patterns and where moderators are either spatial or spatio-temporal in nature. We define causal estimands based on stochastic interventions where researchers specify counterfactual distributions of treatment events. We propose the Hajek-type estimator of the conditional average treatment effect (CATE) as a function of spatio-temporal moderator variables, and establish its asymptotic normality as the number of time periods increases. We then introduce a statistical test of no heterogeneous treatment effects. Through simulations, we evaluate the finite-sample performance of the proposed CATE estimator and its inferential properties. Our motivating application examines the heterogeneous effects of US airstrikes on insurgent violence in Iraq. Drawing on declassified spatio-temporal data, we examine how prior aid distributions moderate airstrike effects. Contrary to expectations from counterinsurgency theories, we find that prior aid distribution, along with greater amounts of aid per capita, is associated with increased insurgent attacks following airstrikes.


[27] 2412.14222

A Survey on Large Language Model-based Agents for Statistics and Data Science

In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.


[28] 2412.14226

FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated Learning

Federated learning (FL) is a machine learning methodology that involves the collaborative training of a global model across multiple decentralized clients in a privacy-preserving way. Several FL methods are introduced to tackle communication inefficiencies but do not address how to sample participating clients in each round effectively and in a privacy-preserving manner. In this paper, we propose \textit{FedSTaS}, a client and data-level sampling method inspired by \textit{FedSTS} and \textit{FedSampling}. In each federated learning round, \textit{FedSTaS} stratifies clients based on their compressed gradients, re-allocate the number of clients to sample using an optimal Neyman allocation, and sample local data from each participating clients using a data uniform sampling strategy. Experiments on three datasets show that \textit{FedSTaS} can achieve higher accuracy scores than those of \textit{FedSTS} within a fixed number of training rounds.


[29] 2412.14291

Projected gradient methods for nonconvex and stochastic optimization: new complexities and auto-conditioned stepsizes

We present a novel class of projected gradient (PG) methods for minimizing a smooth but not necessarily convex function over a convex compact set. We first provide a novel analysis of the "vanilla" PG method, achieving the best-known iteration complexity for finding an approximate stationary point of the problem. We then develop an "auto-conditioned" projected gradient (AC-PG) variant that achieves the same iteration complexity without requiring the input of the Lipschitz constant of the gradient or any line search procedure. The key idea is to estimate the Lipschitz constant using first-order information gathered from the previous iterations, and to show that the error caused by underestimating the Lipschitz constant can be properly controlled. We then generalize the PG methods to the stochastic setting, by proposing a stochastic projected gradient (SPG) method and a variance-reduced stochastic gradient (VR-SPG) method, achieving new complexity bounds in different oracle settings. We also present auto-conditioned stepsize policies for both stochastic PG methods and establish comparable convergence guarantees.


[30] 2412.14297

Distributionally Robust Policy Learning under Concept Drifts

Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studiesa more nuanced problem -- robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $\Pi$, and show that the sub-optimality gap of the proposed algorithm is of the order $\kappa(\Pi)n^{-1/2}$, with $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.


[31] 2412.14318

Long-time accuracy of ensemble Kalman filters for chaotic and machine-learned dynamical systems

Filtering is concerned with online estimation of the state of a dynamical system from partial and noisy observations. In applications where the state is high dimensional, ensemble Kalman filters are often the method of choice. This paper establishes long-time accuracy of ensemble Kalman filters. We introduce conditions on the dynamics and the observations under which the estimation error remains small in the long-time horizon. Our theory covers a wide class of partially-observed chaotic dynamical systems, which includes the Navier-Stokes equations and Lorenz models. In addition, we prove long-time accuracy of ensemble Kalman filters with surrogate dynamics, thus validating the use of machine-learned forecast models in ensemble data assimilation.


[32] 2412.14421

Comparing noisy neural population dynamics using optimal transport distances

Biological and artificial neural systems form high-dimensional neural representations that underpin their computational capabilities. Methods for quantifying geometric similarity in neural representations have become a popular tool for identifying computational principles that are potentially shared across neural systems. These methods generally assume that neural responses are deterministic and static. However, responses of biological systems, and some artificial systems, are noisy and dynamically unfold over time. Furthermore, these characteristics can have substantial influence on a system's computational capabilities. Here, we demonstrate that existing metrics can fail to capture key differences between neural systems with noisy dynamic responses. We then propose a metric for comparing the geometry of noisy neural trajectories, which can be derived as an optimal transport distance between Gaussian processes. We use the metric to compare models of neural responses in different regions of the motor system and to compare the dynamics of latent diffusion models for text-to-image synthesis.


[33] 2412.14474

Benign Overfitting in Out-of-Distribution Generalization of Linear Models

Benign overfitting refers to the phenomenon where an over-parameterized model fits the training data perfectly, including noise in the data, but still generalizes well to the unseen test data. While prior work provides some theoretical understanding of this phenomenon under the in-distribution setup, modern machine learning often operates in a more challenging Out-of-Distribution (OOD) regime, where the target (test) distribution can be rather different from the source (training) distribution. In this work, we take an initial step towards understanding benign overfitting in the OOD regime by focusing on the basic setup of over-parameterized linear models under covariate shift. We provide non-asymptotic guarantees proving that benign overfitting occurs in standard ridge regression, even under the OOD regime when the target covariance satisfies certain structural conditions. We identify several vital quantities relating to source and target covariance, which govern the performance of OOD generalization. Our result is sharp, which provably recovers prior in-distribution benign overfitting guarantee [Tsigler and Bartlett, 2023], as well as under-parameterized OOD guarantee [Ge et al., 2024] when specializing to each setup. Moreover, we also present theoretical results for a more general family of target covariance matrix, where standard ridge regression only achieves a slow statistical rate of $O(1/\sqrt{n})$ for the excess risk, while Principal Component Regression (PCR) is guaranteed to achieve the fast rate $O(1/n)$, where $n$ is the number of samples.


[34] 2412.14477

Graph-Structured Topic Modeling for Documents with Spatial or Covariate Dependencies

We address the challenge of incorporating document-level metadata into topic modeling to improve topic mixture estimation. To overcome the computational complexity and lack of theoretical guarantees in existing Bayesian methods, we extend probabilistic latent semantic indexing (pLSI), a frequentist framework for topic modeling, by incorporating document-level covariates or known similarities between documents through a graph formalism. Modeling documents as nodes and edges denoting similarities, we propose a new estimator based on a fast graph-regularized iterative singular value decomposition (SVD) that encourages similar documents to share similar topic mixture proportions. We characterize the estimation error of our proposed method by deriving high-probability bounds and develop a specialized cross-validation method to optimize our regularization parameters. We validate our model through comprehensive experiments on synthetic datasets and three real-world corpora, demonstrating improved performance and faster inference compared to existing Bayesian methods.


[35] 2412.14497

Treatment Effects Estimation on Networked Observational Data using Disentangled Variational Graph Autoencoder

Estimating individual treatment effect (ITE) from observational data has gained increasing attention across various domains, with a key challenge being the identification of latent confounders affecting both treatment and outcome. Networked observational data offer new opportunities to address this issue by utilizing network information to infer latent confounders. However, most existing approaches assume observed variables and network information serve only as proxy variables for latent confounders, which often fails in practice, as some variables influence treatment but not outcomes, and vice versa. Recent advances in disentangled representation learning, which disentangle latent factors into instrumental, confounding, and adjustment factors, have shown promise for ITE estimation. Building on this, we propose a novel disentangled variational graph autoencoder that learns disentangled factors for treatment effect estimation on networked observational data. Our graph encoder further ensures factor independence using the Hilbert-Schmidt Independence Criterion. Extensive experiments on two semi-synthetic datasets derived from real-world social networks and one synthetic dataset demonstrate that our method achieves state-of-the-art performance.


[36] 2412.14650

Permutation recovery of spikes in noisy high-dimensional tensor estimation

We study the dynamics of gradient flow in high dimensions for the multi-spiked tensor problem, where the goal is to estimate $r$ unknown signal vectors (spikes) from noisy Gaussian tensor observations. Specifically, we analyze the maximum likelihood estimation procedure, which involves optimizing a highly nonconvex random function. We determine the sample complexity required for gradient flow to efficiently recover all spikes, without imposing any assumptions on the separation of the signal-to-noise ratios (SNRs). More precisely, our results provide the sample complexity required to guarantee recovery of the spikes up to a permutation. Our work builds on our companion paper [Ben Arous, Gerbelot, Piccolo 2024], which studies Langevin dynamics and determines the sample complexity and separation conditions for the SNRs necessary for ensuring exact recovery of the spikes (where the recovered permutation matches the identity). During the recovery process, the correlations between the estimators and the hidden vectors increase in a sequential manner. The order in which these correlations become significant depends on their initial values and the corresponding SNRs, which ultimately determines the permutation of the recovered spikes.


[37] 2412.14660

Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models

Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and autonomous driving. This paper investigates representative MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning, as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight how uncertainty differs between text and images and how their integration affects overall uncertainty. To better understand MLLMs' miscalibration and their ability to self-assess uncertainty, we construct the IDK (I don't know) dataset, which is key to evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with proper prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications. Code and IDK dataset: \href{https://github.com/hfutml/Calibration-MLLM}{https://github.com/hfutml/Calibration-MLLM}.


[38] 2412.14740

Recovering semipermeable barriers from reflected Brownian motion

We study the recovery of one-dimensional semipermeable barriers for a stochastic process in a planar domain. The considered process acts like Brownian motion when away from the barriers and is reflected upon contact until a sufficient but random amount of interaction has occurred, determined by the permeability, after which it passes through. Given a sequence of samples, we wonder when one can determine the location and shape of the barriers. This paper identifies several different recovery regimes, determined by the available observation period and the time between samples, with qualitatively different behavior. The observation period $T$ dictates if the full barriers or only certain pieces can be recovered, and the sampling rate significantly influences the convergence rate as $T\to \infty$. This rate turns out polynomial for fixed-frequency data, but exponentially fast in a high-frequency regime. Further, the environment's impact on the difficulty of the problem is quantified using interpretable parameters in the recovery guarantees, and is found to also be regime-dependent. For instance, the curvature of the barriers affects the convergence rate for fixed-frequency data, but becomes irrelevant when $T\to \infty$ with high-frequency data. The results are accompanied by explicit algorithms, and we conclude by illustrating the application to real-life data.


[39] 2412.14753

Opportunities and limitations of explaining quantum machine learning

A common trait of many machine learning models is that it is often difficult to understand and explain what caused the model to produce the given output. While the explainability of neural networks has been an active field of research in the last years, comparably little is known for quantum machine learning models. Despite a few recent works analyzing some specific aspects of explainability, as of now there is no clear big picture perspective as to what can be expected from quantum learning models in terms of explainability. In this work, we address this issue by identifying promising research avenues in this direction and lining out the expected future results. We additionally propose two explanation methods designed specifically for quantum machine learning models, as first of their kind to the best of our knowledge. Next to our pre-view of the field, we compare both existing and novel methods to explain the predictions of quantum learning models. By studying explainability in quantum machine learning, we can contribute to the sustainable development of the field, preventing trust issues in the future.


[40] 2412.15063

Graph-neural-network predictions of solid-state NMR parameters from spherical tensor decomposition

Nuclear magnetic resonance (NMR) is a powerful spectroscopic technique that is sensitive to the local atomic structure of matter. Computational predictions of NMR parameters can help to interpret experimental data and validate structural models, and machine learning (ML) has emerged as an efficient route to making such predictions. Here, we systematically study graph-neural-network approaches to representing and learning tensor quantities for solid-state NMR -- specifically, the anisotropic magnetic shielding and the electric field gradient. We assess how the numerical accuracy of different ML models translates into prediction quality for experimentally relevant NMR properties: chemical shifts, quadrupolar coupling constants, tensor orientations, and even static 1D spectra. We apply these ML models to a structurally diverse dataset of amorphous SiO$_2$ configurations, spanning a wide range of density and local order, to larger configurations beyond the reach of traditional first-principles methods, and to the dynamics of the $\alpha\unicode{x2013}\beta$ inversion in cristobalite. Our work marks a step toward streamlining ML-driven NMR predictions for both static and dynamic behavior of complex materials, and toward bridging the gap between first-principles modeling and real-world experimental data.