New articles on Statistics


[1] 2501.07642

fastrerandomize: An R Package for Fast Rerandomization Using Accelerated Computing

The fastrerandomize R package provides hardware-accelerated tools for performing rerandomization and randomization testing in experimental research. Using a JAX backend, the package enables exact rerandomization inference even for large experiments with hundreds of billions of possible randomizations. Key functionalities include generating pools of acceptable rerandomizations based on covariate balance, conducting exact randomization tests, and performing pre-analysis evaluations to determine optimal rerandomization acceptance thresholds. Through batched processing and GPU acceleration, fastrerandomize achieves substantial performance gains compared to existing implementations, making previously intractable designs computationally feasible. The package therefore extends the randomization-based inference toolkit in R, allowing researchers to efficiently implement more stringent rerandomization designs and conduct valid inference even with large sample sizes or in high-dimensional settings.


[2] 2501.07645

Coverage errors for Student's t confidence intervals comparable to those in Hall (1988)

Table 1 of Hall (1988) contains asymptotic coverage error formulas for some nonparametric approximate 95% confidence intervals for the mean based on $n$ IID samples. The table includes an entry for an interval based on the central limit theorem using Gaussian quantiles and the Gaussian maximum likelihood variance estimate. It is missing an entry for the very widely used Student $t$ confidence intervals. This note makes a mild numerical correction for the Gaussian entry and provides an entry for the Student $t$ intervals. For skewness $\gamma$ and kurtosis $\kappa$, the corrected Gaussian formula is $0.14\kappa -2.16\gamma^2-3.42$ and the formula for the $t$ intervals is $0.14\kappa -2.16\gamma^2$. The impetus to revisit this estimate arose from the surprisingly robust performance of Student's t statistic in randomized quasi-Monte Carlo sampling.


[3] 2501.07665

Nonparametric estimation of the multivariate Spearman's footrule: a further discussion

In this paper, we propose two new estimators of the multivariate rank correlation coefficient Spearman's footrule which are based on two general estimators for Average Orthant Dependence measures. We compare the new proposals with a previous estimator existing in the literature and show that the three estimators are asymptotically equivalent, but, in small samples, one of the proposed estimators outperforms the others. We also analyse Pitman efficiency of these indices to test for multivariate independence as compared to multivariate versions of Kendall's tau and Spearman's rho.


[4] 2501.07668

Fast sampling and model selection for Bayesian mixture models

We describe two Monte Carlo algorithms for sampling from the integrated posterior distributions of a range of Bayesian mixture models. Both algorithms allow us to directly sample not only the assignment of observations to components but also the number of components, thereby fitting the model and performing model selection over the number of components in a single computation. The first algorithm is a traditional collapsed Gibbs sampler, albeit with an unusual move-set; the second builds on the first, adding rejection-free sampling from the prior over component assignments, to create an algorithm that has excellent mixing time in typical applications and outperforms current state-of-the-art methods, in some cases by a wide margin. We demonstrate our methods with a selection of applications to latent class analysis.


[5] 2501.07685

Adaptive sequential Monte Carlo for automated cross validation in structural Bayesian hierarchical models

Importance sampling (IS) is widely used for approximate Bayesian cross validation (CV) due to its efficiency, requiring only the re-weighting of a single set of posterior draws. With structural Bayesian hierarchical models, vanilla IS can produce unreliable results, as out-of-sample replication may involve non-standard case-deletion schemes which significantly alter the posterior geometry. This inevitably necessitates computationally expensive re-runs of Markov chain Monte Carlo (MCMC), making structural CV impracticable. To address this challenge, we consider sampling from a sequence of posteriors leading to the case-deleted posterior(s) via adaptive sequential Monte Carlo (SMC). We design the sampler to (a) support a broad range of structural CV schemes, (b) enhance efficiency by adaptively selecting Markov kernels, intervening in parallelizable MCMC re-runs only when necessary, and (c) streamline the workflow by automating the design of intermediate bridging distributions. Its practical utility is demonstrated through three real-world applications involving three types of predictive model assessments: leave-group-out CV, group $K$-fold CV, and sequential one-step-ahead validation.


[6] 2501.07722

ML-assisted Randomization Tests for Detecting Treatment Effects in A/B Experiments

Experimentation is widely utilized for causal inference and data-driven decision-making across disciplines. In an A/B experiment, for example, an online business randomizes two different treatments (e.g., website designs) to their customers and then aims to infer which treatment is better. In this paper, we construct randomization tests for complex treatment effects, including heterogeneity and interference. A key feature of our approach is the use of flexible machine learning (ML) models, where the test statistic is defined as the difference between the cross-validation errors from two ML models, one including the treatment variable and the other without it. This approach combines the predictive power of modern ML tools with the finite-sample validity of randomization procedures, enabling a robust and efficient way to detect complex treatment effects in experimental settings. We demonstrate this combined benefit both theoretically and empirically through applied examples.


[7] 2501.07741

Concentration of Measure for Distributions Generated via Diffusion Models

We show via a combination of mathematical arguments and empirical evidence that data distributions sampled from diffusion models satisfy a Concentration of Measure Property saying that any Lipschitz $1$-dimensional projection of a random vector is not too far from its mean with high probability. This implies that such models are quite restrictive and gives an explanation for a fact previously observed in arXiv:2410.14171 that conventional diffusion models cannot capture "heavy-tailed" data (i.e. data $\mathbf{x}$ for which the norm $\|\mathbf{x}\|_2$ does not possess a subgaussian tail) well. We then proceed to train a generalized linear model using stochastic gradient descent (SGD) on the diffusion-generated data for a multiclass classification task and observe empirically that a Gaussian universality result holds for the test error. In other words, the test error depends only on the first and second order statistics of the diffusion-generated data in the linear setting. Results of such forms are desirable because they allow one to assume the data itself is Gaussian for analyzing performance of the trained classifier. Finally, we note that current approaches to proving universality do not apply to this case as the covariance matrices of the data tend to have vanishing minimum singular values for the diffusion-generated data, while the current proofs assume that this is not the case (see Subsection 3.4 for more details). This leaves extending previous mathematical universality results as an intriguing open question.


[8] 2501.07763

On the Statistical Capacity of Deep Generative Models

Deep generative models are routinely used in generating samples from complex, high-dimensional distributions. Despite their apparent successes, their statistical properties are not well understood. A common assumption is that with enough training data and sufficiently large neural networks, deep generative model samples will have arbitrarily small errors in sampling from any continuous target distribution. We set up a unifying framework that debunks this belief. We demonstrate that broad classes of deep generative models, including variational autoencoders and generative adversarial networks, are not universal generators. Under the predominant case of Gaussian latent variables, these models can only generate concentrated samples that exhibit light tails. Using tools from concentration of measure and convex geometry, we give analogous results for more general log-concave and strongly log-concave latent variable distributions. We extend our results to diffusion models via a reduction argument. We use the Gromov--Levy inequality to give similar guarantees when the latent variables lie on manifolds with positive Ricci curvature. These results shed light on the limited capacity of common deep generative models to handle heavy tails. We illustrate the empirical relevance of our work with simulations and financial data.


[9] 2501.07770

Maximum likelihood estimation in the sparse Rasch model

The Rasch model has been widely used to analyse item response data in psychometrics and educational assessments. When the number of individuals and items are large, it may be impractical to provide all possible responses. It is desirable to study sparse item response experiments. Here, we propose to use the Erd\H{o}s\textendash R\'enyi random sampling design, where an individual responds to an item with low probability $p$. We prove the uniform consistency of the maximum likelihood estimator %by developing a leave-one-out method for the Rasch model when both the number of individuals, $r$, and the number of items, $t$, approach infinity. Sampling probability $p$ can be as small as $\max\{\log r/r, \log t/t\}$ up to a constant factor, which is a fundamental requirement to guarantee the connection of the sampling graph by the theory of the Erd\H{o}s\textendash R\'enyi graph. The key technique behind this significant advancement is a powerful leave-one-out method for the Rasch model. We further establish the asymptotical normality of the MLE by using a simple matrix to approximate the inverse of the Fisher information matrix. The theoretical results are corroborated by simulation studies and an analysis of a large item-response dataset.


[10] 2501.07772

Bridging Root-$n$ and Non-standard Asymptotics: Dimension-agnostic Adaptive Inference in M-Estimation

This manuscript studies a general approach to construct confidence sets for the solution of population-level optimization, commonly referred to as M-estimation. Statistical inference for M-estimation poses significant challenges due to the non-standard limiting behaviors of the corresponding estimator, which arise in settings with increasing dimension of parameters, non-smooth objectives, or constraints. We propose a simple and unified method that guarantees validity in both regular and irregular cases. Moreover, we provide a comprehensive width analysis of the proposed confidence set, showing that the convergence rate of the diameter is adaptive to the unknown degree of instance-specific regularity. We apply the proposed method to several high-dimensional and irregular statistical problems.


[11] 2501.07789

Using Statistical Precision Medicine to Identify Optimal Treatments in a Heart Failure Setting

Identifying optimal medical treatments to improve survival has long been a critical goal of pharmacoepidemiology. Traditionally, we use an average treatment effect measure to compare outcomes between treatment plans. However, new methods leveraging advantages of machine learning combined with the foundational tenets of causal inference are offering an alternative to the average treatment effect. Here, we use three unique, precision medicine algorithms (random forests, residual weighted learning, efficient augmentation relaxed learning) to identify optimal treatment rules where patients receive the optimal treatment as indicated by their clinical history. First, we present a simple hypothetical example and a real-world application among heart failure patients using Medicare claims data. We next demonstrate how the optimal treatment rule improves the absolute risk in a hypothetical, three-modifier setting. Finally, we identify an optimal treatment rule that optimizes the time to outcome in a real-world heart failure setting. In both examples, we compare the average time to death under the optimized, tailored treatment rule with the average time to death under a universal treatment rule to show the benefit of precision medicine methods. The improvement under the optimal treatment rule in the real-world setting is greatest (additional ~9 days under the tailored rule) for survival time free of heart failure readmission.


[12] 2501.07790

Sampling from Density power divergence-based Generalized posterior distribution via Stochastic optimization

Robust Bayesian inference using density power divergence (DPD) has emerged as a promising approach for handling outliers in statistical estimation. While the DPD-based posterior offers theoretical guarantees for robustness, its practical implementation faces significant computational challenges, particularly for general parametric models with intractable integral terms. These challenges become especially pronounced in high-dimensional settings where traditional numerical integration methods prove inadequate and computationally expensive. We propose a novel sampling methodology that addresses these limitations by integrating the loss-likelihood bootstrap with a stochastic gradient descent algorithm specifically designed for DPD-based estimation. Our approach enables efficient and scalable sampling from DPD-based posteriors for a broad class of parametric models, including those with intractable integrals, and we further extend it to accommodate generalized linear models. Through comprehensive simulation studies, we demonstrate that our method efficiently samples from DPD-based posteriors, offering superior computational scalability compared to conventional methods, particularly in high-dimensional settings. The results also highlight its ability to handle complex parametric models with intractable integral terms.


[13] 2501.07795

Black-box Optimization with Simultaneous Statistical Inference for Optimal Performance

Black-box optimization is often encountered for decision-making in complex systems management, where the knowledge of system is limited. Under these circumstances, it is essential to balance the utilization of new information with computational efficiency. In practice, decision-makers often face the dual tasks of optimization and statistical inference for the optimal performance, in order to achieve it with a high reliability. Our goal is to address the dual tasks in an online fashion. Wu et al (2022) [arXiv preprint: 2210.06737] point out that the sample average of performance estimates generated by the optimization algorithm needs not to admit a central limit theorem. We propose an algorithm that not only tackles this issue, but also provides an online consistent estimator for the variance of the performance. Furthermore, we characterize the convergence rate of the coverage probabilities of the asymptotic confidence intervals.


[14] 2501.07841

GeoWarp: Warped spatial processes for inferring subsea sediment properties

For offshore structures like wind turbines, subsea infrastructure, pipelines, and cables, it is crucial to quantify the properties of the seabed sediments at a proposed site. However, data collection offshore is costly, so analysis of the seabed sediments must be made from measurements that are spatially sparse. Adding to this challenge, the structure of the seabed sediments exhibits both nonstationarity and anisotropy. To address these issues, we propose GeoWarp, a hierarchical spatial statistical modeling framework for inferring the 3-D geotechnical properties of subsea sediments. GeoWarp decomposes the seabed properties into a region-wide vertical mean profile (modeled using B-splines), and a nonstationary 3-D spatial Gaussian process. Process nonstationarity and anisotropy are accommodated by warping space in three dimensions and by allowing the process variance to change with depth. We apply GeoWarp to measurements of the seabed made using cone penetrometer tests (CPTs) at six sites on the North West Shelf of Australia. We show that GeoWarp captures the complex spatial distribution of the sediment properties, and produces realistic 3-D simulations suitable for downstream engineering analyses. Through cross-validation, we show that GeoWarp has predictive performance superior to other state-of-the-art methods, demonstrating its value as a tool in offshore geotechnical engineering.


[15] 2501.07842

Prediction Inference Using Generalized Functional Mixed Effects Models

We introduce inferential methods for prediction based on functional random effects in generalized functional mixed effects models. This is similar to the inference for random effects in generalized linear mixed effects models (GLMMs), but for functional instead of scalar outcomes. The method combines: (1) local GLMMs to extract initial estimators of the functional random components on the linear predictor scale; (2) structural functional principal components analysis (SFPCA) for dimension reduction; and (3) global Bayesian multilevel model conditional on the eigenfunctions for inference on the functional random effects. Extensive simulations demonstrate excellent coverage properties of credible intervals for the functional random effects in a variety of scenarios and for different data sizes. To our knowledge, this is the first time such simulations are conducted and reported, likely because prediction inference was not viewed as a priority and existing methods are too slow to calculate coverage. Methods are implemented in a reproducible R package and demonstrated using the NHANES 2011-2014 accelerometry data.


[16] 2501.07949

One cut-point phase-type distributions in Reliability. An application to Resistive Random Access Memories

A new probability distribution to study lifetime data in reliability is introduced in this paper. This one is a first approach to a non-homogeneous phase-type distribution. It is built by considering one cut-point in the non-negative semi-line of a phase-type distribution. The density function is defined and the main measures associated, such as the reliability function, hazard rate, cumulative hazard rate and the characteristic function are also worked out. This new class of distributions enables to decrease the number of parameter in the estimate when inference is considered. Besides, the likelihood distribution is built to estimate the model parameters by maximum likelihood. Several applications by considering Resistive Random Access Memories compare the adjustment when phase type distributions and one cut-point phase-type distributions are considered. The developed methodology has been computationally implemented in R-cran.


[17] 2501.07995

MMAPs to model complex multi-state systems with vacation policies in the repair facility

Two complex multi-state systems subject to multiple events are built in an algorithmic and computational way by considering phase-type distributions and Markovian arrival processes with marked arrivals. The internal performance of the system is composed of different degradation levels and internal repairable and non-repairable failures can occur. Also, the system is subject to external shocks that may provoke repairable or non-repairable failure. A multiple vacation policy is introduced in the system for the repairperson. Preventive maintenance is included in the system to improve the behaviour. Two types of task may be performed by the repairperson; corrective repair and preventive maintenance. The systems are modelled, the transient and stationary distributions are built and different performance measures are calculated in a matrix-algorithmic form. Cost and rewards are included in the model in a vector matrix way. Several economic measures are worked out and the net reward per unit of time is used to optimize the system. A numerical example shows that the system can be optimized according to the existence of preventive maintenance and the distribution of vacation time. The results have been implemented computationally with Matlab and R (packages: expm, optim).


[18] 2501.08033

Semiparametric Skew-Elliptical Distributions For High-Dimensional Graphical Models

We propose a semiparametric approach called elliptical skew-(S)KEPTIC for efficiently and robustly estimating non-Gaussian graphical models. Relaxing the assumption of semiparametric elliptical distributions to the family of \textit{meta skew-elliptical} that accommodates a skewness component, we derive a new estimator which is an extension of the SKEPTIC estimator in Liu et al. (2012), based on semiparametric Gaussian copula graphical models, to the case of skew-elliptical copula graphical models. Theoretically, we demonstrate that the elliptical skew-(S)KEPTIC estimator achieves robust parametric convergence rates in both graph recovery and parameters estimation. We conduct numerical simulations to prove the reliable graph recovery performance of the elliptical skew-(S)KEPTIC estimator. Finally, the new method is applied to the daily log-returns of the stocks of the S\&P500 index and shows better interpretability compared to the Gaussian copula graphical models.


[19] 2501.08050

On the use of Statistical Learning Theory for model selection in Structural Health Monitoring

Whenever data-based systems are employed in engineering applications, defining an optimal statistical representation is subject to the problem of model selection. This paper focusses on how well models can generalise in Structural Health Monitoring (SHM). Although statistical model validation in this field is often performed heuristically, it is possible to estimate generalisation more rigorously using the bounds provided by Statistical Learning Theory (SLT). Therefore, this paper explores the selection process of a kernel smoother for modelling the impulse response of a linear oscillator from the perspective of SLT. It is demonstrated that incorporating domain knowledge into the regression problem yields a lower guaranteed risk, thereby enhancing generalisation.


[20] 2501.08093

A note on local parameter orthogonality for multivariate data and the Whittle algorithm for multivariate autoregressive models

This article extends the Cox-Reid local parameter orthogonality to a multivariate setting and shows that the extension can lead to the Whittle algorithm for multivariate autoregressive models.


[21] 2501.08201

Globally Convergent Variational Inference

In variational inference (VI), an approximation of the posterior distribution is selected from a family of distributions through numerical optimization. With the most common variational objective function, known as the evidence lower bound (ELBO), only convergence to a local optimum can be guaranteed. In this work, we instead establish the global convergence of a particular VI method. This VI method, which may be considered an instance of neural posterior estimation (NPE), minimizes an expectation of the inclusive (forward) KL divergence to fit a variational distribution that is parameterized by a neural network. Our convergence result relies on the neural tangent kernel (NTK) to characterize the gradient dynamics that arise from considering the variational objective in function space. In the asymptotic regime of a fixed, positive-definite neural tangent kernel, we establish conditions under which the variational objective admits a unique solution in a reproducing kernel Hilbert space (RKHS). Then, we show that the gradient descent dynamics in function space converge to this unique function. In ablation studies and practical problems, we demonstrate that our results explain the behavior of NPE in non-asymptotic finite-neuron settings, and show that NPE outperforms ELBO-based optimization, which often converges to shallow local optima.


[22] 2501.08228

Estimation of survival functions for events based on a continuous outcome: a distributional approach

The limitations resulting from the dichtomisation of continuous outcomes have been extensively described. But the need to present results based on binary outcomes in particular in health science remains. Alternatives based on the distribution of the continuous outcome have been proposed. Here we explore the possibilities of using a distributional approach in the context of time-to-event analysis when the event is defined by values of a continuous outcome above or below a threshold. For this we propose in a first step a distributional version of the Kaplan-Meier estimate of the survival function based on repeated truncation of the normal distribution. The method is evaluated with a simulation study and illustrated with a case study.


[23] 2501.08231

Bayesian Shrinkage Priors for Penalized Synthetic Control Estimators in the Presence of Spillovers

Synthetic control (SC) methods are widely used to evaluate the impact of policy interventions, particularly those targeting specific geographic areas or regions, commonly referred to as units. These methods construct an artificial (synthetic) unit from untreated (control) units, intended to mirror the characteristics of the treated region had the intervention not occurred. While neighboring areas are often chosen as controls due to their assumed similarities with the treated, their proximity can introduce spillovers, where the intervention indirectly affects these controls, biasing the estimates. To address this challenge, we propose a Bayesian SC method with distance-based shrinkage priors, designed to estimate causal effects while accounting for spillovers. Modifying traditional penalization techniques, our approach incorporates a weighted distance function that considers both covariate information and spatial proximity to the treated. Rather than simply excluding nearby controls, this framework data-adaptively selects those less likely to be impacted by spillovers, providing a balance between bias and variance reduction. Through simulation studies, we demonstrate the finite-sample properties of our method under varying levels of spillover. We then apply this approach to evaluate the impact of Philadelphia's beverage tax on the sales of sugar-sweetened and artificially sweetened beverages in mass merchandise stores.


[24] 2501.08265

Fast and Cheap Covariance Smoothing

We introduce the Tensorized-and-Restricted Krylov (TReK) method, a simple and efficient algorithm for estimating covariance tensors with large observational sizes. TReK extends the conjugate gradient method to incorporate range restrictions, enabling its use in a variety of covariance smoothing applications. By leveraging matrix-level operations, it achieves significant improvements in both computational speed and memory cost, improving over existing methods by an order of magnitude. TReK ensures finite-step convergence in the absence of rounding errors and converges fast in practice, making it well-suited for large-scale problems. The algorithm is also highly flexible, supporting a wide range of forward and projection tensors.


[25] 2501.08270

Individual causal effect estimation accounting for latent disease state modification among bipolar participants in mobile health studies

Individuals with bipolar disorder tend to cycle through disease states such as depression and mania. The heterogeneous nature of disease across states complicates the evaluation of interventions for bipolar disorder patients, as varied interventional success is observed within and across individuals. In fact, we hypothesize that disease state acts as an effect modifier for the causal effect of a given intervention on health outcomes. To address this dilemma, we propose an N-of-1 approach using an adapted autoregressive hidden Markov model, applied to longitudinal mobile health data collected from individuals with bipolar disorder. This method allows us to identify a latent variable from mobile health data to be treated as an effect modifier between the exposure and outcome of interest while allowing for missing data in the outcome. A counterfactual approach is employed for causal inference and to obtain a g-formula estimator to recover said effect. The performance of the proposed method is compared with a naive approach across extensive simulations and application to a multi-year smartphone study of bipolar patients, evaluating the individual effect of digital social activity on sleep duration across different latent disease states.


[26] 2501.08274

Constructing optimal dynamic monitoring and treatment regimes: An application to hypertension care

Hypertension is a leading cause of cardiovascular diseases and morbidity, with antihypertensive drugs and blood pressure management strategies having heterogeneous effects on patients. Previous authors exploited this heterogeneity to construct optimal dynamic treatment regimes for hypertension that input patient characteristics and output the best drug or blood pressure management strategy to prescribe. There is, however, a lack of research on optimizing monitoring schedules for these patients. It is unclear whether different monitoring patterns and drug add-on strategies could lower blood pressure differently across patients. We propose a new consistent methodology to develop optimal dynamic monitoring and add-on regimes that is doubly-robust and relies on the theory of Robins' g-methods and dynamic weighted ordinary least squares. We discuss the treatment of longitudinal missing data for that inference. The approach is evaluated in large simulation studies and applied to data from the SPRINT trial in the United States to derive a new optimal rule. This type of rule could be used by patients or physicians to personalize the timing of visit and by physicians to decide whether prescribing an antihypertensive drug is beneficial.


[27] 2501.08288

Avoiding subtraction and division of stochastic signals using normalizing flows: NFdeconvolve

Across the scientific realm, we find ourselves subtracting or dividing stochastic signals. For instance, consider a stochastic realization, $x$, generated from the addition or multiplication of two stochastic signals $a$ and $b$, namely $x=a+b$ or $x = ab$. For the $x=a+b$ example, $a$ can be fluorescence background and $b$ the signal of interest whose statistics are to be learned from the measured $x$. Similarly, when writing $x=ab$, $a$ can be thought of as the illumination intensity and $b$ the density of fluorescent molecules of interest. Yet dividing or subtracting stochastic signals amplifies noise, and we ask instead whether, using the statistics of $a$ and the measurement of $x$ as input, we can recover the statistics of $b$. Here, we show how normalizing flows can generate an approximation of the probability distribution over $b$, thereby avoiding subtraction or division altogether. This method is implemented in our software package, NFdeconvolve, available on GitHub with a tutorial linked in the main text.


[28] 2501.08320

COMBO and COMMA: R packages for regression modeling and inference in the presence of misclassified binary mediator or outcome variables

Misclassified binary outcome or mediator variables can cause unpredictable bias in resulting parameter estimates. As more datasets that were not originally collected for research purposes are being used for studies in the social and health sciences, the need for methods that address data quality concerns is growing. In this paper, we describe two R packages, COMBO and COMMA, that implement bias-correction methods for misclassified binary outcome and mediator variables, respectively. These likelihood-based approaches do not require gold standard measures and allow for estimation of sensitivity and specificity rates for the misclassified variable(s). In addition, these R packages automatically apply crucial label switching corrections, allowing researchers to circumvent the inherent permutation invariance of the misclassification model likelihood. We demonstrate COMBO for single-outcome cases using a study of bar exam passage. We develop and evaluate a risk prediction model based on noisy indicators in a pretrial risk assessment study to demonstrate COMBO for multi-outcome cases. In addition, we use COMMA to evaluate the mediating effect of potentially misdiagnosed gestational hypertension on the maternal ethnicity-birthweight relationship.


[29] 2501.07599

Analyzing Spatio-Temporal Dynamics of Dissolved Oxygen for the River Thames using Superstatistical Methods and Machine Learning

By employing superstatistical methods and machine learning, we analyze time series data of water quality indicators for the River Thames, with a specific focus on the dynamics of dissolved oxygen. After detrending, the probability density functions of dissolved oxygen fluctuations exhibit heavy tails that are effectively modeled using $q$-Gaussian distributions. Our findings indicate that the multiplicative Empirical Mode Decomposition method stands out as the most effective detrending technique, yielding the highest log-likelihood in nearly all fittings. We also observe that the optimally fitted width parameter of the $q$-Gaussian shows a negative correlation with the distance to the sea, highlighting the influence of geographical factors on water quality dynamics. In the context of same-time prediction of dissolved oxygen, regression analysis incorporating various water quality indicators and temporal features identify the Light Gradient Boosting Machine as the best model. SHapley Additive exPlanations reveal that temperature, pH, and time of year play crucial roles in the predictions. Furthermore, we use the Transformer to forecast dissolved oxygen concentrations. For long-term forecasting, the Informer model consistently delivers superior performance, achieving the lowest MAE and SMAPE with the 192 historical time steps that we used. This performance is attributed to the Informer's ProbSparse self-attention mechanism, which allows it to capture long-range dependencies in time-series data more effectively than other machine learning models. It effectively recognizes the half-life cycle of dissolved oxygen, with particular attention to key intervals. Our findings provide valuable insights for policymakers involved in ecological health assessments, aiding in accurate predictions of river water quality and the maintenance of healthy aquatic ecosystems.


[30] 2501.07600

Impact of Data Breadth and Depth on Performance of Siamese Neural Network Model: Experiments with Three Keystroke Dynamic Datasets

Deep learning models, such as the Siamese Neural Networks (SNN), have shown great potential in capturing the intricate patterns in behavioral data. However, the impacts of dataset breadth (i.e., the number of subjects) and depth (e.g., the amount of training samples per subject) on the performance of these models is often informally assumed, and remains under-explored. To this end, we have conducted extensive experiments using the concepts of "feature space" and "density" to guide and gain deeper understanding on the impact of dataset breadth and depth on three publicly available keystroke datasets (Aalto, CMU and Clarkson II). Through varying the number of training subjects, number of samples per subject, amount of data in each sample, and number of triplets used in training, we found that when feasible, increasing dataset breadth enables the training of a well-trained model that effectively captures more inter-subject variability. In contrast, we find that the extent of depth's impact from a dataset depends on the nature of the dataset. Free-text datasets are influenced by all three depth-wise factors; inadequate samples per subject, sequence length, training triplets and gallery sample size, which may all lead to an under-trained model. Fixed-text datasets are less affected by these factors, and as such make it easier to create a well-trained model. These findings shed light on the importance of dataset breadth and depth in training deep learning models for behavioral biometrics and provide valuable insights for designing more effective authentication systems.


[31] 2501.07602

An Explainable Pipeline for Machine Learning with Functional Data

Machine learning (ML) models have shown success in applications with an objective of prediction, but the algorithmic complexity of some models makes them difficult to interpret. Methods have been proposed to provide insight into these "black-box" models, but there is little research that focuses on supervised ML when the model inputs are functional data. In this work, we consider two applications from high-consequence spaces with objectives of making predictions using functional data inputs. One application aims to classify material types to identify explosive materials given hyperspectral computed tomography scans of the materials. The other application considers the forensics science task of connecting an inkjet printed document to the source printer using color signatures extracted by Raman spectroscopy. An instinctive route to consider for analyzing these data is a data driven ML model for classification, but due to the high consequence nature of the applications, we argue it is important to appropriately account for the nature of the data in the analysis to not obscure or misrepresent patterns. As such, we propose the Variable importance Explainable Elastic Shape Analysis (VEESA) pipeline for training ML models with functional data that (1) accounts for the vertical and horizontal variability in the functional data and (2) provides an explanation in the original data space of how the model uses variability in the functional data for prediction. The pipeline makes use of elastic functional principal components analysis (efPCA) to generate uncorrelated model inputs and permutation feature importance (PFI) to identify the principal components important for prediction. The variability captured by the important principal components in visualized the original data space. We ultimately discuss ideas for natural extensions of the VEESA pipeline and challenges for future research.


[32] 2501.07643

A Step Toward Interpretability: Smearing the Likelihood

The problem of interpretability of machine learning architecture in particle physics has no agreed-upon definition, much less any proposed solution. We present a first modest step toward these goals by proposing a definition and corresponding practical method for isolation and identification of relevant physical energy scales exploited by the machine. This is accomplished by smearing or averaging over all input events that lie within a prescribed metric energy distance of one another and correspondingly renders any quantity measured on a finite, discrete dataset continuous over the dataspace. Within this approach, we are able to explicitly demonstrate that (approximate) scaling laws are a consequence of extreme value theory applied to analysis of the distribution of the irreducible minimal distance over which a machine must extrapolate given a finite dataset. As an example, we study quark versus gluon jet identification, construct the smeared likelihood, and show that discrimination power steadily increases as resolution decreases, indicating that the true likelihood for the problem is sensitive to emissions at all scales.


[33] 2501.07652

Finite Sample Identification of Partially Observed Bilinear Dynamical Systems

We consider the problem of learning a realization of a partially observed bilinear dynamical system (BLDS) from noisy input-output data. Given a single trajectory of input-output samples, we provide a finite time analysis for learning the system's Markov-like parameters, from which a balanced realization of the bilinear system can be obtained. Our bilinear system identification algorithm learns the system's Markov-like parameters by regressing the outputs to highly correlated, nonlinear, and heavy-tailed covariates. Moreover, the stability of BLDS depends on the sequence of inputs used to excite the system. These properties, unique to partially observed bilinear dynamical systems, pose significant challenges to the analysis of our algorithm for learning the unknown dynamics. We address these challenges and provide high probability error bounds on our identification algorithm under a uniform stability assumption. Our analysis provides insights into system theoretic quantities that affect learning accuracy and sample complexity. Lastly, we perform numerical experiments with synthetic data to reinforce these insights.


[34] 2501.07681

Dataset Distillation as Pushforward Optimal Quantization

Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose a simple extension of the state-of-the-art data distillation method D4M, achieving better performance on the ImageNet-1K dataset with trivial additional computation, and state-of-the-art performance in higher image-per-class settings.


[35] 2501.07738

Mixing time for a noisy SIS model on graphs

We study the mixing time of the noisy SIS (Susceptible-Infected-Susceptible) model on graphs. The noisy SIS model is a variant of the standard SIS model, which allows individuals to become infected not just due to contacts with infected individuals but also due to external noise. We show that, under strong external noise, the mixing time is of order $O(n \log n)$. Additionally, we demonstrate that the mixing time on random graphs, namely Erd\"os--R\'enyi graphs, regular multigraphs, and Galton--Watson trees, is also of order $O(n \log n)$ with high probability.


[36] 2501.07761

Impatient Bandits: Optimizing for the Long-Term Without Delay

Increasingly, recommender systems are tasked with improving users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a bandit problem with delayed rewards. There is an apparent trade-off in choosing the learning signal: waiting for the full reward to become available might take several weeks, slowing the rate of learning, whereas using short-term proxy rewards reflects the actual long-term goal only imperfectly. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Rewards as well as shorter-term surrogate outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that quickly learns to identify content aligned with long-term success using this new predictive model. We prove a regret bound for our algorithm that depends on the \textit{Value of Progressive Feedback}, an information theoretic metric that captures the quality of short-term leading indicators that are observed prior to the long-term reward. We apply our approach to a podcast recommendation problem, where we seek to recommend shows that users engage with repeatedly over two months. We empirically validate that our approach significantly outperforms methods that optimize for short-term proxies or rely solely on delayed rewards, as demonstrated by an A/B test in a recommendation system that serves hundreds of millions of users.


[37] 2501.07765

PINN-FEM: A Hybrid Approach for Enforcing Dirichlet Boundary Conditions in Physics-Informed Neural Networks

Physics-Informed Neural Networks (PINNs) solve partial differential equations (PDEs) by embedding governing equations and boundary/initial conditions into the loss function. However, enforcing Dirichlet boundary conditions accurately remains challenging, often leading to soft enforcement that compromises convergence and reliability in complex domains. We propose a hybrid approach, PINN-FEM, which combines PINNs with finite element methods (FEM) to impose strong Dirichlet boundary conditions via domain decomposition. This method incorporates FEM-based representations near the boundary, ensuring exact enforcement without compromising convergence. Through six experiments of increasing complexity, PINN-FEM outperforms standard PINN models, showcasing superior accuracy and robustness. While distance functions and similar techniques have been proposed for boundary condition enforcement, they lack generality for real-world applications. PINN-FEM bridges this gap by leveraging FEM near boundaries, making it well-suited for industrial and scientific problems.


[38] 2501.07879

Distributed Nonparametric Estimation: from Sparse to Dense Samples per Terminal

Consider the communication-constrained problem of nonparametric function estimation, in which each distributed terminal holds multiple i.i.d. samples. Under certain regularity assumptions, we characterize the minimax optimal rates for all regimes, and identify phase transitions of the optimal rates as the samples per terminal vary from sparse to dense. This fully solves the problem left open by previous works, whose scopes are limited to regimes with either dense samples or a single sample per terminal. To achieve the optimal rates, we design a layered estimation protocol by exploiting protocols for the parametric density estimation problem. We show the optimality of the protocol using information-theoretic methods and strong data processing inequalities, and incorporating the classic balls and bins model. The optimal rates are immediate for various special cases such as density estimation, Gaussian, binary, Poisson and heteroskedastic regression models.


[39] 2501.07964

Derivation of Output Correlation Inferences for Multi-Output (aka Multi-Task) Gaussian Process

Gaussian process (GP) is arguably one of the most widely used machine learning algorithms in practice. One of its prominent applications is Bayesian optimization (BO). Although the vanilla GP itself is already a powerful tool for BO, it is often beneficial to be able to consider the dependencies of multiple outputs. To do so, Multi-task GP (MTGP) is formulated, but it is not trivial to fully understand the derivations of its formulations and their gradients from the previous literature. This paper serves friendly derivations of the MTGP formulations and their gradients.


[40] 2501.07969

Enhanced Sparse Bayesian Learning Methods with Application to Massive MIMO Channel Estimation

We consider the problem of sparse channel estimation in massive multiple-input multiple-output systems. In this context, we propose an enhanced version of the sparse Bayesian learning (SBL) framework, referred to as enhanced SBL (E-SBL), which is based on a reparameterization of the original SBL model. Specifically, we introduce a scale vector that brings extra flexibility to the model, which is estimated along with the other unknowns. Moreover, we introduce a variant of E-SBL, referred to as modified E-SBL (M-E-SBL), which is based on a computationally more efficient parameter estimation. We compare the proposed E-SBL and M-E-SBL with the baseline SBL and with a method based on variational message passing (VMP) in terms of computational complexity and performance. Numerical results show that the proposed E-SBL and M-E-SBL outperform the baseline SBL and VMP in terms of mean squared error of the channel estimation in all the considered scenarios. Furthermore, we show that M-E-SBL produces results comparable with E-SBL with considerably cheaper computations.


[41] 2501.07975

Some observations on the ambivalent role of symmetries in Bayesian inference problems

We collect in this note some observations on the role of symmetries in Bayesian inference problems, that can be useful or detrimental depending on the way they act on the signal and on the observations. We emphasize in particular the need to gauge away unobservable invariances in the definition of a distance between a signal and its estimator, and the consequences this implies for the statistical mechanics treatment of such models, taking as a motivating example the extensive rank matrix factorization problem.


[42] 2501.08040

Convergence Analysis of Real-time Recurrent Learning (RTRL) for a class of Recurrent Neural Networks

Recurrent neural networks (RNNs) are commonly trained with the truncated backpropagation-through-time (TBPTT) algorithm. For the purposes of computational tractability, the TBPTT algorithm truncates the chain rule and calculates the gradient on a finite block of the overall data sequence. Such approximation could lead to significant inaccuracies, as the block length for the truncated backpropagation is typically limited to be much smaller than the overall sequence length. In contrast, Real-time recurrent learning (RTRL) is an online optimization algorithm which asymptotically follows the true gradient of the loss on the data sequence as the number of sequence time steps $t \rightarrow \infty$. RTRL forward propagates the derivatives of the RNN hidden/memory units with respect to the parameters and, using the forward derivatives, performs online updates of the parameters at each time step in the data sequence. RTRL's online forward propagation allows for exact optimization over extremely long data sequences, although it can be computationally costly for models with large numbers of parameters. We prove convergence of the RTRL algorithm for a class of RNNs. The convergence analysis establishes a fixed point for the joint distribution of the data sequence, RNN hidden layer, and the RNN hidden layer forward derivatives as the number of data samples from the sequence and the number of training steps tend to infinity. We prove convergence of the RTRL algorithm to a stationary point of the loss. Numerical studies illustrate our theoretical results. One potential application area for RTRL is the analysis of financial data, which typically involve long time series and models with small to medium numbers of parameters. This makes RTRL computationally tractable and a potentially appealing optimization method for training models. Thus, we include an example of RTRL applied to limit order book data.


[43] 2501.08149

Multiple-Input Variational Auto-Encoder for Anomaly Detection in Heterogeneous Data

Anomaly detection (AD) plays a pivotal role in AI applications, e.g., in classification, and intrusion/threat detection in cybersecurity. However, most existing methods face challenges of heterogeneity amongst feature subsets posed by non-independent and identically distributed (non-IID) data. We propose a novel neural network model called Multiple-Input Auto-Encoder for AD (MIAEAD) to address this. MIAEAD assigns an anomaly score to each feature subset of a data sample to indicate its likelihood of being an anomaly. This is done by using the reconstruction error of its sub-encoder as the anomaly score. All sub-encoders are then simultaneously trained using unsupervised learning to determine the anomaly scores of feature subsets. The final AUC of MIAEAD is calculated for each sub-dataset, and the maximum AUC obtained among the sub-datasets is selected. To leverage the modelling of the distribution of normal data to identify anomalies of the generative models, we develop a novel neural network architecture/model called Multiple-Input Variational Auto-Encoder (MIVAE). MIVAE can process feature subsets through its sub-encoders before learning distribution of normal data in the latent space. This allows MIVAE to identify anomalies that deviate from the learned distribution. We theoretically prove that the difference in the average anomaly score between normal samples and anomalies obtained by the proposed MIVAE is greater than that of the Variational Auto-Encoder (VAEAD), resulting in a higher AUC for MIVAE. Extensive experiments on eight real-world anomaly datasets demonstrate the superior performance of MIAEAD and MIVAE over conventional methods and the state-of-the-art unsupervised models, by up to 6% in terms of AUC score. Alternatively, MIAEAD and MIVAE have a high AUC when applied to feature subsets with low heterogeneity based on the coefficient of variation (CV) score.


[44] 2501.08150

Evaluating Policy Effects through Network Dynamics and Sampling

In the process of enacting or introducing a new policy, policymakers frequently consider the population's responses. These considerations are critical for effective governance. There are numerous methods to gauge the ground sentiment from a subset of the population; examples include surveys or listening to various feedback channels. Many conventional approaches implicitly assume that opinions are static; however, in reality, the population will discuss and debate these new policies among themselves, and reform new opinions in the process. In this paper, we pose the following questions: Can we quantify the effect of these social dynamics on the broader opinion towards a new policy? Given some information about the relationship network that underlies the population, how does overall opinion change post-discussion? We investigate three different settings in which the policy is revealed: respondents who do not know each other, groups of respondents who all know each other, and respondents chosen randomly. By controlling who the policy is revealed to, we control the degree of discussion among the population. We quantify how these factors affect the changes in policy beliefs via the Wasserstein distance between the empirically observed data post-discussion and its distribution pre-discussion. We also provide several numerical analyses based on generated network and real-life network datasets. Our work aims to address the challenges associated with network topology and social interactions, and provide policymakers with a quantitative lens to assess policy effectiveness in the face of resource constraints and network complexities.


[45] 2501.08202

Data-driven system identification using quadratic embeddings of nonlinear dynamics

We propose a novel data-driven method called QENDy (Quadratic Embedding of Nonlinear Dynamics) that not only allows us to learn quadratic representations of highly nonlinear dynamical systems, but also to identify the governing equations. The approach is based on an embedding of the system into a higher-dimensional feature space in which the dynamics become quadratic. Just like SINDy (Sparse Identification of Nonlinear Dynamics), our method requires trajectory data, time derivatives for the training data points, which can also be estimated using finite difference approximations, and a set of preselected basis functions, called dictionary. We illustrate the efficacy and accuracy of QENDy with the aid of various benchmark problems and compare its performance with SINDy and a deep learning method for identifying quadratic embeddings. Furthermore, we analyze the convergence of QENDy and SINDy in the infinite data limit, highlight their similarities and main differences, and compare the quadratic embedding with linearization techniques based on the Koopman operator.


[46] 2501.08223

Big Batch Bayesian Active Learning by Considering Predictive Probabilities

We observe that BatchBALD, a popular acquisition function for batch Bayesian active learning for classification, can conflate epistemic and aleatoric uncertainty, leading to suboptimal performance. Motivated by this observation, we propose to focus on the predictive probabilities, which only exhibit epistemic uncertainty. The result is an acquisition function that not only performs better, but is also faster to evaluate, allowing for larger batches than before.


[47] 2501.08263

Multiplayer Federated Learning: Reaching Equilibrium with Less Communication

Traditional Federated Learning (FL) approaches assume collaborative clients with aligned objectives working towards a shared global model. However, in many real-world scenarios, clients act as rational players with individual objectives and strategic behaviors, a concept that existing FL frameworks are not equipped to adequately address. To bridge this gap, we introduce Multiplayer Federated Learning (MpFL), a novel framework that models the clients in the FL environment as players in a game-theoretic context, aiming to reach an equilibrium. In this scenario, each player tries to optimize their own utility function, which may not align with the collective goal. Within MpFL, we propose Per-Player Local Stochastic Gradient Descent (PEARL-SGD), an algorithm in which each player/client performs local updates independently and periodically communicates with other players. We theoretically analyze PEARL-SGD and prove that it reaches a neighborhood of equilibrium with less communication in the stochastic setup compared to its non-local counterpart. Finally, we verify our theoretical findings through numerical experiments.


[48] 2501.08317

A Similarity Measure Between Functions with Applications to Statistical Learning and Optimization

In this note, we present a novel measure of similarity between two functions. It quantifies how the sub-optimality gaps of two functions convert to each other, and unifies several existing notions of functional similarity. We show that it has convenient operation rules, and illustrate its use in empirical risk minimization and non-stationary online optimization.


[49] 2501.08330

Gradient Equilibrium in Online Learning: Theory and Applications

We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by nor implies sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction.