New articles on Statistics


[1] 2504.16185

Behavior of prediction performance metrics with rare events

Area under the receiving operator characteristic curve (AUC) is commonly reported alongside binary prediction models. However, there are concerns that AUC might be a misleading measure of prediction performance in the rare event setting. This setting is common since many events of clinical importance are rare events. We conducted a simulation study to determine when or whether AUC is unstable in the rare event setting. Specifically, we aimed to determine whether the bias and variance of AUC are driven by the number of events or the event rate. We also investigated the behavior of other commonly used measures of prediction performance, including positive predictive value, accuracy, sensitivity, and specificity. Our results indicate that poor AUC behavior -- as measured by empirical bias, variability of cross-validated AUC estimates, and empirical coverage of confidence intervals -- is driven by the minimum class size, not event rate. Performance of sensitivity is driven by the number of events, while that of specificity is driven by the number of non-events. Other measures, including positive predictive value and accuracy, depend on the event rate even in large samples. AUC is reliable in the rare event setting provided that the total number of events is moderately large.


[2] 2504.16186

Analogy making as the basis of statistical inference

Standard statistical theory has arguably proved to be unsuitable as a basis for constructing a satisfactory completely general framework for performing statistical inference. For example, frequentist theory has never come close to providing such a general inferential framework, which is not only attributable to the question surrounding the soundness of this theory, but also to its focus on attempting to address the problem of how to perform statistical inference only in certain special cases. Also, theories of inference that are grounded in the idea of deducing sample-based inferences about populations of interest from a given set of universally acceptable axioms, e.g. many theories that aim to justify Bayesian inference and theories of imprecise probability, suffer from the difficulty of finding such axioms that are weak enough to be widely acceptable, but strong enough to lead to methods of inference that can be regarded as being efficient. These observations justify the need to look for an alternative means by which statistical inference may be performed, and in particular, to explore the one that is offered by analogy making. What is presented here goes down this path. To be clear, this is done in a way that does not simply endorse the common use of analogy making as a supplementary means of understanding how statistical methods work, but formally develops analogy making as the foundation of a general framework for performing statistical inference. In the latter part of the paper, the use of this framework is illustrated by applying some of the most important analogies contained within it to a relatively simple but arguably still unresolved problem of statistical inference, which naturally leads to an original way being put forward of addressing issues that relate to Bartlett's and Lindley's paradoxes.


[3] 2504.16216

Cohort Revenue & Retention Analysis: A Bayesian Approach

We present a Bayesian approach to model cohort-level retention rates and revenue over time. We use Bayesian additive regression trees (BART) to model the retention component which we couple with a linear model for the revenue component. This method is flexible enough to allow adding additional covariates to both model components. This Bayesian framework allows us to quantify uncertainty in the estimation, understand the effect of covariates on retention through partial dependence plots (PDP) and individual conditional expectation (ICE) plots, and most importantly, forecast future revenue and retention rates with well-calibrated uncertainty through highest density intervals. We also provide alternative approaches to model the retention component using neural networks and inference through stochastic variational inference.


[4] 2504.16230

Robust Causal Inference for EHR-based Studies of Point Exposures with Missingness in Eligibility Criteria

Missingness in variables that define study eligibility criteria is a seldom addressed challenge in electronic health record (EHR)-based settings. It is typically the case that patients with incomplete eligibility information are excluded from analysis without consideration of (implicit) assumptions that are being made, leaving study conclusions subject to potential selection bias. In an effort to ascertain eligibility for more patients, researchers may look back further in time prior to study baseline, and in using outdated values of eligibility-defining covariates may inappropriately be including individuals who, unbeknownst to the researcher, fail to meet eligibility at baseline. To the best of our knowledge, however, very little work has been done to mitigate these concerns. We propose a robust and efficient estimator of the causal average treatment effect on the treated, defined in the study eligible population, in cohort studies where eligibility-defining covariates are missing at random. The approach facilitates the use of flexible machine-learning strategies for component nuisance functions while maintaining appropriate convergence rates for valid asymptotic inference. EHR data from Kaiser Permanente are used as motivation as well as a basis for extensive simulations that verify robustness properties under various degrees of model misspecification. The data are also used to demonstrate the use of the method to analyze differences between two common bariatric surgical interventions for long-term weight and glycemic outcomes among a cohort of severely obese patients with type II diabetes mellitus.


[5] 2504.16244

Accounting for spillover when using the augmented synthetic control method: estimating the effect of localized COVID-19 lockdowns in Chile

The implementation of public health policies, particularly in a single or small set of units (e.g., regions), can create complex dynamics with effects extending beyond the directly treated areas. This paper examines the direct effect of COVID-19 lockdowns in Chile on the comunas where they were enacted and the spillover effect from neighboring comunas. To draw inference about these effects, the Augmented Synthetic Control Method (ASCM) is extended to account for interference between neighboring units by introducing a stratified control framework. Specifically, Ridge ASCM with stratified controls (ASCM-SC) is proposed to partition control units based on treatment exposure. By leveraging control units that are untreated, or treated but outside the treated unit's neighborhood, this method estimates both the direct and spillover effects of intervention on treated and neighboring units. Simulations demonstrate improved bias reduction under various data-generating processes. ASCM-SC is applied to estimate the direct and total (direct + indirect) effects of COVID-19 lockdowns in Chile at the start of the COVID-19 pandemic. This method provides a more flexible approach for estimating the effects of public health interventions in settings with interference.


[6] 2504.16279

Detecting Correlation between Multiple Unlabeled Gaussian Networks

This paper studies the hypothesis testing problem to determine whether m > 2 unlabeled graphs with Gaussian edge weights are correlated under a latent permutation. Previously, a sharp detection threshold for the correlation parameter \rho was established by Wu, Xu and Yu for this problem when m = 2. Presently, their result is leveraged to derive necessary and sufficient conditions for general m. In doing so, an interval for \rho is uncovered for which detection is impossible using 2 graphs alone but becomes possible with m > 2 graphs.


[7] 2504.16356

Covariate-dependent Graphical Model Estimation via Neural Networks with Statistical Guarantees

Graphical models are widely used in diverse application domains to model the conditional dependencies amongst a collection of random variables. In this paper, we consider settings where the graph structure is covariate-dependent, and investigate a deep neural network-based approach to estimate it. The method allows for flexible functional dependency on the covariate, and fits the data reasonably well in the absence of a Gaussianity assumption. Theoretical results with PAC guarantees are established for the method, under assumptions commonly used in an Empirical Risk Minimization framework. The performance of the proposed method is evaluated on several synthetic data settings and benchmarked against existing approaches. The method is further illustrated on real datasets involving data from neuroscience and finance, respectively, and produces interpretable results.


[8] 2504.16435

Toward a Principled Workflow for Prevalence Mapping Using Household Survey Data

Understanding the prevalence of key demographic and health indicators in small geographic areas and domains is of global interest, especially in low- and middle-income countries (LMICs), where vital registration data is sparse and household surveys are the primary source of information. Recent advances in computation and the increasing availability of spatially detailed datasets have led to much progress in sophisticated statistical modeling of prevalence. As a result, high-resolution prevalence maps for many indicators are routinely produced in the literature. However, statistical and practical guidance for producing prevalence maps in LMICs has been largely lacking. In particular, advice in choosing and evaluating models and interpreting results is needed, especially when data is limited. Software and analysis tools are also usually inaccessible to researchers in low-resource settings to conduct their own analysis or reproduce findings in the literature. In this paper, we propose a general workflow for prevalence mapping using household survey data. We consider all stages of the analysis pipeline, with particular emphasis on model choice and interpretation. We illustrate the proposed workflow using a case study mapping the proportion of pregnant women who had at least four antenatal care visits in Kenya. Reproducible code is provided in the Supplementary Materials and can be readily extended to a broad collection of indicators.


[9] 2504.16531

Pure Error REML for Analyzing Data from Multi-Stratum Designs

Since the dawn of response surface methodology, it has been recommended that designs include replicate points, so that pure error estimates of variance can be obtained and used to provide unbiased estimated standard errors of the effects of factors. In designs with more than one stratum, such as split-plot and split-split-plot designs, it is less obvious how pure error estimates of the variance components should be obtained, and no pure error estimates are given by the popular residual maximum likelihood (REML) method of estimation. We propose a method of pure error REML estimation of the variance components, using the full treatment model, obtained by treating each combination of factor levels as a discrete treatment. Our method is easy to implement using standard software and improved estimated standard errors of the fixed effects estimates can be obtained by applying the Kenward-Roger correction based on the pure error REML estimates. We illustrate the new method using several data sets and compare the performance of pure error REML with the standard REML method. The results are comparable when the assumed response surface model is correct, but the new method is considerably more robust in the case of model misspecification.


[10] 2504.16535

Decentralized Quantile Regression for Feature-Distributed Massive Datasets with Privacy Guarantees

In this paper, we introduce a novel decentralized surrogate gradient-based algorithm for quantile regression in a feature-distributed setting, where global features are dispersed across multiple machines within a decentralized network. The proposed algorithm, \texttt{DSG-cqr}, utilizes a convolution-type smoothing approach to address the non-smooth nature of the quantile loss function. \texttt{DSG-cqr} is fully decentralized, conjugate-free, easy to implement, and achieves linear convergence up to statistical precision. To ensure privacy, we adopt the Gaussian mechanism to provide $(\epsilon,\delta)$-differential privacy. To overcome the exact residual calculation problem, we estimate residuals using auxiliary variables and develop a confidence interval construction method based on Wald statistics. Theoretical properties are established, and the practical utility of the methods is also demonstrated through extensive simulations and a real-world data application.


[11] 2504.16555

Confidence Sequences for Generalized Linear Models via Regret Analysis

We develop a methodology for constructing confidence sets for parameters of statistical models via a reduction to sequential prediction. Our key observation is that for any generalized linear model (GLM), one can construct an associated game of sequential probability assignment such that achieving low regret in the game implies a high-probability upper bound on the excess likelihood of the true parameter of the GLM. This allows us to develop a scheme that we call online-to-confidence-set conversions, which effectively reduces the problem of proving the desired statistical claim to an algorithmic question. We study two varieties of this conversion scheme: 1) analytical conversions that only require proving the existence of algorithms with low regret and provide confidence sets centered at the maximum-likelihood estimator 2) algorithmic conversions that actively leverage the output of the online algorithm to construct confidence sets (and may be centered at other, adaptively constructed point estimators). The resulting methodology recovers all state-of-the-art confidence set constructions within a single framework, and also provides several new types of confidence sets that were previously unknown in the literature.


[12] 2504.16623

Censored lifespans in a double-truncated sample: Maximum likelihood inference for the exponential distribution

The analysis of a truncated sample can be hindered by censoring. Survival information may be lost to follow-up or the birthdate may be missing. The data can still be modeled as a truncated point process and it is close to a Poisson process, in the Hellinger distance, as long as the sample is small relative to the population. We assume an exponential distribution for the lifespan, derive the likelihood and profile out the unobservable sample size. Identification of the exponential parameter is shown, together with consistency and asymptotic normality of its M-estimator. Even though the estimator sequence is indexed in the sample size, both the point estimator and the standard error are observable. Enterprise lifespans in Germany constitute our example.


[13] 2504.16780

Linear Regression Using Hilbert-Space-Valued Covariates with Unknown Reproducing Kernel

We present a new method of linear regression based on principal components using Hilbert-space-valued covariates with unknown reproducing kernels. We develop a computationally efficient approach to estimation and derive asymptotic theory for the regression parameter estimates under mild assumptions. We demonstrate the approach in simulation studies as well as in data analysis using two-dimensional brain images as predictors.


[14] 2504.16864

Common Functional Decompositions Can Mis-attribute Differences in Outcomes Between Populations

In science and social science, we often wish to explain why an outcome is different in two populations. For instance, if a jobs program benefits members of one city more than another, is that due to differences in program participants (particular covariates) or the local labor markets (outcomes given covariates)? The Kitagawa-Oaxaca-Blinder (KOB) decomposition is a standard tool in econometrics that explains the difference in the mean outcome across two populations. However, the KOB decomposition assumes a linear relationship between covariates and outcomes, while the true relationship may be meaningfully nonlinear. Modern machine learning boasts a variety of nonlinear functional decompositions for the relationship between outcomes and covariates in one population. It seems natural to extend the KOB decomposition using these functional decompositions. We observe that a successful extension should not attribute the differences to covariates -- or, respectively, to outcomes given covariates -- if those are the same in the two populations. Unfortunately, we demonstrate that, even in simple examples, two common decompositions -- functional ANOVA and Accumulated Local Effects -- can attribute differences to outcomes given covariates, even when they are identical in two populations. We provide a characterization of when functional ANOVA misattributes, as well as a general property that any discrete decomposition must satisfy to avoid misattribution. We show that if the decomposition is independent of its input distribution, it does not misattribute. We further conjecture that misattribution arises in any reasonable additive decomposition that depends on the distribution of the covariates.


[15] 2504.16912

Definition, Identification, and Estimation of the Direct and Indirect Number Needed to Treat

The number needed to treat (NNT) is an efficacy and effect size measure commonly used in epidemiological studies and meta-analyses. The NNT was originally defined as the average number of patients needed to be treated to observe one less adverse effect. In this study, we introduce the novel direct and indirect number needed to treat (DNNT and INNT, respectively). The DNNT and the INNT are efficacy measures defined as the average number of patients that needed to be treated to benefit from the treatment's direct and indirect effects, respectively. We start by formally defining these measures using nested potential outcomes. Next, we formulate the conditions for the identification of the DNNT and INNT, as well as for the direct and indirect number needed to expose (DNNE and INNE, respectively) and the direct and indirect exposure impact number (DEIN and IEIN, respectively) in observational studies. Next, we present an estimation method with two analytical examples. A corresponding simulation study follows the examples. The simulation study illustrates that the estimators of the novel indices are consistent, and their analytical confidence intervals meet the nominal coverage rates.


[16] 2303.13865

Compositionality in algorithms for smoothing

Backward Filtering Forward Guiding (BFFG) is a bidirectional algorithm proposed in Mider et al. [2021] and studied more in depth in a general setting in Van der Meulen and Schauer [2022]. In category theory, optics have been proposed for modelling systems with bidirectional data flow. We connect BFFG with optics by demonstrating that the forward and backwards map together define a functor from a category of Markov kernels into a category of optics, which can furthermore be lax monoidal under further assumptions.


[17] 2504.16100

Towards Accurate Forecasting of Renewable Energy : Building Datasets and Benchmarking Machine Learning Models for Solar and Wind Power in France

Accurate prediction of non-dispatchable renewable energy sources is essential for grid stability and price prediction. Regional power supply forecasts are usually indirect through a bottom-up approach of plant-level forecasts, incorporate lagged power values, and do not use the potential of spatially resolved data. This study presents a comprehensive methodology for predicting solar and wind power production at country scale in France using machine learning models trained with spatially explicit weather data combined with spatial information about production sites capacity. A dataset is built spanning from 2012 to 2023, using daily power production data from RTE (the national grid operator) as the target variable, with daily weather data from ERA5, production sites capacity and location, and electricity prices as input features. Three modeling approaches are explored to handle spatially resolved weather data: spatial averaging over the country, dimension reduction through principal component analysis, and a computer vision architecture to exploit complex spatial relationships. The study benchmarks state-of-the-art machine learning models as well as hyperparameter tuning approaches based on cross-validation methods on daily power production data. Results indicate that cross-validation tailored to time series is best suited to reach low error. We found that neural networks tend to outperform traditional tree-based models, which face challenges in extrapolation due to the increasing renewable capacity over time. Model performance ranges from 4% to 10% in nRMSE for midterm horizon, achieving similar error metrics to local models established at a single-plant level, highlighting the potential of these methods for regional power supply forecasting.


[18] 2504.16172

Physics-Informed Inference Time Scaling via Simulation-Calibrated Scientific Machine Learning

High-dimensional partial differential equations (PDEs) pose significant computational challenges across fields ranging from quantum chemistry to economics and finance. Although scientific machine learning (SciML) techniques offer approximate solutions, they often suffer from bias and neglect crucial physical insights. Inspired by inference-time scaling strategies in language models, we propose Simulation-Calibrated Scientific Machine Learning (SCaSML), a physics-informed framework that dynamically refines and debiases the SCiML predictions during inference by enforcing the physical laws. SCaSML leverages derived new physical laws that quantifies systematic errors and employs Monte Carlo solvers based on the Feynman-Kac and Elworthy-Bismut-Li formulas to dynamically correct the prediction. Both numerical and theoretical analysis confirms enhanced convergence rates via compute-optimal inference methods. Our numerical experiments demonstrate that SCaSML reduces errors by 20-50% compared to the base surrogate model, establishing it as the first algorithm to refine approximated solutions to high-dimensional PDE during inference. Code of SCaSML is available at https://github.com/Francis-Fan-create/SCaSML.


[19] 2504.16192

Probabilistic Emulation of the Community Radiative Transfer Model Using Machine Learning

The continuous improvement in weather forecast skill over the past several decades is largely due to the increasing quantity of available satellite observations and their assimilation into operational forecast systems. Assimilating these observations requires observation operators in the form of radiative transfer models. Significant efforts have been dedicated to enhancing the computational efficiency of these models. Computational cost remains a bottleneck, and a large fraction of available data goes unused for assimilation. To address this, we used machine learning to build an efficient neural network based probabilistic emulator of the Community Radiative Transfer Model (CRTM), applied to the GOES Advanced Baseline Imager. The trained NN emulator predicts brightness temperatures output by CRTM and the corresponding error with respect to CRTM. RMSE of the predicted brightness temperature is 0.3 K averaged across all channels. For clear sky conditions, the RMSE is less than 0.1 K for 9 out of 10 infrared channels. The error predictions are generally reliable across a wide range of conditions. Explainable AI methods demonstrate that the trained emulator reproduces the relevant physics, increasing confidence that the model will perform well when presented with new data.


[20] 2504.16270

A Geometric Approach to Problems in Optimization and Data Science

We give new results for problems in computational and statistical machine learning using tools from high-dimensional geometry and probability. We break up our treatment into two parts. In Part I, we focus on computational considerations in optimization. Specifically, we give new algorithms for approximating convex polytopes in a stream, sparsification and robust least squares regression, and dueling optimization. In Part II, we give new statistical guarantees for data science problems. In particular, we formulate a new model in which we analyze statistical properties of backdoor data poisoning attacks, and we study the robustness of graph clustering algorithms to ``helpful'' misspecification.


[21] 2504.16415

Natural Policy Gradient for Average Reward Non-Stationary RL

We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of $\Delta_T$. Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget $\Delta_T$. We present a dynamic regret of $\tilde{\mathscr O}(|S|^{1/2}|A|^{1/2}\Delta_T^{1/6}T^{5/6})$ for both algorithms, where $T$ is the time horizon, and $|S|$, $|A|$ are the sizes of the state and action spaces. The regret analysis leverages a novel adaptation of the Lyapunov function analysis of NAC to dynamic environments and characterizes the effects of simultaneous updates in policy, value function estimate and changes in the environment.


[22] 2504.16430

MAGIC: Near-Optimal Data Attribution for Deep Learning

The goal of predictive data attribution is to estimate how adding or removing a given set of training datapoints will affect model predictions. In convex settings, this goal is straightforward (i.e., via the infinitesimal jackknife). In large-scale (non-convex) settings, however, existing methods are far less successful -- current methods' estimates often only weakly correlate with ground truth. In this work, we present a new data attribution method (MAGIC) that combines classical methods and recent advances in metadifferentiation to (nearly) optimally estimate the effect of adding or removing training data on model predictions.


[23] 2504.16450

An Effective Gram Matrix Characterizes Generalization in Deep Networks

We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an ``effective Gram matrix'' that characterizes the generalization gap after training in terms of the alignment between this Gram matrix and a certain initial ``residual''. Empirical evaluations on image classification datasets indicate that this analysis can predict the test loss accurately. Further, at any point during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the training process is benign, i.e., it does not lead to significant deterioration of the generalization gap (which is zero at initialization). The alignment between the effective Gram matrix and the residual is different for different datasets and architectures. The match/mismatch of the data and the architecture is primarily responsible for good/bad generalization.


[24] 2504.16580

Hyper-Transforming Latent Diffusion Models

We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming-a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining.


[25] 2504.16585

Enhancing Variable Selection in Large-scale Logistic Regression: Leveraging Manual Labeling with Beneficial Noise

In large-scale supervised learning, penalized logistic regression (PLR) effectively addresses the overfitting problem by introducing regularization terms yet its performance still depends on efficient variable selection strategies. This paper theoretically demonstrates that label noise stemming from manual labeling, which is solely related to classification difficulty, represents a type of beneficial noise for variable selection in PLR. This benefit is reflected in a more accurate estimation of the selected non-zero coefficients when compared with the case where only truth labels are used. Under large-scale settings, the sample size for PLR can become very large, making it infeasible to store on a single machine. In such cases, distributed computing methods are required to handle PLR model with manual labeling. This paper presents a partition-insensitive parallel algorithm founded on the ADMM (alternating direction method of multipliers) algorithm to address PLR by incorporating manual labeling. The partition insensitivity of the proposed algorithm refers to the fact that the solutions obtained by the algorithm will not change with the distributed storage of data. In addition, the algorithm has global convergence and a sublinear convergence rate. Experimental results indicate that, as compared with traditional variable selection classification techniques, the PLR with manually-labeled noisy data achieves higher estimation and classification accuracy across multiple large-scale datasets.


[26] 2504.16667

Representation Learning via Non-Contrastive Mutual Information

Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.


[27] 2504.16682

Provable wavelet-based neural approximation

In this paper, we develop a wavelet-based theoretical framework for analyzing the universal approximation capabilities of neural networks over a wide range of activation functions. Leveraging wavelet frame theory on the spaces of homogeneous type, we derive sufficient conditions on activation functions to ensure that the associated neural network approximates any functions in the given space, along with an error estimate. These sufficient conditions accommodate a variety of smooth activation functions, including those that exhibit oscillatory behavior. Furthermore, by considering the $L^2$-distance between smooth and non-smooth activation functions, we establish a generalized approximation result that is applicable to non-smooth activations, with the error explicitly controlled by this distance. This provides increased flexibility in the design of network architectures.


[28] 2504.16683

MCMC for Bayesian estimation of Differential Privacy from Membership Inference Attacks

We propose a new framework for Bayesian estimation of differential privacy, incorporating evidence from multiple membership inference attacks (MIA). Bayesian estimation is carried out via a Markov chain Monte Carlo (MCMC) algorithm, named MCMC-DP-Est, which provides an estimate of the full posterior distribution of the privacy parameter (e.g., instead of just credible intervals). Critically, the proposed method does not assume that privacy auditing is performed with the most powerful attack on the worst-case (dataset, challenge point) pair, which is typically unrealistic. Instead, MCMC-DP-Est jointly estimates the strengths of MIAs used and the privacy of the training algorithm, yielding a more cautious privacy analysis. We also present an economical way to generate measurements for the performance of an MIA that is to be used by the MCMC method to estimate privacy. We present the use of the methods with numerical examples with both artificial and real data.


[29] 2504.16767

Online model learning with data-assimilated reservoir computers

We propose an online learning framework for forecasting nonlinear spatio-temporal signals (fields). The method integrates (i) dimensionality reduction, here, a simple proper orthogonal decomposition (POD) projection; (ii) a generalized autoregressive model to forecast reduced dynamics, here, a reservoir computer; (iii) online adaptation to update the reservoir computer (the model), here, ensemble sequential data assimilation.We demonstrate the framework on a wake past a cylinder governed by the Navier-Stokes equations, exploring the assimilation of full flow fields (projected onto POD modes) and sparse sensors. Three scenarios are examined: a na\"ive physical state estimation; a two-fold estimation of physical and reservoir states; and a three-fold estimation that also adjusts the model parameters. The two-fold strategy significantly improves ensemble convergence and reduces reconstruction error compared to the na\"ive approach. The three-fold approach enables robust online training of partially-trained reservoir computers, overcoming limitations of a priori training. By unifying data-driven reduced order modelling with Bayesian data assimilation, this work opens new opportunities for scalable online model learning for nonlinear time series forecasting.


[30] 2504.16789

MLOps Monitoring at Scale for Digital Platforms

Machine learning models are widely recognized for their strong performance in forecasting. To keep that performance in streaming data settings, they have to be monitored and frequently re-trained. This can be done with machine learning operations (MLOps) techniques under supervision of an MLOps engineer. However, in digital platform settings where the number of data streams is typically large and unstable, standard monitoring becomes either suboptimal or too labor intensive for the MLOps engineer. As a consequence, companies often fall back on very simple worse performing ML models without monitoring. We solve this problem by adopting a design science approach and introducing a new monitoring framework, the Machine Learning Monitoring Agent (MLMA), that is designed to work at scale for any ML model with reasonable labor cost. A key feature of our framework concerns test-based automated re-training based on a data-adaptive reference loss batch. The MLOps engineer is kept in the loop via key metrics and also acts, pro-actively or retrospectively, to maintain performance of the ML model in the production stage. We conduct a large-scale test at a last-mile delivery platform to empirically validate our monitoring framework.