New articles on Statistics

[1] 2407.14630

Identification of changes in gene expression

Evaluating the change in gene expression is a common goal in many research areas, such as in toxicological studies as well as in clinical trials. In practice, the analysis is often based on multiple t-tests evaluated at the observed time points. This severely limits the accuracy of determining the time points at which the gene changes in expression. Even if a parametric approach is chosen, the analysis is often restricted to identifying the onset of an effect. In this paper, we propose a parametric method to identify the time frame where the gene expression significantly changes. This is achieved by fitting a parametric model to the time-response data and constructing a confidence band for its first derivative. The confidence band is derived by a flexible two step bootstrap approach, which can be applied to a wide variety of possible curves. Our method focuses on the first derivative, since it provides an easy to compute and reliable measure for the change in response. It is summarised in terms of a hypothesis test, such that rejecting the null hypothesis means detecting a significant change in gene expression. Furthermore, a method for calculating confidence intervals for time points of interest (e.g. the beginning and end of significant change) is developed. We demonstrate the validity of our approach through a simulation study and present a variety of different applications to mouse gene expression data from a study investigating the effect of a Western diet on the progression of non-alcoholic fatty liver disease.

[2] 2407.14666

A Bayesian workflow for securitizing casualty insurance risk

Casualty insurance-linked securities (ILS) are appealing to investors because the underlying insurance claims, which are directly related to resulting security performance, are uncorrelated with most other asset classes. Conversely, casualty ILS are appealing to insurers as an efficient capital managment tool. However, securitizing casualty insurance risk is non-trivial, as it requires forecasting loss ratios for pools of insurance policies that have not yet been written, in addition to estimating how the underlying losses will develop over time within future accident years. In this paper, we lay out a Bayesian workflow that tackles these complexities by using: (1) theoretically informed time-series and state-space models to capture how loss ratios develop and change over time; (2) historic industry data to inform prior distributions of models fit to individual programs; (3) stacking to combine loss ratio predictions from candidate models, and (4) both prior predictive simulations and simulation-based calibration to aid model specification. Using historic Schedule P filings, we then show how our proposed Bayesian workflow can be used to assess and compare models across a variety of key model performance metrics evaluated on future accident year losses.

[3] 2407.14703

Generalizing and transporting causal inferences from randomized trials in the presence of trial engagement effects

Trial engagement effects are effects of trial participation on the outcome that are not mediated by treatment assignment. Most work on extending (generalizing or transporting) causal inferences from a randomized trial to a target population has, explicitly or implicitly, assumed that trial engagement effects are absent, allowing evidence about the effects of the treatments examined in trials to be applied to non-experimental settings. Here, we define novel causal estimands and present identification results for generalizability and transportability analyses in the presence of trial engagement effects. Our approach allows for trial engagement effects under assumptions of no causal interaction between trial participation and treatment assignment on the absolute or relative scales. We show that under these assumptions, even in the presence of trial engagement effects, the trial data can be combined with covariate data from the target population to identify average treatment effects in the context of usual care as implemented in the target population (i.e., outside the experimental setting). The identifying observed data functionals under these no-interaction assumptions are the same as those obtained under the stronger identifiability conditions that have been invoked in prior work. Therefore, our results suggest a new interpretation for previously proposed generalizability and transportability estimators; this interpretation may be useful in analyses under causal structures where background knowledge suggests that trial engagement effects are present but interactions between trial participation and treatment are negligible.

[4] 2407.14748

Regression models for binary data with scale mixtures of centered skew-normal link functions

For the binary regression, the use of symmetrical link functions are not appropriate when we have evidence that the probability of success increases at a different rate than decreases. In these cases, the use of link functions based on the cumulative distribution function of a skewed and heavy tailed distribution can be useful. The most popular choice is some scale mixtures of skew-normal distribution. This family of distributions can have some identifiability problems, caused by the so-called direct parameterization. Also, in the binary modeling with skewed link functions, we can have another identifiability problem caused by the presence of the intercept and the skewness parameter. To circumvent these issues, in this work we proposed link functions based on the scale mixtures of skew-normal distributions under the centered parameterization. Furthermore, we proposed to fix the sign of the skewness parameter, which is a new perspective in the literature to deal with the identifiability problem in skewed link functions. Bayesian inference using MCMC algorithms and residual analysis are developed. Simulation studies are performed to evaluate the performance of the model. Also, the methodology is applied in a heart disease data.

[5] 2407.14778

Minimax estimation of functionals in sparse vector model with correlated observations

We consider the observations of an unknown $s$-sparse vector ${\boldsymbol \theta}$ corrupted by Gaussian noise with zero mean and unknown covariance matrix ${\boldsymbol \Sigma}$. We propose minimax optimal methods of estimating the $\ell_2$ norm of ${\boldsymbol \theta}$ and testing the hypothesis $H_0: {\boldsymbol \theta}=0$ against sparse alternatives when only partial information about ${\boldsymbol \Sigma}$ is available, such as an upper bound on its Frobenius norm and the values of its diagonal entries to within an unknown scaling factor. We show that the minimax rates of the estimation and testing are leveraged not by the dimension of the problem but by the value of the Frobenius norm of ${\boldsymbol \Sigma}$.

[6] 2407.14781

Bernstein-von Mises theorems for time evolution equations

We consider a class of infinite-dimensional dynamical systems driven by non-linear parabolic partial differential equations with initial condition $\theta$ modelled by a Gaussian process `prior' probability measure. Given discrete samples of the state of the system evolving in space-time, one obtains updated `posterior' measures on a function space containing all possible trajectories. We give a general set of conditions under which these non-Gaussian posterior distributions are approximated, in Wasserstein distance for the supremum-norm metric, by the law of a Gaussian random function. We demonstrate the applicability of our results to periodic non-linear reaction diffusion equations \begin{align*} \frac{\partial}{\partial t} u - \Delta u &= f(u) \\ u(0) &= \theta \end{align*} where $f$ is any smooth and compactly supported reaction function. In this case the limiting Gaussian measure can be characterised as the solution of a time-dependent Schr\"odinger equation with `rough' Gaussian initial conditions whose covariance operator we describe.

[7] 2407.14861

Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes

With the growing access to administrative health databases, retrospective studies have become crucial evidence for medical treatments. Yet, non-randomized studies frequently face selection biases, requiring mitigation strategies. Propensity score matching (PSM) addresses these biases by selecting comparable populations, allowing for analysis without further methodological constraints. However, PSM has several drawbacks. Different matching methods can produce significantly different Average Treatment Effects (ATE) for the same task, even when meeting all validation criteria. To prevent cherry-picking the best method, public authorities must involve field experts and engage in extensive discussions with researchers. To address this issue, we introduce a novel metric, A2A, to reduce the number of valid matches. A2A constructs artificial matching tasks that mirror the original ones but with known outcomes, assessing each matching method's performance comprehensively from propensity estimation to ATE estimation. When combined with Standardized Mean Difference, A2A enhances the precision of model selection, resulting in a reduction of up to 50% in ATE estimation errors across synthetic tasks and up to 90% in predicted ATE variability across both synthetic and real-world datasets. To our knowledge, A2A is the first metric capable of evaluating outcome correction accuracy using covariates not involved in selection. Computing A2A requires solving hundreds of PSMs, we therefore automate all manual steps of the PSM pipeline. We integrate PSM methods from Python and R, our automated pipeline, a new metric, and reproducible experiments into popmatch, our new Python package, to enhance reproducibility and accessibility to bias correction methods.

[8] 2407.14976

Multiple merger coalescent inference of effective population size

Variation in a sample of molecular sequence data informs about the past evolutionary history of the sample's population. Traditionally, Bayesian modeling coupled with the standard coalescent, is used to infer the sample's bifurcating genealogy and demographic and evolutionary parameters such as effective population size, and mutation rates. However, there are many situations where binary coalescent models do not accurately reflect the true underlying ancestral processes. Here, we propose a Bayesian nonparametric method for inferring effective population size trajectories from a multifurcating genealogy under the $\Lambda-$coalescent. In particular, we jointly estimate the effective population size and model parameters for the Beta-coalescent model, a special type of $\Lambda-$coalescent. Finally, we test our methods on simulations and apply them to study various viral dynamics as well as Japanese sardine population size changes over time. The code and vignettes can be found in the phylodyn package.

[9] 2407.14989

Nonparametric Estimation of Ordinary Differential Equations: Snake and Stubble

We study nonparametric estimation in dynamical systems described by ordinary differential equations (ODEs). Specifically, we focus on estimating the unknown function $f \colon \mathbb{R}^d \to \mathbb{R}^d$ that governs the system dynamics through the ODE $\dot{u}(t) = f(u(t))$, where observations $Y_{j,i} = u_j(t_{j,i}) + \varepsilon_{j,i}$ of solutions $u_j$ of the ODE are made at times $t_{j,i}$ with independent noise $\varepsilon_{j,i}$. We introduce two novel models -- the Stubble model and the Snake model -- to mitigate the issue of observation location dependence on $f$, an inherent difficulty in nonparametric estimation of ODE systems. In the Stubble model, we observe many short solutions with initial conditions that adequately cover the domain of interest. Here, we study an estimator based on multivariate local polynomial regression and univariate polynomial interpolation. In the Snake model we observe few long trajectories that traverse the domain on interest. Here, we study an estimator that combines univariate local polynomial estimation with multivariate polynomial interpolation. For both models, we establish error bounds of order $n^{-\frac{\beta}{2(\beta +1)+d}}$ for $\beta$-smooth functions $f$ in an infinite dimensional function class of H\"older-type and establish minimax optimality for the Stubble model in general and for the Snake model under some conditions via comparison to lower bounds from parallel work.

[10] 2407.14993

Lower Bounds for Nonparametric Estimation of Ordinary Differential Equations

We noisily observe solutions of an ordinary differential equation $\dot u = f(u)$ at given times, where $u$ lives in a $d$-dimensional state space. The model function $f$ is unknown and belongs to a H\"older-type smoothness class with parameter $\beta$. For the nonparametric problem of estimating $f$, we provide lower bounds on the error in two complementary model specifications: the snake model with few, long observed solutions and the stubble model with many short ones. The lower bounds are minimax optimal in some settings. They depend on various parameters, which in the optimal asymptotic regime leads to the same rate for the squared error in both models: it is characterized by the exponent $-2\beta/(2(\beta+1)+d)$ for the total number of observations $n$. To derive these results, we establish a master theorem for lower bounds in general nonparametric regression problems, which makes the proofs more comparable and seems to be a useful tool for future use.

[11] 2407.15084

High-dimensional log contrast models with measurement errors

High-dimensional compositional data are frequently encountered in many fields of modern scientific research. In regression analysis of compositional data, the presence of covariate measurement errors poses grand challenges for existing statistical error-in-variable regression analysis methods since measurement error in one component of the composition has an impact on others. To simultaneously address the compositional nature and measurement errors in the high-dimensional design matrix of compositional covariates, we propose a new method named Error-in-composition (Eric) Lasso for regression analysis of corrupted compositional predictors. Estimation error bounds of Eric Lasso and its asymptotic sign-consistent selection properties are established. We then illustrate the finite sample performance of Eric Lasso using simulation studies and demonstrate its potential usefulness in a real data application example.

[12] 2407.15256

Weak-instrument-robust subvector inference in instrumental variables regression: A subvector Lagrange multiplier test and properties of subvector Anderson-Rubin confidence sets

We propose a weak-instrument-robust subvector Lagrange multiplier test for instrumental variables regression. We show that it is asymptotically size-correct under a technical condition. This is the first weak-instrument-robust subvector test for instrumental variables regression to recover the degrees of freedom of the commonly used Wald test, which is not robust to weak instruments. Additionally, we provide a closed-form solution for subvector confidence sets obtained by inverting the subvector Anderson-Rubin test. We show that they are centered around a k-class estimator. Also, we show that the subvector confidence sets for single coefficients of the causal parameter are jointly bounded if and only if Anderson's likelihood-ratio test rejects the hypothesis that the first-stage regression parameter is of reduced rank, that is, that the causal parameter is not identified. Finally, we show that if a confidence set obtained by inverting the Anderson-Rubin test is bounded and nonempty, it is equal to a Wald-based confidence set with a data-dependent confidence level. We explicitly compute this Wald-based confidence test.

[13] 2407.15276

Nonlinear Binscatter Methods

Binned scatter plots are a powerful statistical tool for empirical work in the social, behavioral, and biomedical sciences. Available methods rely on a quantile-based partitioning estimator of the conditional mean regression function to primarily construct flexible yet interpretable visualization methods, but they can also be used to estimate treatment effects, assess uncertainty, and test substantive domain-specific hypotheses. This paper introduces novel binscatter methods based on nonlinear, possibly nonsmooth M-estimation methods, covering generalized linear, robust, and quantile regression models. We provide a host of theoretical results and practical tools for local constant estimation along with piecewise polynomial and spline approximations, including (i) optimal tuning parameter (number of bins) selection, (ii) confidence bands, and (iii) formal statistical tests regarding functional form or shape restrictions. Our main results rely on novel strong approximations for general partitioning-based estimators covering random, data-driven partitions, which may be of independent interest. We demonstrate our methods with an empirical application studying the relation between the percentage of individuals without health insurance and per capita income at the zip-code level. We provide general-purpose software packages implementing our methods in Python, R, and Stata.

[14] 2407.15297

Distributional limits of graph cuts on discretized grids

Graph cuts are among the most prominent tools for clustering and classification analysis. While intensively studied from geometric and algorithmic perspectives, graph cut-based statistical inference still remains elusive to a certain extent. Distributional limits are fundamental in understanding and designing such statistical procedures on randomly sampled data. We provide explicit limiting distributions for balanced graph cuts in general on a fixed but arbitrary discretization. In particular, we show that Minimum Cut, Ratio Cut and Normalized Cut behave asymptotically as the minimum of Gaussians as sample size increases. Interestingly, our results reveal a dichotomy for Cheeger Cut: The limiting distribution of the optimal objective value is the minimum of Gaussians only when the optimal partition yields two sets of unequal volumes, while otherwise the limiting distribution is the minimum of a random mixture of Gaussians. Further, we show the bootstrap consistency for all types of graph cuts by utilizing the directional differentiability of cut functionals. We validate these theoretical findings by Monte Carlo experiments, and examine differences between the cuts and the dependency on the underlying distribution. Additionally, we expand our theoretical findings to the Xist algorithm, a computational surrogate of graph cuts recently proposed in Suchan, Li and Munk (arXiv, 2023), thus demonstrating the practical applicability of our findings e.g. in statistical tests.

[15] 2407.15301

U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks

Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the H\'ajek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.

[16] 2407.15340

Random Survival Forest for Censored Functional Data

This paper introduces a Random Survival Forest (RSF) method for functional data. The focus is specifically on defining a new functional data structure, the Censored Functional Data (CFD), for dealing with temporal observations that are censored due to study limitations or incomplete data collection. This approach allows for precise modelling of functional survival trajectories, leading to improved interpretation and prediction of survival dynamics across different groups. A medical survival study on the benchmark SOFA data set is presented. Results show good performance of the proposed approach, particularly in ranking the importance of predicting variables, as captured through dynamic changes in SOFA scores and patient mortality rates.

[17] 2407.15377

Replicable Bandits for Digital Health Interventions

Adaptive treatment assignment algorithms, such as bandit and reinforcement learning algorithms, are increasingly used in digital health intervention clinical trials. Causal inference and related data analyses are critical for evaluating digital health interventions, deciding how to refine the intervention, and deciding whether to roll-out the intervention more broadly. However the replicability of these analyses has received relatively little attention. This work investigates the replicability of statistical analyses from trials deploying adaptive treatment assignment algorithms. We demonstrate that many standard statistical estimators can be inconsistent and fail to be replicable across repetitions of the clinical trial, even as the sample size grows large. We show that this non-replicability is intimately related to properties of the adaptive algorithm itself. We introduce a formal definition of a "replicable bandit algorithm" and prove that under such algorithms, a wide variety of common statistical analyses are guaranteed to be consistent. We present both theoretical results and simulation studies based on a mobile health oral health self-care intervention. Our findings underscore the importance of designing adaptive algorithms with replicability in mind, especially for settings like digital health where deployment decisions rely heavily on replicated evidence. We conclude by discussing open questions on the connections between algorithm design, statistical inference, and experimental replicability.

[18] 2407.15388

A new paradigm of mortality modeling via individual vitality dynamics

The significance of mortality modeling extends across multiple research areas, including life insurance valuation, longevity risk management, life-cycle hypothesis, and retirement income planning. Despite the variety of existing approaches, such as mortality laws and factor-based models, they often lack compatibility or fail to meet specific research needs. To address these shortcomings, this study introduces a novel approach centered on modeling the dynamics of individual vitality and defining mortality as the depletion of vitality level to zero. More specifically, we develop a four-component framework to analyze the initial value, trend, diffusion, and sudden changes in vitality level over an individual's lifetime. We demonstrate the framework's estimation and analytical capabilities in various settings and discuss its practical implications in actuarial problems and other research areas. The broad applicability and interpretability of our vitality-based modeling approach offer an enhanced paradigm for mortality modeling.

[19] 2407.15393

On some recent quasi-copula problems and some new methods

The aim of this paper is to present three construction methods for quasi-copulas based on recent developments: a representation of multivariate quasi-copulas by means of infima and suprema of copulas, an extension of a classical result on shuffles of min to the setting of quasi-copulas, and a construction method for quasi-copulas obeying a given signed mass pattern on a patch.

[20] 2407.15401

Data Space Inversion for Efficient Predictions and Uncertainty Quantification for Geothermal Models

The ability to make accurate predictions with quantified uncertainty provides a crucial foundation for the successful management of a geothermal reservoir. Conventional approaches for making predictions using geothermal reservoir models involve estimating unknown model parameters using field data, then propagating the uncertainty in these estimates through to the predictive quantities of interest. However, the unknown parameters are not always of direct interest; instead, the predictions are of primary importance. Data space inversion (DSI) is an alternative methodology that allows for the efficient estimation of predictive quantities of interest, with quantified uncertainty, that avoids the need to estimate model parameters entirely. In this paper, we evaluate the applicability of DSI to geothermal reservoir modelling. We first review the processes of model calibration, prediction and uncertainty quantification from a Bayesian perspective, and introduce data space inversion as a simple, efficient technique for approximating the posterior predictive distribution. We then apply the DSI framework to two model problems in geothermal reservoir modelling. We evaluate the accuracy and efficiency of DSI relative to other common methods for uncertainty quantification, study how the number of reservoir model simulations affects the resulting approximation to the posterior predictive distribution, and demonstrate how the framework can be enhanced through the use of suitable reparametrisations. Our results support the idea that data space inversion is a simple, robust and efficient technique for making predictions with quantified uncertainty using geothermal reservoir models, providing a useful alternative to more conventional approaches.

[21] 2407.15449

Persistence-based Modes Inference

We consider the estimation of multiple modes of a (multivariate) density. We start by proposing an estimator of the $H_0$ persistence diagram. We then derive from it a procedure to estimate the number of modes, their locations and the associated local maxima. For large classes of piecewise-continuous functions, we show that these estimators achieve nearly minimax rates. These classes involve geometric control over the discontinuities set and differ from commonly considered function classes in mode(s) inference. Although the global regularity assumptions are stronger, we do not suppose regularity (or even continuity) in any neighborhood of the modes.

[22] 2407.15453

Regression under demographic parity constraints via unlabeled post-processing

We address the problem of performing regression while ensuring demographic parity, even without access to sensitive attributes during inference. We present a general-purpose post-processing algorithm that, using accurate estimates of the regression function and a sensitive attribute predictor, generates predictions that meet the demographic parity constraint. Our method involves discretization and stochastic minimization of a smooth convex function. It is suitable for online post-processing and multi-class classification tasks only involving unlabeled data for the post-processing. Unlike prior methods, our approach is fully theory-driven. We require precise control over the gradient norm of the convex function, and thus, we rely on more advanced techniques than standard stochastic gradient descent. Our algorithm is backed by finite-sample analysis and post-processing bounds, with experimental results validating our theoretical findings.

[23] 2407.15455

Score matching for bridges without time-reversals

We propose a new algorithm for learning a bridged diffusion process using score-matching methods. Our method relies on reversing the dynamics of the forward process and using this to learn a score function, which, via Doob's $h$-transform, gives us a bridged diffusion process; that is, a process conditioned on an endpoint. In contrast to prior methods, ours learns the score term $\nabla_x \log p(t, x; T, y)$, for given $t, Y$ directly, completely avoiding the need for first learning a time reversal. We compare the performance of our algorithm with existing methods and see that it outperforms using the (learned) time-reversals to learn the score term. The code can be found at

[24] 2407.15461

Forecasting mortality rates with functional signatures

This study introduces an innovative methodology for mortality forecasting, which integrates signature-based methods within the functional data framework of the Hyndman-Ullah (HU) model. This new approach, termed the Hyndman-Ullah with truncated signatures (HUts) model, aims to enhance the accuracy and robustness of mortality predictions. By utilizing signature regression, the HUts model aims to capture complex, nonlinear dependencies in mortality data which enhances forecasting accuracy across various demographic conditions. The model is applied to mortality data from 12 countries, comparing its forecasting performance against classical models like the Lee-Carter model and variants of the HU models across multiple forecast horizons. Our findings indicate that overall the HUts model not only provides more precise point forecasts but also shows robustness against data irregularities, such as those observed in countries with historical outliers. The integration of signature-based methods enables the HUts model to capture complex patterns in mortality data, making it a powerful tool for actuaries and demographers. Prediction intervals are also constructed using bootstrapping methods.

[25] 2407.15468

Efficient influence functions for Sobol' indices under two designs of experiments

In this note, we are interested in the asymptotic efficiency of Sobol' indices esti-mators. After recalling the basis of asymptotic efficiency, we compute the efficientinfluence functions for Sobol' indices in two different contexts: the Pick-Freeze andthe given-data settings.

[26] 2407.15564

Non-parametric estimation of conditional quantiles for time series with heavy tails

We propose a modified weighted Nadaraya-Watson estimator for the conditional distribution of a time series with heavy tails. We establish the asymptotic normality of the proposed estimator. Simulation study is carried out to assess the performance of the estimator. We illustrate our method using a dataset.

[27] 2407.15610

A two-step model to study the inclusivity's distribution of Italian early childhood education and care services

This study investigates how to define and measure inclusivity in Italy's early childhood education and care (ECEC) services, bringing to light the gap between legislative principles and local/regional applications. The Italian legislative decree n. 65/2017 prescribes inclusivity in ECEC, defined as being open to all children and indicating it as a top priority. To delve into this concept, we propose a two-step model. First, a latent trait model estimates an inclusivity index as a latent variable. Then, a mixed quantile model examines the distribution of this novel latent inclusivity index across Italian regions. Our findings reveal a substantial variation in inclusivity across Italy. In addition, a proper indicator based on the latent inclusivity index defined in the first step is provided at the NUTS-3 level using the empirical best predictor approach. From our analysis, public facilities demonstrate a higher level of inclusivity compared to their private counterparts. Despite these challenges, we are compelled to identify positive scenarios that can serve as models for regions facing more critical situations. Besides its methodological advancement, this paper provides policymakers and stakeholders with an evident call to action, offering valuable insights into the inclusivity landscape of Italian ECEC services. It underscores the urgent need to standardize the accessibility characteristics of ECEC services throughout Italy to ensure equitable access for all children.

[28] 2407.15638

Orderings of the finite mixture with modified proportional hazard rate model

In this paper, we consider finite mixture models with modified proportional hazard rates. Sufficient conditions for the usual stochastic order and the hazard order are established under chain majorization. We study stochastic comparisons under different settings of T-transform for various values of chain majorization. We establish usual stochastic order and hazard rate order between two mixture random variables when a matrix of model parameters and mixing proportions changes to another matrix in some mathematical sense. Sufficient conditions for the star order and Lorenz order are established under weakly supermajorization. The results of this paper are illustrated with numerical examples.

[29] 2407.15662

How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?

We consider the situation when a learner faces a set of unknown discrete distributions $(p_k)_{k\in \mathcal K}$ defined over a common alphabet $\mathcal X$, and can build for each distribution $p_k$ an individual high-probability confidence set thanks to $n_k$ observations sampled from $p_k$. The set $(p_k)_{k\in \mathcal K}$ is structured: each distribution $p_k$ is obtained from the same common, but unknown, distribution q via applying an unknown permutation to $\mathcal X$. We call this \emph{permutation-equivalence}. The goal is to build refined confidence sets \emph{exploiting} this structural property. Like other popular notions of structure (Lipschitz smoothness, Linearity, etc.) permutation-equivalence naturally appears in machine learning problems, and to benefit from its potential gain calls for a specific approach. We present a strategy to effectively exploit permutation-equivalence, and provide a finite-time high-probability bound on the size of the refined confidence sets output by the strategy. Since a refinement is not possible for too few observations in general, under mild technical assumptions, our finite-time analysis establish when the number of observations $(n_k)_{k\in \mathcal K}$ are large enough so that the output confidence sets improve over initial individual sets. We carefully characterize this event and the corresponding improvement. Further, our result implies that the size of confidence sets shrink at asymptotic rates of $O(1/\sqrt{\sum_{k\in \mathcal K} n_k})$ and $O(1/\max_{k\in K} n_{k})$, respectively for elements inside and outside the support of q, when the size of each individual confidence set shrinks at respective rates of $O(1/\sqrt{n_k})$ and $O(1/n_k)$. We illustrate the practical benefit of exploiting permutation equivalence on a reinforcement learning task.

[30] 2407.15666

Particle Based Inference for Continuous-Discrete State Space Models

This article develops a methodology allowing application of the complete machinery of particle-based inference methods upon what we call the class of continuous-discrete State Space Models (CD-SSMs). Such models correspond to a latent continuous-time It\^o diffusion process which is observed with noise at discrete time instances. Due to the continuous-time nature of the hidden signal, standard Feynman-Kac formulations and their accompanying particle-based approximations have to overcome several challenges, arising mainly due to the following considerations: (i) finite-time transition densities of the signal are typically intractable; (ii) ancestors of sampled signals are determined w.p.~1, thus cannot be resampled; (iii) diffusivity parameters given a sampled signal yield Dirac distributions. We overcome all above issues by introducing a framework based on carefully designed proposals and transformations thereof. That is, we obtain new expressions for the Feynman-Kac model that accommodate the effects of a continuous-time signal and overcome induced degeneracies. The constructed formulations will enable use of the full range of particle-based algorithms for CD-SSMs: for filtering/smoothing and parameter inference, whether online or offline. Our framework is compatible with guided proposals in the filtering steps that are essential for efficient algorithmic performance in the presence of informative observations or in higher dimensions, and is applicable for a very general class of CD-SSMs, including the case when the signal is modelled as a hypo-elliptic diffusion. Our methods can be immediately incorporated to available software packages for particle-based algorithms.

[31] 2407.15674

LASSO Estimation in Exponential Random Graph models

The paper demonstrates the use of LASSO-based estimation in network models. Taking the Exponential Random Graph Model (ERGM) as a flexible and widely used model for network data analysis, the paper focuses on the question of how to specify the (sufficient) statistics, that define the model structure. This includes both, endogenous network statistics (e.g. twostars, triangles, etc.) as well as statistics involving exogenous covariates; on the node as well as on the edge level. LASSO estimation is a penalized estimation that shrinks some of the parameter estimates to be equal to zero. As such it allows for model selection by modifying the amount of penalty. The concept is well established in standard regression and we demonstrate its usage in network data analysis, with the advantage of automatically providing a model selection framework.

[32] 2407.15687

SoftCVI: contrastive variational inference with self-generated soft labels

Estimating a distribution given access to its unnormalized density is pivotal in Bayesian inference, where the posterior is generally known only up to an unknown normalizing constant. Variational inference and Markov chain Monte Carlo methods are the predominant tools for this task; however, both methods are often challenging to apply reliably, particularly when the posterior has complex geometry. Here, we introduce Soft Contrastive Variational Inference (SoftCVI), which allows a family of variational objectives to be derived through a contrastive estimation framework. These objectives have zero variance gradient when the variational approximation is exact, without the need for specialized gradient estimators. The approach involves parameterizing a classifier in terms of the variational distribution, which allows the inference task to be reframed as a contrastive estimation problem, aiming to identify a single true posterior sample among a set of samples. Despite this framing, we do not require positive or negative samples, but rather learn by sampling the variational distribution and computing ground truth soft classification labels from the unnormalized posterior itself. We empirically investigate the performance on a variety of Bayesian inference tasks, using both using both simple (e.g. normal) and expressive (normalizing flow) variational distributions. We find that SoftCVI objectives often outperform other commonly used variational objectives.

[33] 2407.15733

Online closed testing with e-values

In contemporary research, data scientists often test an infinite sequence of hypotheses $H_1,H_2,\ldots $ one by one, and are required to make real-time decisions without knowing the future hypotheses or data. In this paper, we consider such an online multiple testing problem with the goal of providing simultaneous lower bounds for the number of true discoveries in data-adaptively chosen rejection sets. Using the (online) closure principle, we show that for this task it is necessary to use an anytime-valid test for each intersection hypothesis. Motivated by this result, we construct a new online closed testing procedure and a corresponding short-cut with a true discovery guarantee based on multiplying sequential e-values. This general but simple procedure gives uniform improvements over existing methods but also allows to construct entirely new and powerful procedures. In addition, we introduce new ideas for hedging and boosting of sequential e-values that provably increase power. Finally, we also propose the first online true discovery procedure for arbitrarily dependent e-values.

[34] 2407.15744

A matrix algebra for graphical statistical models

Directed mixed graphs permit directed and bidirected edges between any two vertices. They were first considered in the path analysis developed by Sewall Wright and play an essential role in statistical modeling. We introduce a matrix algebra for walks on such graphs. Each element of the algebra is a matrix whose entries are sets of walks on the graph from the corresponding row to the corresponding column. The matrix algebra is then generated by applying addition (set union), multiplication (concatenation), and transpose to the two basic matrices consisting of directed and bidirected edges. We use it to formalize, in the context of Gaussian linear systems, the correspondence between important graphical concepts such as latent projection and graph separation with important probabilistic concepts such as marginalization and (conditional) independence. In two further examples regarding confounder adjustment and the augmentation criterion, we illustrate how the algebra allows us to visualize complex graphical proofs. A "dictionary" and LATEX macros for the matrix algebra are provided in the Appendix.

[35] 2407.15764

Huber means on Riemannian manifolds

This article introduces Huber means on Riemannian manifolds, providing a robust alternative to the Frechet mean by integrating elements of both square and absolute loss functions. The Huber means are designed to be highly resistant to outliers while maintaining efficiency, making it a valuable generalization of Huber's M-estimator for manifold-valued data. We comprehensively investigate the statistical and computational aspects of Huber means, demonstrating their utility in manifold-valued data analysis. Specifically, we establish minimal conditions for ensuring the existence and uniqueness of the Huber mean and discuss regularity conditions for unbiasedness. The Huber means are statistically consistent and enjoy the central limit theorem. Additionally, we propose a moment-based estimator for the limiting covariance matrix, which is used to construct a robust one-sample location test procedure and an approximate confidence region for location parameters. Huber means are shown to be highly robust and efficient in the presence of outliers or under heavy-tailed distribution. To be more specific, it achieves a breakdown point of at least 0.5, the highest among all isometric equivariant estimators, and is more efficient than the Frechet mean under heavy-tailed distribution. Numerical examples on spheres and the set of symmetric positive-definite matrices further illustrate the efficiency and reliability of the proposed Huber means on Riemannian manifolds.

[36] 2407.12178

Exploration Unbound

A sequential decision-making agent balances between exploring to gain new knowledge about an environment and exploiting current knowledge to maximize immediate reward. For environments studied in the traditional literature, optimal decisions gravitate over time toward exploitation as the agent accumulates sufficient knowledge and the benefits of further exploration vanish. What if, however, the environment offers an unlimited amount of useful knowledge and there is large benefit to further exploration no matter how much the agent has learned? We offer a simple, quintessential example of such a complex environment. In this environment, rewards are unbounded and an agent can always increase the rate at which rewards accumulate by exploring to learn more. Consequently, an optimal agent forever maintains a propensity to explore.

[37] 2407.12185

Satisficing Exploration for Deep Reinforcement Learning

A default assumption in the design of reinforcement-learning algorithms is that a decision-making agent always explores to learn optimal behavior. In sufficiently complex environments that approach the vastness and scale of the real world, however, attaining optimal performance may in fact be an entirely intractable endeavor and an agent may seldom find itself in a position to complete the requisite exploration for identifying an optimal policy. Recent work has leveraged tools from information theory to design agents that deliberately forgo optimal solutions in favor of sufficiently-satisfying or satisficing solutions, obtained through lossy compression. Notably, such agents may employ fundamentally different exploratory decisions to learn satisficing behaviors more efficiently than optimal ones that are more data intensive. While supported by a rigorous corroborating theory, the underlying algorithm relies on model-based planning, drastically limiting the compatibility of these ideas with function approximation and high-dimensional observations. In this work, we remedy this issue by extending an agent that directly represents uncertainty over the optimal value function allowing it to both bypass the need for model-based planning and to learn satisficing policies. We provide simple yet illustrative experiments that demonstrate how our algorithm enables deep reinforcement-learning agents to achieve satisficing behaviors. In keeping with previous work on this setting for multi-armed bandits, we additionally find that our algorithm is capable of synthesizing optimal behaviors, when feasible, more efficiently than its non-information-theoretic counterpart.

[38] 2407.14518

Accurate Analysis of Sparse Random Projections

There has been recently a lot of research on sparse variants of random projections, faster adaptations of the state-of-the-art dimensionality reduction technique originally due to Johsnon and Lindenstrauss. Although the construction is very simple, its analyses are notoriously complicated. Meeting the demand for both simplicity and accuracy, this work establishes sharp sub-poissonian tail bounds for the distribution of sparse random projections. Compared to other works, this analysis provide superior numerical guarantees (exactly matching impossibility results) while being arguably less complicated (the technique resembles Bennet's Inequality and is of independent interest).

[39] 2407.14537

Small but not least changes: The Art of Creating Disruptive Innovations

In the ever-evolving landscape of technology, product innovation thrives on replacing outdated technologies with groundbreaking ones or through the ingenious recombination of existing technologies. Our study embarks on a revolutionary journey by genetically representing products, extracting their chromosomal data, and constructing a comprehensive phylogenetic network of automobiles. We delve deep into the technological features that shape innovation, pinpointing the ancestral roots of products and mapping out intricate product-family triangles. By leveraging the similarities within these triangles, we introduce a pioneering "Product Disruption Index"-inspired by the CD index (Funk and Owen-Smith, 2017)-to quantify a product's disruptiveness. Our approach is rigorously validated against the scientifically recognized trend of decreasing disruptiveness over time (Park et al., 2023) and through compelling case studies. Our statistical analysis reveals a fascinating insight: disruptive product innovations often stem from minor, yet crucial, modifications.

[40] 2407.14554

On the Distributions of Product and Quotient of two Independent $\hat{I}$-function variates

The study of probability distributions for random variables and their algebraic combinations has been a central focus driving the advancement of probability and statistics. Since the 1920s, the challenge of calculating the probability distributions of sums, differences, products, and quotients of independent random variables have drawn the attention of numerous statisticians and mathematicians who studied the algebraic properties and relationships of random variables. Statistical distributions are highly helpful in data science and machine learning, as they provide a range of possible values for the variables, aiding in the development of a deeper understanding of the underlying problem. In this paper, we have presented a new probability distribution based on the $\hat{I}$-function. Also, we have discussed the applications of the $\hat{I}$ function, particularly in deriving the distributions of product and the quotient involving two independent $\hat{I}$ function variates. Additionally, it has been shown that both the product and quotient of two independent $\hat{I}$-function variates also follow the $\hat{I}$-function distribution. Furthermore, the new distribution, known as the $\hat{I}$-function distribution, includes several well-known classical distributions such as the gamma, beta, exponential, normal H-function, and G-function distributions, among others, as special cases. Therefore, the $\hat{I}$-function distribution can be considered a characterization or generalization of the above-mentioned distributions.

[41] 2407.14631

Two new feature selection methods based on learn-heuristic techniques for breast cancer prediction: A comprehensive analysis

Breast cancer is not preventable because of its unknown causes. However, its early diagnosis increases patients' recovery chances. Machine learning (ML) can be utilized to improve treatment outcomes in healthcare operations while diminishing costs and time. In this research, we suggest two novel feature selection (FS) methods based upon an imperialist competitive algorithm (ICA) and a bat algorithm (BA) and their combination with ML algorithms. This study aims to enhance diagnostic models' efficiency and present a comprehensive analysis to help clinical physicians make much more precise and reliable decisions than before. K-nearest neighbors, support vector machine, decision tree, Naive Bayes, AdaBoost, linear discriminant analysis, random forest, logistic regression, and artificial neural network are some of the methods employed. This paper applied a distinctive integration of evaluation measures and ML algorithms using the wrapper feature selection based on ICA (WFSIC) and BA (WFSB) separately. We compared two proposed approaches for the performance of the classifiers. Also, we compared our best diagnostic model with previous works reported in the literature survey. Experimentations were performed on the Wisconsin diagnostic breast cancer dataset. Results reveal that the proposed framework that uses the BA with an accuracy of 99.12\%, surpasses the framework using the ICA and most previous works. Additionally, the RF classifier in the approach of FS based on BA emerges as the best model and outperforms others regarding its criteria. Besides, the results illustrate the role of our techniques in reducing the dataset dimensions up to 90\% and increasing the performance of diagnostic models by over 99\%. Moreover, the result demonstrates that there are more critical features than the optimum dataset obtained by proposed FS approaches that have been selected by most ML models.

[42] 2407.14879

Thompson Sampling Itself is Differentially Private

In this work we first show that the classical Thompson sampling algorithm for multi-arm bandits is differentially private as-is, without any modification. We provide per-round privacy guarantees as a function of problem parameters and show composition over $T$ rounds; since the algorithm is unchanged, existing $O(\sqrt{NT\log N})$ regret bounds still hold and there is no loss in performance due to privacy. We then show that simple modifications -- such as pre-pulling all arms a fixed number of times, increasing the sampling variance -- can provide tighter privacy guarantees. We again provide privacy guarantees that now depend on the new parameters introduced in the modification, which allows the analyst to tune the privacy guarantee as desired. We also provide a novel regret analysis for this new algorithm, and show how the new parameters also impact expected regret. Finally, we empirically validate and illustrate our theoretical findings in two parameter regimes and demonstrate that tuning the new parameters substantially improve the privacy-regret tradeoff.

[43] 2407.14942

Concentration and limit of large random matrices with given margins

We study large random matrices with i.i.d. entries conditioned to have prescribed row and column sums (margin). This problem has rich connections to relative entropy minimization, Schr\"{o}dinger bridge, the enumeration of contingency tables, and random graphs with given degree sequences. We show that such margin-constrained random matrix is sharply concentrated around a certain deterministic matrix, which we call the \textit{typical table}. Typical tables have dual characterizations: (1) the expectation of the random matrix ensemble with minimum relative entropy from the base model constrained to have the expected target margin, and (2) the expectation of the maximum likelihood model obtained by rank-one exponential tilting of the base model. The structure of the typical table is dictated by two dual variables, which give the maximum likelihood estimates of the tilting parameters. Based on these results, for a sequence of "tame" margins that converges in \( L^{1} \) to a limiting continuum margin as the size of the matrix diverges, we show that the sequence of margin-constrained random matrices converges in cut norm to a limiting kernel, which is the $L^{2}$-limit of the corresponding rescaled typical tables. The rate of convergence is controlled by how fast the margins converge in $L^{1}$. We derive several new results for random contingency tables from our general framework.

[44] 2407.14960

Addressing Data Heterogeneity in Federated Learning of Cox Proportional Hazards Models

The diversity in disease profiles and therapeutic approaches between hospitals and health professionals underscores the need for patient-centric personalized strategies in healthcare. Alongside this, similarities in disease progression across patients can be utilized to improve prediction models in survival analysis. The need for patient privacy and the utility of prediction models can be simultaneously addressed in the framework of Federated Learning (FL). This paper outlines an approach in the domain of federated survival analysis, specifically the Cox Proportional Hazards (CoxPH) model, with a specific focus on mitigating data heterogeneity and elevating model performance. We present an FL approach that employs feature-based clustering to enhance model accuracy across synthetic datasets and real-world applications, including the Surveillance, Epidemiology, and End Results (SEER) database. Furthermore, we consider an event-based reporting strategy that provides a dynamic approach to model adaptation by responding to local data changes. Our experiments show the efficacy of our approach and discuss future directions for a practical application of FL in healthcare.

[45] 2407.15007

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning

Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision making task by learning from demonstrations, and has been widely applied to robotics, autonomous driving, and autoregressive text generation. The simplest approach to IL, behavior cloning (BC), is thought to incur sample complexity with unfavorable quadratic dependence on the problem horizon, motivating a variety of different online algorithms that attain improved linear horizon dependence under stronger assumptions on the data and the learner's access to the expert. We revisit the apparent gap between offline and online IL from a learning-theoretic perspective, with a focus on general policy classes up to and including deep neural networks. Through a new analysis of behavior cloning with the logarithmic loss, we show that it is possible to achieve horizon-independent sample complexity in offline IL whenever (i) the range of the cumulative payoffs is controlled, and (ii) an appropriate notion of supervised learning complexity for the policy class is controlled. Specializing our results to deterministic, stationary policies, we show that the gap between offline and online IL is not fundamental: (i) it is possible to achieve linear dependence on horizon in offline IL under dense rewards (matching what was previously only known to be achievable in online IL); and (ii) without further assumptions on the policy class, online IL cannot improve over offline IL with the logarithmic loss, even in benign MDPs. We complement our theoretical results with experiments on standard RL tasks and autoregressive language generation to validate the practical relevance of our findings.

[46] 2407.15020

Integrating Attentional Factors and Spacing in Logistic Knowledge Tracing Models to Explore the Impact of Training Sequences on Category Learning

In category learning, a growing body of literature has increasingly focused on exploring the impacts of interleaving in contrast to blocking. The sequential attention hypothesis posits that interleaving draws attention to the differences between categories while blocking directs attention toward similarities within categories. Although a recent study underscores the joint influence of memory and attentional factors on sequencing effects, there remains a scarcity of effective computational models integrating both attentional and memory considerations to comprehensively understand the effect of training sequences on students' performance. This study introduces a novel integration of attentional factors and spacing into the logistic knowledge tracing (LKT) models to monitor students' performance across different training sequences (interleaving and blocking). Attentional factors were incorporated by recording the counts of comparisons between adjacent trials, considering whether they belong to the same or different category. Several features were employed to account for temporal spacing. We used cross-validations to test the model fit and predictions on the learning session and posttest. Our findings reveal that incorporating both attentional factors and spacing features in the Additive Factors Model (AFM) significantly enhances its capacity to capture the effects of interleaving and blocking and demonstrates superior predictive accuracy for students' learning outcomes. By bridging the gap between attentional factors and memory processes, our computational approach offers a more comprehensive framework for understanding and predicting category learning outcomes in educational settings.

[47] 2407.15028

Statistical Models for Outbreak Detection of Measles in North Cotabato, Philippines

A measles outbreak occurs when the number of cases of measles in the population exceeds the typical level. Outbreaks that are not detected and managed early can increase mortality and morbidity and incur costs from activities responding to these events. The number of measles cases in the Province of North Cotabato, Philippines, was used in this study. Weekly reported cases of measles from January 2016 to December 2021 were provided by the Epidemiology and Surveillance Unit of the North Cotabato Provincial Health Office. Several integer-valued autoregressive (INAR) time series models were used to explore the possibility of detecting and identifying measles outbreaks in the province along with the classical ARIMA model. These models were evaluated based on goodness of fit, measles outbreak detection accuracy, and timeliness. The results of this study confirmed that INAR models have the conceptual advantage over ARIMA since the latter produces non-integer forecasts, which are not realistic for count data such as measles cases. Among the INAR models, the ZINGINAR (1) model was recommended for having a good model fit and timely and accurate detection of outbreaks. Furthermore, policymakers and decision-makers from relevant government agencies can use the ZINGINAR (1) model to improve disease surveillance and implement preventive measures against contagious diseases beforehand.

[48] 2407.15110

Practical multi-fidelity machine learning: fusion of deterministic and Bayesian models

Multi-fidelity machine learning methods address the accuracy-efficiency trade-off by integrating scarce, resource-intensive high-fidelity data with abundant but less accurate low-fidelity data. We propose a practical multi-fidelity strategy for problems spanning low- and high-dimensional domains, integrating a non-probabilistic regression model for the low-fidelity with a Bayesian model for the high-fidelity. The models are trained in a staggered scheme, where the low-fidelity model is transfer-learned to the high-fidelity data and a Bayesian model is trained for the residual. This three-model strategy -- deterministic low-fidelity, transfer learning, and Bayesian residual -- leads to a prediction that includes uncertainty quantification both for noisy and noiseless multi-fidelity data. The strategy is general and unifies the topic, highlighting the expressivity trade-off between the transfer-learning and Bayesian models (a complex transfer-learning model leads to a simpler Bayesian model, and vice versa). We propose modeling choices for two scenarios, and argue in favor of using a linear transfer-learning model that fuses 1) kernel ridge regression for low-fidelity with Gaussian processes for high-fidelity; or 2) deep neural network for low-fidelity with a Bayesian neural network for high-fidelity. We demonstrate the effectiveness and efficiency of the proposed strategies and contrast them with the state-of-the-art based on various numerical examples. The simplicity of these formulations makes them practical for a broad scope of future engineering applications.

[49] 2407.15245

Weyl Calculus and Exactly Solvable Schrödinger Bridges with Quadratic State Cost

Schr\"{o}dinger bridge--a stochastic dynamical generalization of optimal mass transport--exhibits a learning-control duality. Viewed as a stochastic control problem, the Schr\"{o}dinger bridge finds an optimal control policy that steers a given joint state statistics to another while minimizing the total control effort subject to controlled diffusion and deadline constraints. Viewed as a stochastic learning problem, the Schr\"{o}dinger bridge finds the most-likely distribution-valued trajectory connecting endpoint distributional observations, i.e., solves the two point boundary-constrained maximum likelihood problem over the manifold of probability distributions. Recent works have shown that solving the Schr\"{o}dinger bridge problem with state cost requires finding the Markov kernel associated with a reaction-diffusion PDE where the state cost appears as a state-dependent reaction rate. We explain how ideas from Weyl calculus in quantum mechanics, specifically the Weyl operator and the Weyl symbol, can help determine such Markov kernels. We illustrate these ideas by explicitly finding the Markov kernel for the case of quadratic state cost via Weyl calculus, recovering our earlier results but avoiding tedious computation with Hermite polynomials.

[50] 2407.15247

TimeInf: Time Series Data Contribution via Influence Functions

Evaluating the contribution of individual data points to a model's prediction is critical for interpreting model predictions and improving model performance. Existing data contribution methods have been applied to various data types, including tabular data, images, and texts; however, their primary focus has been on i.i.d. settings. Despite the pressing need for principled approaches tailored to time series datasets, the problem of estimating data contribution in such settings remains unexplored, possibly due to challenges associated with handling inherent temporal dependencies. This paper introduces TimeInf, a data contribution estimation method for time-series datasets. TimeInf uses influence functions to attribute model predictions to individual time points while preserving temporal structures. Our extensive empirical results demonstrate that TimeInf outperforms state-of-the-art methods in identifying harmful anomalies and helpful time points for forecasting. Additionally, TimeInf offers intuitive and interpretable attributions of data values, allowing us to easily distinguish diverse anomaly patterns through visualizations.

[51] 2407.15277

Conformal Predictions under Markovian Data

We study the split Conformal Prediction method when applied to Markovian data. We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data). This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as $\sqrt{t_\mathrm{mix}\ln(n)/n}$ (where $t_\mathrm{mix}$ is the mixing time of the chain). We also derive upper bounds on the impact of the correlations on the size of the prediction set. Finally we present $K$-split CP, a method that consists in thinning the calibration dataset and that adapts to the mixing properties of the chain. Its coverage gap is reduced to $t_\mathrm{mix}/(n\ln(n))$ without really affecting the size of the prediction set. We finally test our algorithms on synthetic and real-world datasets.

[52] 2407.15284

Revisiting Neighborhood Aggregation in Graph Neural Networks for Node Classification using Statistical Signal Processing

We delve into the issue of node classification within graphs, specifically reevaluating the concept of neighborhood aggregation, which is a fundamental component in graph neural networks (GNNs). Our analysis reveals conceptual flaws within certain benchmark GNN models when operating under the assumption of edge-independent node labels, a condition commonly observed in benchmark graphs employed for node classification. Approaching neighborhood aggregation from a statistical signal processing perspective, our investigation provides novel insights which may be used to design more efficient GNN models.

[53] 2407.15425

Empirical Capacity Model for Self-Attention Neural Networks

Large pretrained self-attention neural networks, or transformers, have been very successful in various tasks recently. The performance of a model on a given task depends on its ability to memorize and generalize the training data. Large transformer models, which may have billions of parameters, in theory have a huge capacity to memorize content. However, the current algorithms for the optimization fall short of the theoretical capacity, and the capacity is also highly dependent on the content. In this paper, we focus on the memory capacity of these models obtained using common training algorithms and synthetic training data. Based on the results, we derive an empirical capacity model (ECM) for a generic transformer. The ECM can be used to design task-specific transformer models with an optimal number of parameters in cases where the target memorization capability of the task can be defined.

[54] 2407.15439

Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

We study the stochastic combinatorial semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints. This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available and fairness among different choices (or arms) is crucial. We consider two types of unrestricted feedback delays: reward-independent delays where the feedback delays are independent of the rewards, and reward-dependent delays where the feedback delays are correlated with the rewards. Furthermore, we introduce merit-based fairness constraints to ensure a fair selection of the arms. We define the reward regret and the fairness regret and present new bandit algorithms to select arms under unrestricted feedback delays based on their merits. We prove that our algorithms all achieve sublinear expected reward regret and expected fairness regret, with a dependence on the quantiles of the delay distribution. We also conduct extensive experiments using synthetic and real-world data and show that our algorithms can fairly select arms with different feedback delays.

[55] 2407.15525

Multiple importance sampling for stochastic gradient estimation

We introduce a theoretical and practical framework for efficient importance sampling of mini-batch samples for gradient estimation from single and multiple probability distributions. To handle noisy gradients, our framework dynamically evolves the importance distribution during training by utilizing a self-adaptive metric. Our framework combines multiple, diverse sampling distributions, each tailored to specific parameter gradients. This approach facilitates the importance sampling of vector-valued gradient estimation. Rather than naively combining multiple distributions, our framework involves optimally weighting data contribution across multiple distributions. This adapted combination of multiple importance yields superior gradient estimates, leading to faster training convergence. We demonstrate the effectiveness of our approach through empirical evaluations across a range of optimization tasks like classification and regression on both image and point cloud datasets.

[56] 2407.15532

Large-scale Time-Varying Portfolio Optimisation using Graph Attention Networks

Apart from assessing individual asset performance, investors in financial markets also need to consider how a set of firms performs collectively as a portfolio. Whereas traditional Markowitz-based mean-variance portfolios are widespread, network-based optimisation techniques have built upon these developments. However, most studies do not contain firms at risk of default and remove any firms that drop off indices over a certain time. This is the first study to incorporate risky firms and use all the firms in portfolio optimisation. We propose and empirically test a novel method that leverages Graph Attention networks (GATs), a subclass of Graph Neural Networks (GNNs). GNNs, as deep learning-based models, can exploit network data to uncover nonlinear relationships. Their ability to handle high-dimensional features and accommodate customised layers for specific purposes makes them particularly appealing for large-scale problems such as mid- and small-cap portfolio optimization. This study utilises 30 years of data on mid-cap firms, creating graphs of firms using distance correlation and the Triangulated Maximally Filtered Graph approach. These graphs are the inputs to a GAT model that we train using custom layers which impose weight and allocation constraints and a loss function derived from the Sharpe ratio, thus directly maximising portfolio risk-adjusted returns. This new model is benchmarked against a network characteristic-based portfolio, a mean variance-based portfolio, and an equal-weighted portfolio. The results show that the portfolio produced by the GAT-based model outperforms all benchmarks and is consistently superior to other strategies over a long period while also being informative of market dynamics.

[57] 2407.15580

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.

[58] 2407.15636

On-the-fly spectral unmixing based on Kalman filtering

This work introduces an on-the-fly (i.e., online) linear unmixing method which is able to sequentially analyze spectral data acquired on a spectrum-by-spectrum basis. After deriving a sequential counterpart of the conventional linear mixing model, the proposed approach recasts the linear unmixing problem into a linear state-space estimation framework. Under Gaussian noise and state models, the estimation of the pure spectra can be efficiently conducted by resorting to Kalman filtering. Interestingly, it is shown that this Kalman filter can operate in a lower-dimensional subspace while ensuring the nonnegativity constraint inherent to pure spectra. This dimensionality reduction allows significantly lightening the computational burden, while leveraging recent advances related to the representation of essential spectral information. The proposed method is evaluated through extensive numerical experiments conducted on synthetic and real Raman data sets. The results show that this Kalman filter-based method offers a convenient trade-off between unmixing accuracy and computational efficiency, which is crucial for operating in an on-the-fly setting. To the best of the authors' knowledge, this is the first operational method which is able to solve the spectral unmixing problem efficiently in a dynamic fashion. It also constitutes a valuable building block for benefiting from acquisition and processing frameworks recently proposed in the microscopy literature, which are motivated by practical issues such as reducing acquisition time and avoiding potential damages being inflicted to photosensitive samples.

[59] 2407.15703

Estimating Probability Densities with Transformer and Denoising Diffusion

Transformers are often the go-to architecture to build foundation models that ingest a large amount of training data. But these models do not estimate the probability density distribution when trained on regression problems, yet obtaining full probabilistic outputs is crucial to many fields of science, where the probability distribution of the answer can be non-Gaussian and multimodal. In this work, we demonstrate that training a probabilistic model using a denoising diffusion head on top of the Transformer provides reasonable probability density estimation even for high-dimensional inputs. The combined Transformer+Denoising Diffusion model allows conditioning the output probability density on arbitrary combinations of inputs and it is thus a highly flexible density function emulator of all possible input/output combinations. We illustrate our Transformer+Denoising Diffusion model by training it on a large dataset of astronomical observations and measured labels of stars within our Galaxy and we apply it to a variety of inference tasks to show that the model can infer labels accurately with reasonable distributions.

[60] 2407.15792

Robust Mixture Learning when Outliers Overwhelm Small Groups

We study the problem of estimating the means of well-separated mixtures when an adversary may add arbitrary outliers. While strong guarantees are available when the outlier fraction is significantly smaller than the minimum mixing weight, much less is known when outliers may crowd out low-weight clusters - a setting we refer to as list-decodable mixture learning (LD-ML). In this case, adversarial outliers can simulate additional spurious mixture components. Hence, if all means of the mixture must be recovered up to a small error in the output list, the list size needs to be larger than the number of (true) components. We propose an algorithm that obtains order-optimal error guarantees for each mixture mean with a minimal list-size overhead, significantly improving upon list-decodable mean estimation, the only existing method that is applicable for LD-ML. Although improvements are observed even when the mixture is non-separated, our algorithm achieves particularly strong guarantees when the mixture is separated: it can leverage the mixture structure to partially cluster the samples before carefully iterating a base learner for list-decodable mean estimation at different scales.