New articles on Statistics


[1] 2208.07898

Collaborative causal inference on distributed data

The development of technologies for causal inference with the privacy preservation of distributed data has attracted considerable attention in recent years. To address this issue, we propose a quasi-experiment based on data collaboration (DC-QE) that enables causal inference from distributed data with privacy preservation. Our method preserves the privacy of private data by sharing only dimensionality-reduced intermediate representations, which are individually constructed by each party. Moreover, our method can reduce both random errors and biases, whereas existing methods can only reduce random errors in the estimation of treatment effects. Through numerical experiments on both artificial and real-world data, we confirmed that our method can lead to better estimation results than individual analyses. With the spread of our method, intermediate representations can be published as open data to help researchers find causalities and accumulated as a knowledge base.


[2] 2208.07899

Predictions of damages from Atlantic tropical cyclones: a hierarchical Bayesian study on extremes

Bayesian hierarchical models are proposed for modeling tropical cyclone characteristics and their damage potential in the Atlantic basin. We model the joint probability distribution of tropical cyclone characteristics and their damage potential at two different temporal scales, while taking several climate indices into account. First, a predictive model for an entire season is developed that forecasts the number of cyclone events that will take place, the probability of each cyclone causing some amount of damage, and the monetized value of damages. Then, specific characteristics of individual cyclones are considered to predict the monetized value of the damage it will cause. Robustness studies are conducted and excellent prediction power is demonstrated across different data science models and evaluation techniques.


[3] 2208.07900

An unified framework for point-level, areal, and mixed spatial data: the Hausdorff-Gaussian Process

More realistic models can be built taking into account spatial dependence when analyzing areal data. Most of the models for areal data employ adjacency matrices to assess the spatial structure of the data. Such methodologies impose some limitations. Remarkably, spatial polygons of different shapes and sizes are not treated differently, and it becomes difficult, if not impractical, to compute predictions based on these models. Moreover, spatial misalignment (when spatial information is available at different spatial levels) becomes harder to be handled. These limitations can be circumvented by formulating models using other structures to quantify spatial dependence. In this paper, we introduce the Hausdorff-Gaussian process (HGP). The HGP relies on the Hausdorff distance, valid for both point and areal data, allowing for simultaneously accommodating geostatistical and areal models under the same modeling framework. We present the benefits of using the HGP as a random effect for Bayesian spatial generalized mixed-effects models and via a simulation study comparing the performance of the HGP to the most popular models for areal data. Finally, the HGP is applied to respiratory cancer data observed in Great Glasgow.


[4] 2208.07927

Semi-supervised Transfer Learning for Evaluation of Model Classification Performance

In modern machine learning applications, frequent encounters of covariate shift and label scarcity have posed challenges to robust model training and evaluation. Numerous transfer learning methods have been developed to robustly adapt the model itself to some unlabeled target populations using existing labeled data in a source population. However, there is a paucity of literature on transferring performance metrics of a trained model. In this paper, we aim to evaluate the performance of a trained binary classifier on unlabeled target population based on receiver operating characteristic (ROC) analysis. We proposed $\bf S$emi-supervised $\bf T$ransfer l$\bf E$arning of $\bf A$ccuracy $\bf M$easures (STEAM), an efficient three-step estimation procedure that employs 1) double-index modeling to construct calibrated density ratio weights and 2) robust imputation to leverage the large amount of unlabeled data to improve estimation efficiency. We establish the consistency and asymptotic normality of the proposed estimators under correct specification of either the density ratio model or the outcome model. We also correct for potential overfitting bias in the estimators in finite samples with cross-validation. We compare our proposed estimators to existing methods and show reductions in bias and gains in efficiency through simulations. We illustrate the practical utility of the proposed method on evaluating prediction performance of a phenotyping model for Rheumatoid Arthritis (RA) on temporally evolving EHR cohorts.


[5] 2208.07959

Variable Selection in Latent Regression IRT Models via Knockoffs: An Application to International Large-scale Assessment in Education

International large-scale assessments (ILSAs) play an important role in educational research and policy making. They collect valuable data on education quality and performance development across many education systems, giving countries the opportunity to share techniques, organisational structures and policies that have proven efficient and successful. To gain insights from ILSA data, we identify non-cognitive variables associated with students' academic performance. This problem has three analytical challenges: 1) students' academic performance is measured by cognitive items under a matrix sampling design; 2) there are often many missing values in the non-cognitive variables; and 3) multiple comparisons due to a large number of non-cognitive variables. We consider an application to data from the Programme for International Student Assessment (PISA), aiming to identify non-cognitive variables associated with students' performance in science. We formulate it as a variable selection problem under a latent variable model and further propose a knockoff method that conducts variable selection with a controlled error rate for false selections. Keywords: Model-X knockoffs, item response theory, missing data, variable selection, international large-scale assessment


[6] 2208.07961

Online Learning for Mixture of Multivariate Hawkes Processes

Online learning of Hawkes processes has received increasing attention in the last couple of years especially for modeling a network of actors. However, these works typically either model the rich interaction between the events or the latent cluster of the actors or the network structure between the actors. We propose to model the latent structure of the network of actors as well as their rich interaction across events for real-world settings of medical and financial applications. Experimental results on both synthetic and real-world data showcase the efficacy of our approach.


[7] 2208.07991

High-Dimensional Directional Brain Network Analysis for Focal Epileptic Seizures

The brain is a high-dimensional directional network system consisting of many regions as network nodes that influence each other. The directional influence from one region to another is referred to as directional connectivity. Epilepsy is a directional network disorder, as epileptic activity spreads from a seizure onset zone (SOZ) to many other regions after seizure onset. However, directional network studies of epilepsy have been mostly limited to low-dimensional directional networks between the SOZ and contiguous regions due to the lack of efficient methods for analyzing high-dimensional directional brain networks. To address this gap, we study high-dimensional directional networks in epileptic brains by using a clustering-enabled multivariate autoregressive state-space model (MARSS) to analyze multi-channel intracranial EEG recordings of focal seizures. This new MARSS characterizes the SOZ, nearby regions, and many other non-SOZ regions as one integrated high-dimensional directional network system with a clustering feature. Using the new MARSS, we reveal changes in high-dimensional directional brain networks throughout seizure development. We simultaneously identify directional connections and the SOZ cluster, regions most affected by SOZ activity, in different seizure periods. We found that, after seizure onset, the numbers of directional connections of the SOZ and regions in the SOZ cluster increase significantly. We also reveal that many regions outside the SOZ cluster have no changes in directional connections, although these regions' EEG data signal ictal activity. Lastly, we use these high-dimensional network results to localize the SOZ and achieve 100% true positive rates and less than 3% false positive rates for different SOZ locations.


[8] 2208.07996

Correcting Convexity Bias in Function and Functional Estimate

A general framework with a series of different methods is proposed to improve the estimate of convex function (or functional) values when only noisy observations of the true input are available. Technically, our methods catch the bias introduced by the convexity and remove this bias from a baseline estimate. Theoretical analysis are conducted to show that the proposed methods can strictly reduce the expected estimate error under mild conditions. When applied, the methods require no specific knowledge about the problem except the convexity and the evaluation of the function. Therefore, they can serve as off-the-shelf tools to obtain good estimate for a wide range of problems, including optimization problems with random objective functions or constraints, and functionals of probability distributions such as the entropy and the Wasserstein distance. Numerical experiments on a wide variety of problems show that our methods can significantly improve the quality of the estimate compared with the baseline method.


[9] 2208.08065

Revisiting the propensity score's central role: Towards bridging balance and efficiency in the era of causal machine learning

About forty years ago, in a now--seminal contribution, Rosenbaum & Rubin (1983) introduced a critical characterization of the propensity score as a central quantity for drawing causal inferences in observational study settings. In the decades since, much progress has been made across several research fronts in causal inference, notably including the re-weighting and matching paradigms. Focusing on the former and specifically on its intersection with machine learning and semiparametric efficiency theory, we re-examine the role of the propensity score in modern methodological developments. As Rosenbaum & Rubin (1983)'s contribution spurred a focus on the balancing property of the propensity score, we re-examine the degree to which and how this property plays a role in the development of asymptotically efficient estimators of causal effects; moreover, we discuss a connection between the balancing property and efficient estimation in the form of score equations and propose a score test for evaluating whether an estimator achieves balance.


[10] 2208.08068

Quantum Bayes AI

Quantum Bayesian AI (Q-B) is an emerging field that levers the computational gains available in Quantum computing. The promise is an exponential speed-up in many Bayesian algorithms. Our goal is to apply these methods directly to statistical and machine learning problems. We provide a duality between classical and quantum probability for calculating of posterior quantities of interest. Our framework unifies MCMC, Deep Learning and Quantum Learning calculations from the viewpoint from von Neumann's principle of quantum measurement. Quantum embeddings and neural gates are also an important part of data encoding and feature selection. There is a natural duality with well-known kernel methods in statistical learning. We illustrate the behaviour of quantum algorithms on two simple classification algorithms. Finally, we conclude with directions for future research.


[11] 2208.08077

Natural cubic splines for the analysis of Alzheimer's clinical trials

Mixed model repeated measures (MMRM) is the most common analysis approach used in clinical trials for Alzheimer's disease and other progressive diseases measured with continuous outcomes measured over time. The model treats time as a categorical variable, which allows an unconstrained estimate of the mean for each study visit in each randomized group. Categorizing time in this way can be problematic when assessments occur off-schedule, as including off-schedule visits can induce bias, and excluding them ignores valuable information and violates the intention to treat principle. This problem has been exacerbated by clinical trial visits which have been delayed due to the COVID19 pandemic. As an alternative to MMRM, we propose a constrained longitudinal data analysis with natural cubic splines that treats time as continuous and uses test version effects to model the mean over time. The spline model is shown to be superior, in practice and simulation studies, to categorical-time models like MMRM and models that assume a proportional treatment effect.


[12] 2208.08108

Characterizing M-estimators

We characterize the full classes of M-estimators for semiparametric models of general functionals by formally connecting the theory of consistent loss functions from forecast evaluation with the theory of M-estimation. This novel characterization result opens up the possibility for theoretical research on efficient and equivariant M-estimation and, more generally, it allows to leverage existing results on loss functions known from the literature of forecast evaluation in estimation theory.


[13] 2208.08138

Shallow neural network representation of polynomials

We show that $d$-variate polynomials of degree $R$ can be represented on $[0,1]^d$ as shallow neural networks of width $d+1+\sum_{r=2}^R\binom{r+d-1}{d-1}[\binom{r+d-1}{d-1}+1]$. Also, by SNN representation of localized Taylor polynomials of univariate $C^\beta$-smooth functions, we derive for shallow networks the minimax optimal rate of convergence, up to a logarithmic factor, to unknown univariate regression function.


[14] 2208.08150

Capturing network and dynamic effects in bike sharing system via fused Lasso

Data collected from a bike-sharing system exhibit complex temporal and spatial features. We analyze shared-bike usage data collected in Seoul, South Korea, at the level of individual stations while accounting for station-specific behavior and covariate effects. We adopt a penalized regression approach with a multilayer network fused Lasso penalty. The proposed fusion penalties are imposed on networks which embed spatio-temporal linkages, and capture the homogeneity in bike usage that is attributed to intricate spatio-temporal features without arbitrarily partitioning the data. We demonstrate that the proposed approach yields competitive predictive performance and provides a new interpretation of the data.


[15] 2208.08230

Two-Stage Robust and Sparse Distributed Statistical Inference for Large-Scale Data

In this paper, we address the problem of conducting statistical inference in settings involving large-scale data that may be high-dimensional and contaminated by outliers. The high volume and dimensionality of the data require distributed processing and storage solutions. We propose a two-stage distributed and robust statistical inference procedures coping with high-dimensional models by promoting sparsity. In the first stage, known as model selection, relevant predictors are locally selected by applying robust Lasso estimators to the distinct subsets of data. The variable selections from each computation node are then fused by a voting scheme to find the sparse basis for the complete data set. It identifies the relevant variables in a robust manner. In the second stage, the developed statistically robust and computationally efficient bootstrap methods are employed. The actual inference constructs confidence intervals, finds parameter estimates and quantifies standard deviation. Similar to stage 1, the results of local inference are communicated to the fusion center and combined there. By using analytical methods, we establish the favorable statistical properties of the robust and computationally efficient bootstrap methods including consistency for a fixed number of predictors, and robustness. The proposed two-stage robust and distributed inference procedures demonstrate reliable performance and robustness in variable selection, finding confidence intervals and bootstrap approximations of standard deviations even when data is high-dimensional and contaminated by outliers.


[16] 2208.08247

Domain Knowledge in A*-Based Causal Discovery

Causal discovery has become a vital tool for scientists and practitioners wanting to discover causal relationships from observational data. While most previous approaches to causal discovery have implicitly assumed that no expert domain knowledge is available, practitioners can often provide such domain knowledge from prior experience. Recent work has incorporated domain knowledge into constraint-based causal discovery. The majority of such constraint-based methods, however, assume causal faithfulness, which has been shown to be frequently violated in practice. Consequently, there has been renewed attention towards exact-search score-based causal discovery methods, which do not assume causal faithfulness, such as A*-based methods. However, there has been no consideration of these methods in the context of domain knowledge. In this work, we focus on efficiently integrating several types of domain knowledge into A*-based causal discovery. In doing so, we discuss and explain how domain knowledge can reduce the graph search space and then provide an analysis of the potential computational gains. We support these findings with experiments on synthetic and real data, showing that even small amounts of domain knowledge can dramatically speed up A*-based causal discovery and improve its performance and practicality.


[17] 2208.08265

Semi-Supervised Anomaly Detection Based on Quadratic Multiform Separation

In this paper we propose a novel method for semi-supervised anomaly detection (SSAD). Our classifier is named QMS22 as its inception was dated 2022 upon the framework of quadratic multiform separation (QMS), a recently introduced classification model. QMS22 tackles SSAD by solving a multi-class classification problem involving both the training set and the test set of the original problem. The classification problem intentionally includes classes with overlapping samples. One of the classes contains mixture of normal samples and outliers, and all other classes contain only normal samples. An outlier score is then calculated for every sample in the test set using the outcome of the classification problem. We also include performance evaluation of QMS22 against top performing classifiers using ninety-five benchmark imbalanced datasets from the KEEL repository. These classifiers are BRM (Bagging-Random Miner), OCKRA (One-Class K-means with Randomly-projected features Algorithm), ISOF (Isolation Forest), and ocSVM (One-Class Support Vector Machine). It is shown by using the area under the curve of the receiver operating characteristic curve as the performance measure, QMS22 significantly outperforms ISOF and ocSVM. Moreover, the Wilcoxon signed-rank tests reveal that there is no statistically significant difference when testing QMS22 against BRM nor QMS22 against OCKRA.


[18] 2208.08291

Debiased Inference on Identified Linear Functionals of Underidentified Nuisances via Penalized Minimax Estimation

We study generic inference on identified linear functionals of nonunique nuisances defined as solutions to underidentified conditional moment restrictions. This problem appears in a variety of applications, including nonparametric instrumental variable models, proximal causal inference under unmeasured confounding, and missing-not-at-random data with shadow variables. Although the linear functionals of interest, such as average treatment effect, are identifiable under suitable conditions, nonuniqueness of nuisances pose serious challenges to statistical inference, since in this setting common nuisance estimators can be unstable and lack fixed limits. In this paper, we propose penalized minimax estimators for the nuisance functions and show they enable valid inference in this challenging setting. The proposed nuisance estimators can accommodate flexible function classes, and importantly, they can converge to fixed limits determined by the penalization, regardless of whether the nuisances are unique or not. We use the penalized nuisance estimators to form a debiased estimator for the linear functional of interest and prove its asymptotic normality under generic high-level conditions, which provide for asymptotically valid confidence intervals.


[19] 2208.08401

Conformal Inference for Online Prediction with Arbitrary Distribution Shifts

Conformal inference is a flexible methodology for transforming the predictions made by any black-box model (e.g. neural nets, random forests) into valid prediction sets. The only necessary assumption is that the training and test data be exchangeable (e.g. i.i.d.). Unfortunately, this assumption is usually unrealistic in online environments in which the processing generating the data may vary in time and consecutive data-points are often temporally correlated. In this article, we develop an online algorithm for producing prediction intervals that are robust to these deviations. Our methods build upon conformal inference and thus can be combined with any black-box predictor. We show that the coverage error of our algorithm is controlled by the size of the underlying change in the environment and thus directly connect the size of the distribution shift with the difficulty of the prediction problem. Finally, we apply our procedure in two real-world settings and find that our method produces robust prediction intervals under real-world dynamics.


[20] 2208.08415

Estimation and Specification Test for Diffusion Models with Stochastic Volatility

Given the importance of continuous-time stochastic volatility models to describe the dynamics of interest rates, we propose a goodness-of-fit test for the parametric form of the drift and diffusion functions, based on a marked empirical process of the residuals. The test statistics are constructed using a continuous functional (Kolmogorov-Smirnov and Cram\'er-von Mises) over the empirical processes. In order to evaluate the proposed tests, we implement a simulation study, where a bootstrap method is considered for the calibration of the tests. As the estimation of diffusion models with stochastic volatility based on discretely sampled data has proven difficult, we address this issue by means of a Monte Carlo study for different estimation procedures. Finally, an application of the procedures to real data is provided.


[21] 2208.08420

A Comparative Review of Specification Tests for Diffusion Models

Diffusion models play an essential role in modeling continuous-time stochastic processes in the financial field. Therefore, several proposals have been developed in the last decades to test the specification of stochastic differential equations. We provide a survey to collect some developments on goodness-of-fit tests for diffusion models and implement these methods to illustrate their finite sample behavior, regarding size and power, by means of a simulation study. We also apply the ideas of distance correlation for testing independence to propose a test for the parametric specification of diffusion models, comparing its performance with the other methods and analyzing the effect of the curse of dimensionality. As real data examples, treasury securities with different maturities are considered.


[22] 2208.07951

On the generalization of learning algorithms that do not converge

Generalization analyses of deep learning typically assume that the training converges to a fixed point. But, recent results indicate that in practice, the weights of deep neural networks optimized with stochastic gradient descent often oscillate indefinitely. To reduce this discrepancy between theory and practice, this paper focuses on the generalization of neural networks whose training dynamics do not necessarily converge to fixed points. Our main contribution is to propose a notion of statistical algorithmic stability (SAS) that extends classical algorithmic stability to non-convergent algorithms and to study its connection to generalization. This ergodic-theoretic approach leads to new insights when compared to the traditional optimization and learning theory perspectives. We prove that the stability of the time-asymptotic behavior of a learning algorithm relates to its generalization and empirically demonstrate how loss dynamics can provide clues to generalization performance. Our findings provide evidence that networks that "train stably generalize better" even when the training continues indefinitely and the weights do not converge.


[23] 2208.07984

Private Estimation with Public Data

We initiate the study of differentially private (DP) estimation with access to a small amount of public data. For private estimation of d-dimensional Gaussians, we assume that the public data comes from a Gaussian that may have vanishing similarity in total variation distance with the underlying Gaussian of the private data. We show that under the constraints of pure or concentrated DP, d+1 public data samples are sufficient to remove any dependence on the range parameters of the private data distribution from the private sample complexity, which is known to be otherwise necessary without public data. For separated Gaussian mixtures, we assume that the underlying public and private distributions are the same, and we consider two settings: (1) when given a dimension-independent amount of public data, the private sample complexity can be improved polynomially in terms of the number of mixture components, and any dependence on the range parameters of the distribution can be removed in the approximate DP case; (2) when given an amount of public data linear in the dimension, the private sample complexity can be made independent of range parameters even under concentrated DP, and additional improvements can be made to the overall sample complexity.


[24] 2208.08003

Superior generalization of smaller models in the presence of significant label noise

The benefits of over-parameterization in achieving superior generalization performance have been shown in several recent studies, justifying the trend of using larger models in practice. In the context of robust learning however, the effect of neural network size has not been well studied. In this work, we find that in the presence of a substantial fraction of mislabeled examples, increasing the network size beyond some point can be harmful. In particular, the originally monotonic or `double descent' test loss curve (w.r.t. network width) turns into a U-shaped or a double U-shaped curve when label noise increases, suggesting that the best generalization is achieved by some model with intermediate size. We observe that when network size is controlled by density through random pruning, similar test loss behaviour is observed. We also take a closer look into both phenomenon through bias-variance decomposition and theoretically characterize how label noise shapes the variance term. Similar behavior of the test loss can be observed even when state-of-the-art robust methods are applied, indicating that limiting the network size could further boost existing methods. Finally, we empirically examine the effect of network size on the smoothness of learned functions, and find that the originally negative correlation between size and smoothness is flipped by label noise.


[25] 2208.08006

Sum of independent random variable for Shanker, Akash, Ishita, Pranav, Rani and Ram Awadh distributions

In statistics and probability theory, one the most important statistics is the sums of random variables. After introducing a probability distribution, determining the sum of n independent and identically distributed random variables is one of the interesting topics for authors. This paper presented the probability density function for the sum of n independent and identically distributed random variables such as Shanker, Akash, Ishita, Rani, Pranav and Ram Awadh. In order to determine all aforementioned distributions, the problem-solving methods are applied which is based on the change-of-variables technique. The mth moments for them were also accurately calculated. Besides, the reliability and the mean time to failure of a 1 out of n cold standby spare system has also been evaluated under the Lindley components failure time.


[26] 2208.08175

Expressivity of Hidden Markov Chains vs. Recurrent Neural Networks from a system theoretic viewpoint

Hidden Markov Chains (HMC) and Recurrent Neural Networks (RNN) are two well known tools for predicting time series. Even though these solutions were developed independently in distinct communities, they share some similarities when considered as probabilistic structures. So in this paper we first consider HMC and RNN as generative models, and we embed both structures in a common generative unified model (GUM). We next address a comparative study of the expressivity of these models. To that end we assume that the models are furthermore linear and Gaussian. The probability distributions produced by these models are characterized by structured covariance series, and as a consequence expressivity reduces to comparing sets of structured covariance series, which enables us to call for stochastic realization theory (SRT). We finally provide conditions under which a given covariance series can be realized by a GUM, an HMC or an RNN.


[27] 2208.08287

Sparse Nonnegative Tucker Decomposition and Completion under Noisy Observations

Tensor decomposition is a powerful tool for extracting physically meaningful latent factors from multi-dimensional nonnegative data, and has been an increasing interest in a variety of fields such as image processing, machine learning, and computer vision. In this paper, we propose a sparse nonnegative Tucker decomposition and completion method for the recovery of underlying nonnegative data under noisy observations. Here the underlying nonnegative data tensor is decomposed into a core tensor and several factor matrices with all entries being nonnegative and the factor matrices being sparse. The loss function is derived by the maximum likelihood estimation of the noisy observations, and the $\ell_0$ norm is employed to enhance the sparsity of the factor matrices. We establish the error bound of the estimator of the proposed model under generic noise scenarios, which is then specified to the observations with additive Gaussian noise, additive Laplace noise, and Poisson observations, respectively. Our theoretical results are better than those by existing tensor-based or matrix-based methods. Moreover, the minimax lower bounds are shown to be matched with the derived upper bounds up to logarithmic factors. Numerical examples on both synthetic and real-world data sets demonstrate the superiority of the proposed method for nonnegative tensor data completion.


[28] 2208.08341

Algorithmic Fairness and Statistical Discrimination

Algorithmic fairness is a new interdisciplinary field of study focused on how to measure whether a process, or algorithm, may unintentionally produce unfair outcomes, as well as whether or how the potential unfairness of such processes can be mitigated. Statistical discrimination describes a set of informational issues that can induce rational (i.e., Bayesian) decision-making to lead to unfair outcomes even in the absence of discriminatory intent. In this article, we provide overviews of these two related literatures and draw connections between them. The comparison illustrates both the conflict between rationality and fairness and the importance of endogeneity (e.g., "rational expectations" and "self-fulfilling prophecies") in defining and pursuing fairness. Taken in concert, we argue that the two traditions suggest a value for considering new fairness notions that explicitly account for how the individual characteristics an algorithm intends to measure may change in response to the algorithm.


[29] 2208.08348

Ban The Box? Information, Incentives, and Statistical Discrimination

"Banning the Box" refers to a policy campaign aimed at prohibiting employers from soliciting applicant information that could be used to statistically discriminate against categories of applicants (in particular, those with criminal records). In this article, we examine how the concealing or revealing of informative features about an applicant's identity affects hiring both directly and, in equilibrium, by possibly changing applicants' incentives to invest in human capital. We show that there exist situations in which an employer and an applicant are in agreement about whether to ban the box. Specifically, depending on the structure of the labor market, banning the box can be (1) Pareto dominant, (2) Pareto dominated, (3) benefit the applicant while harming the employer, or (4) benefit the employer while harming the applicant. Our results have policy implications spanning beyond employment decisions, including the use of credit checks by landlords and standardized tests in college admissions.