New articles on Statistics


[1] 2205.05715

Causal discovery under a confounder blanket

Inferring causal relationships from observational data is rarely straightforward, but the problem is especially difficult in high dimensions. For these applications, causal discovery algorithms typically require parametric restrictions or extreme sparsity constraints. We relax these assumptions and focus on an important but more specialized problem, namely recovering a directed acyclic subgraph of variables known to be causally descended from some (possibly large) set of confounding covariates, i.e. a $\textit{confounder blanket}$. This is useful in many settings, for example when studying a dynamic biomolecular subsystem with genetic data providing causally relevant background information. Under a structural assumption that, we argue, must be satisfied in practice if informative answers are to be found, our method accommodates graphs of low or high sparsity while maintaining polynomial time complexity. We derive a sound and complete algorithm for identifying causal relationships under these conditions and implement testing procedures with provable error control for linear and nonlinear systems. We demonstrate our approach on a range of simulation settings.


[2] 2205.05770

De-biasing "bias" measurement

When a model's performance differs across socially or culturally relevant groups--like race, gender, or the intersections of many such groups--it is often called "biased." While much of the work in algorithmic fairness over the last several years has focused on developing various definitions of model fairness (the absence of group-wise model performance disparities) and eliminating such "bias," much less work has gone into rigorously measuring it. In practice, it important to have high quality, human digestible measures of model performance disparities and associated uncertainty quantification about them that can serve as inputs into multi-faceted decision-making processes. In this paper, we show both mathematically and through simulation that many of the metrics used to measure group-wise model performance disparities are themselves statistically biased estimators of the underlying quantities they purport to represent. We argue that this can cause misleading conclusions about the relative group-wise model performance disparities along different dimensions, especially in cases where some sensitive variables consist of categories with few members. We propose the "double-corrected" variance estimator, which provides unbiased estimates and uncertainty quantification of the variance of model performance across groups. It is conceptually simple and easily implementable without statistical software package or numerical optimization. We demonstrate the utility of this approach through simulation and show on a real dataset that while statistically biased estimators of model group-wise model performance disparities indicate statistically significant between-group model performance disparities, when accounting for statistical bias in the estimator, the estimated group-wise disparities in model performance are no longer statistically significant.


[3] 2205.05777

Efficient estimation of modified treatment policy effects based on the generalized propensity score

Continuous treatments have posed a significant challenge for causal inference, both in the formulation and identification of scientifically meaningful effects and in their robust estimation. Traditionally, focus has been placed on techniques applicable to binary or categorical treatments with few levels, allowing for the application of propensity score-based methodology with relative ease. Efforts to accommodate continuous treatments introduced the generalized propensity score, yet estimators of this nuisance parameter commonly utilize parametric regression strategies that sharply limit the robustness and efficiency of inverse probability weighted estimators of causal effect parameters. We formulate and investigate a novel, flexible estimator of the generalized propensity score based on a nonparametric function estimator that provably converges at a suitably fast rate to the target functional so as to facilitate statistical inference. With this estimator, we demonstrate the construction of nonparametric inverse probability weighted estimators of a class of causal effect estimands tailored to continuous treatments. To ensure the asymptotic efficiency of our proposed estimators, we outline several non-restrictive selection procedures for utilizing a sieve estimation framework to undersmooth estimators of the generalized propensity score. We provide the first characterization of such inverse probability weighted estimators achieving the nonparametric efficiency bound in a setting with continuous treatments, demonstrating this in numerical experiments. We further evaluate the higher-order efficiency of our proposed estimators by deriving and numerically examining the second-order remainder of the corresponding efficient influence function in the nonparametric model. Open source software implementing our proposed estimation techniques, the haldensify R package, is briefly discussed.


[4] 2205.05843

A Survey of Risk-Aware Multi-Armed Bandits

In several applications such as clinical trials and financial portfolio optimization, the expected value (or the average reward) does not satisfactorily capture the merits of a drug or a portfolio. In such applications, risk plays a crucial role, and a risk-aware performance measure is preferable, so as to capture losses in the case of adverse events. This survey aims to consolidate and summarise the existing research on risk measures, specifically in the context of multi-armed bandits. We review various risk measures of interest, and comment on their properties. Next, we review existing concentration inequalities for various risk measures. Then, we proceed to defining risk-aware bandit problems, We consider algorithms for the regret minimization setting, where the exploration-exploitation trade-off manifests, as well as the best-arm identification setting, which is a pure exploration problem -- both in the context of risk-sensitive measures. We conclude by commenting on persisting challenges and fertile areas for future research.


[5] 2205.05878

Training Uncertainty-Aware Classifiers with Conformalized Deep Learning

Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. We address this problem by developing a novel training algorithm that can lead to more dependable uncertainty estimates, without sacrificing predictive power. The idea is to mitigate overconfidence by minimizing a loss function, inspired by advances in conformal inference, that quantifies model uncertainty by carefully leveraging hold-out data. Experiments with synthetic and real data demonstrate this method leads to smaller conformal prediction sets with higher conditional coverage, after exact calibration with hold-out data, compared to state-of-the-art alternatives.


[6] 2205.05910

Comments on: "Hybrid Semiparametric Bayesian Networks"

Invited discussion on the paper "Hybrid Semiparametric Bayesian Networks" by David Atienza, Pedro Larranaga and Concha Bielza (TEST, 2022).


[7] 2205.05932

The LAN property for McKean-Vlasov models in a mean-field regime

We establish the local asymptotic normality (LAN) property for estimating a multidimensional parameter in the drift of a system of $N$ interacting particles observed over a fixed time horizon in a mean-field regime $N \rightarrow \infty$. By implementing the classical theory of Ibragimov and Hasminski, we obtain in particular sharp results for the maximum likelihood estimator that go beyond its simple asymptotic normality thanks to H\'ajek's convolution theorem and strong controls of the likelihood process that yield asymptotic minimax optimality (up to constants). Our structural results shed some light to the accompanying nonlinear McKean-Vlasov experiment, and enable us to derive simple and explicit criteria to obtain identifiability and non-degeneracy of the Fisher information matrix. These conditions are also of interest for other recent studies on the topic of parametric inference for interacting diffusions.


[8] 2205.05934

A Non-parametric Bayesian Model for Detecting Differential Item Functioning: An Application to Political Representation in the US

A common approach when studying the quality of representation involves comparing the latent preferences of voters and legislators, commonly obtained by fitting an item-response theory (IRT) model to a common set of stimuli. Despite being exposed to the same stimuli, voters and legislators may not share a common understanding of how these stimuli map onto their latent preferences, leading to differential item-functioning (DIF) and incomparability of estimates. We explore the presence of DIF and incomparability of latent preferences obtained through IRT models by re-analyzing an influential survey data set, where survey respondents expressed their preferences on roll call votes that U.S. legislators had previously voted on. To do so, we propose defining a Dirichlet Process prior over item-response functions in standard IRT models. In contrast to typical multi-step approaches to detecting DIF, our strategy allows researchers to fit a single model, automatically identifying incomparable sub-groups with different mappings from latent traits onto observed responses. We find that although there is a group of voters whose estimated positions can be safely compared to those of legislators, a sizeable share of surveyed voters understand stimuli in fundamentally different ways. Ignoring these issues can lead to incorrect conclusions about the quality of representation.


[9] 2205.05955

Bayesian inference for stochastic oscillatory systems using the phase-corrected Linear Noise Approximation

Likelihood-based inference in stochastic non-linear dynamical systems, such as those found in chemical reaction networks and biological clock systems, is inherently complex and has largely been limited to small and unrealistically simple systems. Recent advances in analytically tractable approximations to the underlying conditional probability distributions enable long-term dynamics to be accurately modelled, and make the large number of model evaluations required for exact Bayesian inference much more feasible. We propose a new methodology for inference in stochastic non-linear dynamical systems exhibiting oscillatory behaviour and show the parameters in these models can be realistically estimated from simulated data. Preliminary analyses based on the Fisher Information Matrix of the model can guide the implementation of Bayesian inference. We show that this parameter sensitivity analysis can predict which parameters are practically identifiable. Several Markov chain Monte Carlo algorithms are compared, with our results suggesting a parallel tempering algorithm consistently gives the best approach for these systems, which are shown to frequently exhibit multi-modal posterior distributions.


[10] 2205.05993

On integrating the number of synthetic data sets $m$ into the 'a priori' synthesis approach

Until recently, multiple synthetic data sets were always released to analysts, to allow valid inferences to be obtained. However, under certain conditions - including when saturated count models are used to synthesize categorical data - single imputation ($m=1$) is sufficient. Nevertheless, increasing $m$ causes utility to improve, but at the expense of higher risk, an example of the risk-utility trade-off. The question, therefore, is: which value of $m$ is optimal with respect to the risk-utility trade-off? Moreover, the paper considers two ways of analysing categorical data sets: as they have a contingency table representation, multiple categorical data sets can be averaged before being analysed, as opposed to the usual way of averaging post-analysis. This paper also introduces a pair of metrics, $\tau_3(k,d)$ and $\tau_4(k,d)$, that are suited for assessing disclosure risk in multiple categorical synthetic data sets. Finally, the synthesis methods are demonstrated empirically.


[11] 2205.06024

Low-variance estimation in the Plackett-Luce model via quasi-Monte Carlo sampling

The Plackett-Luce (PL) model is ubiquitous in learning-to-rank (LTR) because it provides a useful and intuitive probabilistic model for sampling ranked lists. Counterfactual offline evaluation and optimization of ranking metrics are pivotal for using LTR methods in production. When adopting the PL model as a ranking policy, both tasks require the computation of expectations with respect to the model. These are usually approximated via Monte-Carlo (MC) sampling, since the combinatorial scaling in the number of items to be ranked makes their analytical computation intractable. Despite recent advances in improving the computational efficiency of the sampling process via the Gumbel top-k trick, the MC estimates can suffer from high variance. We develop a novel approach to producing more sample-efficient estimators of expectations in the PL model by combining the Gumbel top-k trick with quasi-Monte Carlo (QMC) sampling, a well-established technique for variance reduction. We illustrate our findings both theoretically and empirically using real-world recommendation data from Amazon Music and the Yahoo learning-to-rank challenge.


[12] 2205.06087

A single risk approach to the semiparametric copula competing risks model

A typical situation in competing risks analysis is that the researcher is only interested in a subset of risks. This paper considers a depending competing risks model with the distribution of one risk being a parametric or semi-parametric model, while the model for the other risks being unknown. Identifiability is shown for popular classes of parametric models and the semiparametric proportional hazards model. The identifiability of the parametric models does not require a covariate, while the semiparametric model requires at least one. Estimation approaches are suggested which are shown to be $\sqrt{n}$-consistent. Applicability and attractive finite sample performance are demonstrated with the help of simulations and data examples.


[13] 2205.06129

Addressing Census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements

Prediction of an individual's race and ethnicity plays an important role in social science and public health research. Examples include studies of racial disparity in health and voting. Recently, Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to combine information from Census surname files with the geocoding of an individual's residence, has emerged as a leading methodology for this prediction task. Unfortunately, BISG suffers from two Census data problems that contribute to unsatisfactory predictive performance for minorities. First, the decennial Census often contains zero counts for minority racial groups in the Census blocks where some members of those groups reside. Second, because the Census surname files only include frequent names, many surnames -- especially those of minorities -- are missing from the list. To address the zero counts problem, we introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts by extending the na\"ive Bayesian inference of the BISG methodology to full posterior inference. To address the missing surname problem, we supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available. Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians. The proposed methodology, together with additional name data, is available via the open-source software package wru.


[14] 2205.06131

Framework for inferring empirical causal graphs from binary data to support multidimensional poverty analysis

Poverty is one of the fundamental issues that mankind faces. Multidimensional Poverty Index (MPI) is deployed for measuring poverty issues in a population beyond monetary. However, MPI cannot provide information regarding associations and causal relations among poverty factors. Does education cause income inequality in a specific region? Is lacking education a cause of health issues? By not knowing causal relations, policy maker cannot pinpoint root causes of poverty issues of a specific population, which might not be the same across different population. Additionally, MPI requires binary data, which cannot be analyzed by most of causal inference frameworks. In this work, we proposed an exploratory-data-analysis framework for finding possible causal relations with confidence intervals among binary data. The proposed framework provides not only how severe the issue of poverty is, but it also provides the causal relations among poverty factors. Moreover, knowing a confidence interval of degree of causal direction lets us know how strong a causal relation is. We evaluated the proposed framework with several baseline approaches in simulation datasets as well as using two real-world datasets as case studies 1) Twin births of the United States: the relation between birth weight and mortality of twin, and 2) Thailand population surveys from 378k households of Chiang Mai and 353k households of Khon Kaen provinces. Our framework performed better than baselines in most cases. The first case study reveals almost all mortality cases in twins have issues of low birth weights but not all low-birth-weight twins were died. The second case study reveals that smoking associates with drinking alcohol in both provinces and there is a causal relation of smoking causes drinking alcohol in only Chiang Mai province. The framework can be applied beyond the poverty context.


[15] 2205.06163

Gaussian Whittle-Matérn fields on metric graphs

We define a new class of Gaussian processes on compact metric graphs such as street or river networks. The proposed models, the Whittle-Mat\'ern fields, are defined via a fractional stochastic partial differential equation on the compact metric graph and are a natural extension of Gaussian fields with Mat\'ern covariance functions on Euclidean domains to the non-Euclidean metric graph setting. Existence of the processes, as well as their sample path regularity properties are derived. The model class in particular contains differentiable Gaussian processes. To the best of our knowledge, this is the first construction of a valid differentiable Gaussian field on general compact metric graphs. We then focus on a model subclass which we show contains processes with Markov properties. For this case, we show how to evaluate finite dimensional distributions of the process exactly and computationally efficiently. This facilitates using the proposed models for statistical inference without the need for any approximations. Finally, we derive some of the main statistical properties of the model class, such as consistency of maximum likelihood estimators of model parameters and asymptotic optimality properties of linear prediction based on the model with misspecified parameters.


[16] 2205.06256

Real-time Monitoring of Functional Data

With the rise of Industry 4.0, huge amounts of data are now generated that are apt to be modelled as functional data. In this setting, standard profile monitoring methods aim to assess the stability over time of a completely observed functional quality characteristic. However, in some practical situations, evaluating the process state in real-time, i.e., as the process is running, could be of great interest to significantly improve the effectiveness of monitoring. To this aim, we propose a new method, referred to as functional real-time monitoring (FRTM), that is able to account for both phase and amplitude variation through the following steps: (i) registration; (ii) dimensionality reduction; (iii) monitoring of a partially observed functional quality characteristic. An extensive Monte Carlo simulation study is performed to quantify the performance of FRTM with respect to two competing methods. Finally, an example is presented where the proposed method is used to monitor batches from a penicillin production process in real-time.


[17] 2205.05779

Multivariate ordered discrete response models

We introduce multivariate ordered discrete response models that exhibit non-lattice structures. From the perspective of behavioral economics, these models correspond to broad bracketing in decision making, whereas lattice models, which researchers typically estimate in practice, correspond to narrow bracketing. There is also a class of hierarchical models, which nests lattice models and is a special case of non-lattice models. Hierarchical models correspond to sequential decision making and can be represented by binary decision trees. In each of these cases, we specify latent processes as a sum of an index of covariates and an unobserved error, with unobservables for different latent processes potentially correlated. This additional dependence further complicates the identification of model parameters in non-lattice models. We give conditions sufficient to guarantee identification under the independence of errors and covariates, compare these conditions to what is required to attain identification in lattice models and outline an estimation approach. Finally, we provide simulations and empirical examples, through which we discuss the case when unobservables follow a distribution from a known parametric family, focusing on popular probit specifications.


[18] 2205.05800

Stochastic first-order methods for average-reward Markov decision processes

We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.


[19] 2205.05838

Orthogonal Gromov-Wasserstein Discrepancy with Efficient Lower Bound

Comparing structured data from possibly different metric-measure spaces is a fundamental task in machine learning, with applications in, e.g., graph classification. The Gromov-Wasserstein (GW) discrepancy formulates a coupling between the structured data based on optimal transportation, tackling the incomparability between different structures by aligning the intra-relational geometries. Although efficient local solvers such as conditional gradient and Sinkhorn are available, the inherent non-convexity still prevents a tractable evaluation, and the existing lower bounds are not tight enough for practical use. To address this issue, we take inspiration from the connection with the quadratic assignment problem, and propose the orthogonal Gromov-Wasserstein (OGW) discrepancy as a surrogate of GW. It admits an efficient and closed-form lower bound with the complexity of $\mathcal{O}(n^3)$, and directly extends to the fused Gromov-Wasserstein (FGW) distance, incorporating node features into the coupling. Extensive experiments on both the synthetic and real-world datasets show the tightness of our lower bounds, and both OGW and its lower bounds efficiently deliver accurate predictions and satisfactory barycenters for graph sets.


[20] 2205.05848

An MMSE Lower Bound via Poincaré Inequality

This paper studies the minimum mean squared error (MMSE) of estimating $\mathbf{X} \in \mathbb{R}^d$ from the noisy observation $\mathbf{Y} \in \mathbb{R}^k$, under the assumption that the noise (i.e., $\mathbf{Y}|\mathbf{X}$) is a member of the exponential family. The paper provides a new lower bound on the MMSE. Towards this end, an alternative representation of the MMSE is first presented, which is argued to be useful in deriving closed-form expressions for the MMSE. This new representation is then used together with the Poincar\'e inequality to provide a new lower bound on the MMSE. Unlike, for example, the Cram\'{e}r-Rao bound, the new bound holds for all possible distributions on the input $\mathbf{X}$. Moreover, the lower bound is shown to be tight in the high-noise regime for the Gaussian noise setting under the assumption that $\mathbf{X}$ is sub-Gaussian. Finally, several numerical examples are shown which demonstrate that the bound performs well in all noise regimes.


[21] 2205.05885

Sampling Online Social Networks: Metropolis Hastings Random Walk and Random Walk

As social network analysis (SNA) has drawn much attention in recent years, one bottleneck of SNA is these network data are too massive to handle. Furthermore, some network data are not accessible due to privacy problems. Therefore, we have to develop sampling methods to draw representative sample graphs from the population graph. In this paper, Metropolis-Hastings Random Walk (MHRW) and Random Walk with Jumps (RWwJ) sampling strategies are introduced, including the procedure of collecting nodes, the underlying mathematical theory, and corresponding estimators. We compared our methods and existing research outcomes and found that MHRW performs better when estimating degree distribution (61% less error than RWwJ) and graph order (0.69% less error than RWwJ), while RWwJ estimates follower and following ratio average and mutual relationship proportion in adjacent relationship with better results, with 13% less error and 6% less error than MHRW. We analyze the reasons for the outcomes and give possible future work directions.


[22] 2205.06069

Sequential algorithms for testing identity and closeness of distributions

What advantage do \emph{sequential} procedures provide over batch algorithms for testing properties of unknown distributions? Focusing on the problem of testing whether two distributions $\mathcal{D}_1$ and $\mathcal{D}_2$ on $\{1,\dots, n\}$ are equal or $\epsilon$-far, we give several answers to this question. We show that for a small alphabet size $n$, there is a sequential algorithm that outperforms any batch algorithm by a factor of at least $4$ in terms sample complexity. For a general alphabet size $n$, we give a sequential algorithm that uses no more samples than its batch counterpart, and possibly fewer if the actual distance $TV(\mathcal{D}_1, \mathcal{D}_2)$ between $\mathcal{D}_1$ and $\mathcal{D}_2$ is larger than $\epsilon$. As a corollary, letting $\epsilon$ go to $0$, we obtain a sequential algorithm for testing closeness when no a priori bound on $TV(\mathcal{D}_1, \mathcal{D}_2)$ is given that has a sample complexity $\tilde{\mathcal{O}}(\frac{n^{2/3}}{TV(\mathcal{D}_1, \mathcal{D}_2)^{4/3}})$: this improves over the $\tilde{\mathcal{O}}(\frac{n/\log n}{TV(\mathcal{D}_1, \mathcal{D}_2)^{2} })$ tester of \cite{daskalakis2017optimal} and is optimal up to multiplicative constants. We also establish limitations of sequential algorithms for the problem of testing identity and closeness: they can improve the worst case number of samples by at most a constant factor.


[23] 2205.06127

Sample Complexity Bounds for Robustly Learning Decision Lists against Evasion Attacks

A fundamental problem in adversarial machine learning is to quantify how much training data is needed in the presence of evasion attacks. In this paper we address this issue within the framework of PAC learning, focusing on the class of decision lists. Given that distributional assumptions are essential in the adversarial setting, we work with probability distributions on the input data that satisfy a Lipschitz condition: nearby points have similar probability. Our key results illustrate that the adversary's budget (that is, the number of bits it can perturb on each input) is a fundamental quantity in determining the sample complexity of robust learning. Our first main result is a sample-complexity lower bound: the class of monotone conjunctions (essentially the simplest non-trivial hypothesis class on the Boolean hypercube) and any superclass has sample complexity at least exponential in the adversary's budget. Our second main result is a corresponding upper bound: for every fixed $k$ the class of $k$-decision lists has polynomial sample complexity against a $\log(n)$-bounded adversary. This sheds further light on the question of whether an efficient PAC learning algorithm can always be used as an efficient $\log(n)$-robust learning algorithm under the uniform distribution.


[24] 2205.06226

The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.