### Ensemble sampler for infinite-dimensional inverse problems

We introduce a new Markov chain Monte Carlo sampler for infinite-dimensional Bayesian inverse problems. The new sampler is based on the affine invariant ensemble sampler, which we extend for the first time to function spaces. The new sampler is more efficient than preconditioned Crank-Nicolson, yet it requires no gradient information or posterior covariance information, making the sampler broadly applicable.

### Space-Time Covid-19 Bayesian SIR modeling in South Carolina

The Covid-19 pandemic has spread across the world since the beginning of 2020. Many regions have experienced its effects. The state of South Carolina in the USA has seen cases since early March 2020 and a primary peak in early April 2020. A lockdown was imposed on April 6th but lifting of restrictions started on April 24th. The daily case and death data as reported by NCHS (deaths) via the New York Times GitHUB repository have been analyzed and approaches to modeling of the data are presented. Prediction is also considered and the role of asymptomatic transmission is assessed as a latent unobserved effect. Two different time periods are examined and one step prediction is provided.

### Intrinsic Sliced Wasserstein Distances for Comparing Collections of Probability Distributions on Manifolds and Graphs

Collections of probability distributions arise in a variety of statistical applications ranging from user activity pattern analysis to brain connectomics. In practice these distributions are represented by histograms over diverse domain types including finite intervals, circles, cylinders, spheres, other manifolds, and graphs. This paper introduces an approach for detecting differences between two collections of histograms over such general domains. To this end, we introduce the intrinsic slicing construction that yields a novel class of Wasserstein distances on manifolds and graphs. These distances are Hilbert embeddable, which allows us to reduce the histogram collection comparison problem to the comparison of means in a high-dimensional Euclidean space. We develop a hypothesis testing procedure based on conducting t-tests on each dimension of this embedding, then combining the resulting p-values using recently proposed p-value combination techniques. Our numerical experiments in a variety of data settings show that the resulting tests are powerful and the p-values are well-calibrated. Example applications to user activity patterns, spatial data, and brain connectomics are provided.

### CONQ: CONtinuous Quantile Treatment Effects for Large-Scale Online Controlled Experiments

In many industry settings, online controlled experimentation (A/B test) has been broadly adopted as the gold standard to measure product or feature impacts. Most research has primarily focused on user engagement type metrics, specifically measuring treatment effects at mean (average treatment effects, ATE), and only a few have been focusing on performance metrics (e.g. latency), where treatment effects are measured at quantiles. Measuring quantile treatment effects (QTE) is challenging due to the myriad difficulties such as dependency introduced by clustered samples, scalability issues, density bandwidth choices, etc. In addition, previous literature has mainly focused on QTE at some pre-defined locations, such as P50 or P90, which doesn't always convey the full picture. In this paper, we propose a novel scalable non-parametric solution, which can provide a continuous range of QTE with point-wise confidence intervals while circumventing the density estimation altogether. Numerical results show high consistency with traditional methods utilizing asymptotic normality. An end-to-end pipeline has been implemented at Snap Inc., providing daily insights on key performance metrics at a distributional level.

### Nonparametric estimation of copulas and copula densities by orthogonal projections

In this paper we study nonparametric estimators of copulas and copula densities. We first focus our study on a density copula estimator based on a polynomial orthogonal projection of the joint density. A new copula estimator is then deduced. Its asymptotic properties are studied: we provide a large functional class for which this construction is optimal in the minimax and maxiset sense and we propose a method selection for the smoothing parameter. An intensive simulation study shows the very good performance of both copulas and copula densities estimators which we compare to a large panel of competitors. A real dataset in actuarial science illustrates this approach.

### Classification Accuracy and Parameter Estimation in Multilevel Contexts: A Study of Conditional Nonparametric Multilevel Latent Class Analysis

The current research has two aims. First, to demonstrate the utility conditional nonparametric multilevel latent class analysis (NP-MLCA) for multi-site program evaluation using an empirical dataset. Second, to investigate how classification accuracy and parameter estimation of a conditional NP-MLCA are affected by six study factors: the quality of latent class indicators, the number of latent class indicators, level-1 covariate effects, cross-level covariate effects, the number of level-2 units, and the size of level-2 units. A total of 96 conditions was examined using a simulation study. The resulting classification accuracy rates, the power and type-I error of cross-level covariate effects and contextual effects suggest that the nonparametric multilevel latent class model can be applied broadly in multilevel contexts.

### The Performance Analysis of Generalized Margin Maximizer (GMM) on Separable Data

Logistic models are commonly used for binary classification tasks. The success of such models has often been attributed to their connection to maximum-likelihood estimators. It has been shown that gradient descent algorithm, when applied on the logistic loss, converges to the max-margin classifier (a.k.a. hard-margin SVM). The performance of the max-margin classifier has been recently analyzed. Inspired by these results, in this paper, we present and study a more general setting, where the underlying parameters of the logistic model possess certain structures (sparse, block-sparse, low-rank, etc.) and introduce a more general framework (which is referred to as "Generalized Margin Maximizer", GMM). While classical max-margin classifiers minimize the $2$-norm of the parameter vector subject to linearly separating the data, GMM minimizes any arbitrary convex function of the parameter vector. We provide a precise analysis of the performance of GMM via the solution of a system of nonlinear equations. We also provide a detailed study for three special cases: ($1$) $\ell_2$-GMM that is the max-margin classifier, ($2$) $\ell_1$-GMM which encourages sparsity, and ($3$) $\ell_{\infty}$-GMM which is often used when the parameter vector has binary entries. Our theoretical results are validated by extensive simulation results across a range of parameter values, problem instances, and model structures.

### An Exact Solution Path Algorithm for SLOPE and Quasi-Spherical OSCAR

Sorted $L_1$ penalization estimator (SLOPE) is a regularization technique for sorted absolute coefficients in high-dimensional regression. By arbitrarily setting its regularization weights $\lambda$ under the monotonicity constraint, SLOPE can have various feature selection and clustering properties. On weight tuning, the selected features and their clusters are very sensitive to the tuning parameters. Moreover, the exhaustive tracking of their changes is difficult using grid search methods. This study presents a solution path algorithm that provides the complete and exact path of solutions for SLOPE in fine-tuning regularization weights. A simple optimality condition for SLOPE is derived and used to specify the next splitting point of the solution path. This study also proposes a new design of a regularization sequence $\lambda$ for feature clustering, which is called the quasi-spherical and octagonal shrinkage and clustering algorithm for regression (QS-OSCAR). QS-OSCAR is designed with a contour surface of the regularization terms most similar to a sphere. Among several regularization sequence designs, sparsity and clustering performance are compared through simulation studies. The numerical observations show that QS-OSCAR performs feature clustering more efficiently than other designs.

### Staged trees are curved exponential families

Staged tree models are a discrete generalization of Bayesian networks. We show that these form curved exponential families and derive their natural parameters, sufficient statistic, and cumulant-generating function as functions of their graphical representation. We give necessary graphical criteria for classifying regular subfamilies and discuss implications for model selection.

### On the robustness of kernel-based pairwise learning

It is shown that many results on the statistical robustness of kernel-based pairwise learning can be derived under basically no assumptions on the input and output spaces. In particular neither moment conditions on the conditional distribution of Y given X = x nor the boundedness of the output space is needed. We obtain results on the existence and boundedness of the influence function and show qualitative robustness of the kernel-based estimator. The present paper generalizes results by Christmann and Zhou (2016) by allowing the prediction function to take two arguments and can thus be applied in a variety of situations such as ranking.

### Matern Gaussian Processes on Graphs

Gaussian processes are a versatile framework for learning unknown functions in a manner that permits one to utilize prior information about their properties. Although many different Gaussian process models are readily available when the input space is Euclidean, the choice is much more limited for Gaussian processes whose input space is an undirected graph. In this work, we leverage the stochastic partial differential equation characterization of Mat\'{e}rn Gaussian processes - a widely-used model class in the Euclidean setting - to study their analog for undirected graphs. We show that the resulting Gaussian processes inherit various attractive properties of their Euclidean and Riemannian analogs and provide techniques that allow them to be trained using standard methods, such as inducing points. This enables graph Mat\'{e}rn Gaussian processes to be employed in mini-batch and non-conjugate settings, thereby making them more accessible to practitioners and easier to deploy within larger learning frameworks.

### Investigating the Robustness of Artificial Intelligent Algorithms with Mixture Experiments

Artificial intelligent (AI) algorithms, such as deep learning and XGboost, are used in numerous applications including computer vision, autonomous driving, and medical diagnostics. The robustness of these AI algorithms is of great interest as inaccurate prediction could result in safety concerns and limit the adoption of AI systems. In this paper, we propose a framework based on design of experiments to systematically investigate the robustness of AI classification algorithms. A robust classification algorithm is expected to have high accuracy and low variability under different application scenarios. The robustness can be affected by a wide range of factors such as the imbalance of class labels in the training dataset, the chosen prediction algorithm, the chosen dataset of the application, and a change of distribution in the training and test datasets. To investigate the robustness of AI classification algorithms, we conduct a comprehensive set of mixture experiments to collect prediction performance results. Then statistical analyses are conducted to understand how various factors affect the robustness of AI classification algorithms. We summarize our findings and provide suggestions to practitioners in AI applications.

### Low-Variance Policy Gradient Estimation with World Models

In this paper, we propose World Model Policy Gradient (WMPG), an approach to reduce the variance of policy gradient estimates using learned world models (WM's). In WMPG, a WM is trained online and used to imagine trajectories. The imagined trajectories are used in two ways. Firstly, to calculate a without-replacement estimator of the policy gradient. Secondly, the return of the imagined trajectories is used as an informed baseline. We compare the proposed approach with AC and MAC on a set of environments of increasing complexity (CartPole, LunarLander and Pong) and find that WMPG has better sample efficiency. Based on these results, we conclude that WMPG can yield increased sample efficiency in cases where a robust latent representation of the environment can be learned.

### Teaching a GAN What Not to Learn

Generative adversarial networks (GANs) were originally envisioned as unsupervised generative models that learn to follow a target distribution. Variants such as conditional GANs, auxiliary-classifier GANs (ACGANs) project GANs on to supervised and semi-supervised learning frameworks by providing labelled data and using multi-class discriminators. In this paper, we approach the supervised GAN problem from a different perspective, one that is motivated by the philosophy of the famous Persian poet Rumi who said, "The art of knowing is knowing what to ignore." In the GAN framework, we not only provide the GAN positive data that it must learn to model, but also present it with so-called negative samples that it must learn to avoid - we call this "The Rumi Framework." This formulation allows the discriminator to represent the underlying target distribution better by learning to penalize generated samples that are undesirable - we show that this capability accelerates the learning process of the generator. We present a reformulation of the standard GAN (SGAN) and least-squares GAN (LSGAN) within the Rumi setting. The advantage of the reformulation is demonstrated by means of experiments conducted on MNIST, Fashion MNIST, CelebA, and CIFAR-10 datasets. Finally, we consider an application of the proposed formulation to address the important problem of learning an under-represented class in an unbalanced dataset. The Rumi approach results in substantially lower FID scores than the standard GAN frameworks while possessing better generalization capability.

### Generalization bounds for deep thresholding networks

We consider compressive sensing in the scenario where the sparsity basis (dictionary) is not known in advance, but needs to be learned from examples. Motivated by the well-known iterative soft thresholding algorithm for the reconstruction, we define deep networks parametrized by the dictionary, which we call deep thresholding networks. Based on training samples, we aim at learning the optimal sparsifying dictionary and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. The dictionary learning is performed via minimizing the empirical risk. We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks. We obtain estimates of the sample complexity that depend only linearly on the dimensions and on the depth.

### Post-selection inference with HSIC-Lasso

Detecting influential features in complex (non-linear and/or high-dimensional) datasets is key for extracting the relevant information. Most of the popular selection procedures, however, require assumptions on the underlying data - such as distributional ones -, which barely agree with empirical observations. Therefore, feature selection based on nonlinear methods, such as the model-free HSIC-Lasso, is a more relevant approach. In order to ensure valid inference among the chosen features, the selection procedure must be accounted for. In this paper, we propose selective inference with HSIC-Lasso using the framework of truncated Gaussians together with the polyhedral lemma. Based on these theoretical foundations, we develop an algorithm allowing for low computational costs and the treatment of the hyper-parameter selection issue. The relevance of our method is illustrated using artificial and real-world datasets. In particular, our empirical findings emphasise that type-I error control at the considered level can be achieved.

### Independence Tests Without Ground Truth for Noisy Learners

Exact ground truth invariant polynomial systems can be written for arbitrarily correlated binary classifiers. Their solutions give estimates for sample statistics that require knowledge of the ground truth of the correct labels in the sample. Of these polynomial systems, only a few have been solved in closed form. Here we discuss the exact solution for independent binary classifiers - resolving an outstanding problem that has been presented at this conference and others. Its practical applicability is hampered by its sole remaining assumption - the classifiers need to be independent in their sample errors. We discuss how to use the closed form solution to create a self-consistent test that can validate the independence assumption itself absent the correct labels ground truth. It can be cast as an algebraic geometry conjecture for binary classifiers that remains unsolved. A similar conjecture for the ground truth invariant algebraic system for scalar regressors is solvable, and we present the solution here. We also discuss experiments on the Penn ML Benchmark classification tasks that provide further evidence that the conjecture may be true for the polynomial system of binary classifiers.

### A statistical model to assess risk for supporting SARS-CoV-2 quarantine decisions

In February 2020 the first human infection with SARS-CoV-2 was reported in Germany. Since then the local public health offices have been responsible to monitor and react to the dynamics of the pandemic. One of their major tasks is to contain the spread of the virus after potential spreading events, for example when one or multiple participants have a positive test result after a group meeting (e.g. at school, at a sports event or at work). In this case, contacts of the infected person have to be traced and potentially are quarantined (at home) for a period of time. When all relevant contact persons obtain a negative polymerase chain reaction (PCR) test result, the quarantine may be stopped. However, tracing and testing of all contacts is time-consuming, costly and (thus) not always feasible. This motivates our work in which we present a statistical model for the probability that no transmission of Sars-CoV-2 occurred given an arbitrary number of test results at potentially different timepoints. Hereby, the time-dependent sensitivity and specificity of the conducted PCR test are taken in account. We employ a parametric Bayesian model which can be adopted to different situations when specific prior knowledge is available. This is illustrated for group events in German school classes and applied to exemplary real-world data from this context. Our approach has the potential to support important quarantine decisions with the goal to achieve a better balance between necessary containment of the pandemic and preservation of social and economic life. The focus of future work should be on further refinement and evaluation of quarantine decisions based on our statistical model.

### Learning interaction kernels in mean-field equations of 1st-order systems of interacting particles

We introduce a nonparametric algorithm to learn interaction kernels of mean-field equations for 1st-order systems of interacting particles. The data consist of discrete space-time observations of the solution. By least squares with regularization, the algorithm learns the kernel on data-adaptive hypothesis spaces efficiently. A key ingredient is a probabilistic error functional derived from the likelihood of the mean-field equation's diffusion process. The estimator converges, in a reproducing kernel Hilbert space and an L2 space under an identifiability condition, at a rate optimal in the sense that it equals the numerical integrator's order. We demonstrate our algorithm on three typical examples: the opinion dynamics with a piecewise linear kernel, the granular media model with a quadratic kernel, and the aggregation-diffusion with a repulsive-attractive kernel.

### Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas

In this paper we review Bernstein and grid-type copulas for arbitrary dimensions and general grid resolutions in connection with discrete random vectors possessing uniform margins. We further suggest a pragmatic way to fit the dependence structure of multivariate data to Bernstein copulas via grid-type copulas and empirical contingency tables. Finally, we discuss a Monte Carlo study for the simulation and PML estimation for aggregate dependent losses form observed windstorm and flooding data.

### Attentive Clustering Processes

Amortized approaches to clustering have recently received renewed attention thanks to novel objective functions that exploit the expressiveness of deep learning models. In this work we revisit a recent proposal for fast amortized probabilistic clustering, the Clusterwise Clustering Process (CCP), which yields samples from the posterior distribution of cluster labels for sets of arbitrary size using only O(K) forward network evaluations, where K is an arbitrary number of clusters. While adequate in simple datasets, we show that the model can severely underfit complex datasets, and hypothesize that this limitation can be traced back to the implicit assumption that the probability of a point joining a cluster is equally sensitive to all the points available to join the same cluster. We propose an improved model, the Attentive Clustering Process (ACP), that selectively pays more attention to relevant points while preserving the invariance properties of the generative model. We illustrate the advantages of the new model in applications to spike-sorting in multi-electrode arrays and community discovery in networks. The latter case combines the ACP model with graph convolutional networks, and to our knowledge is the first deep learning model that handles an arbitrary number of communities.

### Spatiotemporal effects of the causal factors on COVID-19 incidences in the contiguous United States

Since December 2019, the world has been witnessing the gigantic effect of an unprecedented global pandemic called Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV-2) - COVID-19. So far, 38,619,674 confirmed cases and 1,093,522 confirmed deaths due to COVID-19 have been reported. In the United States (US), the cases and deaths are recorded as 7,833,851 and 215,199. Several timely researches have discussed the local and global effects of the confounding factors on COVID-19 casualties in the US. However, most of these studies considered little about the time varying associations between and among these factors, which are crucial for understanding the outbreak of the present pandemic. Therefore, this study adopts various relevant approaches, including local and global spatial regression models and machine learning to explore the causal effects of the confounding factors on COVID-19 counts in the contiguous US. Totally five spatial regression models, spatial lag model (SLM), ordinary least square (OLS), spatial error model (SEM), geographically weighted regression (GWR) and multiscale geographically weighted regression (MGWR), are performed at the county scale to take into account the scale effects on modelling. For COVID-19 cases, ethnicity, crime, and income factors are found to be the strongest covariates and explain the maximum model variances. For COVID-19 deaths, both (domestic and international) migration and income factors play a crucial role in explaining spatial differences of COVID-19 death counts across counties. The local coefficient of determination (R2) values derived from the GWR and MGWR models are found very high over the Wisconsin-Indiana-Michigan (the Great Lake) region, as well as several parts of Texas, California, Mississippi and Arkansas.

### Domain adaptation under structural causal models

Domain adaptation (DA) arises as an important problem in statistical machine learning when the source data used to train a model is different from the target data used to test the model. Recent advances in DA have mainly been application-driven and have largely relied on the idea of a common subspace for source and target data. To understand the empirical successes and failures of DA methods, we propose a theoretical framework via structural causal models that enables analysis and comparison of the prediction performance of DA methods. This framework also allows us to itemize the assumptions needed for the DA methods to have a low target error. Additionally, with insights from our theory, we propose a new DA method called CIRM that outperforms existing DA methods when both the covariates and label distributions are perturbed in the target data. We complement the theoretical analysis with extensive simulations to show the necessity of the devised assumptions. Reproducible synthetic and real data experiments are also provided to illustrate the strengths and weaknesses of DA methods when parts of the assumptions of our theory are violated.

### COVID-19 incidences and its association with environmental quality: A country-level assessment in India

This study explored the association between the five key air pollutants (Nitrogen Dioxide (NO2), Sulphur Dioxide (SO2), Particulate Matter (PM2.5, PM10), and Carbon Monoxide (CO)) and COVID-19 incidences in India. The COVID-19 confirmed cases, air pollution concentration and meteorological variables (temperature, wind speed, surface pressure) for district and city scale were obtained for 2019 and 2020. The location-based air pollution observations were converted to a raster surface using interpolation. The deaths and positive cases are reported so far were found highest in Mumbai (436 and 11394), followed by Ahmedabad (321 and 4991), Pune (129 and 2129), Kolkata (99 and 783), Indore (83 and 1699), Jaipur (53 and 1111), Ujjain (42 and 201), Surat (37 and 799), Vadodara (31 and 400), Chennai (23 and 2647), Bhopal (22 and 652), Thane (21 and 1889), respectively. Unlike the other studies, this study has not found any substantial association between air pollution and COVID-19 incidences at the district level. Considering the number of confirmed cases, the coefficient of determination (R2) values estimated as 0.003 for PM2.5, 0.002 for PM10 and SO2, 0.001 for CO, and 0.0002 for NO2, respectively. This suggests an absolute no significant association between air pollution and COVID-19 incidences (both confirmed cases and death) in India. The same association was observed for the number of deaths as well. For COVID-19 confirmed cases, none of the five pollutants has exhibited any statistically significant association. Additionally, except the wind speed, the climate variables have no produced any statistically significant association with the COVID-19 incidences.

### Learning Bayesian Networks from Ordinal Data

Bayesian networks are a powerful framework for studying the dependency structure of variables in a complex system. The problem of learning Bayesian networks is tightly associated with the given data type. Ordinal data, such as stages of cancer, rating scale survey questions, and letter grades for exams, are ubiquitous in applied research. However, existing solutions are mainly for continuous and nominal data. In this work, we propose an iterative score-and-search method - called the Ordinal Structural EM (OSEM) algorithm - for learning Bayesian networks from ordinal data. Unlike traditional approaches designed for nominal data, we explicitly respect the ordering amongst the categories. More precisely, we assume that the ordinal variables originate from marginally discretizing a set of Gaussian variables, whose structural dependence in the latent space follows a directed acyclic graph. Then, we adopt the Structural EM algorithm and derive closed-form scoring functions for efficient graph searching. Through simulation studies, we illustrate the superior performance of the OSEM algorithm compared to the alternatives and analyze various factors that may influence the learning accuracy. Finally, we demonstrate the practicality of our method with a real-world application on psychological survey data from 408 patients with co-morbid symptoms of obsessive-compulsive disorder and depression.

### Group-regularized ridge regression via empirical Bayes noise level cross-validation

Features in predictive models are not exchangeable, yet common supervised models treat them as such. Here we study ridge regression when the analyst can partition the features into $K$ groups based on external side-information. For example, in high-throughput biology, features may represent gene expression, protein abundance or clinical data and so each feature group represents a distinct modality. The analyst's goal is to choose optimal regularization parameters $\lambda = (\lambda_1, \dotsc, \lambda_K)$ -- one for each group. In this work, we study the impact of $\lambda$ on the predictive risk of group-regularized ridge regression by deriving limiting risk formulae under a high-dimensional random effects model with $p\asymp n$ as $n \to \infty$. Furthermore, we propose a data-driven method for choosing $\lambda$ that attains the optimal asymptotic risk: The key idea is to interpret the residual noise variance $\sigma^2$, as a regularization parameter to be chosen through cross-validation. An empirical Bayes construction maps the one-dimensional parameter $\sigma$ to the $K$-dimensional vector of regularization parameters, i.e., $\sigma \mapsto \widehat{\lambda}(\sigma)$. Beyond its theoretical optimality, the proposed method is practical and runs as fast as cross-validated ridge regression without feature groups ($K=1$).

### Tensor Completion via Tensor Networks with a Tucker Wrapper

In recent years, low-rank tensor completion (LRTC) has received considerable attention due to its applications in image/video inpainting, hyperspectral data recovery, etc. With different notions of tensor rank (e.g., CP, Tucker, tensor train/ring, etc.), various optimization based numerical methods are proposed to LRTC. However, tensor network based methods have not been proposed yet. In this paper, we propose to solve LRTC via tensor networks with a Tucker wrapper. Here by "Tucker wrapper" we mean that the outermost factor matrices of the tensor network are all orthonormal. We formulate LRTC as a problem of solving a system of nonlinear equations, rather than a constrained optimization problem. A two-level alternative least square method is then employed to update the unknown factors. The computation of the method is dominated by tensor matrix multiplications and can be efficiently performed. Also, under proper assumptions, it is shown that with high probability, the method converges to the exact solution at a linear rate. Numerical simulations show that the proposed algorithm is comparable with state-of-the-art methods.

### Test Set Optimization by Machine Learning Algorithms

Diagnosis results are highly dependent on the volume of test set. To derive the most efficient test set, we propose several machine learning based methods to predict the minimum amount of test data that produces relatively accurate diagnosis. By collecting outputs from failing circuits, the feature matrix and label vector are generated, which involves the inference information of the test termination point. Thus we develop a prediction model to fit the data and determine when to terminate testing. The considered methods include LASSO and Support Vector Machine(SVM) where the relationship between goals(label) and predictors(feature matrix) are considered to be linear in LASSO and nonlinear in SVM. Numerical results show that SVM reaches a diagnosis accuracy of 90.4% while deducting the volume of test set by 35.24%.

### Preventing COVID-19 Fatalities: State versus Federal Policies

Are policies implemented in individual states to prevent pandemic deaths effective when there is no policy coordination by the federal government? We answer this question by developing a stochastic extension of a SIRD epidemiological model for a country composed of multiple states. Our model allows for interstate mobility. We consider three policies: mask mandates, stay-at-home orders, and interstate travel bans. We fit our model to daily U.S. state-level COVID-19 death counts and exploit our estimates to produce various policy counterfactuals. While the restrictions imposed by some states inhibited a significant number of virus deaths, we find that more than two-thirds of U.S. COVID-19 deaths could have been prevented by late September 2020 had the federal government imposed federal mandates as early as some of the earliest states did. Our results highlight the need for a coordination-aimed approach by a federal government for the successful containment of a pandemic.

### Multitask Bandit Learning through Heterogeneous Feedback Aggregation

In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the $\epsilon$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg$(\epsilon)$, that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise similarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise similarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.

### Off-Policy Interval Estimation with Lipschitz Value Iteration

Off-policy evaluation provides an essential tool for evaluating the effects of different policies or treatments using only observed data. When applied to high-stakes scenarios such as medical diagnosis or financial decision-making, it is crucial to provide provably correct upper and lower bounds of the expected reward, not just a classical single point estimate, to the end-users, as executing a poor policy can be very costly. In this work, we propose a provably correct method for obtaining interval bounds for off-policy evaluation in a general continuous setting. The idea is to search for the maximum and minimum values of the expected reward among all the Lipschitz Q-functions that are consistent with the observations, which amounts to solving a constrained optimization problem on a Lipschitz function space. We go on to introduce a Lipschitz value iteration method to monotonically tighten the interval, which is simple yet efficient and provably convergent. We demonstrate the practical efficiency of our method on a range of benchmarks.

### Multiscale characteristics of the emerging global cryptocurrency market

The review introduces the history of cryptocurrencies, offering a description of the blockchain technology behind them. Differences between cryptocurrencies and the exchanges on which they are traded have been shown. The central part surveys the analysis of cryptocurrency price changes on various platforms. The statistical properties of the fluctuations in the cryptocurrency market have been compared to the traditional markets. With the help of the latest statistical physics methods the non-linear correlations and multiscale characteristics of the cryptocurrency market are analyzed. In the last part the co-evolution of the correlation structure among the 100 cryptocurrencies having the largest capitalization is retraced. The detailed topology of cryptocurrency network on the Binance platform from bitcoin perspective is also considered. Finally, an interesting observation on the Covid-19 pandemic impact on the cryptocurrency market is presented and discussed: recently we have witnessed a "phase transition" of the cryptocurrencies from being a hedge opportunity for the investors fleeing the traditional markets to become a part of the global market that is substantially coupled to the traditional financial instruments like the currencies, stocks, and commodities. The main contribution is an extensive demonstration that structural self-organization in the cryptocurrency markets has caused the same to attain complexity characteristics that are nearly indistinguishable from the Forex market at the level of individual time-series. However, the cross-correlations between the exchange rates on cryptocurrency platforms differ from it. The cryptocurrency market is less synchronized and the information flows more slowly, which results in more frequent arbitrage opportunities. The methodology used in the review allows the latter to be detected, and lead-lag relationships to be discovered.

### Sparse Signal Reconstruction for Nonlinear Models via Piecewise Rational Optimization

We propose a method to reconstruct sparse signals degraded by a nonlinear distortion and acquired at a limited sampling rate. Our method formulates the reconstruction problem as a nonconvex minimization of the sum of a data fitting term and a penalization term. In contrast with most previous works which settle for approximated local solutions, we seek for a global solution to the obtained challenging nonconvex problem. Our global approach relies on the so-called Lasserre relaxation of polynomial optimization. We here specifically include in our approach the case of piecewise rational functions, which makes it possible to address a wide class of nonconvex exact and continuous relaxations of the $\ell_0$ penalization function. Additionally, we study the complexity of the optimization problem. It is shown how to use the structure of the problem to lighten the computational burden efficiently. Finally, numerical simulations illustrate the benefits of our method in terms of both global optimality and signal reconstruction.

### Probabilistic interval predictor based on dissimilarity functions

This work presents a new method to obtain probabilistic interval predictions of a dynamical system. The method uses stored past system measurements to estimate the future evolution of the system. The proposed method relies on the use of dissimilarity functions to estimate the conditional probability density function of the outputs. A family of empirical probability density functions, parameterized by means of two parameters, is introduced. It is shown that the the proposed family encompasses the multivariable normal probability density function as a particular case. We show that the proposed method constitutes a generalization of classical estimation methods. A cross-validation scheme is used to tune the two parameters on which the methodology relies. In order to prove the effectiveness of the methodology presented, some numerical examples and comparisons are provided.

### Rates of convergence for Gibbs sampling in the analysis of almost exchangeable data

Motivated by de Finetti's representation theorem for partially exchangeable arrays, we want to sample $\mathbf p \in [0,1]^d$ from a distribution with density proportional to $\exp(-A^2\sum_{i<j}c_{ij}(p_i-p_j)^2)$. We are particularly interested in the case of an almost exchangeable array ($A$ large). We analyze the rate of convergence of a coordinate Gibbs sampler used to simulate from these measures. We show that for every fixed matrix $C=(c_{ij})$, and large enough $A$, mixing happens in $\Theta(A^2)$ steps in a suitable Wasserstein distance. The upper and lower bounds are explicit and depend on the matrix $C$ through few relevant spectral parameters.

### [36] 2010.15550

There has been considerable interest in boosting and bagging, including the combination of the adaptive techniques of AdaBoost with the random selection with replacement techniques of Bagging. At the same time there has been a revisiting of the way we evaluate, with chance-corrected measures like Kappa, Informedness, Correlation or ROC AUC being advocated. This leads to the question of whether learning algorithms can do better by optimizing an appropriate chance corrected measure. Indeed, it is possible for a weak learner to optimize Accuracy to the detriment of the more reaslistic chance-corrected measures, and when this happens the booster can give up too early. This phenomenon is known to occur with conventional Accuracy-based AdaBoost, and the MultiBoost algorithm has been developed to overcome such problems using restart techniques based on bagging. This paper thus complements the theoretical work showing the necessity of using chance-corrected measures for evaluation, with empirical work showing how use of a chance-corrected measure can improve boosting. We show that the early surrender problem occurs in MultiBoost too, in multiclass situations, so that chance-corrected AdaBook and Multibook can beat standard Multiboost or AdaBoost, and we further identify which chance-corrected measures to use when.

### Overcoming The Limitations of Neural Networks in Composite-Pattern Learning with Architopes

The effectiveness of neural networks in solving complex problems is well recognized; however, little is known about their limitations. We demonstrate that the feed-forward architecture, for most commonly used activation functions, is incapable of approximating functions comprised of multiple sub-patterns while simultaneously respecting their composite-pattern structure. We overcome this bottleneck with a simple architecture modification that reallocates the neurons of any single feed-forward network across several smaller sub-networks, each specialized on a distinct part of the input-space. The modified architecture, called an Architope, is more expressive on two fronts. First, it is dense in an associated space of piecewise continuous functions in which the feed-forward architecture is not dense. Second, it achieves the same approximation rate as the feed-forward networks while only requiring $\mathscr{O}(N^{-1})$ fewer parameters in its hidden layers. Moreover, the architecture achieves these approximation improvements while preserving the target's composite-pattern structure.

### Probabilistic Transformers

We show that Transformers are Maximum Posterior Probability estimators for Mixtures of Gaussian Models. This brings a probabilistic point of view to Transformers and suggests extensions to other probabilistic cases.

### Impact of (SARS-CoV-2) COVID 19 on the indigenous language-speaking population in Mexico

The importance of the working document is that it allows the analysis of the information and the status of cases associated with (SARS-CoV-2) COVID-19 as open data at the municipal, state and national level, with a daily record of patients, according to a age, sex, comorbidities, for the condition of (SARS-CoV-2) COVID-19 according to the following characteristics: a) Positive, b) Negative, c) Suspicious. Likewise, it presents information related to the identification of an outpatient and / or hospitalized patient, attending to their medical development, identifying: a) Recovered, b) Deaths and c) Active, in Phase 3 and Phase 4, in the five main population areas speaker of indigenous language in the State of Veracruz - Mexico. The data analysis is carried out through the application of a data mining algorithm, which provides the information, fast and timely, required for the estimation of Medical Care Scenarios of (SARS-CoV-2) COVID-19, as well as for know the impact on the indigenous language-speaking population in Mexico.

### Autoregressive Asymmetric Linear Gaussian Hidden Markov Models

In a real life process evolving over time, the relationship between its relevant variables may change. Therefore, it is advantageous to have different inference models for each state of the process. Asymmetric hidden Markov models fulfil this dynamical requirement and provide a framework where the trend of the process can be expressed as a latent variable. In this paper, we modify these recent asymmetric hidden Markov models to have an asymmetric autoregressive component, allowing the model to choose the order of autoregression that maximizes its penalized likelihood for a given training set. Additionally, we show how inference, hidden states decoding and parameter learning must be adapted to fit the proposed model. Finally, we run experiments with synthetic and real data to show the capabilities of this new model.

### Reliable Graph Neural Networks via Robust Aggregation

Perturbations targeting the graph structure have proven to be extremely effective in reducing the performance of Graph Neural Networks (GNNs), and traditional defenses such as adversarial training do not seem to be able to improve robustness. This work is motivated by the observation that adversarially injected edges effectively can be viewed as additional samples to a node's neighborhood aggregation function, which results in distorted aggregations accumulating over the layers. Conventional GNN aggregation functions, such as a sum or mean, can be distorted arbitrarily by a single outlier. We propose a robust aggregation function motivated by the field of robust statistics. Our approach exhibits the largest possible breakdown point of 0.5, which means that the bias of the aggregation is bounded as long as the fraction of adversarial edges of a node is less than 50\%. Our novel aggregation function, Soft Medoid, is a fully differentiable generalization of the Medoid and therefore lends itself well for end-to-end deep learning. Equipping a GNN with our aggregation improves the robustness with respect to structure perturbations on Cora ML by a factor of 3 (and 5.5 on Citeseer) and by a factor of 8 for low-degree nodes.

### Analyzing the tree-layer structure of Deep Forests

Random forests on the one hand, and neural networks on the other hand, have met great success in the machine learning community for their predictive performance. Combinations of both have been proposed in the literature, notably leading to the so-called deep forests (DF) [25]. In this paper, we investigate the mechanisms at work in DF and outline that DF architecture can generally be simplified into more simple and computationally efficient shallow forests networks. Despite some instability, the latter may outperform standard predictive tree-based methods. In order to precisely quantify the improvement achieved by these light network configurations over standard tree learners, we theoretically study the performance of a shallow tree network made of two layers, each one composed of a single centered tree. We provide tight theoretical lower and upper bounds on its excess risk. These theoretical results show the interest of tree-network architectures for well-structured data provided that the first layer, acting as a data encoder, is rich enough.

### Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks

Compressing large neural networks is an important step for their deployment in resource-constrained computational platforms. In this context, vector quantization is an appealing framework that expresses multiple parameters using a single code, and has recently achieved state-of-the-art network compression on a range of core vision and natural language processing tasks. Key to the success of vector quantization is deciding which parameter groups should be compressed together. Previous work has relied on heuristics that group the spatial dimension of individual convolutional filters, but a general solution remains unaddressed. This is desirable for pointwise convolutions (which dominate modern architectures), linear layers (which have no notion of spatial dimension), and convolutions (when more than one filter is compressed to the same codeword). In this paper we make the observation that the weights of two adjacent layers can be permuted while expressing the same function. We then establish a connection to rate-distortion theory and search for permutations that result in networks that are easier to compress. Finally, we rely on an annealed quantization algorithm to better compress the network and achieve higher final accuracy. We show results on image classification, object detection, and segmentation, reducing the gap with the uncompressed model by 40 to 70% with respect to the current state of the art.

### Understanding the Failure Modes of Out-of-Distribution Generalization

Empirical studies suggest that machine learning models often rely on features, such as the background, that may be spuriously correlated with the label only during training time, resulting in poor accuracy during test-time. In this work, we identify the fundamental factors that give rise to this behavior, by explaining why models fail this way {\em even} in easy-to-learn tasks where one would expect these models to succeed. In particular, through a theoretical study of gradient-descent-trained linear classifiers on some easy-to-learn tasks, we uncover two complementary failure modes. These modes arise from how spurious correlations induce two kinds of skews in the data: one geometric in nature, and another, statistical in nature. Finally, we construct natural modifications of image classification datasets to understand when these failure modes can arise in practice. We also design experiments to isolate the two failure modes when training modern neural networks on these datasets.

### A Local Search Framework for Experimental Design

We present a local search framework to design and analyze both combinatorial algorithms and rounding algorithms for experimental design problems. This framework provides a unifying approach to match and improve all known results in D/A/E-design and to obtain new results in previously unknown settings. For combinatorial algorithms, we provide a new analysis of the classical Fedorov's exchange method. We prove that this simple local search algorithm works well as long as there exists an almost optimal solution with good condition number. Moreover, we design a new combinatorial local search algorithm for E-design using the regret minimization framework. For rounding algorithms, we provide a unified randomized exchange algorithm to match and improve previous results for D/A/E-design. Furthermore, the algorithm works in the more general setting to approximately satisfy multiple knapsack constraints, which can be used for weighted experimental design and for incorporating fairness constraints into experimental design.