### High-recall causal discovery for autocorrelated time series with latent confounders

We present a new method for linear and nonlinear, lagged and contemporaneous constraint-based causal discovery from observational time series in the presence of latent confounders. We show that existing causal discovery methods such as FCI and variants suffer from low recall in the autocorrelated time series case and identify low effect size of conditional independence tests as the main reason. Information-theoretical arguments show that effect size can often be increased if causal parents are included in the conditioning sets. To identify parents early on, we suggest an iterative procedure that utilizes novel orientation rules to determine ancestral relationships already during the edge removal phase. We prove that the method is order-independent, and sound and complete in the oracle case. Extensive simulation studies for different numbers of variables, time lags, sample sizes, and further cases demonstrate that our method indeed achieves much higher recall than existing methods while keeping false positives at the desired level. This performance gain grows with stronger autocorrelation. Our method also covers causal discovery for non-time series data as a special case. We provide Python code for all methods involved in the simulation studies.

### Inference on the change point in high dimensional time series models via plug in least square

We study a plug in least squares estimator for the change point parameter where change is in the mean of a high dimensional random vector under subgaussian or subexponential distributions. We obtain sufficient conditions under which this estimator possesses sufficient adaptivity against plug in estimates of mean parameters in order to yield an optimal rate of convergence $O_p(\xi^{-2})$ in the integer scale. This rate is preserved while allowing high dimensionality as well as a potentially diminishing jump size $\xi,$ provided $s\log (p\vee T)=o(\surd(Tl_T))$ or $s\log^{3/2}(p\vee T)=o(\surd(Tl_T))$ in the subgaussian and subexponential cases, respectively. Here $s,p,T$ and $l_T$ represent a sparsity parameter, model dimension, sampling period and the separation of the change point from its parametric boundary. Moreover, since the rate of convergence is free of $s,p$ and logarithmic terms of $T,$ it allows the existence of limiting distributions. These distributions are then derived as the {\it argmax} of a two sided negative drift Brownian motion or a two sided negative drift random walk under vanishing and non-vanishing jump size regimes, respectively. Thereby allowing inference of the change point parameter in the high dimensional setting. Feasible algorithms for implementation of the proposed methodology are provided. Theoretical results are supported with monte-carlo simulations.

### Model Distillation for Revenue Optimization: Interpretable Personalized Pricing

Data-driven pricing strategies are becoming increasingly common, where customers are offered a personalized price based on features that are predictive of their valuation of a product. It is desirable to have this pricing policy be simple and interpretable, so it can be verified, checked for fairness, and easily implemented. However, efforts to incorporate machine learning into a pricing framework often lead to complex pricing policies which are not interpretable, resulting in mixed results in practice. We present a customized, prescriptive tree-based algorithm that distills knowledge from a complex black box machine learning algorithm, segments customers with similar valuations and prescribes prices in such a way that maximizes revenue while maintaining interpretability. We quantify the regret of a resulting policy and demonstrate its efficacy in applications with both synthetic and real-world datasets.

### Statistical hypothesis testing versus machine-learning binary classification: distinctions and guidelines

Making binary decisions is a common data analytical task in scientific research and industrial applications. In data sciences, there are two related but distinct strategies: hypothesis testing and binary classification. In practice, how to choose between these two strategies can be unclear and rather confusing. Here we summarize key distinctions between these two strategies in three aspects and list five practical guidelines for data analysts to choose the appropriate strategy for specific analysis needs. We demonstrate the use of those guidelines in a cancer driver gene prediction example.

### Two-sample Testing for Large, Sparse High-Dimensional Multinomials under Rare/Weak Perturbations

Given two samples from possibly different discrete distributions over a common set of size $N$, consider the problem of testing whether these distributions are identical, vs. the following rare/weak perturbation alternative: the frequencies of $N^{1-\beta}$ elements are perturbed by $r(\log N)/2n$ in the Hellinger distance, where $n$ is the size of each sample. We adapt the Higher Criticism (HC) test to this setting using P-values obtained from $N$ exact binomial tests. We characterize the asymptotic performance of the HC-based test in terms of the sparsity parameter $\beta$ and the perturbation intensity parameter $r$. Specifically, we derive a region in the $(\beta,r)$-plane where the test asymptotically has maximal power, while having asymptotically no power outside this region. Our analysis distinguishes between the cases of dense ($N\gg n$) and sparse ($N\ll n$) contingency tables. In the dense case, the phase transition curve matches that of an analogous two-sample normal means model.

### Karhunen-Loève Expansions for Axially Symmetric Gaussian Processes: Modeling Strategies and $L^2$ Approximations

Axially symmetric processes on spheres, for which the second-order dependency structure may substantially vary with shifts in latitude, are a prominent alternative to model the spatial uncertainty of natural variables located over large portions of the Earth. In this paper, we focus on Karhunen-Lo\eve expansions of axially symmetric Gaussian processes. First, we investigate a parametric family of Karhunen-Lo\eve coefficients that allows for versatile spatial covariance functions. The isotropy as well as the longitudinal independence can be obtained as limit cases of our proposal. Second, we introduce a strategy to render any longitudinally reversible process irreversible, which means that its covariance function could admit certain types of asymmetries along longitudes. Then, finitely truncated Karhunen-Lo\eve expansions are used to approximate axially symmetric processes. For such approximations, bounds for the $L^2$-error are provided. Numerical experiments are conducted to illustrate our findings.

### Accelerating Nonconvex Learning via Replica Exchange Langevin Diffusion

Langevin diffusion is a powerful method for nonconvex optimization, which enables the escape from local minima by injecting noise into the gradient. In particular, the temperature parameter controlling the noise level gives rise to a tradeoff between global exploration'' and local exploitation'', which correspond to high and low temperatures. To attain the advantages of both regimes, we propose to use replica exchange, which swaps between two Langevin diffusions with different temperatures. We theoretically analyze the acceleration effect of replica exchange from two perspectives: (i) the convergence in \chi^2-divergence, and (ii) the large deviation principle. Such an acceleration effect allows us to faster approach the global minima. Furthermore, by discretizing the replica exchange Langevin diffusion, we obtain a discrete-time algorithm. For such an algorithm, we quantify its discretization error in theory and demonstrate its acceleration effect in practice.

### Cluster Prediction for Opinion Dynamics from Partial Observations

We present a Bayesian approach to predict the clustering of opinions for a system of interacting agents from partial observations. The Bayesian formulation overcomes the unobservability of the system and quantifies the uncertainty in the prediction. We characterize the clustering by the posterior of the clusters' sizes and centers, and we represent the posterior by samples. To overcome the challenge in sampling the high-dimensional posterior, we introduce an auxiliary implicit sampling (AIS) algorithm using two-step observations. Numerical results show that the AIS algorithm leads to accurate predictions of the sizes and centers for the leading clusters, in both cases of noiseless and noisy observations. In particular, the centers are predicted with high success rates, but the sizes exhibit a considerable uncertainty that is sensitive to observation noise and the observation ratio.

### Estimating Extreme Value Index by Subsampling for Massive Datasets with Heavy-Tailed Distributions

Modern statistical analyses often encounter datasets with massive sizes and heavy-tailed distributions. For datasets with massive sizes, traditional estimation methods can hardly be used to estimate the extreme value index directly. To address the issue, we propose here a subsampling-based method. Specifically, multiple subsamples are drawn from the whole dataset by using the technique of simple random subsampling with replacement. Based on each subsample, an approximate maximum likelihood estimator can be computed. The resulting estimators are then averaged to form a more accurate one. Under appropriate regularity conditions, we show theoretically that the proposed estimator is consistent and asymptotically normal. With the help of the estimated extreme value index, a normal range can be established for a heavy-tailed random variable. Observations that fall outside the range should be treated as suspected records and can be practically regarded as outliers. Extensive simulation experiments are provided to demonstrate the promising performance of our method. A real data analysis is also presented for illustration purpose.

### Discovering Drug-Drug and Drug-Disease Interactions Inducing Acute Kidney Injury Using Deep Rule Forests

Patients with Acute Kidney Injury (AKI) increase mortality, morbidity, and long-term adverse events. Therefore, early identification of AKI may improve renal function recovery, decrease comorbidities, and further improve patients' survival. To control certain risk factors and develop targeted prevention strategies are important to reduce the risk of AKI. Drug-drug interactions and drug-disease interactions are critical issues for AKI. Typical statistical approaches cannot handle the complexity of drug-drug and drug-disease interactions. In this paper, we propose a novel learning algorithm, Deep Rule Forests (DRF), which discovers rules from multilayer tree models as the combinations of drug usages and disease indications to help identify such interactions. We found that several disease and drug usages are considered having significant impact on the occurrence of AKI. Our experimental results also show that the DRF model performs comparatively better than typical tree-based and other state-of-the-art algorithms in terms of prediction accuracy and model interpretability.

### Prediction Regions for Poisson and Over-Dispersed Poisson Regression Models with Applications to Forecasting Number of Deaths during the COVID-19 Pandemic

Motivated by the current Coronavirus Disease (COVID-19) pandemic, which is due to the SARS-CoV-2 virus, and the important problem of forecasting daily deaths and cumulative deaths, this paper examines the construction of prediction regions or intervals under the Poisson regression model and for an over-dispersed Poisson regression model. For the Poisson regression model, several prediction regions are developed and their performance are compared through simulation studies. The methods are applied to the problem of forecasting daily and cumulative deaths in the United States (US) due to COVID-19. To examine their performance relative to what actually happened, daily deaths data until May 15th were used to forecast cumulative deaths by June 1st. It was observed that there is over-dispersion in the observed data relative to the Poisson regression model. An over-dispersed Poisson regression model is therefore proposed. This new model builds on frailty ideas in Survival Analysis and over-dispersion is quantified through an additional parameter. The Poisson regression model is a hidden model in this over-dispersed Poisson regression model and obtains as a limiting case when the over-dispersion parameter increases to infinity. A prediction region for the cumulative number of US deaths due to COVID-19 by July 16th, given the data until July 2nd, is presented. Finally, the paper discusses limitations of proposed procedures and mentions open research problems, as well as the dangers and pitfalls when forecasting on a long horizon, with focus on this pandemic where events, both foreseen and unforeseen, could have huge impacts on point predictions and prediction regions.

### Transfer learning of regression models from a sequence of datasets by penalized estimation

Transfer learning refers to the promising idea of initializing model fits based on pre-training on other data. We particularly consider regression modeling settings where parameter estimates from previous data can be used as anchoring points, yet may not be available for all parameters, thus covariance information cannot be reused. A procedure that updates through targeted penalized estimation, which shrinks the estimator towards a nonzero value, is presented. The parameter estimate from the previous data serves as this nonzero value when an update is sought from novel data. This naturally extends to a sequence of data sets with the same response, but potentially only partial overlap in covariates. The iteratively updated regression parameter estimator is shown to be asymptotically unbiased and consistent. The penalty parameter is chosen through constrained cross-validated loglikelihood optimization. The constraint bounds the amount of shrinkage of the updated estimator toward the current one from below. The bound aims to preserve the (updated) estimator's goodness-of-fit on all-but-the-novel data. The proposed approach is compared to other regression modeling procedures. Finally, it is illustrated on an epidemiological study where the data arrive in batches with different covariate-availability and the model is re-fitted with the availability of a novel batch.

### An Empirical Bayes Approach to Shrinkage Estimation on the Manifold of Symmetric Positive-Definite Matrices

The James-Stein estimator is an estimator of the multivariate normal mean and dominates the maximum likelihood estimator (MLE) under squared error loss. The original work inspired great interest in developing shrinkage estimators for a variety of problems. Nonetheless, research on shrinkage estimation for manifold-valued data is scarce. In this paper, we propose shrinkage estimators for the parameters of the Log-Normal distribution defined on the manifold of $N \times N$ symmetric positive-definite matrices. For this manifold, we choose the Log-Euclidean metric as its Riemannian metric since it is easy to compute and is widely used in applications. By using the Log-Euclidean distance in the loss function, we derive a shrinkage estimator in an analytic form and show that it is asymptotically optimal within a large class of estimators including the MLE, which is the sample Fr\'echet mean of the data. We demonstrate the performance of the proposed shrinkage estimator via several simulated data experiments. Furthermore, we apply the shrinkage estimator to perform statistical inference in diffusion magnetic resonance imaging problems.

### On identifying unobserved heterogeneity in stochastic blockmodel graphs with vertex covariates

Both observed and unobserved vertex heterogeneity can influence block structure in graphs. To assess these effects on block recovery, we present a comparative analysis of two model-based spectral algorithms for clustering vertices in stochastic blockmodel graphs with vertex covariates. The first algorithm directly estimates the induced block assignments by investigating the estimated block connectivity probability matrix including the vertex covariate effect. The second algorithm estimates the vertex covariate effect and then estimates the induced block assignments after accounting for this effect. We employ Chernoff information to analytically compare the algorithms' performance and derive the Chernoff ratio formula for some special models of interest. Analytic results and simulations suggest that, in general, the second algorithm is preferred: we can better estimate the induced block assignments by first estimating the vertex covariate effect. In addition, real data experiments on a diffusion MRI connectome data set indicate that the second algorithm has the advantages of revealing underlying block structure and taking observed vertex heterogeneity into account in real applications. Our findings emphasize the importance of distinguishing between observed and unobserved factors that can affect block structure in graphs.

### Rate-optimality of consistent distribution-free tests of independence based on center-outward ranks and signs

Rank correlations have found many innovative applications in the last decade. In particular, suitable versions of rank correlations have been used for consistent tests of independence between pairs of random variables. The use of ranks is especially appealing for continuous data as tests become distribution-free. However, the traditional concept of ranks relies on ordering data and is, thus, tied to univariate observations. As a result it has long remained unclear how one may construct distribution-free yet consistent tests of independence between multivariate random vectors. This is the problem we address in this paper, in which we lay out a general framework for designing dependence measures that give tests of multivariate independence that are not only consistent and distribution-free but which we also prove to be statistically efficient. Our framework leverages the recently introduced concept of center-outward ranks and signs, a multivariate generalization of traditional ranks, and adopts a common standard form for dependence measures that encompasses many popular measures from the literature. In a unified study, we derive a general asymptotic representation of center-outward test statistics under independence, extending to the multivariate setting the classical Hajek asymptotic representation results. This representation permits a direct calculation of limiting null distributions for the proposed test statistics. Moreover, it facilitates a local power analysis that provides strong support for the center-outward approach to multivariate ranks by establishing, for the first time, the rate-optimality of center-outward tests within families of Konijn alternatives.

### The maximum of the periodogram of Hilbert space valued time series

We are interested to detect periodic signals in Hilbert space valued time series when the length of the period is unknown. A natural test statistic is the maximum Hilbert-Schmidt norm of the periodogram operator over all fundamental frequencies. In this paper we analyze the asymptotic distribution of this test statistic. We consider the case where the noise variables are independent and then generalize our results to functional linear processes. Details for implementing the test are provided for the class of functional autoregressive processes. We illustrate the usefulness of our approach by examining air quality data from Graz, Austria. The accuracy of the asymptotic theory in finite samples is evaluated in a simulation experiment.

### The joint survival signature of coherent systems with shared components

The concept of joint bivariate signature, introduced by Navarro et al. (2013), is a useful tool for studying the dependence between two systems with shared components. As with the univariate signature, introduced by Samaniego (2007), its applications are limited to systems with only one type of components which restricts its practical use. Coolen and Coolen-Maturi (2012) introduced the survival signature, which is capable of dealing with multiple types of components. In this paper we present a survival signature for systems with shared components, including one or multiple types of components.

### Continuous shrinkage prior revisited: a collapsing behavior and remedy

Modern genomic studies are increasingly focused on identifying more and more genes clinically associated with a health response. Commonly used Bayesian shrinkage priors are designed primarily to detect only a handful of signals when the dimension of the predictors is very high. In this article, we investigate the performance of a popular continuous shrinkage prior in the presence of relatively large number of true signals. We draw attention to an undesirable phenomenon; the posterior mean is rendered very close to a null vector, caused by a sharp underestimation of the global-scale parameter. The phenomenon is triggered by the absence of a tail-index controlling mechanism in the Bayesian shrinkage priors. We provide a remedy by developing a global-local-tail shrinkage prior which can automatically learn the tail-index and can provide accurate inference even in the presence of moderately large number of signals. The collapsing behavior of the Horseshoe with its remedy is exemplified in numerical examples and in two gene expression datasets.

### Geographically Weighted Regression Analysis for Spatial Economics Data: a Bayesian Recourse

The geographically weighted regression (GWR) is a well-known statistical approach to explore spatial non-stationarity of the regression relationship in spatial data analysis. In this paper, we discuss a Bayesian recourse of GWR. Bayesian variable selection based on spike-and-slab prior, bandwidth selection based on range prior, and model assessment using a modified deviance information criterion and a modified logarithm of pseudo-marginal likelihood are fully discussed in this paper. Usage of the graph distance in modeling areal data is also introduced. Extensive simulation studies are carried out to examine the empirical performance of the proposed methods with both small and large number of location scenarios, and comparison with the classical frequentist GWR is made. The performance of variable selection and estimation of the proposed methodology under different circumstances are satisfactory. We further apply the proposed methodology in analysis of a province-level macroeconomic data of 30 selected provinces in China. The estimation and variable selection results reveal insights about China's economy that are convincing and agree with previous studies and facts.

### Bayesian Hierarchical Spatial Regression Models for Spatial Data in the Presence of Missing Covariates with Applications

In many applications, survey data are collected from different survey centers in different regions. It happens that in some circumstances, response variables are completely observed while the covariates have missing values. In this paper, we propose a joint spatial regression model for the response variable and missing covariates via a sequence of one-dimensional conditional spatial regression models. We further construct a joint spatial model for missing covariate data mechanisms. The properties of the proposed models are examined and a Markov chain Monte Carlo sampling algorithm is used to sample from the posterior distribution. In addition, the Bayesian model comparison criteria, the modified Deviance Information Criterion (mDIC) and the modified Logarithm of the Pseudo-Marginal Likelihood (mLPML), are developed to assess the fit of spatial regression models for spatial data. Extensive simulation studies are carried out to examine empirical performance of the proposed methods. We further apply the proposed methodology to analyze a real data set from a Chinese Health and Nutrition Survey (CHNS) conducted in 2011.

### Adverse event enrichment tests using VAERS

Vaccination safety is critical for individual and public health. Many existing methods have been used to conduct safety studies with the VAERS (Vaccine Adverse Event Reporting System) database. However, these methods frequently identify many adverse event (AE) signals and they are often hard to interpret in a biological context. The AE ontology introduces biologically meaningful structures to the VAERS database by connecting similar AEs, which provides meaningful interpretation for the underlying safety issues. In this paper, we develop rigorous statistical methods to identify "interesting" AE groups by performing AE enrichment analysis. We extend existing gene enrichment tests to perform AE enrichment analysis. Unlike the continuous gene expression data, AE data are counts. Therefore, AE data has many zeros and ties. We propose two enrichment tests, AEFisher and AEKS. AEFisher is a modified Fisher's exact test based on pre-selected significant AEs, while AEKS is based on a modified Kolmogorov-Smirnov statistic. Both tests incorporate the special features of the AE data. The proposed methods were evaluated using simulation studies and were further illustrated on two studies using VAERS data. By appropriately addressing the issues of ties and excessive zeros in AE count data, our enrichment tests performed well as demonstrated by simulation studies and analyses of VAERS data. The proposed methods were implemented in R package AEenrich and can be installed from the Comprehensive R Archive Network, CRAN.

### Semiparametric transformation model for competing risks data with cure fraction

Modelling and analysis of competing risks data with long-term survivors is an important area of research in recent years. For example, in the study of cancer patients treated for soft tissue sarcoma, patient may die due to different causes. Considerable portion of the patients may remain cancer free after the treatment. Accordingly, it is important to incorporate long-term survivors in the analysis of competing risks data. Motivated by this, we propose a new method for the analysis of competing risks data with long term survivors. The new method enables us to estimate the overall survival probability without estimating the cure fraction. We formulate the effects of covariates on sub-distribution (cumulative incidence) functions using linear transformation model. Estimating equations based on counting process are developed to find the estimators of regression coefficients. The asymptotic properties of the estimators are studied using martingale theory. An extensive Monte Carlo simulation study is carried out to assess the finite sample performance of the proposed estimators. Finally, we illustrate our method using a real data set.

### SMIM: a unified framework of Survival sensitivity analysis using Multiple Imputation and Martingale

Censored survival data are common in clinical trial studies. We propose a unified framework for sensitivity analysis to censoring at random in survival data using multiple imputation and martingale, called SMIM. The proposed framework adopts the \delta-adjusted and control-based models, indexed by the sensitivity parameter, entailing censoring at random and a wide collection of censoring not at random assumptions. Also, it targets for a broad class of treatment effect estimands defined as functionals of treatment-specific survival functions, taking into account of missing data due to censoring. Multiple imputation facilitates the use of simple full-sample estimation; however, the standard Rubin's combining rule may overestimate the variance for inference in the sensitivity analysis framework. We decompose the multiple imputation estimator into a martingale series based on the sequential construction of the estimator and propose the wild bootstrap inference by resampling the martingale series. The new bootstrap inference has a theoretical guarantee for consistency and is computationally efficient compared to the non-parametric bootstrap counterpart. We evaluate the finite-sample performance of the proposed SMIM through simulation and applications on HIV and cardiovascular clinical trials.

### The network structure of cultural distances

This paper proposes a novel measure of cultural distances between countries. Making use of the information coming from the World Value Survey (Wave 6), and considering the interdependence among cultural traits, the paper proposes a methodology to define the cultural distance between countries, that takes into account the network structure of national cultural traits. Exploiting the possibilities offered by Copula graphical models for ordinal and categorical data, the paper infers the network structure of 54 countries and proposes a new summary measure of national cultural distances. The DBRV Cultural Distance index shows that, as for 2010-2014, compared to Inglehart and Welzel (2005) the world appears to be more culturally heterogeneous than what it was previously thought.

### Semiparametric Tensor Factor Analysis by Iteratively Projected SVD

This paper introduces a general framework of Semiparametric TEnsor FActor analysis (STEFA) that focuses on the methodology and theory of low-rank tensor decomposition with auxiliary covariates. STEFA models extend tensor factor models by incorporating instrumental covariates in the loading matrices. We propose an algorithm of Iteratively Projected SVD (IP-SVD) for the semiparametric estimations. It iteratively projects tensor data onto the linear space spanned by covariates and applies SVD on matricized tensors over each mode. We establish the convergence rates of the loading matrices and the core tensor factor. Compared with the Tucker decomposition, IP-SVD yields more accurate estimates with a faster convergence rate. Besides estimation, we show several prediction methods with newly observed covariates based on the STEFA model. On both real and synthetic tensor data, we demonstrate the efficacy of the STEFA model and the IP-SVD algorithm on both the estimation and prediction tasks.

### Robust Causal Inference Under Covariate Shift via Worst-Case Subpopulation Treatment Effects

We propose the worst-case treatment effect (WTE) across all subpopulations of a given size, a conservative notion of topline treatment effect. Compared to the average treatment effect (ATE) that solely relies on the covariate distribution of collected data, WTE is robust to unanticipated covariate shifts, and ensures positive findings guarantee uniformly valid treatment effects over underrepresented minority groups. We develop a semiparametrically efficient estimator for the WTE, leveraging machine learning-based estimates of heterogenous treatment effects and propensity scores. By virtue of satisfying a key (Neyman) orthogonality property, our estimator enjoys central limit behavior---oracle rates with true nuisance parameters---even when estimates of nuisance parameters converge at slower rates. For both observational and randomized studies, we prove that our estimator achieves the optimal asymptotic variance, by establishing a semiparametric efficiency lower bound. On real datasets where robustness to covariate shift is of core concern, we illustrate the non-robustness of ATE under even mild distributional shift, and demonstrate that the WTE guards against brittle findings that are invalidated by unanticipated covariate shifts.

### Piecewise Linear Regression via a Difference of Convex Functions

We present a new piecewise linear regression methodology that utilizes fitting a difference of convex functions (DC functions) to the data. These are functions $f$ that may be represented as the difference $\phi_1 - \phi_2$ for a choice of convex functions $\phi_1, \phi_2$. The method proceeds by estimating piecewise-liner convex functions, in a manner similar to max-affine regression, whose difference approximates the data. The choice of the function is regularised by a new seminorm over the class of DC functions that controls the $\ell_\infty$ Lipschitz constant of the estimate. The resulting methodology can be efficiently implemented via Quadratic programming even in high dimensions, and is shown to have close to minimax statistical risk. We empirically validate the method, showing it to be practically implementable, and to have comparable performance to existing regression/classification methods on real-world datasets.

### Extending Mixture of Experts Model to Investigate Heterogeneity of Trajectories: When, Where and How to Add Which Covariates

Researchers are usually interested in examining the impact of covariates when uncovering sample heterogeneity. The majority of theoretical and empirical studies with such aims focus on identifying covariates as predictors of class membership in the structural equation modeling framework. In other words, those covariates only indirectly affect the sample heterogeneity. However, the covariates' influence on between-individual differences can also be direct. This article presents a mixture model that investigates covariates to explain within-cluster and between-cluster heterogeneity simultaneously, known as a mixture-of-experts (MoE). This study aims to extend the MoE framework to investigate heterogeneity in nonlinear trajectories: to identify latent classes, covariates as predictors to clusters, and covariates that explain within-cluster differences in change patterns over time. Our simulation studies demonstrate that the proposed model generally estimate the parameters unbiasedly, precisely and exhibit appropriate empirical coverage for a nominal 95% confidence interval. This study also proposes implementing structural equation model forests to shrink the covariate space of MoE models and illustrates how to select covariate and construct a MoE with longitudinal mathematics achievement data.

### Pseudo-Rehearsal for Continual Learning with Normalizing Flows

Catastrophic forgetting (CF) happens whenever a neural network overwrites past knowledge while being trained on new tasks. Common techniques to handle CF include regularization of the weights (using, e.g., their importance on past tasks), and rehearsal strategies, where the network is constantly re-trained on past data. Generative models have also been applied for the latter, in order to have endless sources of data. In this paper, we propose a novel method that combines the strengths of regularization and generative-based rehearsal approaches. Our generative model consists of a normalizing flow (NF), a probabilistic and invertible neural network, trained on the internal embeddings of the network. By keeping a single NF conditioned on the task, we show that our memory overhead remains constant. In addition, exploiting the invertibility of the NF, we propose a simple approach to regularize the network's embeddings with respect to past tasks. We show that our method performs favorably with respect to state-of-the-art approaches in the literature, with bounded computational power and memory overheads.

### Handling highly correlated genes of Single-Cell RNA sequencing data in prediction models

Motivation: Selecting feature genes and predicting cells' phenotype are typical tasks in the analysis of scRNA-seq data. Many algorithms were developed for these tasks, but high correlations among genes create challenges specifically in scRNA-seq analysis, which are not well addressed. Highly correlated genes lead to collinearity and unreliable model fitting. Highly correlated genes compete with each other in feature selection, which causes underestimation of their importance. Most importantly, when a causal gene is highly correlated other genes, most algorithms select one of them in a data driven manner. The correlation structure among genes could change substantially. Hence, it is critical to build a prediction model based on causal genes but not their highly correlated genes. Results: To address the issues discussed above, we propose a grouping algorithm which can be integrated in prediction models. Using real benchmark scRNA-seq data sets and simulated cell phenotypes, we show our novel method significantly outperform standard prediction models in the performance of both prediction and feature selection. Our algorithm report the whole group of correlated genes, which allow researchers to conduct additional studies to identify the causal genes from the group. Availability: An R package is being developed and will be made available on the Comprehensive R Archive Network (CRAN). In the meantime, R code can be requested by email.

### Online Regularization for High-Dimensional Dynamic Pricing Algorithms

We propose a novel \textit{online regularization} scheme for revenue-maximization in high-dimensional dynamic pricing algorithms. The online regularization scheme equips the proposed optimistic online regularized maximum likelihood pricing (\texttt{OORMLP}) algorithm with three major advantages: encode market noise knowledge into pricing process optimism; empower online statistical learning with always-validity over all decision points; envelop prediction error process with time-uniform non-asymptotic oracle inequalities. This type of non-asymptotic inference results allows us to design safer and more robust dynamic pricing algorithms in practice. In theory, the proposed \texttt{OORMLP} algorithm exploits the sparsity structure of high-dimensional models and obtains a logarithmic regret in a decision horizon. These theoretical advances are made possible by proposing an optimistic online LASSO procedure that resolves dynamic pricing problems at the \textit{process} level, based on a novel use of non-asymptotic martingale concentration. In experiments, we evaluate \texttt{OORMLP} in different synthetic pricing problem settings and observe that \texttt{OORMLP} performs better than \texttt{RMLP} proposed in \cite{javanmard2019dynamic}.

### Adjusted Logistic Propensity Weighting Methods for Population Inference using Nonprobability Volunteer-Based Epidemiologic Cohorts

Many epidemiologic studies forgo probability sampling and turn to nonprobability volunteer-based samples because of cost, response burden, and invasiveness of biological samples. However, finite population inference is difficult to make from the nonprobability samples due to the lack of population representativeness. Aiming for making inferences at the population level using nonprobability samples, various inverse propensity score weighting (IPSW) methods have been studied with the propensity defined by the participation rate of population units in the nonprobability sample. In this paper, we propose an adjusted logistic propensity weighting (ALP) method to estimate the participation rates for nonprobability sample units. Compared to existing IPSW methods, the proposed ALP method is easy to implement by ready-to-use software while producing approximately unbiased estimators for population quantities regardless of the nonprobability sample rate. The efficiency of the ALP estimator can be further improved by scaling the survey sample weights in propensity estimation. Taylor linearization variance estimators are proposed for ALP estimators of finite population means that account for all sources of variability. The proposed ALP methods are evaluated numerically via simulation studies and empirically using the na\"ive unweighted National Health and Nutrition Examination Survey III sample, while taking the 1997 National Health Interview Survey as the reference, to estimate the 15-year mortality rates.

### Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

Overparametrized neural networks trained by gradient descent (GD) can provably overfit any training data. However, the generalization guarantee may not hold for noisy data. From a nonparametric perspective, this paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises. We establish a lower bound on the $L_2$ estimation error with respect to the GD iteration, which is away from zero without a delicate choice of early stopping. In turn, through a comprehensive analysis of $\ell_2$-regularized GD trajectories, we prove that for overparametrized one-hidden-layer ReLU neural network with the $\ell_2$ regularization: (1) the output is close to that of the kernel ridge regression with the corresponding neural tangent kernel; (2) minimax {optimal} rate of $L_2$ estimation error is achieved. Numerical experiments confirm our theory and further demonstrate that the $\ell_2$ regularization approach improves the training robustness and works for a wider range of neural networks.

### An Application of Newsboy Problem in Supply Chain Optimisation of Online Fashion E-Commerce

We describe a supply chain optimization model deployed in an online fashion e-commerce company in India called Myntra. Our model is simple, elegant and easy to put into service. The model utilizes historic data and predicts the quantity of Stock Keeping Units (SKUs) to hold so that the metrics "Fulfilment Index" and "Utilization Index" are optimized. We present the mathematics central to our model as well as compare the performance of our model with baseline regression based solutions.

### Treatment effect bias from sample snooping: blinding outcomes is neither necessary nor sufficient

Popular guidance on observational data analysis states that outcomes should be blinded when determining matching criteria or propensity scores. Such a blinding is informally said to maintain the "objectivity" of the analysis (Rubin et al., 2008). To explore these issues, we begin by proposing a definition of objectivity based on the worst-case bias that can occur without blinding, which we call "added variable bias." This bias is indeed severe, and can diverge towards infinity as the sample size grows. However, we also show that bias of the same order of magnitude can occur even without a delineated, blinded design stage, so long as some prior knowledge is available that links covariates to outcomes. Finally, we outline an alternative sample partitioning procedure for estimating the average treatment effect on the controls, or the average treatment effect on the treated, while avoiding added variable bias. This procedure allows for the analysis to not be fully prespecified; uses all of the the outcome data from all partitions in the final analysis step; and does not require blinding. Together, these results illustrate that outcome blinding is neither necessary nor sufficient for preventing added variable bias, and should not be considered a requirement when evaluating novel causal inference methods.

### Second-order matrix extension of Beta distribution and its high order moments

In this article, we consider a second-order matrix extension of Beta distribution. That is a distribution on second-order random matrix. We will give the analytical formula for its high order moments, which is superior over general numerical integration method.

### Closed-form variance estimators for weighted and stratified dose-response function estimators using generalized propensity score

Propensity score methods are widely used in observational studies for evaluating marginal treatment effects. The generalized propensity score (GPS) is an extension of the propensity score framework, historically developed in the case of binary exposures, for use with quantitative or continuous exposures. In this paper, we proposed variance esti-mators for treatment effect estimators on continuous outcomes. Dose-response functions (DRF) were estimated through weighting on the inverse of the GPS, or using stratification. Variance estimators were evaluated using Monte Carlo simulations. Despite the use of stabilized weights, the variability of the weighted estimator of the DRF was particularly high, and none of the variance estimators (a bootstrap-based estimator, a closed-form estimator especially developped to take into account the estimation step of the GPS, and a sandwich estimator) were able to adequately capture this variability, resulting in coverages below to the nominal value, particularly when the proportion of the variation in the quantitative exposure explained by the covariates was 1 large. The stratified estimator was more stable, and variance estima-tors (a bootstrap-based estimator, a pooled linearized estimator, and a pooled model-based estimator) more efficient at capturing the empirical variability of the parameters of the DRF. The pooled variance estimators tended to overestimate the variance, whereas the bootstrap estimator, which intrinsically takes into account the estimation step of the GPS, resulted in correct variance estimations and coverage rates. These methods were applied to a real data set with the aim of assessing the effect of maternal body mass index on newborn birth weight.

### Testing normality in any dimension by Fourier methods in a multivariate Stein equation

We study a novel class of affine invariant and consistent tests for multivariate normality. The tests are based on a characterization of the standard $d$-variate normal distribution by means of the unique solution of an initial value problem connected to a partial differential equation, which is motivated by a multivariate Stein equation. The test criterion is a suitably weighted $L^2$-statistic. We derive the limit distribution of the test statistic under the null hypothesis as well as under contiguous and fixed alternatives to normality. A consistent estimator of the limiting variance under fixed alternatives as well as an asymptotic confidence interval of the distance of an underlying alternative with respect to the multivariate normal law is derived. In simulation studies, we show that the tests are strong in comparison with prominent competitors, and that the empirical coverage rate of the asymptotic confidence interval converges to the nominal level. We present a real data example, and we outline topics for further research.

### Surprise sampling: improving and extending the local case-control sampling

Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from Ai, et al. (2018), our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.

### Consistency analysis of bilevel data-driven learning in inverse problems

One fundamental problem when solving inverse problems is how to find regularization parameters. This article considers solving this problem using data-driven bilevel optimization, i.e. we consider the adaptive learning of the regularization parameter from data by means of optimization. This approach can be interpreted as solving an empirical risk minimization problem, and we analyze its performance in the large data sample size limit for general nonlinear problems. To reduce the associated computational cost, online numerical schemes are derived using the stochastic gradient method. We prove convergence of these numerical schemes under suitable assumptions on the forward problem. Numerical experiments are presented illustrating the theoretical results and demonstrating the applicability and efficiency of the proposed approaches for various linear and nonlinear inverse problems, including Darcy flow, the eikonal equation, and an image denoising example.

### A review of spatial causal inference methods for environmental and epidemiological applications

The scientific rigor and computational methods of causal inference have had great impacts on many disciplines, but have only recently begun to take hold in spatial applications. Spatial casual inference poses analytic challenges due to complex correlation structures and interference between the treatment at one location and the outcomes at others. In this paper, we review the current literature on spatial causal inference and identify areas of future work. We first discuss methods that exploit spatial structure to account for unmeasured confounding variables. We then discuss causal analysis in the presence of spatial interference including several common assumptions used to reduce the complexity of the interference patterns under consideration. These methods are extended to the spatiotemporal case where we compare and contrast the potential outcomes framework with Granger causality, and to geostatistical analyses involving spatial random fields of treatments and responses. The methods are introduced in the context of observational environmental and epidemiological studies, and are compared using both a simulation study and analysis of the effect of ambient air pollution on COVID-19 mortality rate. Code to implement many of the methods using the popular Bayesian software OpenBUGS is provided.

### The FMRIB Variational Bayesian Inference Tutorial II: Stochastic Variational Bayes

Bayesian methods have proved powerful in many applications for the inference of model parameters from data. These methods are based on Bayes' theorem, which itself is deceptively simple. However, in practice the computations required are intractable even for simple cases. Hence methods for Bayesian inference have historically either been significantly approximate, e.g., the Laplace approximation, or achieve samples from the exact solution at significant computational expense, e.g., Markov Chain Monte Carlo methods. Since around the year 2000 so-called Variational approaches to Bayesian inference have been increasingly deployed. In its most general form Variational Bayes (VB) involves approximating the true posterior probability distribution via another more 'manageable' distribution, the aim being to achieve as good an approximation as possible. In the original FMRIB Variational Bayes tutorial we documented an approach to VB based that took a 'mean field' approach to forming the approximate posterior, required the conjugacy of prior and likelihood, and exploited the Calculus of Variations, to derive an iterative series of update equations, akin to Expectation Maximisation. In this tutorial we revisit VB, but now take a stochastic approach to the problem that potentially circumvents some of the limitations imposed by the earlier methodology. This new approach bears a lot of similarity to, and has benefited from, computational methods applied to machine learning algorithms. Although, what we document here is still recognisably Bayesian inference in the classic sense, and not an attempt to use machine learning as a black-box to solve the inference problem.

### Non-Gaussian component analysis: testing the dimension of the signal subspace

Dimension reduction is a common strategy in multivariate data analysis which seeks a subspace which contains all interesting features needed for the subsequent analysis. Non-Gaussian component analysis attempts for this purpose to divide the data into a non-Gaussian part, the signal, and a Gaussian part, the noise. We will show that the simultaneous use of two scatter functionals can be used for this purpose and suggest a bootstrap test to test the dimension of the non-Gaussian subspace. Sequential application of the test can then for example be used to estimate the signal dimension.

### Yield curve and macroeconomy interaction: evidence from the non-parametric functional lagged regression approach

Viewing a yield curve as a sparse collection of measurements on a latent continuous random function allows us to model it statistically as a sparsely observed functional time series. Doing so, we use the state-of-the-art methods in non-parametric statistical inference for sparsely observed functional time series to analyse the lagged regression dependence of the US Treasury yield curve on US macroeconomic variables. Our non-parametric analysis confirms previous findings established under parametric assumptions, namely a strong impact of the federal funds rate on the short end of the yield curve and a moderate effect of the annual inflation on the longer end of the yield curve.

### Comparing representational geometries using the unbiased distance correlation

Representational similarity analysis (RSA) tests models of brain computation by investigating how neural activity patterns change in response to different experimental conditions. Instead of predicting activity patterns directly, the models predict the geometry of the representation, i.e. to what extent experimental conditions are associated with similar or dissimilar activity patterns. RSA therefore first quantifies the representational geometry by calculating a dissimilarity measure for all pairs of conditions, and then compares the estimated representational dissimilarities to those predicted by the model. Here we address two central challenges of RSA: First, dissimilarity measures such as the Euclidean, Mahalanobis, and correlation distance, are biased by measurement noise, which can lead to incorrect inferences. Unbiased dissimilarity estimates can be obtained by crossvalidation, at the price of increased variance. Second, the pairwise dissimilarity estimates are not statistically independent. Ignoring the dependency makes model comparison with RSA statistically suboptimal. We present an analytical expression for the mean and (co-)variance of both biased and unbiased estimators of Euclidean and Mahalanobis distance, allowing us to exactly quantify the bias-variance trade-off. We then use the analytical expression of the co-variance of the dissimilarity estimates to derive a simple method correcting for this covariance. Combining unbiased distance estimates with this correction leads to a novel criterion for comparing representational geometries, the unbiased distance correlation, which, as we show, allows for near optimal model comparison.

### Towards Efficient Connected and Automated Driving System via Multi-agent Graph Reinforcement Learning

Connected and automated vehicles (CAVs) have attracted more and more attention recently. The fast actuation time allows them having the potential to promote the efficiency and safety of the whole transportation system. Due to technical challenges, there will be a proportion of vehicles that can be equipped with automation while other vehicles are without automation. Instead of learning a reliable behavior for ego automated vehicle, we focus on how to improve the outcomes of the total transportation system by allowing each automated vehicle to learn cooperation with each other and regulate human-driven traffic flow. One of state of the art method is using reinforcement learning to learn intelligent decision making policy. However, direct reinforcement learning framework cannot improve the performance of the whole system. In this article, we demonstrate that considering the problem in multi-agent setting with shared policy can help achieve better system performance than non-shared policy in single-agent setting. Furthermore, we find that utilization of attention mechanism on interaction features can capture the interplay between each agent in order to boost cooperation. To the best of our knowledge, while previous automated driving studies mainly focus on enhancing individual's driving performance, this work serves as a starting point for research on system-level multi-agent cooperation performance using graph information sharing. We conduct extensive experiments in car-following and unsignalized intersection settings. The results demonstrate that CAVs controlled by our method can achieve the best performance against several state of the art baselines.

### Meta Learning for Causal Direction

The inaccessibility of controlled randomized trials due to inherent constraints in many fields of science has been a fundamental issue in causal inference. In this paper, we focus on distinguishing the cause from effect in the bivariate setting under limited observational data. Based on recent developments in meta learning as well as in causal inference, we introduce a novel generative model that allows distinguishing cause and effect in the small data setting. Using a learnt task variable that contains distributional information of each dataset, we propose an end-to-end algorithm that makes use of similar training datasets at test time. We demonstrate our method on various synthetic as well as real-world data and show that it is able to maintain high accuracy in detecting directions across varying dataset sizes.

### Solving Bayesian Network Structure Learning Problem with Integer Linear Programming

This dissertation investigates integer linear programming (ILP) formulation of Bayesian Network structure learning problem. We review the definition and key properties of Bayesian network and explain score metrics used to measure how well certain Bayesian network structure fits the dataset. We outline the integer linear programming formulation based on the decomposability of score metrics. In order to ensure acyclicity of the structure, we add cluster constraints'' developed specifically for Bayesian network, in addition to cycle constraints applicable to directed acyclic graphs in general. Since there would be exponential number of these constraints if we specify them fully, we explain the methods to add them as cutting planes without declaring them all in the initial model. Also, we develop a heuristic algorithm that finds a feasible solution based on the idea of sink node on directed acyclic graphs. We implemented the ILP formulation and cutting planes as a \textsf{Python} package, and present the results of experiments with different settings on reference datasets.

### Does imputation matter? Benchmark for predictive models

Incomplete data are common in practical applications. Most predictive machine learning models do not handle missing values so they require some preprocessing. Although many algorithms are used for data imputation, we do not understand the impact of the different methods on the predictive models' performance. This paper is first that systematically evaluates the empirical effectiveness of data imputation algorithms for predictive models. The main contributions are (1) the recommendation of a general method for empirical benchmarking based on real-life classification tasks and the (2) comparative analysis of different imputation methods for a collection of data sets and a collection of ML algorithms.

### On optimal two-stage testing of multiple mediators

Mediation analysis in high-dimensional settings often involves identifying potential mediators among a large number of measured variables. For this purpose, a two-step familywise error rate procedure called ScreenMin has been recently proposed (Djordjilovi\'c et al. 2019). In ScreenMin, variables are first screened and only those that pass the screening are tested. The proposed threshold for selection has been shown to guarantee asymptotic familywise error rate. In this work, we investigate the impact of the selection threshold on the finite sample familywise error rate. We derive a power maximizing selection threshold and show that it is well approximated by an adaptive threshold of Wang et al. (2016). We illustrate the investigated procedures on a case-control study examining the effect of fish intake on the risk of colorectal adenoma.

### Cross-Fitting and Averaging for Machine Learning Estimation of Heterogeneous Treatment Effects

We investigate the finite sample performance of sample splitting, cross-fitting and averaging for the estimation of the conditional average treatment effect. Recently proposed methods, so-called meta-learners, make use of machine learning to estimate different nuisance functions and hence allow for fewer restrictions on the underlying structure of the data. To limit a potential overfitting bias, that may result when using machine learning methods, cross-fitting estimators have been proposed. This includes the splitting of the data in different folds. To the best of our knowledge, it is not yet clear how exactly the data should be split and averaged. We employ a simulation study with different data generation processes and consider different estimators that vary in sample-splitting, cross-fitting and averaging procedures. We investigate the performance of each estimator independently on four different meta-learners: The doubly-robust-learner, the R-learner, the T-learner and the X-learner. We find that the performance of all meta-learners heavily depends on the procedure of splitting and averaging. The best performance in terms of mean squared error (MSE) could be achieved when using a 5-fold cross-fitting estimator which is averaged by the median over multiple different sample-splittings.

### Stochastic Stein Discrepancies

Stein discrepancies (SDs) monitor convergence and non-convergence in approximate inference when exact integration and sampling are intractable. However, the computation of a Stein discrepancy can be prohibitive if the Stein operator - often a sum over likelihood terms or potentials - is expensive to evaluate. To address this deficiency, we show that stochastic Stein discrepancies (SSDs) based on subsampled approximations of the Stein operator inherit the convergence control properties of standard SDs with probability 1. In our experiments with biased Markov chain Monte Carlo (MCMC) hyperparameter tuning, approximate MCMC sampler selection, and stochastic Stein variational gradient descent, SSDs deliver comparable inferences to standard SDs with orders of magnitude fewer likelihood evaluations.

### A Mathematical Theory of Attention

Attention is a powerful component of modern neural networks across a wide variety of domains. However, despite its ubiquity in machine learning, there is a gap in our understanding of attention from a theoretical point of view. We propose a framework to fill this gap by building a mathematically equivalent model of attention using measure theory. With this model, we are able to interpret self-attention as a system of self-interacting particles, we shed light on self-attention from a maximum entropy perspective, and we show that attention is actually Lipschitz-continuous (with an appropriate metric) under suitable assumptions. We then apply these insights to the problem of mis-specified input data; infinitely-deep, weight-sharing self-attention networks; and more general Lipschitz estimates for a specific type of attention studied in concurrent work.

### On the minmax regret for statistical manifolds: the role of curvature

Model complexity plays an essential role in its selection, namely, by choosing a model that fits the data and is also succinct. Two-part codes and the minimum description length have been successful in delivering procedures to single out the best models, avoiding overfitting. In this work, we pursue this approach and complement it by performing further assumptions in the parameter space. Concretely, we assume that the parameter space is a smooth manifold, and by using tools of Riemannian geometry, we derive a sharper expression than the standard one given by the stochastic complexity, where the scalar curvature of the Fisher information metric plays a dominant role. Furthermore, we derive the minmax regret for general statistical manifolds and apply our results to derive optimal dimensional reduction in the context of principal component analysis.

### Descent-to-Delete: Gradient-Based Methods for Machine Unlearning

We study the data deletion problem for convex models. By leveraging techniques from convex optimization and reservoir sampling, we give the first data deletion algorithms that are able to handle an arbitrarily long sequence of adversarial updates while promising both per-deletion run-time and steady-state error that do not grow with the length of the update sequence. We also introduce several new conceptual distinctions: for example, we can ask that after a deletion, the entire state maintained by the optimization algorithm is statistically indistinguishable from the state that would have resulted had we retrained, or we can ask for the weaker condition that only the observable output is statistically indistinguishable from the observable output that would have resulted from retraining. We are able to give more efficient deletion algorithms under this weaker deletion criterion.

### A Unifying View of Optimism in Episodic Reinforcement Learning

The principle of optimism in the face of uncertainty underpins many theoretically successful reinforcement learning algorithms. In this paper we provide a general framework for designing, analyzing and implementing such algorithms in the episodic reinforcement learning problem. This framework is built upon Lagrangian duality, and demonstrates that every model-optimistic algorithm that constructs an optimistic MDP has an equivalent representation as a value-optimistic dynamic programming algorithm. Typically, it was thought that these two classes of algorithms were distinct, with model-optimistic algorithms benefiting from a cleaner probabilistic analysis while value-optimistic algorithms are easier to implement and thus more practical. With the framework developed in this paper, we show that it is possible to get the best of both worlds by providing a class of algorithms which have a computationally efficient dynamic-programming implementation and also a simple probabilistic analysis. Besides being able to capture many existing algorithms in the tabular setting, our framework can also address largescale problems under realizable function approximation, where it enables a simple model-based analysis of some recently proposed methods.

### Examining Redundancy in the Context of Safe Machine Learning

This paper describes a set of experiments with neural network classifiers on the MNIST database of digits. The purpose is to investigate na\"ive implementations of redundant architectures as a first step towards safe and dependable machine learning. We report on a set of measurements using the MNIST database which ultimately serve to underline the expected difficulties in using NN classifiers in safe and dependable systems.

### The Effect of Class Imbalance on Precision-Recall Curves

In this note I study how the precision of a classifier depends on the ratio $r$ of positive to negative cases in the test set, as well as the classifier's true and false positive rates. This relationship allows prediction of how the precision-recall curve will change with $r$, which seems not to be well known. It also allows prediction of how $F_{\beta}$ and the Precision Gain and Recall Gain measures of Flach and Kull (2015) vary with $r$.

### Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity

Training neural network models with discrete (categorical or structured) latent variables can be computationally challenging, due to the need for marginalization over large or combinatorial sets. To circumvent this issue, one typically resorts to sampling-based approximations of the true marginal, requiring noisy gradient estimators (e.g., score function estimator) or continuous relaxations with lower-variance reparameterized gradients (e.g., Gumbel-Softmax). In this paper, we propose a new training strategy which replaces these estimators by an exact yet efficient marginalization. To achieve this, we parameterize discrete distributions over latent assignments using differentiable sparse mappings: sparsemax and its structured counterparts. In effect, the support of these distributions is greatly reduced, which enables efficient marginalization. We report successful results in three tasks covering a range of latent variable modeling applications: a semisupervised deep generative model, a latent communication game, and a generative model with a bit vector latent representation. In all cases, we obtain good performance while still achieving the practicality of sampling-based approximations.

### Knowledge Distillation Beyond Model Compression

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various techniques have been proposed since the original formulation, which mimic different aspects of the teacher such as the representation space, decision boundary, or intra-data relationship. Some methods replace the one-way knowledge distillation from a static teacher with collaborative learning between a cohort of students. Despite the recent advances, a clear understanding of where knowledge resides in a deep neural network and an optimal method for capturing knowledge from teacher and transferring it to student remains an open question. In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge. We demonstrate the versatility of the KD framework on different datasets and network architectures under varying capacity gaps between the teacher and student. The study provides intuition for the effects of mimicking different aspects of the teacher and derives insights from the performance of the different distillation approaches to guide the design of more effective KD methods. Furthermore, our study shows the effectiveness of the KD framework in learning efficiently under varying severity levels of label noise and class imbalance, consistently providing generalization gains over standard training. We emphasize that the efficacy of KD goes much beyond a model compression technique and it should be considered as a general-purpose training paradigm which offers more robustness to common challenges in the real-world datasets compared to the standard training procedure.

### Unsupervised Learning of Lagrangian Dynamics from Images for Prediction and Control

Recent approaches for modelling dynamics of physical systems with neural networks enforce Lagrangian or Hamiltonian structure to improve prediction and generalization. However, these approaches fail to handle the case when coordinates are embedded in high-dimensional data such as images. We introduce a new unsupervised neural network model that learns Lagrangian dynamics from images, with interpretability that benefits prediction and control. The model infers Lagrangian dynamics on generalized coordinates that are simultaneously learned with a coordinate-aware variational autoencoder (VAE). The VAE is designed to account for the geometry of physical systems composed of multiple rigid bodies in the plane. By inferring interpretable Lagrangian dynamics, the model learns physical system properties, such as kinetic and potential energy, which enables long-term prediction of dynamics in the image space and synthesis of energy-based controllers.

### A Coupled Manifold Optimization Framework to Jointly Model the Functional Connectomics and Behavioral Data Spaces

The problem of linking functional connectomics to behavior is extremely challenging due to the complex interactions between the two distinct, but related, data domains. We propose a coupled manifold optimization framework which projects fMRI data onto a low dimensional matrix manifold common to the cohort. The patient specific loadings simultaneously map onto a behavioral measure of interest via a second, non-linear, manifold. By leveraging the kernel trick, we can optimize over a potentially infinite dimensional space without explicitly computing the embeddings. As opposed to conventional manifold learning, which assumes a fixed input representation, our framework directly optimizes for embedding directions that predict behavior. Our optimization algorithm combines proximal gradient descent with the trust region method, which has good convergence guarantees. We validate our framework on resting state fMRI from fifty-eight patients with Autism Spectrum Disorder using three distinct measures of clinical severity. Our method outperforms traditional representation learning techniques in a cross validated setting, thus demonstrating the predictive power of our coupled objective.

### Integrating Neural Networks and Dictionary Learning for Multidimensional Clinical Characterizations from Functional Connectomics Data

We propose a unified optimization framework that combines neural networks with dictionary learning to model complex interactions between resting state functional MRI and behavioral data. The dictionary learning objective decomposes patient correlation matrices into a collection of shared basis networks and subject-specific loadings. These subject-specific features are simultaneously input into a neural network that predicts multidimensional clinical information. Our novel optimization framework combines the gradient information from the neural network with that of a conventional matrix factorization objective. This procedure collectively estimates the basis networks, subject loadings, and neural network weights most informative of clinical severity. We evaluate our combined model on a multi-score prediction task using 52 patients diagnosed with Autism Spectrum Disorder (ASD). Our integrated framework outperforms state-of-the-art methods in a ten-fold cross validated setting to predict three different measures of clinical severity.

### A Deep-Generative Hybrid Model to Integrate Multimodal and Dynamic Connectivity for Predicting Spectrum-Level Deficits in Autism

We propose an integrated deep-generative framework, that jointly models complementary information from resting-state functional MRI (rs-fMRI) connectivity and diffusion tensor imaging (DTI) tractography to extract predictive biomarkers of a disease. The generative part of our framework is a structurally-regularized Dynamic Dictionary Learning (sr-DDL) model that decomposes the dynamic rs-fMRI correlation matrices into a collection of shared basis networks and time varying patient-specific loadings. This matrix factorization is guided by the DTI tractography matrices to learn anatomically informed connectivity profiles. The deep part of our framework is an LSTM-ANN block, which models the temporal evolution of the patient sr-DDL loadings to predict multidimensional clinical severity. Our coupled optimization procedure collectively estimates the basis networks, the patient-specific dynamic loadings, and the neural network weights. We validate our framework on a multi-score prediction task in 57 patients diagnosed with Autism Spectrum Disorder (ASD). Our hybrid model outperforms state-of-the-art baselines in a five-fold cross validated setting and extracts interpretable multimodal neural signatures of brain dysfunction in ASD.

### Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient

Exploration-exploitation dilemma has long been a crucial issue in reinforcement learning. In this paper, we propose a new approach to automatically balance between these two. Our method is built upon the Soft Actor-Critic (SAC) algorithm, which uses an `entropy temperature" that balances the original task reward and the policy entropy, and hence controls the trade-off between exploitation and exploration. It is empirically shown that SAC is very sensitive to this hyperparameter, and the follow-up work (SAC-v2), which uses constrained optimization for automatic adjustment, has some limitations. The core of our method, namely Meta-SAC, is to use metagradient along with a novel meta objective to automatically tune the entropy temperature in SAC. We show that Meta-SAC achieves promising performances on several of the Mujoco benchmarking tasks, and outperforms SAC-v2 over 10% in one of the most challenging tasks, humanoid-v2.

### CICLAD: A Fast and Memory-efficient Closed Itemset Miner for Streams

Mining association rules from data streams is a challenging task due to the (typically) limited resources available vs. the large size of the result. Frequent closed itemsets (FCI) enable an efficient first step, yet current FCI stream miners are not optimal on resource consumption, e.g. they store a large number of extra itemsets at an additional cost. In a search for a better storage-efficiency trade-off, we designed Ciclad,an intersection-based sliding-window FCI miner. Leveraging in-depth insights into FCI evolution, it combines minimal storage with quick access. Experimental results indicate Ciclad's memory imprint is much lower and its performances globally better than competitor methods.

### On the application of transfer learning in prognostics and health management

Advancements in sensing and computing technologies, the development of human and computer interaction frameworks, big data storage capabilities, and the emergence of cloud storage and could computing have resulted in an abundance of data in the modern industry. This data availability has encouraged researchers and industry practitioners to rely on data-based machine learning, especially deep learning, models for fault diagnostics and prognostics more than ever. These models provide unique advantages, however, their performance is heavily dependent on the training data and how well that data represents the test data. This issue mandates fine-tuning and even training the models from scratch when there is a slight change in operating conditions or equipment. Transfer learning is an approach that can remedy this issue by keeping portions of what is learned from previous training and transferring them to the new application. In this paper, a unified definition for transfer learning and its different types is provided, Prognostics and Health Management (PHM) studies that have used transfer learning are reviewed in detail, and finally, a discussion on transfer learning application considerations and gaps is provided for improving the applicability of transfer learning in PHM.

### Structure-Aware Human-Action Generation

Generating long-range skeleton-based human actions has been a challenging problem since small deviations of one frame can cause a malformed action sequence. Most existing methods borrow ideas from video generation, which naively treat skeleton nodes/joints as pixels of images without considering the rich inter-frame and intra-frame structure information, leading to potential distorted actions. Graph convolutional networks (GCNs) is a promising way to leverage structure information to learn structure representations. However, directly adopting GCNs to tackle such continuous action sequences both in spatial and temporal spaces is challenging as the action graph could be huge. To overcome this issue, we propose a variant of GCNs to leverage the powerful self-attention mechanism to adaptively sparsify a complete action graph in the temporal space. Our method could dynamically attend to important past frames and construct a sparse graph to apply in the GCN framework, well-capturing the structure information in action sequences. Extensive experimental results demonstrate the superiority of our method on two standard human action datasets compared with existing methods.

### Building a Competitive Associative Classifier

With the huge success of deep learning, other machine learning paradigms have had to take back seat. Yet other models, particularly rule-based, are more readable and explainable and can even be competitive when labelled data is not abundant. However, most of the existing rule-based classifiers suffer from the production of a large number of classification rules, affecting the model readability. This hampers the classification accuracy as noisy rules might not add any useful informationfor classification and also lead to longer classification time. In this study, we propose SigD2 which uses a novel, two-stage pruning strategy which prunes most of the noisy, redundant and uninteresting rules and makes the classification model more accurate and readable. To make SigDirect more competitive with the most prevalent but uninterpretable machine learning-based classifiers like neural networks and support vector machines, we propose bagging and boosting on the ensemble of the SigDirect classifier. The results of the proposed algorithms are quite promising and we are able to obtain a minimal set of statistically significant rules for classification without jeopardizing the classification accuracy. We use 15 UCI datasets and compare our approach with eight existing systems.The SigD2 and boosted SigDirect (ACboost) ensemble model outperform various state-of-the-art classifiers not only in terms of classification accuracy but also in terms of the number of rules.

### Excess deaths hidden 100 days after the quarantine in Peru by COVID-19

Objective: To make an estimate of the excess deaths caused by COVID-19 in the non-violent mortality of Peru, controlling for the effect of quarantine. Methods: Analysis of longitudinal data from the departments of Peru using official public information from the National Death Information System and the Ministry of Health of Peru. The analysis is performed between January 1, 2018 and June 23, 2020 (100 days of quarantine). The daily death rate per million inhabitants has been used. The days in which the departments were quarantined with a limit number of accumulated cases of COVID-19 were used to estimate the quarantine impact. Three limits were established for cases: less than 1, 10 and 100 cases. Result: In Peru, the daily death rate per million inhabitants decreased by -1.89 (95% CI: -2.70; -1.07) on quarantine days and without COVID-19 cases. When comparing this result with the total number of non-violent deaths, the excess deaths during the first 100 days of quarantine is 36,230. This estimate is 1.12 times the estimate with data from 2019 and 4.2 times the deaths officers by COVID-19. Conclusion: Quarantine reduced nonviolent deaths; however, they are overshadowed by the increase as a direct or indirect cause of the pandemic. Therefore, the difference between the number of current deaths and that of past years underestimates the real excess of deaths.

### Linear Bandits with Limited Adaptivity and Learning Distributional Optimal Design

Motivated by practical needs such as large-scale learning, we study the impact of adaptivity constraints to linear contextual bandits, a central problem in online active learning. We consider two popular limited adaptivity models in literature: batch learning and rare policy switches. We show that, when the context vectors are adversarially chosen in $d$-dimensional linear contextual bandits, the learner needs $\Omega(d \log T/ \log (d \log T))$ policy switches to achieve the minimax-optimal expected regret, almost matching the $O(d \log T)$ upper bound by Abbasi-Yadkori et al. [2011]; for stochastic context vectors, even in the more restricted batch learning model, only $O(\log \log T)$ batches are needed to achieve the optimal regret. Together with the known results in literature, our results present a complete picture about the adaptivity constraints in linear contextual bandits. Along the way, we propose \emph{distributional optimal design}, a natural extension of the optimal experiment design, and provide a sample-efficient learning algorithm for the problem, which may be of independent interest.

### Bidirectional Model-based Policy Optimization

Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making, which, however, may fail catastrophically if the model is inaccurate. Although there are several existing methods dedicated to combating the model error, the potential of the single forward model is still limited. In this paper, we propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions. We develop a novel method, called Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization. Furthermore, we theoretically derive a tighter bound of return discrepancy, which shows the superiority of BMPO against the one using merely the forward model. Extensive experiments demonstrate that BMPO outperforms state-of-the-art model-based methods in terms of sample efficiency and asymptotic performance.

### Nested Subspace Arrangement for Representation of Relational Data

Studies on acquiring appropriate continuous representations of discrete objects, such as graphs and knowledge base data, have been conducted by many researchers in the field of machine learning. In this study, we introduce Nested SubSpace (NSS) arrangement, a comprehensive framework for representation learning. We show that existing embedding techniques can be regarded as special cases of the NSS arrangement. Based on the concept of the NSS arrangement, we implement a Disk-ANChor ARrangement (DANCAR), a representation learning method specialized to reproducing general graphs. Numerical experiments have shown that DANCAR has successfully embedded WordNet in ${\mathbb R}^{20}$ with an F1 score of 0.993 in the reconstruction task. DANCAR is also suitable for visualization in understanding the characteristics of graphs.

### DessiLBI: Exploring Structural Sparsity of Deep Networks via Differential Inclusion Paths

Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling over-parameterized models to compressive ones, we propose a new approach based on differential inclusions of inverse scale spaces. Specifically, it generates a family of models from simple to complex ones that couples a pair of parameters to simultaneously train over-parameterized deep models and structural sparsity on weights of fully connected and convolutional layers. Such a differential inclusion scheme has a simple discretization, proposed as Deep structurally splitting Linearized Bregman Iteration (DessiLBI), whose global convergence analysis in deep learning is established that from any initializations, algorithmic iterations converge to a critical point of empirical risks. Experimental evidence shows that DessiLBI achieve comparable and even better performance than the competitive optimizers in exploring the structural sparsity of several widely used backbones on the benchmark datasets. Remarkably, with early stopping, DessiLBI unveils "winning tickets" in early epochs: the effective sparse structure with comparable test accuracy to fully trained over-parameterized models.

### Humans-as-a-sensor for buildings: Intensive longitudinal indoor comfort models

Evaluating and optimising human comfort within the built environment is challenging due to the large number of physiological, psychological and environmental variables that affect occupant comfort preference. Humans are often better than sensors at capturing all of these disparate phenomena and interpreting their impact; the challenge is collecting spatially and temporally diverse subjective feedback in a scalable way. This paper presents a methodology to collect intensive longitudinal subjective feedback of comfort-based preference using micro ecological momentary assessments on a smartwatch platform. An experiment with 30 occupants over two weeks produced 4,378 field-based surveys for thermal, noise, and acoustic preference. The occupants and the spaces in which they left feedback were then clustered according to these preference tendencies. These groups were used to create different feature sets with combinations of environmental and physiological variables, for use in a multi-class classification task. These classification models were trained on a feature set that was developed from time-series attributes, environmental and near-body sensors, heart rate, and the historical preferences of both the individual and the comfort group assigned. The most accurate model didn't use environmental sensor data and yet had multi-class classification F1 micro scores of 64%, 80% and 86% for thermal, light, and noise preference, respectively. The discussion outlines how these models provide comfort preference prediction as good or better than installed sensors, even in situations when some occupants are not willing or able to wear smartwatches. The approach presented prompts reflection on how the building analysis community evaluates, controls, and designs indoor environments.

### Discount Factor as a Regularizer in Reinforcement Learning

Specifying a Reinforcement Learning (RL) task involves choosing a suitable planning horizon, which is typically modeled by a discount factor. It is known that applying RL algorithms with a lower discount factor can act as a regularizer, improving performance in the limited data regime. Yet the exact nature of this regularizer has not been investigated. In this work, we fill in this gap. For several Temporal-Difference (TD) learning methods, we show an explicit equivalence between using a reduced discount factor and adding an explicit regularization term to the algorithm's loss. Motivated by the equivalence, we empirically study this technique compared to standard $L_2$ regularization by extensive experiments in discrete and continuous domains, using tabular and functional representations. Our experiments suggest the regularization effectiveness is strongly related to properties of the available data, such as size, distribution, and mixing rate.

### Relationship between manifold smoothness and adversarial vulnerability in deep learning with local errors

Artificial neural networks can achieve impressive performances, and even outperform humans in some specific tasks. Nevertheless, unlike biological brains, the artificial neural networks suffer from tiny perturbations in sensory input, under various kinds of adversarial attacks. It is therefore necessary to study the origin of the adversarial vulnerability. Here, we establish a fundamental relationship between geometry of hidden representations (manifold perspective) and the generalization capability of the deep networks. For this purpose, we choose a deep neural network trained by local errors, and then analyze emergent properties of trained networks through the manifold dimensionality, manifold smoothness, and the generalization capability. To explore effects of adversarial examples, we consider independent Gaussian noise attacks and fast-gradient-sign-method (FGSM) attacks. Our study reveals that a high generalization accuracy requires a relatively fast power-law decay of the eigen-spectrum of hidden representations. Under Gaussian attacks, the relationship between generalization accuracy and power-law exponent is monotonic, while a non-monotonic behavior is observed for FGSM attacks. Our empirical study provides a route towards a final mechanistic interpretation of adversarial vulnerability under adversarial attacks.

### RDP-GAN: A Rényi-Differential Privacy based Generative Adversarial Network

Generative adversarial network (GAN) has attracted increasing attention recently owing to its impressive ability to generate realistic samples with high privacy protection. Without directly interactive with training examples, the generative model can be fully used to estimate the underlying distribution of an original dataset while the discriminative model can examine the quality of the generated samples by comparing the label values with the training examples. However, when GANs are applied on sensitive or private training examples, such as medical or financial records, it is still probable to divulge individuals' sensitive and private information. To mitigate this information leakage and construct a private GAN, in this work we propose a R\'enyi-differentially private-GAN (RDP-GAN), which achieves differential privacy (DP) in a GAN by carefully adding random noises on the value of the loss function during training. Moreover, we derive the analytical results of the total privacy loss under the subsampling method and cumulated iterations, which show its effectiveness on the privacy budget allocation. In addition, in order to mitigate the negative impact brought by the injecting noise, we enhance the proposed algorithm by adding an adaptive noise tuning step, which will change the volume of added noise according to the testing accuracy. Through extensive experimental results, we verify that the proposed algorithm can achieve a better privacy level while producing high-quality samples compared with a benchmark DP-GAN scheme based on noise perturbation on training gradients.

### Deep Graph Random Process for Relational-Thinking-Based Speech Recognition

Lying at the core of human intelligence, relational thinking is characterized by initially relying on innumerable unconscious percepts pertaining to relations between new sensory signals and prior knowledge, consequently becoming a recognizable concept or object through coupling and transformation of these percepts. Such mental processes are difficult to model in real-world problems such as in conversational automatic speech recognition (ASR), as the percepts (if they are modelled as graphs indicating relationships among utterances) are supposed to be innumerable and not directly observable. In this paper, we present a Bayesian nonparametric deep learning method called deep graph random process (DGP) that can generate an infinite number of probabilistic graphs representing percepts. We further provide a closed-form solution for coupling and transformation of these percept graphs for acoustic modeling. Our approach is able to successfully infer relations among utterances without using any relational data during training. Experimental evaluations on ASR tasks including CHiME-2 and CHiME-5 demonstrate the effectiveness and benefits of our method.

### Simple and Deep Graph Convolutional Networks

Graph convolutional networks (GCNs) are a powerful deep learning approach for graph-structured data. Recently, GCNs and subsequent variants have shown superior performance in various application areas on real-world datasets. Despite their success, most of the current GCN models are shallow, due to the {\em over-smoothing} problem. In this paper, we study the problem of designing and analyzing deep graph convolutional networks. We propose the GCNII, an extension of the vanilla GCN model with two simple yet effective techniques: {\em Initial residual} and {\em Identity mapping}. We provide theoretical and empirical evidence that the two techniques effectively relieves the problem of over-smoothing. Our experiments show that the deep GCNII model outperforms the state-of-the-art methods on various semi- and full-supervised tasks. Code is available at https://github.com/chennnM/GCNII .

### Off-Policy Exploitability-Evaluation and Equilibrium-Learning in Two-Player Zero-Sum Markov Games

Off-policy evaluation (OPE) is the problem of evaluating new policies using historical data obtained from a different policy. Off-policy learning (OPL), on the other hand, is the problem of finding an optimal policy using historical data. In recent OPE and OPL contexts, most of the studies have focused on one-player cases, and not on more than two-player cases. In this study, we propose methods for OPE and OPL in two-player zero-sum Markov games. For OPE, we estimate exploitability that is often used as a metric for determining how close a strategy profile is to a Nash equilibrium in two-player zero-sum games. For OPL, we calculate maximin policies as Nash equilibrium strategies over the historical data. We prove the exploitability estimation error bounds for OPE and regret bounds for OPL based on the doubly robust and double reinforcement learning estimators. Finally, we demonstrate the effectiveness and performance of the proposed methods through experiments.

### Variational Policy Gradient Method for Reinforcement Learning with General Utilities

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

### Scalable Differentiable Physics for Learning and Control

Differentiable physics is a powerful approach to learning and control problems that involve physical objects and environments. While notable progress has been made, the capabilities of differentiable physics solvers remain limited. We develop a scalable framework for differentiable physics that can support a large number of objects and their interactions. To accommodate objects with arbitrary geometry and topology, we adopt meshes as our representation and leverage the sparsity of contacts for scalable differentiable collision handling. Collisions are resolved in localized regions to minimize the number of optimization variables even when the number of simulated objects is high. We further accelerate implicit differentiation of optimization with nonlinear constraints. Experiments demonstrate that the presented framework requires up to two orders of magnitude less memory and computation in comparison to recent particle-based methods. We further validate the approach on inverse problems and control scenarios, where it outperforms derivative-free and model-free baselines by at least an order of magnitude.

### Deep Active Learning via Open Set Recognition

In many applications, data is easy to acquire but expensive and time consuming to label prominent examples include medical imaging and NLP. This disparity has only grown in recent years as our ability to collect data improves. Under these constraints, it makes sense to select only the most informative instances from the unlabeled pool and request an oracle (e.g a human expert) to provide labels for those samples. The goal of active learning is to infer the informativeness of unlabeled samples so as to minimize the number of requests to the oracle. Here, we formulate active learning as an open-set recognition problem. In this latter paradigm, only some of the inputs belong to known classes; the classifier must identify the rest as unknown.More specifically, we leverage variational neuralnetworks (VNNs), which produce high-confidence (i.e., low-entropy) predictions only for inputs that closely resemble the training data. We use the inverse of this confidence measure to select the samples that the oracle should label. Intuitively, unlabeled samples that the VNN is uncertain about are more informative for future training. We carried out an extensive evaluation of our novel, probabilistic formulation of active learning, achieving state-of-the-art results on CIFAR-10 andCIFAR-100. In addition, unlike current active learning methods, our algorithm can learn tasks with non i.i.d distribution, without the need for task labels. As our experiments show, when the unlabeled pool consists of a mixture of samples from multiple tasks, our approach can automatically distinguish between samples from seen vs. unseen tasks.

### Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels

In weakly supervised learning, unbiased risk estimator(URE) is a powerful tool for training classifiers when training and test data are drawn from different distributions. Nevertheless, UREs lead to overfitting in many problem settings when the models are complex like deep networks. In this paper, we investigate reasons for such overfitting by studying a weakly supervised problem called learning with complementary labels. We argue the quality of gradient estimation matters more in risk minimization. Theoretically, we show that a URE gives an unbiased gradient estimator(UGE). Practically, however, UGEs may suffer from huge variance, which causes empirical gradients to be usually far away from true gradients during minimization. To this end, we propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance and makes empirical gradients more aligned with true gradients in the direction. Thanks to this characteristic, SCL successfully mitigates the overfitting issue and improves URE-based methods.

### Faster algorithms for Markov equivalence

Maximal ancestral graphs (MAGs) have many desirable properties; in particular they can fully describe conditional independences from directed acyclic graphs (DAGs) in the presence of latent and selection variables. However, different MAGs may encode the same conditional independences, and are said to be \emph{Markov equivalent}. Thus identifying necessary and sufficient conditions for equivalence is essential for structure learning. Several criteria for this already exist, but in this paper we give a new non-parametric characterization in terms of the heads and tails that arise in the parameterization for discrete models. We also provide a polynomial time algorithm ($O(ne^{2})$, where $n$ and $e$ are the number of vertices and edges respectively) to verify equivalence. Moreover, we extend our criterion to ADMGs and summary graphs and propose an algorithm that converts an ADMG or summary graph to an equivalent MAG in polynomial time ($O(n^{2}e)$). Hence by combining both algorithms, we can also verify equivalence between two summary graphs or ADMGs.

### Modelling Human Mobility considering Spatial,Temporal and Social Dimensions

Modelling human mobility is crucial in several areas, from urban planning to epidemic modeling, traffic forecasting, and what-if analysis. On the one hand, existing models focus mainly on reproducing the spatial and temporal dimensions of human mobility, while the social aspect, though it influences human movements significantly, is often neglected. On the other hand, those models that capture some social aspects of human mobility have trivial and unrealistic spatial and temporal mechanisms. In this paper, we propose STS-EPR, a modeling framework that embeds mechanisms to capture the spatial, temporal, and social aspects together. Our experiments show that STS-EPR outperforms existing spatial-temporal or social models on a set of standard mobility metrics and that it can be used with a limited amount of information without any significant loss of realism. STS-EPR, which is open-source and tested on open data, is a step towards the design of mechanistic models that can capture all the aspects of human mobility in a comprehensive way.

### Block Model Guided Unsupervised Feature Selection

Feature selection is a core area of data mining with a recent innovation of graph-driven unsupervised feature selection for linked data. In this setting we have a dataset $\mathbf{Y}$ consisting of $n$ instances each with $m$ features and a corresponding $n$ node graph (whose adjacency matrix is $\mathbf{A}$) with an edge indicating that the two instances are similar. Existing efforts for unsupervised feature selection on attributed networks have explored either directly regenerating the links by solving for $f$ such that $f(\mathbf{y}_i,\mathbf{y}_j) \approx \mathbf{A}_{i,j}$ or finding community structure in $\mathbf{A}$ and using the features in $\mathbf{Y}$ to predict these communities. However, graph-driven unsupervised feature selection remains an understudied area with respect to exploring more complex guidance. Here we take the novel approach of first building a block model on the graph and then using the block model for feature selection. That is, we discover $\mathbf{F}\mathbf{M}\mathbf{F}^T \approx \mathbf{A}$ and then find a subset of features $\mathcal{S}$ that induces another graph to preserve both $\mathbf{F}$ and $\mathbf{M}$. We call our approach Block Model Guided Unsupervised Feature Selection (BMGUFS). Experimental results show that our method outperforms the state of the art on several real-world public datasets in finding high-quality features for clustering.

### Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions

This paper seeks to establish a framework for directing a society of simple, specialized, self-interested agents to solve what traditionally are posed as monolithic single-agent sequential decision problems. What makes it challenging to use a decentralized approach to collectively optimize a central objective is the difficulty in characterizing the equilibrium strategy profile of non-cooperative games. To overcome this challenge, we design a mechanism for defining the learning environment of each agent for which we know that the optimal solution for the global objective coincides with a Nash equilibrium strategy profile of the agents optimizing their own local objectives. The society functions as an economy of agents that learn the credit assignment process itself by buying and selling to each other the right to operate on the environment state. We derive a class of decentralized reinforcement learning algorithms that are broadly applicable not only to standard reinforcement learning but also for selecting options in semi-MDPs and dynamically composing computation graphs. Lastly, we demonstrate the potential advantages of a society's inherent modular structure for more efficient transfer learning.

### Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs

This paper studies few-shot relation extraction, which aims at predicting the relation for a pair of entities in a sentence by training with a few labeled examples in each relation. To more effectively generalize to new relations, in this paper we study the relationships between different relations and propose to leverage a global relation graph. We propose a novel Bayesian meta-learning approach to effectively learn the posterior distribution of the prototype vectors of relations, where the initial prior of the prototype vectors is parameterized with a graph neural network on the global relation graph. Moreover, to effectively optimize the posterior distribution of the prototype vectors, we propose to use the stochastic gradient Langevin dynamics, which is related to the MAML algorithm but is able to handle the uncertainty of the prototype vectors. The whole framework can be effectively and efficiently optimized in an end-to-end fashion. Experiments on two benchmark datasets prove the effectiveness of our proposed approach against competitive baselines in both the few-shot and zero-shot settings.

### Efficient Parameter Estimation of Truncated Boolean Product Distributions

We study the problem of estimating the parameters of a Boolean product distribution in $d$ dimensions, when the samples are truncated by a set $S \subset \{0, 1\}^d$ accessible through a membership oracle. This is the first time that the computational and statistical complexity of learning from truncated samples is considered in a discrete setting. We introduce a natural notion of fatness of the truncation set $S$, under which truncated samples reveal enough information about the true distribution. We show that if the truncation set is sufficiently fat, samples from the true distribution can be generated from truncated samples. A stunning consequence is that virtually any statistical task (e.g., learning in total variation distance, parameter estimation, uniformity or identity testing) that can be performed efficiently for Boolean product distributions, can also be performed from truncated samples, with a small increase in sample complexity. We generalize our approach to ranking distributions over $d$ alternatives, where we show how fatness implies efficient parameter estimation of Mallows models from truncated samples. Exploring the limits of learning discrete models from truncated samples, we identify three natural conditions that are necessary for efficient identifiability: (i) the truncation set $S$ should be rich enough; (ii) $S$ should be accessible through membership queries; and (iii) the truncation by $S$ should leave enough randomness in all directions. By carefully adapting the Stochastic Gradient Descent approach of (Daskalakis et al., FOCS 2018), we show that these conditions are also sufficient for efficient learning of truncated Boolean product distributions.

### Meta-Semi: A Meta-learning Approach for Semi-supervised Learning

Deep learning based semi-supervised learning (SSL) algorithms have led to promising results in recent years. However, they tend to introduce multiple tunable hyper-parameters, making them less practical in real SSL scenarios where the labeled data is scarce for extensive hyper-parameter search. In this paper, we propose a novel meta-learning based SSL algorithm (Meta-Semi) that requires tuning only one additional hyper-parameter, compared with a standard supervised deep learning algorithm, to achieve competitive performance under various conditions of SSL. We start by defining a meta optimization problem that minimizes the loss on labeled data through dynamically reweighting the loss on unlabeled samples, which are associated with soft pseudo labels during training. As the meta problem is computationally intensive to solve directly, we propose an efficient algorithm to dynamically obtain the approximate solutions. We show theoretically that Meta-Semi converges to the stationary point of the loss function on labeled data under mild conditions. Empirically, Meta-Semi outperforms state-of-the-art SSL algorithms significantly on the challenging semi-supervised CIFAR-100 and STL-10 tasks, and achieves competitive performance on CIFAR-10 and SVHN.

### Selective Dyna-style Planning Under Limited Model Capacity

In model-based reinforcement learning, planning with an imperfect model of the environment has the potential to harm learning progress. But even when a model is imperfect, it may still contain information that is useful for planning. In this paper, we investigate the idea of using an imperfect model selectively. The agent should plan in parts of the state space where the model would be helpful but refrain from using the model where it would be harmful. An effective selective planning mechanism requires estimating predictive uncertainty, which arises out of aleatoric uncertainty, parameter uncertainty, and model inadequacy, among other sources. Prior work has focused on parameter uncertainty for selective planning. In this work, we emphasize the importance of model inadequacy. We show that heteroscedastic regression can signal predictive uncertainty arising from model inadequacy that is complementary to that which is detected by methods designed for parameter uncertainty, indicating that considering both parameter uncertainty and model inadequacy may be a more promising direction for effective selective planning than either in isolation.

### Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Clusters for Extreme Multi-label Text Classification

Extreme multi-label text classification (XMTC) is a task for tagging a given text with the most relevant labels from an extremely large label set. We propose a novel deep learning method called APLC-XLNet. Our approach fine-tunes the recently released generalized autoregressive pretrained model (XLNet) to learn a dense representation for the input text. We propose Adaptive Probabilistic Label Clusters (APLC) to approximate the cross entropy loss by exploiting the unbalanced label distribution to form clusters that explicitly reduce the computational time. Our experiments, carried out on five benchmark datasets, show that our approach significantly outperforms existing state-of-the-art methods. Our source code is available publicly at https://github.com/huiyegit/APLC_XLNet.

### Overlaying Spaces and Practical Applicability of Complex Geometries

Recently, non-Euclidean spaces became popular for embedding structured data. Following hyperbolic and spherical spaces, more general product spaces have been proposed. However, searching for the best configuration of a product space is a resource-intensive procedure, which reduces the practical applicability of the idea. We introduce a novel concept of overlaying spaces that does not have the problem of configuration search and outperforms the competitors in structured data embedding tasks, when the aim is to preserve all distances. On the other hand, for local loss functions (e.g., for ranking losses), the dot-product similarity, which is often overlooked in graph embedding literature since it cannot be converted to a metric, outperforms all metric spaces. We discuss advantages of the dot product over proper metric spaces.

### Novel min-max reformulations of Linear Inverse Problems

In this article, we dwell into the class of so-called ill-posed Linear Inverse Problems (LIP) which simply refers to the task of recovering the entire signal from its relatively few random linear measurements. Such problems arise in a variety of settings with applications ranging from medical image processing, recommender systems, etc. We propose a slightly generalized version of the error constrained linear inverse problem and obtain a novel and equivalent convex-concave min-max reformulation by providing an exposition to its convex geometry. Saddle points of the min-max problem are completely characterized in terms of a solution to the LIP, and vice versa. Applying simple saddle point seeking ascend-descent type algorithms to solve the min-max problems provides novel and simple algorithms to find a solution to the LIP. Moreover, the reformulation of an LIP as the min-max problem provided in this article is crucial in developing methods to solve the dictionary learning problem with almost sure recovery constraints.

### Momentum Accelerates Evolutionary Dynamics

We combine momentum from machine learning with evolutionary dynamics, where momentum can be viewed as a simple mechanism of intergenerational memory. Using information divergences as Lyapunov functions, we show that momentum accelerates the convergence of evolutionary dynamics including the replicator equation and Euclidean gradient descent on populations. When evolutionarily stable states are present, these methods prove convergence for small learning rates or small momentum, and yield an analytic determination of the relative decrease in time to converge that agrees well with computations. The main results apply even when the evolutionary dynamic is not a gradient flow. We also show that momentum can alter the convergence properties of these dynamics, for example by breaking the cycling associated to the rock-paper-scissors landscape, leading to either convergence to the ordinarily non-absorbing equilibrium, or divergence, depending on the value and mechanism of momentum.

### Can Un-trained Neural Networks Compete with Trained Neural Networks at Image Reconstruction?

Convolutional Neural Networks (CNNs) are highly effective for image reconstruction problems. Typically, CNNs are trained on large amounts of training images. Recently, however, un-trained neural networks such as the Deep Image Prior and Deep Decoder have achieved excellent image reconstruction performance for standard image reconstruction problems such as image denoising and image inpainting, without using any training data. This success raises the question whether un-trained neural networks can compete with trained ones for practical imaging tasks. To address this question, we consider accelerated magnetic resonance imaging (MRI), an important medical imaging problem, which has received significant attention from the deep-learning community, and for which a dedicated training set exists. We study and optimize un-trained architectures, and as a result, propose a variation of the architectures of the deep image prior and deep decoder. We show that the resulting convolutional decoder out-performs other un-trained methods and---most importantly---achieves on-par performance with a standard trained baseline, the U-net, on the FastMRI dataset, a new dataset for benchmarking deep learning based reconstruction methods. Besides achieving on-par reconstruction performance relative to trained methods, we demonstrate that a key advantage over trained methods is robustness to out-of-distribution examples.

### Deep Learning for Anomaly Detection: A Review

Anomaly detection, a.k.a. outlier detection, has been a lasting yet active research area in various research communities for several decades. There are still some unique problem complexities and challenges that require advanced approaches. In recent years, deep learning enabled anomaly detection, i.e., deep anomaly detection, has emerged as a critical direction. This paper reviews the research of deep anomaly detection with a comprehensive taxonomy of detection methods, covering advancements in three high-level categories and 11 fine-grained categories of the methods. We review their key intuitions, objective functions, underlying assumptions, advantages and disadvantages, and discuss how they address the aforementioned challenges. We further discuss a set of possible future opportunities and new perspectives on addressing the challenges.

### Explaining Fast Improvement in Online Policy Optimization

Online policy optimization (OPO) views policy optimization for sequential decision making as an online learning problem. In this framework, the algorithm designer defines a sequence of online loss functions such that the regret rate in online learning implies the policy convergence rate and the minimal loss witnessed by the policy class determines the policy performance bias. This reduction technique has been successfully applied to solving various policy optimization problems, including imitation learning, structured prediction, and system identification. Interestingly, the policy improvement speed observed in practice is usually much faster than existing theory suggests. In this work, we provide an explanation of this fast policy improvement phenomenon. Let $\epsilon$ denote the policy class bias and assume the online loss functions are convex, smooth, and non-negative. We prove that, after $N$ rounds of OPO with stochastic feedback, the policy converges in $\tilde{O}(1/N + \sqrt{\epsilon/N})$ in both expectation and high probability. In other words, we show that adopting a sufficiently expressive policy class in OPO has two benefits: both the convergence rate increases and the performance bias decreases, as the policy class becomes reasonably rich. This new theoretical insight is further verified in an online imitation learning experiment.

### Covariate Distribution Aware Meta-learning

Meta-learning has proven to be successful at few-shot learning across the regression, classification and reinforcement learning paradigms. Recent approaches have adopted Bayesian interpretations to improve gradient based meta-learners by quantifying the uncertainty of the post-adaptation estimates. Most of these works almost completely ignore the latent relationship between the covariate distribution (p(x)) of a task and the corresponding conditional distribution p(y|x). In this paper, we identify the need to explicitly model the meta-distribution over the task covariates in a hierarchical Bayesian framework. We begin by introducing a graphical model that explicitly leverages very few samples drawn from p(x) to better infer the posterior over the optimal parameters of the conditional distribution (p(y|x)) for each task. Based on this model we provide an inference strategy and a corresponding meta-algorithm that explicitly accounts for the meta-distribution over task covariates. Finally, we demonstrate the significant gains of our proposed algorithm on a synthetic regression dataset.

### Learning Implicit Credit Assignment for Multi-Agent Actor-Critic

We present a new policy-based multi-agent reinforcement learning algorithm that implicitly addresses the credit assignment problem under fully cooperative settings. Our key motivation is that credit assignment may not require an explicit formulation as long as (1) the policy gradients of a trained, centralized critic carry sufficient information for the decentralized agents to maximize the critic estimate through optimal cooperation and (2) a sustained level of agent exploration is enforced throughout training. In this work, we achieve the former by formulating the centralized critic as a hypernetwork such that the latent state representation is now fused into the policy gradients through its multiplicative association with the agent policies, and we show that this is key to learning optimal joint actions that may otherwise require explicit credit assignment. To achieve the latter, we further propose a practical technique called adaptive entropy regularization where magnitudes of the policy gradients from the entropy term are dynamically rescaled to sustain consistent levels of exploration throughout training. Our final algorithm, which we call LICA, is evaluated on several benchmarks including the multi-agent particle environments and a set of challenging StarCraft II micromanagement tasks, and we show that LICA significantly outperforms previous methods.

### Tensor Convolutional Sparse Coding with Low-Rank activations, an application to EEG analysis

Recently, there has been growing interest in the analysis of spectrograms of ElectroEncephaloGram (EEG), particularly to study the neural correlates of (un)-consciousness during General Anesthesia (GA). Indeed, it has been shown that order three tensors (channels x frequencies x times) are a natural and useful representation of these signals. However this encoding entails significant difficulties, especially for convolutional sparse coding (CSC) as existing methods do not take advantage of the particularities of tensor representation, such as rank structures, and are vulnerable to the high level of noise and perturbations that are inherent to EEG during medical acts. To address this issue, in this paper we introduce a new CSC model, named Kruskal CSC (K-CSC), that uses the Kruskal decomposition of the activation tensors to leverage the intrinsic low rank nature of these representations in order to extract relevant and interpretable encodings. Our main contribution, TC-FISTA, uses multiple tools to efficiently solve the resulting optimization problem despite the increasing complexity induced by the tensor representation. We then evaluate TC-FISTA on both synthetic dataset and real EEG recorded during GA. The results show that TC-FISTA is robust to noise and perturbations, resulting in accurate, sparse and interpretable encoding of the signals.

### Learning from Failure: Training Debiased Classifier from Biased Classifier

Neural networks often learn to make predictions that overly rely on spurious correlation existing in the dataset, which causes the model to be biased. While previous work tackles this issue with domain-specific knowledge or explicit supervision on the spuriously correlated attributes, we instead tackle a more challenging setting where such information is unavailable. To this end, we first observe that neural networks learn to rely on the spurious correlation only when it is ''easier'' to learn than the desired knowledge, and such reliance is most prominent during the early phase of training. Based on the observations, we propose a failure-based debiasing scheme by training a pair of neural networks simultaneously. Our main idea is twofold; (a) we intentionally train the first network to be biased by repeatedly amplifying its ''prejudice'', and (b) we debias the training of the second network by focusing on samples that go against the prejudice of the biased network in (a). Extensive experiments demonstrate that our method significantly improves the training of network against various types of biases in both synthetic and real-world datasets. Surprisingly, our framework even occasionally outperforms the debiasing methods requiring explicit supervision of the spuriously correlated attributes.

### A Novel Random Forest Dissimilarity Measure for Multi-View Learning

Multi-view learning is a learning task in which data is described by several concurrent representations. Its main challenge is most often to exploit the complementarities between these representations to help solve a classification/regression task. This is a challenge that can be met nowadays if there is a large amount of data available for learning. However, this is not necessarily true for all real-world problems, where data are sometimes scarce (e.g. problems related to the medical environment). In these situations, an effective strategy is to use intermediate representations based on the dissimilarities between instances. This work presents new ways of constructing these dissimilarity representations, learning them from data with Random Forest classifiers. More precisely, two methods are proposed, which modify the Random Forest proximity measure, to adapt it to the context of High Dimension Low Sample Size (HDLSS) multi-view classification problems. The second method, based on an Instance Hardness measurement, is significantly more accurate than other state-of-the-art measurements including the original RF Proximity measurement and the Large Margin Nearest Neighbor (LMNN) metric learning measurement.

### Multi-Kernel Fusion for RBF Neural Networks

A simple yet effective architectural design of radial basis function neural networks (RBFNN) makes them amongst the most popular conventional neural networks. The current generation of radial basis function neural network is equipped with multiple kernels which provide significant performance benefits compared to the previous generation using only a single kernel. In existing multi-kernel RBF algorithms, multi-kernel is formed by the convex combination of the base/primary kernels. In this paper, we propose a novel multi-kernel RBFNN in which every base kernel has its own (local) weight. This novel flexibility in the network provides better performance such as faster convergence rate, better local minima and resilience against stucking in poor local minima. These performance gains are achieved at a competitive computational complexity compared to the contemporary multi-kernel RBF algorithms. The proposed algorithm is thoroughly analysed for performance gain using mathematical and graphical illustrations and also evaluated on three different types of problems namely: (i) pattern classification, (ii) system identification and (iii) function approximation. Empirical results clearly show the superiority of the proposed algorithm compared to the existing state-of-the-art multi-kernel approaches.

### [109] 2007.02613

Adversarial risk analysis (ARA) is a relatively new area of research that informs decision-making when facing intelligent opponents and uncertain outcomes. It enables an analyst to express her Bayesian beliefs about an opponent's utilities, capabilities, probabilities and the type of strategic calculation that the opponent is using. Within that framework, the analyst then solves the problem from the perspective of the opponent while placing subjective probability distributions on all unknown quantities. This produces a distribution over the actions of the opponent that permits the analyst to maximize her expected utility. This overview covers conceptual, modeling, computational and applied issues in ARA.

### Understanding and Improving Fast Adversarial Training

A recent line of work focused on making adversarial training computationally efficient for deep learning models. In particular, Wong et al. (2020) showed that $\ell_\infty$-adversarial training with fast gradient sign method (FGSM) can fail due to a phenomenon called "catastrophic overfitting", when the model quickly loses its robustness over a single epoch of training. We show that adding a random step to FGSM, as proposed in Wong et al. (2020), does not prevent catastrophic overfitting, and that randomness is not important per se -- its main role being simply to reduce the magnitude of the perturbation. Moreover, we show that catastrophic overfitting is not inherent to deep and overparametrized networks, but can occur in a single-layer convolutional network with a few filters. In an extreme case, even a single filter can make the network highly non-linear locally, which is the main reason why FGSM training fails. Based on this observation, we propose a new regularization method, GradAlign, that prevents catastrophic overfitting by explicitly maximizing the gradient alignment inside the perturbation set and improves the quality of the FGSM solution. As a result, GradAlign allows to successfully apply FGSM training also for larger $\ell_\infty$-perturbations and reduce the gap to multi-step adversarial training. The code of our experiments is available at https://github.com/tml-epfl/understanding-fast-adv-training.

### Dynamic memory to alleviate catastrophic forgetting in continuous learning settings

In medical imaging, technical progress or changes in diagnostic procedures lead to a continuous change in image appearance. Scanner manufacturer, reconstruction kernel, dose, other protocol specific settings or administering of contrast agents are examples that influence image content independent of the scanned biology. Such domain and task shifts limit the applicability of machine learning algorithms in the clinical routine by rendering models obsolete over time. Here, we address the problem of data shifts in a continuous learning scenario by adapting a model to unseen variations in the source domain while counteracting catastrophic forgetting effects. Our method uses a dynamic memory to facilitate rehearsal of a diverse training data subset to mitigate forgetting. We evaluated our approach on routine clinical CT data obtained with two different scanner protocols and synthetic classification tasks. Experiments show that dynamic memory counters catastrophic forgetting in a setting with multiple data shifts without the necessity for explicit knowledge about when these shifts occur.

### On Data Augmentation and Adversarial Risk: An Empirical Analysis

Data augmentation techniques have become standard practice in deep learning, as it has been shown to greatly improve the generalisation abilities of models. These techniques rely on different ideas such as invariance-preserving transformations (e.g, expert-defined augmentation), statistical heuristics (e.g, Mixup), and learning the data distribution (e.g, GANs). However, in the adversarial settings it remains unclear under what conditions such data augmentation methods reduce or even worsen the misclassification risk. In this paper, we therefore analyse the effect of different data augmentation techniques on the adversarial risk by three measures: (a) the well-known risk under adversarial attacks, (b) a new measure of prediction-change stress based on the Laplacian operator, and (c) the influence of training examples on prediction. The results of our empirical analysis disprove the hypothesis that an improvement in the classification performance induced by a data augmentation is always accompanied by an improvement in the risk under adversarial attack. Further, our results reveal that the augmented data has more influence than the non-augmented data, on the resulting models. Taken together, our results suggest that general-purpose data augmentations that do not take into the account the characteristics of the data and the task, must be applied with care.

### Impact of COVID-19 on Forecasting Stock Prices: An Integration of Stationary Wavelet Transform and Bidirectional Long Short-Term Memory

COVID-19 is an infectious disease that mostly affects the respiratory system. At the time of this research being performed, there were more than 1.4 million cases of COVID-19, and one of the biggest anxieties is not just our health, but our livelihoods, too. In this research, authors investigate the impact of COVID-19 on the global economy, more specifically, the impact of COVID-19 on financial movement of Crude Oil price and three U.S. stock indexes: DJI, S&P 500 and NASDAQ Composite. The proposed system for predicting commodity and stock prices integrates the Stationary Wavelet Transform (SWT) and Bidirectional Long Short-Term Memory (BDLSTM) networks. Firstly, SWT is used to decompose the data into approximation and detail coefficients. After decomposition, data of Crude Oil price and stock market indexes along with COVID-19 confirmed cases were used as input variables for future price movement forecasting. As a result, the proposed system BDLSTM+WT-ADA achieved satisfactory results in terms of five-day Crude Oil price forecast.

### Auxiliary Learning by Implicit Differentiation

Training with multiple auxiliary tasks is a common practice used in deep learning for improving the performance on the main task of interest. Two main challenges arise in this multi-task learning setting: (i) Designing useful auxiliary tasks; and (ii) Combining auxiliary tasks into a single coherent loss. We propose a novel framework, \textit{AuxiLearn}, that targets both challenges, based on implicit differentiation. First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function. This network can learn \textit{non-linear} interactions between auxiliary tasks. Second, when no useful auxiliary task is known, we describe how to learn a network that generates a meaningful, novel auxiliary task. We evaluate AuxiLearn in a series of tasks and domains, including image segmentation and learning with attributes. We find that AuxiLearn consistently improves accuracy compared with competing methods.

### Scaling Imitation Learning in Minecraft

Imitation learning is a powerful family of techniques for learning sensorimotor coordination in immersive environments. We apply imitation learning to attain state-of-the-art performance on hard exploration problems in the Minecraft environment. We report experiments that highlight the influence of network architecture, loss function, and data augmentation. An early version of our approach reached second place in the MineRL competition at NeurIPS 2019. Here we report stronger results that can be used as a starting point for future competition entries and related research. Our code is available at https://github.com/amiranas/minerl_imitation_learning.

### Splintering with distributions: A stochastic decoy scheme for private computation

Performing computations while maintaining privacy is an important problem in todays distributed machine learning solutions. Consider the following two set ups between a client and a server, where in setup i) the client has a public data vector $\mathbf{x}$, the server has a large private database of data vectors $\mathcal{B}$ and the client wants to find the inner products $\langle \mathbf{x,y_k} \rangle, \forall \mathbf{y_k} \in \mathcal{B}$. The client does not want the server to learn $\mathbf{x}$ while the server does not want the client to learn the records in its database. This is in contrast to another setup ii) where the client would like to perform an operation solely on its data, such as computation of a matrix inverse on its data matrix $\mathbf{M}$, but would like to use the superior computing ability of the server to do so without having to leak $\mathbf{M}$ to the server. \par We present a stochastic scheme for splitting the client data into privatized shares that are transmitted to the server in such settings. The server performs the requested operations on these shares instead of on the raw client data at the server. The obtained intermediate results are sent back to the client where they are assembled by the client to obtain the final result.

### Weak error analysis for stochastic gradient descent optimization algorithms

Stochastic gradient descent (SGD) type optimization schemes are fundamental ingredients in a large number of machine learning based algorithms. In particular, SGD type optimization schemes are frequently employed in applications involving natural language processing, object and face recognition, fraud detection, computational advertisement, and numerical approximations of partial differential equations. In mathematical convergence results for SGD type optimization schemes there are usually two types of error criteria studied in the scientific literature, that is, the error in the strong sense and the error with respect to the objective function. In applications one is often not only interested in the size of the error with respect to the objective function but also in the size of the error with respect to a test function which is possibly different from the objective function. The analysis of the size of this error is the subject of this article. In particular, the main result of this article proves under suitable assumptions that the size of this error decays at the same speed as in the special case where the test function coincides with the objective function.

### Bridging the COVID-19 Data and the Epidemiological Model using Time Varying Parameter SIRD Model

This paper extends the canonical model of epidemiology, SIRD model, to allow for time varying parameters for real-time measurement of the stance of the COVID-19 pandemic. Time variation in model parameters is captured using the generalized autoregressive score modelling structure designed for the typically daily count data related to pandemic. The resulting specification permits a flexible yet parsimonious model structure with a very low computational cost. This is especially crucial at the onset of the pandemic when the data is scarce and the uncertainty is abundant. Full sample results show that countries including US, Brazil and Russia are still not able to contain the pandemic with the US having the worst performance. Furthermore, Iran and South Korea are likely to experience the second wave of the pandemic. A real-time exercise show that the proposed structure delivers timely and precise information on the current stance of the pandemic ahead of the competitors that use rolling window. This, in turn, transforms into accurate short-term predictions of the active cases. We further modify the model to allow for unreported cases. Results suggest that the effects of the presence of these cases on the estimation results diminish towards the end of sample with the increasing number of testing.

### SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows

Normalizing flows and variational autoencoders are powerful generative models that can represent complicated density functions. However, they both impose constraints on the models: Normalizing flows use bijective transformations to model densities whereas VAEs learn stochastic transformations that are non-invertible and thus typically do not provide tractable estimates of the marginal likelihood. In this paper, we introduce SurVAE Flows: A modular framework of composable transformations that encompasses VAEs and normalizing flows. SurVAE Flows bridge the gap between normalizing flows and VAEs with surjective transformations, wherein the transformations are deterministic in one direction -- thereby allowing exact likelihood computation, and stochastic in the reverse direction -- hence providing a lower bound on the corresponding likelihood. We show that several recently proposed methods, including dequantization and augmented normalizing flows, can be expressed as SurVAE Flows. Finally, we introduce common operations such as the max value, the absolute value, sorting and stochastic permutation as composable layers in SurVAE Flows.

### Probabilistic Prediction of Geomagnetic Storms and the K$_{\textrm{p}}$ Index

Geomagnetic activity is often described using summary indices to summarize the likelihood of space weather impacts, as well as when parameterizing space weather models. The geomagnetic index $\text{K}_\text{p}$ in particular, is widely used for these purposes. Current state-of-the-art forecast models provide deterministic $\text{K}_\text{p}$ predictions using a variety of methods -- including empirically-derived functions, physics-based models, and neural networks -- but do not provide uncertainty estimates associated with the forecast. This paper provides a sample methodology to generate a 3-hour-ahead $\text{K}_\text{p}$ prediction with uncertainty bounds and from this provide a probabilistic geomagnetic storm forecast. Specifically, we have used a two-layered architecture to separately predict storm ($\text{K}_\text{p}\geq 5^-$) and non-storm cases. As solar wind-driven models are limited in their ability to predict the onset of transient-driven activity we also introduce a model variant using solar X-ray flux to assess whether simple models including proxies for solar activity can improve the predictions of geomagnetic storm activity with lead times longer than the L1-to-Earth propagation time. By comparing the performance of these models we show that including operationally-available information about solar irradiance enhances the ability of predictive models to capture the onset of geomagnetic storms and that this can be achieved while also enabling probabilistic forecasts.

### Black-box Adversarial Example Generation with Normalizing Flows

Deep neural network classifiers suffer from adversarial vulnerability: well-crafted, unnoticeable changes to the input data can affect the classifier decision. In this regard, the study of powerful adversarial attacks can help shed light on sources of this malicious behavior. In this paper, we propose a novel black-box adversarial attack using normalizing flows. We show how an adversary can be found by searching over a pre-trained flow-based model base distribution. This way, we can generate adversaries that resemble the original data closely as the perturbations are in the shape of the data. We then demonstrate the competitive performance of the proposed approach against well-known black-box adversarial attack methods.

### Optimization from Structured Samples for Coverage Functions

We revisit the optimization from samples (OPS) model, which studies the problem of optimizing objective functions directly from the sample data. Previous results showed that we cannot obtain a constant approximation ratio for the maximum coverage problem using polynomially many independent samples of the form $\{S_i, f(S_i)\}_{i=1}^t$ (Balkanski et al., 2017), even if coverage functions are $(1 - \epsilon)$-PMAC learnable using these samples (Badanidiyuru et al., 2012), which means most of the function values can be approximately learned very well with high probability. In this work, to circumvent the impossibility result of OPS, we propose a stronger model called optimization from structured samples (OPSS) for coverage functions, where the data samples encode the structural information of the functions. We show that under three general assumptions on the sample distributions, we can design efficient OPSS algorithms that achieve a constant approximation for the maximum coverage problem. We further prove a constant lower bound under these assumptions, which is tight when not considering computational efficiency. Moreover, we also show that if we remove any one of the three assumptions, OPSS for the maximum coverage problem has no constant approximation.

### Semi-nonparametric Latent Class Choice Model with a Flexible Class Membership Component: A Mixture Model Approach

This study presents a semi-nonparametric Latent Class Choice Model (LCCM) with a flexible class membership component. The proposed model formulates the latent classes using mixture models as an alternative approach to the traditional random utility specification with the aim of comparing the two approaches on various measures including prediction accuracy and representation of heterogeneity in the choice process. Mixture models are parametric model-based clustering techniques that have been widely used in areas such as machine learning, data mining and patter recognition for clustering and classification problems. An Expectation-Maximization (EM) algorithm is derived for the estimation of the proposed model. Using two different case studies on travel mode choice behavior, the proposed model is compared to traditional discrete choice models on the basis of parameter estimates' signs, value of time, statistical goodness-of-fit measures, and cross-validation tests. Results show that mixture models improve the overall performance of latent class choice models by providing better out-of-sample prediction accuracy in addition to better representations of heterogeneity without weakening the behavioral and economic interpretability of the choice models.

### Learning the Prediction Distribution for Semi-Supervised Learning with Normalising Flows

As data volumes continue to grow, the labelling process increasingly becomes a bottleneck, creating demand for methods that leverage information from unlabelled data. Impressive results have been achieved in semi-supervised learning (SSL) for image classification, nearing fully supervised performance, with only a fraction of the data labelled. In this work, we propose a probabilistically principled general approach to SSL that considers the distribution over label predictions, for labels of different complexity, from "one-hot" vectors to binary vectors and images. Our method regularises an underlying supervised model, using a normalising flow that learns the posterior distribution over predictions for labelled data, to serve as a prior over the predictions on unlabelled data. We demonstrate the general applicability of this approach on a range of computer vision tasks with varying output complexity: classification, attribute prediction and image-to-image translation.

### Certifying Decision Trees Against Evasion Attacks by Program Analysis

Machine learning has proved invaluable for a range of different tasks, yet it also proved vulnerable to evasion attacks, i.e., maliciously crafted perturbations of input data designed to force mispredictions. In this paper we propose a novel technique to verify the security of decision tree models against evasion attacks with respect to an expressive threat model, where the attacker can be represented by an arbitrary imperative program. Our approach exploits the interpretability property of decision trees to transform them into imperative programs, which are amenable for traditional program analysis techniques. By leveraging the abstract interpretation framework, we are able to soundly verify the security guarantees of decision tree models trained over publicly available datasets. Our experiments show that our technique is both precise and efficient, yielding only a minimal number of false positives and scaling up to cases which are intractable for a competitor approach.

### Parametric machines: a fresh approach to architecture search

Using tools from category theory, we provide a framework where artificial neural networks, and their architectures, can be formally described. We first define the notion of machine in a general categorical context, and show how simple machines can be combined into more complex ones. We explore finite- and infinite-depth machines, which generalize neural networks and neural ordinary differential equations. Borrowing ideas from functional analysis and kernel methods, we build complete, normed, infinite-dimensional spaces of machines, and discuss how to find optimal architectures and parameters -- within those spaces -- to solve a given computational problem. In our numerical experiments, these kernel-inspired networks can outperform classical neural networks when the training dataset is small.

### TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

We investigate whether Jacobi preconditioning, accounting for the bootstrap term in temporal difference (TD) learning, can help boost performance of adaptive optimizers. Our method, TDprop, computes a per parameter learning rate based on the diagonal preconditioning of the TD update rule. We show how this can be used in both $n$-step returns and TD($\lambda$). Our theoretical findings demonstrate that including this additional preconditioning information is, surprisingly, comparable to normal semi-gradient TD if the optimal learning rate is found for both via a hyperparameter search. In Deep RL experiments using Expected SARSA, TDprop meets or exceeds the performance of Adam in all tested games under near-optimal learning rates, but a well-tuned SGD can yield similar improvements -- matching our theory. Our findings suggest that Jacobi preconditioning may improve upon typical adaptive optimization methods in Deep RL, but despite incorporating additional information from the TD bootstrap term, may not always be better than SGD.

### Online Learning of Facility Locations

In this paper, we provide a rigorous theoretical investigation of an online learning version of the Facility Location problem which is motivated by emerging problems in real-world applications. In our formulation, we are given a set of sites and an online sequence of user requests. At each trial, the learner selects a subset of sites and then incurs a cost for each selected site and an additional cost which is the price of the user's connection to the nearest site in the selected subset. The problem may be solved by an application of the well-known Hedge algorithm. This would, however, require time and space exponential in the number of the given sites, which motivates our design of a novel quasi-linear time algorithm for this problem, with good theoretical guarantees on its performance.

### Complex Human Action Recognition in Live Videos Using Hybrid FR-DL Method

Automated human action recognition is one of the most attractive and practical research fields in computer vision, in spite of its high computational costs. In such systems, the human action labelling is based on the appearance and patterns of the motions in the video sequences; however, the conventional methodologies and classic neural networks cannot use temporal information for action recognition prediction in the upcoming frames in a video sequence. On the other hand, the computational cost of the preprocessing stage is high. In this paper, we address challenges of the preprocessing phase, by an automated selection of representative frames among the input sequences. Furthermore, we extract the key features of the representative frame rather than the entire features. We propose a hybrid technique using background subtraction and HOG, followed by application of a deep neural network and skeletal modelling method. The combination of a CNN and the LSTM recursive network is considered for feature selection and maintaining the previous information, and finally, a Softmax-KNN classifier is used for labelling human activities. We name our model as Feature Reduction & Deep Learning based action recognition method, or FR-DL in short. To evaluate the proposed method, we use the UCF dataset for the benchmarking which is widely-used among researchers in action recognition research. The dataset includes 101 complicated activities in the wild. Experimental results show a significant improvement in terms of accuracy and speed in comparison with six state-of-the-art articles.

### Run2Survive: A Decision-theoretic Approach to Algorithm Selection based on Survival Analysis

Algorithm selection (AS) deals with the automatic selection of an algorithm from a fixed set of candidate algorithms most suitable for a specific instance of an algorithmic problem class, where "suitability" often refers to an algorithm's runtime. Due to possibly extremely long runtimes of candidate algorithms, training data for algorithm selection models is usually generated under time constraints in the sense that not all algorithms are run to completion on all instances. Thus, training data usually comprises censored information, as the true runtime of algorithms timed out remains unknown. However, many standard AS approaches are not able to handle such information in a proper way. On the other side, survival analysis (SA) naturally supports censored data and offers appropriate ways to use such data for learning distributional models of algorithm runtime, as we demonstrate in this work. We leverage such models as a basis of a sophisticated decision-theoretic approach to algorithm selection, which we dub Run2Survive. Moreover, taking advantage of a framework of this kind, we advocate a risk-averse approach to algorithm selection, in which the avoidance of a timeout is given high priority. In an extensive experimental study with the standard benchmark ASlib, our approach is shown to be highly competitive and in many cases even superior to state-of-the-art AS approaches.

### Faster Graph Embeddings via Coarsening

Graph embeddings are a ubiquitous tool for machine learning tasks, such as node classification and link prediction, on graph-structured data. However, computing the embeddings for large-scale graphs is prohibitively inefficient even if we are interested only in a small subset of relevant vertices. To address this, we present an efficient graph coarsening approach, based on Schur complements, for computing the embedding of the relevant vertices. We prove that these embeddings are preserved exactly by the Schur complement graph that is obtained via Gaussian elimination on the non-relevant vertices. As computing Schur complements is expensive, we give a nearly-linear time algorithm that generates a coarsened graph on the relevant vertices that provably matches the Schur complement in expectation in each iteration. Our experiments involving prediction tasks on graphs demonstrate that computing embeddings on the coarsened graph, rather than the entire graph, leads to significant time savings without sacrificing accuracy.

### Online NEAT for Credit Evaluation -- a Dynamic Problem with Sequential Data

In this paper, we describe application of Neuroevolution to a P2P lending problem in which a credit evaluation model is updated based on streaming data. We apply the algorithm Neuroevolution of Augmenting Topologies (NEAT) which has not been widely applied generally in the credit evaluation domain. In addition to comparing the methodology with other widely applied machine learning techniques, we develop and evaluate several enhancements to the algorithm which make it suitable for the particular aspects of online learning that are relevant in the problem. These include handling unbalanced streaming data, high computation costs, and maintaining model similarity over time, that is training the stochastic learning algorithm with new data but minimizing model change except where there is a clear benefit for model performance

### Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning

What goals should a multi-goal reinforcement learning agent pursue during training in long-horizon tasks? When the desired (test time) goal distribution is too distant to offer a useful learning signal, we argue that the agent should not pursue unobtainable goals. Instead, it should set its own intrinsic goals that maximize the entropy of the historical achieved goal distribution. We propose to optimize this objective by having the agent pursue past achieved goals in sparsely explored areas of the goal space, which focuses exploration on the frontier of the achievable goal set. We show that our strategy achieves an order of magnitude better sample efficiency than the prior state of the art on long-horizon multi-goal tasks including maze navigation and block stacking.

### Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting

Modeling complex spatial and temporal correlations in the correlated time series data is indispensable for understanding the traffic dynamics and predicting the future status of an evolving traffic system. Recent works focus on designing complicated graph neural network architectures to capture shared patterns with the help of pre-defined graphs. In this paper, we argue that learning node-specific patterns is essential for traffic forecasting while the pre-defined graph is avoidable. To this end, we propose two adaptive modules for enhancing Graph Convolutional Network (GCN) with new capabilities: 1) a Node Adaptive Parameter Learning (NAPL) module to capture node-specific patterns; 2) a Data Adaptive Graph Generation (DAGG) module to infer the inter-dependencies among different traffic series automatically. We further propose an Adaptive Graph Convolutional Recurrent Network (AGCRN) to capture fine-grained spatial and temporal correlations in traffic series automatically based on the two modules and recurrent networks. Our experiments on two real-world traffic datasets show AGCRN outperforms state-of-the-art by a significant margin without pre-defined graphs about spatial connections.

### Partially Conditioned Generative Adversarial Networks

Generative models are undoubtedly a hot topic in Artificial Intelligence, among which the most common type is Generative Adversarial Networks (GANs). These architectures let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset. With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset. From a practical standpoint, however, one might desire to generate data conditioned on partial information. That is, only a subset of the ancillary conditioning variables might be of interest when synthesising data. In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy to deal with the ensuing problems. Experiments illustrating the value of the proposed approach in digit and face image synthesis under partial conditioning information are presented, showing that the proposed method can effectively outperform the standard approach under these circumstances.

### Weak SINDy For Partial Differential Equations

We extend the WSINDy (Weak SINDy) method of sparse recovery introduced previously by the authors (arXiv:2005.04339) to the setting of partial differential equations (PDEs). As in the case of ODE discovery, the weak form replaces pointwise approximation of derivatives with local integrations against test functions and achieves effective machine-precision recovery of weights from noise-free data (i.e. below the tolerance of the simulation scheme) as well as natural robustness to noise without the use of noise filtering. The resulting WSINDy_PDE algorithm uses separable test functions implemented efficiently via convolutions for discovery of PDE models with computational complexity $O(NM)$ from data points with $M = N^{D+1}$ points, or $N$ points in each of $D+1$ dimensions. We demonstrate on several notoriously challenging PDEs the speed and accuracy with which WSINDy_PDE recovers the correct models from datasets with surprisingly large levels noise (often with levels of noise much greater than 10%).

### Learning the Markov order of paths in a network

We study the problem of learning the Markov order in categorical sequences that represent paths in a network, i.e. sequences of variable lengths where transitions between states are constrained to a known graph. Such data pose challenges for standard Markov order detection methods and demand modelling techniques that explicitly account for the graph constraint. Adopting a multi-order modelling framework for paths, we develop a Bayesian learning technique that (i) more reliably detects the correct Markov order compared to a competing method based on the likelihood ratio test, (ii) requires considerably less data compared to methods using AIC or BIC, and (iii) is robust against partial knowledge of the underlying constraints. We further show that a recently published method that uses a likelihood ratio test has a tendency to overfit the true Markov order of paths, which is not the case for our Bayesian technique. Our method is important for data scientists analyzing patterns in categorical sequence data that are subject to (partially) known constraints, e.g. sequences with forbidden words, mobility trajectories and click stream data, or sequence data in bioinformatics. Addressing the key challenge of model selection, our work is further relevant for the growing body of research that emphasizes the need for higher-order models in network analysis.

### Counterfactual Data Augmentation using Locally Factored Dynamics

Many dynamic processes, including common scenarios in robotic control and reinforcement learning (RL), involve a set of interacting subprocesses. Though the subprocesses are not independent, their interactions are often sparse, and the dynamics at any given time step can often be decomposed into locally independent causal mechanisms. Such local causal structures can be leveraged to improve the sample efficiency of sequence prediction and off-policy reinforcement learning. We formalize this by introducing local causal models (LCMs), which are induced from a global causal model by conditioning on a subset of the state space. We propose an approach to inferring these structures given an object-oriented state representation, as well as a novel algorithm for model-free Counterfactual Data Augmentation (CoDA). CoDA uses local structures and an experience replay to generate counterfactual experiences that are causally valid in the global model. We find that CoDA significantly improves the performance of RL agents in locally factored tasks, including the batch-constrained and goal-conditioned settings.

### Multi-Objective DNN-based Precoder for MIMO Communications

This paper introduces a unified deep neural network (DNN)-based precoder for two-user multiple-input multiple-output (MIMO) networks with five objectives: data transmission, energy harvesting, simultaneous wireless information and power transfer, physical layer (PHY) security, and multicasting. First, a rotation-based precoding is developed to solve the above problems independently. Rotation-based precoding is new precoding and power allocation that beats existing solutions in PHY security and multicasting and is reliable in different antenna settings. Next, a DNN-based precoder is designed to unify the solution for all objectives. The proposed DNN concurrently learns the solutions given by conventional methods, i.e., analytical or rotation-based solutions. A binary vector is designed as an input feature to distinguish the objectives. Numerical results demonstrate that, compared to the conventional solutions, the proposed DNN-based precoder reduces on-the-fly computational complexity more than an order of magnitude while reaching near-optimal performance (99.45\% of the averaged optimal solutions). The new precoder is also more robust to the variations of the numbers of antennas at the receivers.

### Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks

We present Wiki-CS, a novel dataset derived from Wikipedia for benchmarking Graph Neural Networks. The dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field. We use the dataset to evaluate semi-supervised node classification and single-relation link prediction models. Our experiments show that these methods perform well on a new domain, with structural properties different from earlier benchmarks. The dataset is publicly available, along with the implementation of the data pipeline and the benchmark experiments, at https://github.com/pmernyei/wiki-cs-dataset .

### Meta-Learning for Variational Inference

Variational inference (VI) plays an essential role in approximate Bayesian inference due to its computational efficiency and broad applicability. Crucial to the performance of VI is the selection of the associated divergence measure, as VI approximates the intractable distribution by minimizing this divergence. In this paper we propose a meta-learning algorithm to learn the divergence metric suited for the task of interest, automating the design of VI methods. In addition, we learn the initialization of the variational parameters without additional cost when our method is deployed in the few-shot learning scenarios. We demonstrate our approach outperforms standard VI on Gaussian mixture distribution approximation, Bayesian neural network regression, image generation with variational autoencoders and recommender systems with a partial variational autoencoder.

### Node Classification on Graphs with Few-Shot Novel Labels via Meta Transformed Network Embedding

We study the problem of node classification on graphs with few-shot novel labels, which has two distinctive properties: (1) There are novel labels to emerge in the graph; (2) The novel labels have only a few representative nodes for training a classifier. The study of this problem is instructive and corresponds to many applications such as recommendations for newly formed groups with only a few users in online social networks. To cope with this problem, we propose a novel Meta Transformed Network Embedding framework (MetaTNE), which consists of three modules: (1) A \emph{structural module} provides each node a latent representation according to the graph structure. (2) A \emph{meta-learning module} captures the relationships between the graph structure and the node labels as prior knowledge in a meta-learning manner. Additionally, we introduce an \emph{embedding transformation function} that remedies the deficiency of the straightforward use of meta-learning. Inherently, the meta-learned prior knowledge can be used to facilitate the learning of few-shot novel labels. (3) An \emph{optimization module} employs a simple yet effective scheduling strategy to train the above two modules with a balance between graph structure learning and meta-learning. Experiments on four real-world datasets show that MetaTNE brings a huge improvement over the state-of-the-art methods.

### INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark, specifically designed to test agents' generalization ability. INT is based on a procedure for generating theorems and proofs; this procedure's knobs allow us to measure 6 different types of generalization, each reflecting a distinct challenge characteristic to automated theorem proving. In addition, unlike prior benchmarks for learning-assisted theorem proving, INT provides a lightweight and user-friendly theorem proving environment with fast simulations, conducive to performing learning-based and search-based research. We introduce learning-based baselines and evaluate them across 6 dimensions of generalization with the benchmark. We then evaluate the same agents augmented with Monte Carlo Tree Search (MCTS) at test time, and show that MCTS can help to prove new theorems.

### Preintegrated IMU Features For Efficient Deep Inertial Odometry

MEMS Inertial Measurement Units (IMUs) are inexpensive and effective sensors that provide proprioceptive motion measurements for many robots and consumer devices. However, their noise characteristics and manufacturing imperfections lead to complex ramifications in classical fusion pipelines. While deep learning models provide the required flexibility to model these complexities from data, they have higher computation and memory requirements, making them impractical choices for low-power and embedded applications. This paper attempts to address the mentioned conflict by proposing a computationally, efficient inertial representation for deep inertial odometry. Replacing the raw IMU data in deep Inertial models, preintegrated features improves the model's efficiency. The effectiveness of this method has been demonstrated for the task of pedestrian inertial odometry, and its efficiency has been shown through its embedded implementation on a microcontroller with restricted resources.

### Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift

A fundamental assumption of most machine learning algorithms is that the training and test data are drawn from the same underlying distribution. However, this assumption is violated in almost all practical applications: machine learning systems are regularly tested on data that are structurally different from the training set, either due to temporal correlations, particular end users, or other factors. In this work, we consider the setting where test examples are not drawn from the training distribution. Prior work has approached this problem by attempting to be robust to all possible test time distributions, which may degrade average performance, or by "peeking" at the test examples during training, which is not always feasible. In contrast, we propose to learn models that are adaptable, such that they can adapt to distribution shift at test time using a batch of unlabeled test data points. We acquire such models by learning to adapt to training batches sampled according to different sub-distributions, which simulate structural distribution shifts that may occur at test time. We introduce the problem of adaptive risk minimization (ARM), a formalization of this setting that lends itself to meta-learning methods. Compared to a variety of methods under the paradigms of empirical risk minimization and robust optimization, our approach provides substantial empirical gains on image classification problems in the presence of distribution shift.

### Meta-Learning Symmetries by Reparameterization

Many successful deep learning architectures are equivariant to certain transformations in order to conserve parameters and improve generalization: most famously, convolution layers are equivariant to shifts of the input. This approach only works when practitioners know a-priori symmetries of the task and can manually construct an architecture with the corresponding equivariances. Our goal is a general approach for learning equivariances from data, without needing prior knowledge of a task's symmetries or custom task-specific architectures. We present a method for learning and encoding equivariances into networks by learning corresponding parameter sharing patterns from data. Our method can provably encode equivariance-inducing parameter sharing for any finite group of symmetry transformations, and we find experimentally that it can automatically learn a variety of equivariances from symmetries in data. We provide our experiment code and pre-trained models at https://github.com/AllanYangZhou/metalearning-symmetries.