### A Survival Mediation Model with Bayesian Model Averaging

Determining the extent to which a patient is benefiting from cancer therapy is challenging. Criteria for quantifying the extent of "tumor response" observed within a few cycles of treatment have been established for various types of solid as well as hematologic malignancies. These measures comprise the primary endpoints of phase II trials. Regulatory approvals of new cancer therapies, however, are usually contingent upon the demonstration of superior overall survival with randomized evidence acquired with a phase III trial comparing the novel therapy to an appropriate standard of care treatment. With nearly two thirds of phase III oncology trials failing to achieve statistically significant results, researchers continue to refine and propose new surrogate endpoints. This article presents a Bayesian framework for studying relationships among treatment, patient subgroups, tumor response and survival. Combining classical components of mediation analysis with Bayesian model averaging (BMA), the methodology is robust to model mis-specification among various possible relationships among the observable entities. Posterior inference is demonstrated via application to a randomized controlled phase III trial in metastatic colorectal cancer. Moreover, the article details posterior predictive distributions of survival and statistical metrics for quantifying the extent of direct and indirect, or tumor response mediated, treatment effects.

### Network Analysis of Orchestral Concert Programming

Orchestral concert programming is a challenging, yet critical task for expanding audience engagement and is usually driven by qualitative heuristics and common musical practices. Quantitative analysis of orchestral programming has been limited, but has become more possible as many orchestras archive their performance history online. The contribution of this work is to use statistical network models to quantitatively explore orchestral concert programming, focusing on which factors determine if two composers are programmed together in the same concert by the Boston Symphony Orchestra. We find that the type of composition is the most important covariate in determining which composers are performed together and the additive and multiplicative effects are logical from an orchestral programming perspective. These results suggest that a network analysis is a promising approach for the analysis of concert programming, with several directions for future extensions.

### Characteristic and Necessary Minutiae in Fingerprints

Fingerprints feature a ridge pattern with moderately varying ridge frequency (RF), following an orientation field (OF), which usually features some singularities. Additionally at some points, called minutiae, ridge lines end or fork and this point pattern is usually used for fingerprint identification and authentication. Whenever the OF features divergent ridge lines (e.g.\ near singularities), a nearly constant RF necessitates the generation of more ridge lines, originating at minutiae. We call these the necessary minutiae. It turns out that fingerprints feature additional minutiae which occur at rather arbitrary locations. We call these the random minutiae or, since they may convey fingerprint individuality beyond the OF, the characteristic minutiae. In consequence, the minutiae point pattern is the superposition of two stochastic point processes: a Strauss point process (whose intensity is given by the divergence field) with an additional hard core and a homogeneous Poisson point process modelling the necessary and the characteristic minutiae, respectively. We perform Bayesian inference using an MCMC-based minutiae separating algorithm (MiSeal). In simulations, it provides good mixing and good estimation of underlying parameters. In application to fingerprints, we can separate the two minutiae processes and verify by example of two different prints with similar OF that characteristic minutiae convey fingerprint individuality.

### A semi-analytical solution to the maximum likelihood fit of Poisson data to a linear model using the Cash statistic

[ABRIDGED] The Cash statistic, also known as the C stat, is commonly used for the analysis of low-count Poisson data, including data with null counts for certain values of the independent variable. The use of this statistic is especially attractive for low-count data that cannot be combined, or re-binned, without loss of resolution. This paper presents a new maximum-likelihood solution for the best-fit parameters of a linear model using the Poisson-based Cash statistic. The solution presented in this paper provides a new and simple method to measure the best-fit parameters of a linear model for any Poisson-based data, including data with null counts. In particular, the method enforces the requirement that the best-fit linear model be non-negative throughout the support of the independent variable. The method is summarized in a simple algorithm to fit Poisson counting data of any size and counting rate with a linear model, by-passing entirely the use of the traditional $\chi^2$ statistic.

### Computationally Efficient Deep Bayesian Unit-Level Modeling of Survey Data under Informative Sampling for Small Area Estimation

The topic of deep learning has seen a surge of interest in recent years both within and outside of the field of Statistics. Deep models leverage both nonlinearity and interaction effects to provide superior predictions in many cases when compared to linear or generalized linear models. However, one of the main challenges with deep modeling approaches is quantification of uncertainty. The use of random weight models, such as the popularized "Extreme Learning Machine," offer a potential solution in this regard. In addition to uncertainty quantification, these models are extremely computationally efficient as they do not require optimization through stochastic gradient descent, which is what is typically done for deep learning. We show how the use of random weights in a deep model can fit into a likelihood based framework to allow for uncertainty quantification of the model parameters and any desired estimates. Furthermore, we show how this approach can be used to account for informative sampling of survey data through the use of a pseudo-likelihood. We illustrate the effectiveness of this methodology through simulation and with a real survey data application involving American National Election Studies data.

### Discovering causal factors of drought in Ethiopia

Drought is a costly natural hazard, many aspects of which remain poorly understood. It has many contributory factors, driving its outset, duration, and severity, including land surface, anthropogenic activities, and, most importantly, meteorological anomalies. Prediction plays a crucial role in drought preparedness and risk mitigation. However, this is a challenging task at socio-economically critical lead times (1-2 years), because meteorological anomalies operate at a wide range of temporal and spatial scales. Among them, past studies have shown a correlation between the Sea Surface Temperature (SST) anomaly and the amount of precipitation in various locations in Africa. In its Eastern part, the cooling phase of El Nino-Southern Oscillation (ENSO) and SST anomaly in the Indian ocean are correlated with the lack of rainfall. Given the intrinsic shortcomings of correlation coefficients, we investigate the association among SST modes of variability and the monthly fraction of grid points in Ethiopia, which are in drought conditions in terms of causality. Using the empirical extreme quantiles of precipitation distribution as a proxy for drought, We show that the level of SST second mode of variability in the prior year influences the occurrence of drought in Ethiopia. The causal link between these two variables has a negative coefficient that verifies the conclusion of past studies that rainfall deficiency in the Horn of Africa is associated with ENSO's cooling phase.

### Assessing the contagiousness of mass shootings with nonparametric Hawkes processes

Gun violence and mass shootings are high-profile epidemiological issues facing the United States with questions regarding their contagiousness gaining prevalence in news media. Through the use of nonparametric Hawkes processes, we examine the evidence for the existence of contagiousness within a catalog of mass shootings and highlight the broader benefits of using such nonparametric point process models in modeling the occurrence of such events.

### Statistical Inference for High-Dimensional Vector Autoregression with Measurement Error

High-dimensional vector autoregression with measurement error is frequently encountered in a large variety of scientific and business applications. In this article, we study statistical inference of the transition matrix under this model. While there has been a large body of literature studying sparse estimation of the transition matrix, there is a paucity of inference solutions, especially in the high-dimensional scenario. We develop inferential procedures for both the global and simultaneous testing of the transition matrix. We first develop a new sparse expectation-maximization algorithm to estimate the model parameters, and carefully characterize their estimation precisions. We then construct a Gaussian matrix, after proper bias and variance corrections, from which we derive the test statistics. Finally, we develop the testing procedures and establish their asymptotic guarantees. We study the finite-sample performance of our tests through intensive simulations, and illustrate with a brain connectivity analysis example.

### Ridge Regression Revisited: Debiasing, Thresholding and Bootstrap

In high dimensional setting, the facts that the classical ridge regression method cannot perform model selection on its own and it introduces large bias make this method an unsatisfactory tool for analyzing high dimensional linear models. In this paper, we propose the debiased and threshold ridge regression method which solves these drawbacks. Besides, focus on performing statistical inference and prediction of linear combinations of parameters, we provide a normal approximation theorem for the estimator and propose two bootstrap algorithms which provide joint confidence regions and prediction regions for the linear combinations. In statistical inference part, apart from the dimension of parameters, we allow the number of linear combinations to grow as sample size increases. From numerical experiments, we can see that the proposed regression method is robust with the fluctuation in ridge parameter and reduces estimation errors compared to classical and threshold ridge regression methods. Apart from theoretical interests, the proposed algorithms can be applied to disciplines such as econometrics, biology and etc.

### Automatic Forecasting using Gaussian Processes

Automatic forecasting is the task of receiving a time series and returning a forecast for the next time steps without any human intervention. We propose an approach for automatic forecasting based on Gaussian Processes (GPs). So far, the main limits of GPs on this task have been the lack of a criterion for the selection of the kernel and the long times required for training different competing kernels. We design a fixed additive kernel, which contains the components needed to model most time series. During training the unnecessary components are made irrelevant by automatic relevance determination. We assign priors to each hyperparameter. We design the priors by analyzing a separate set of time series through a hierarchical GP. The resulting model performs very well on different types of time series, being competitive or outperforming the state-of-the-art approaches.Thanks to the priors, we reliably estimate the parameters with a single restart; this speedup makes the model efficient to train and suitable for processing a large number of time series.

### On mixtures of extremal copulas and attainability of concordance signatures

The concordance signature of a random vector or its distribution is defined to be the set of concordance probabilities for margins of all orders. It is proved that the concordance signature of a copula is always equal to the concordance signature of some unique mixture of the extremal copulas. Applications of the result include a characterization of the set of Kendall rank correlation matrices as the cut polytope as well as a method for determining whether sets of concordance probabilities are attainable. The elliptical copulas are shown to yield a strict subset of the attainable concordance signatures as well as a strict subset of the attainable Kendall rank correlation matrices; the Student $t$ copula is shown to converge to a mixture of extremal copulas sharing its concordance signature with all elliptical distributions that have the same correlation matrix. A method of estimating an attainable concordance signature from data is derived and shown to correspond to using standard estimates of bivariate and multivariate Kendall's tau in the absence of ties

### Multidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey

Multidimensional Scaling (MDS) is one of the first fundamental manifold learning methods. It can be categorized into several methods, i.e., classical MDS, kernel classical MDS, metric MDS, and non-metric MDS. Sammon mapping and Isomap can be considered as special cases of metric MDS and kernel classical MDS, respectively. In this tutorial and survey paper, we review the theory of MDS, Sammon mapping, and Isomap in detail. We explain all the mentioned categories of MDS. Then, Sammon mapping, Isomap, and kernel Isomap are explained. Out-of-sample embedding for MDS and Isomap using eigenfunctions and kernel mapping are introduced. Then, Nystrom approximation and its use in landmark MDS and landmark Isomap are introduced for big data embedding. We also provide some simulations for illustrating the embedding by these methods.

### Indoor Environment Data Time-Series Reconstruction Using Autoencoder Neural Networks

As the number of installed meters in buildings increases, there is a growing number of data time-series that could be used to develop data-driven models to support and optimize building operation. However, building data sets are often characterized by errors and missing values, which are considered, by the recent research, among the main limiting factors on the performance of the proposed models. Motivated by the need to address the problem of missing data in building operation, this work presents a data-driven approach to fill these gaps. In this study, three different autoencoder neural networks are trained to reconstruct missing indoor environment data time-series in a data set collected in an office building in Aachen, Germany. The models are applicable for different time-series obtained from room automation, such as indoor air temperature, relative humidity and $CO_{2}$ data streams. The results prove that the proposed methods outperform classic numerical approaches and they result in reconstructing the corresponding variables with average RMSEs of 0.42 {\deg}C, 1.30 % and 78.41 ppm, respectively.

### Random autoregressive models: A structured overview

Models characterized by autoregressive structure and random coefficients are powerful tools for the analysis of high-frequency, high-dimensional and volatile time series. The available literature on such models is broad, but also sectorial, overlapping, and confusing. Most models focus on one property of the data, while much can be gained by combining the strength of various models and their sources of heterogeneity. We present a structured overview of the literature on autoregressive models with random coefficients. We describe hierarchy and analogies among models, and for each we systematically list properties, estimation methods, tests, software packages and typical applications.

### Mean-Variance Analysis in Bayesian Optimization under Uncertainty

We consider active learning (AL) in an uncertain environment in which trade-off between multiple risk measures need to be considered. As an AL problem in such an uncertain environment, we study Mean-Variance Analysis in Bayesian Optimization (MVA-BO) setting. Mean-variance analysis was developed in the field of financial engineering and has been used to make decisions that take into account the trade-off between the average and variance of investment uncertainty. In this paper, we specifically focus on BO setting with an uncertain component and consider multi-task, multi-objective, and constrained optimization scenarios for the mean-variance trade-off of the uncertain component. When the target blackbox function is modeled by Gaussian Process (GP), we derive the bounds of the two risk measures and propose AL algorithm for each of the above three problems based on the risk measure bounds. We show the effectiveness of the proposed AL algorithms through theoretical analysis and numerical experiments.

### Integration of AI and mechanistic modeling in generative adversarial networks for stochastic inverse problems

The problem of finding distributions of input parameters for deterministic mechanistic models to match distributions of model outputs to stochastic observations, i.e., the "Stochastic Inverse Problem" (SIP), encompasses a range of common tasks across a variety of scientific disciplines. Here, we demonstrate that SIP could be reformulated as a constrained optimization problem and adapted for applications in intervention studies to simultaneously infer model input parameters for two sets of observations, under control conditions and under an intervention. In the constrained optimization problem, the solution of SIP is enforced to accommodate the prior knowledge on the model input parameters and to produce outputs consistent with given observations by minimizing the divergence between the inferred distribution of input parameters and the prior. Unlike in standard SIP, the prior incorporates not only knowledge about model input parameters for objects in each set, but also information on the joint distribution or the deterministic map between the model input parameters in two sets of observations. To solve standard and intervention SIP, we employed conditional generative adversarial networks (GANs) and designed novel GANs that incorporate multiple generators and discriminators and have structures that reflect the underlying constrained optimization problems. This reformulation allows us to build computationally scalable solutions to tackle complex model input parameter inference scenarios, which appear routinely in physics, biophysics, economics and other areas, and which currently could not be handled with existing methods.

### Graph representation forecasting of patient's medical conditions: towards a digital twin

Objective: Modern medicine needs to shift from a wait and react, curative discipline to a preventative, interdisciplinary science aiming at providing personalised, systemic and precise treatment plans to patients. The aim of this work is to present how the integration of machine learning approaches with mechanistic computational modelling could yield a reliable infrastructure to run probabilistic simulations where the entire organism is considered as a whole. Methods: We propose a general framework that composes advanced AI approaches and integrates mathematical modelling in order to provide a panoramic view over current and future physiological conditions. The proposed architecture is based on a graph neural network (GNNs) forecasting clinically relevant endpoints (such as blood pressure) and a generative adversarial network (GANs) providing a proof of concept of transcriptomic integrability. Results: We show the results of the investigation of pathological effects of overexpression of ACE2 across different signalling pathways in multiple tissues on cardiovascular functions. We provide a proof of concept of integrating a large set of composable clinical models using molecular data to drive local and global clinical parameters and derive future trajectories representing the evolution of the physiological state of the patient. Significance: We argue that the graph representation of a computational patient has potential to solve important technological challenges in integrating multiscale computational modelling with AI. We believe that this work represents a step forward towards a healthcare digital twin.

### Complex-Valued vs. Real-Valued Neural Networks for Classification Perspectives: An Example on Non-Circular Data

The contributions of this paper are twofold. First, we show the potential interest of Complex-Valued Neural Network (CVNN) on classification tasks for complex-valued datasets. To highlight this assertion, we investigate an example of complex-valued data in which the real and imaginary parts are statistically dependent through the property of non-circularity. In this context, the performance of fully connected feed-forward CVNNs is compared against a real-valued equivalent model. The results show that CVNN performs better for a wide variety of architectures and data structures. CVNN accuracy presents a statistically higher mean and median and lower variance than Real-Valued Neural Network (RVNN). Furthermore, if no regularization technique is used, CVNN exhibits lower overfitting. The second contribution is the release of a Python library (Barrachina 2019) using Tensorflow as back-end that enables the implementation and training of CVNNs in the hopes of motivating further research on this area.

### Functional data analysis: An application to COVID-19 data in the United States

The COVID-19 pandemic so far has caused huge negative impacts on different areas all over the world, and the United States (US) is one of the most affected countries. In this paper, we use methods from the functional data analysis to look into the COVID-19 data in the US. We explore the modes of variation of the data through a functional principal component analysis (FPCA), and study the canonical correlation between confirmed and death cases. In addition, we run a cluster analysis at the state level so as to investigate the relation between geographical locations and the clustering structure. Lastly, we consider a functional time series model fitted to the cumulative confirmed cases in the US, and make forecasts based on the dynamic FPCA. Both point and interval forecasts are provided, and the methods for assessing the accuracy of the forecasts are also included.

### A Principle of Least Action for the Training of Neural Networks

Neural networks have been achieving high generalization performance on many tasks despite being highly over-parameterized. Since classical statistical learning theory struggles to explain this behavior, much effort has recently been focused on uncovering the mechanisms behind it, in the hope of developing a more adequate theoretical framework and having a better control over the trained models. In this work, we adopt an alternate perspective, viewing the neural network as a dynamical system displacing input particles over time. We conduct a series of experiments and, by analyzing the network's behavior through its displacements, we show the presence of a low kinetic energy displacement bias in the transport map of the network, and link this bias with generalization performance. From this observation, we reformulate the learning problem as follows: finding neural networks which solve the task while transporting the data as efficiently as possible. This offers a novel formulation of the learning problem which allows us to provide regularity results for the solution network, based on Optimal Transport theory. From a practical viewpoint, this allows us to propose a new learning algorithm, which automatically adapts to the complexity of the given task, and leads to networks with a high generalization ability even in low data regimes.

### Bayesian Matrix Completion for Hypothesis Testing

The United States Environmental Protection Agency (EPA) screens thousands of chemicals primarily to differentiate those that are active vs inactive for different types of biological endpoints. However, it is not feasible to test all possible combinations of chemicals, assay endpoints, and concentrations, resulting in a majority of missing combinations. Our goal is to derive posterior probabilities of activity for each chemical by assay endpoint combination. Therefore, we are faced with a task of matrix completion in the context of hypothesis testing for sparse functional data. We propose a Bayesian hierarchical framework, which borrows information across different chemicals and assay endpoints. Our model predicts bioactivity profiles of whether the dose-response curve is constant or not, using low-dimensional latent attributes of chemicals and of assay endpoints. This framework facilitates out-of-sample prediction of bioactivity potential for new chemicals not yet tested, while capturing heteroscedastic residuals. We demonstrate the performance via extensive simulation studies and an application to data from the EPA's ToxCast/Tox21 program. Our approach allows more realistic and stable estimation of potential toxicity as shown for two disease outcomes: neurodevelopmental disorders and obesity.

### A Time To Event Framework For Multi-touch Attribution

Multi-touch attribution (MTA) estimates the relative contributions of the multiple ads a user may see prior to any observed conversions. Increasingly, advertisers also want to base budget and bidding decisions on these attributions, spending more on ads that drive more conversions. We describe two requirements for an MTA system to be suitable for this application: First, it must be able to handle continuously updated and incomplete data. Second, it must be sufficiently flexible to capture that an ad's effect will change over time. We describe an MTA system, consisting of a model for user conversion behavior and a credit assignment algorithm, that satisfies these requirements. Our model for user conversion behavior treats conversions as occurrences in an inhomogeneous Poisson process, while our attribution algorithm is based on iteratively removing the last ad in the path.

### Transfer Learning in Deep Reinforcement Learning: A Survey

This paper surveys the field of transfer learning in the problem setting of Reinforcement Learning (RL). RL has been the key solution to sequential decision-making problems. Along with the fast advance of RL in various domains. including robotics and game-playing, transfer learning arises as an important technique to assist RL by leveraging and transferring external expertise to boost the learning process. In this survey, we review the central issues of transfer learning in the RL domain, providing a systematic categorization of its state-of-the-art techniques. We analyze their goals, methodologies, applications, and the RL frameworks under which these transfer learning techniques would be approachable. We discuss the relationship between transfer learning and other relevant topics from an RL perspective and also explore the potential challenges as well as future development directions for transfer learning in RL.

### Captum: A unified and generic model interpretability library for PyTorch

In this paper we introduce a novel, unified, open-source model interpretability library for PyTorch [12]. The library contains generic implementations of a number of gradient and perturbation-based attribution algorithms, also known as feature, neuron and layer importance algorithms, as well as a set of evaluation metrics for these algorithms. It can be used for both classification and non-classification models including graph-structured models built on Neural Networks (NN). In this paper we give a high-level overview of supported attribution algorithms and show how to perform memory-efficient and scalable computations. We emphasize that the three main characteristics of the library are multimodality, extensibility and ease of use. Multimodality supports different modality of inputs such as image, text, audio or video. Extensibility allows adding new algorithms and features. The library is also designed for easy understanding and use. Besides, we also introduce an interactive visualization tool called Captum Insights that is built on top of Captum library and allows sample-based model debugging and visualization using feature importance metrics.

### Comparison Lift: Bandit-based Experimentation System for Online Advertising

Comparison Lift is an experimentation-as-a-service (EaaS) application for testing online advertising audiences and creatives at JD.com. Unlike many other EaaS tools that focus primarily on fixed sample A/B testing, Comparison Lift deploys a custom bandit-based experimentation algorithm. The advantages of the bandit-based approach are two-fold. First, it aligns the randomization induced in the test with the advertiser's goals from testing. Second, by adapting experimental design to information acquired during the test, it reduces substantially the cost of experimentation to the advertiser. Since launch in May 2019, Comparison Lift has been utilized in over 1,500 experiments. We estimate that utilization of the product has helped increase click-through rates of participating advertising campaigns by 46% on average. We estimate that the adaptive design in the product has generated 27% more clicks on average during testing compared to a fixed sample A/B design. Both suggest significant value generation and cost savings to advertisers from the product.

### Matrix Profile XXII: Exact Discovery of Time Series Motifs under DTW

Over the last decade, time series motif discovery has emerged as a useful primitive for many downstream analytical tasks, including clustering, classification, rule discovery, segmentation, and summarization. In parallel, there has been an increased understanding that Dynamic Time Warping (DTW) is the best time series similarity measure in a host of settings. Surprisingly however, there has been virtually no work on using DTW to discover motifs. The most obvious explanation of this is the fact that both motif discovery and the use of DTW can be computationally challenging, and the current best mechanisms to address their lethargy are mutually incompatible. In this work, we present the first scalable exact method to discover time series motifs under DTW. Our method automatically performs the best trade-off between time-to-compute and tightness-of-lower-bounds for a novel hierarchy of lower bounds representation we introduce. We show that under realistic settings, our algorithm can admissibly prune up to 99.99% of the DTW computations.

### Improving Delay Based Reservoir Computing via Eigenvalue Analysis

We analyze the reservoir computation capability of the Lang-Kobayashi system by comparing the numerically computed recall capabilities and the eigenvalue spectrum. We show that these two quantities are deeply connected, and thus the reservoir computing performance is predictable by analyzing the eigenvalue spectrum. Our results suggest that any dynamical system used as a reservoir can be analyzed in this way as long as the reservoir perturbations are sufficiently small. Optimal performance is found for a system with the eigenvalues having real parts close to zero and off-resonant imaginary parts.

### Type-augmented Relation Prediction in Knowledge Graphs

Knowledge graphs (KGs) are of great importance to many real world applications, but they generally suffer from incomplete information in the form of missing relations between entities. Knowledge graph completion (also known as relation prediction) is the task of inferring missing facts given existing ones. Most of the existing work is proposed by maximizing the likelihood of observed instance-level triples. Not much attention, however, is paid to the ontological information, such as type information of entities and relations. In this work, we propose a type-augmented relation prediction (TaRP) method, where we apply both the type information and instance-level information for relation prediction. In particular, type information and instance-level information are encoded as prior probabilities and likelihoods of relations respectively, and are combined by following Bayes' rule. Our proposed TaRP method achieves significantly better performance than state-of-the-art methods on three benchmark datasets: FB15K, YAGO26K-906, and DB111K-174. In addition, we show that TaRP achieves significantly improved data efficiency. More importantly, the type information extracted from a specific dataset can generalize well to other datasets through the proposed TaRP model.

### An analysis of deep neural networks for predicting trends in time series data

The emergence of small and portable smart sensors have opened up new opportunities for many applications, including automated factories, smart cities and connected healthcare, broadly referred to as the "Internet of Things (IoT)". These devices produce time series data. While deep neural networks (DNNs) has been widely applied to computer vision, natural language processing and speech recognition, there is limited research on DNNs for time series prediction. Machine learning (ML) applications for time series prediction has traditionally involved predicting the next value in the series. However, in certain applications, segmenting the time series into a sequence of trends and predicting the next trend is preferred. Recently, a hybrid DNN algorithm, TreNet was proposed for trend prediction. TreNet, which combines an LSTM that takes in trendlines and a CNN that takes in point data was shown to have superior performance for trend prediction when compared to other approaches. However, the study used a standard cross-validation method which does not take into account the sequential nature of time series. In this work, we reproduce TreNet using a walk-forward validation method, which is more appropriate to time series data. We compare the performance of the hybrid TreNet algorithm, on the same three data sets used in the original study, to vanilla MLP, LSTM, and CNN that take in point data, and also to traditional ML algorithms, i.e. the Random Forest (RF), Support Vector Regression and Gradient Boosting Machine. Our results differ significantly from those reported for the original TreNet. In general TreNet still performs better than the vanilla DNN models, but not substantially so as reported for the original TreNet. Furthermore, our results show that the RF algorithm performed substantially better than TreNet on the methane data set.

### A Network-Based High-Level Data Classification Algorithm Using Betweenness Centrality

Data classification is a major machine learning paradigm, which has been widely applied to solve a large number of real-world problems. Traditional data classification techniques consider only physical features (e.g., distance, similarity, or distribution) of the input data. For this reason, those are called \textit{low-level} classification. On the other hand, the human (animal) brain performs both low and high orders of learning and it has a facility in identifying patterns according to the semantic meaning of the input data. Data classification that considers not only physical attributes but also the pattern formation is referred to as \textit{high-level} classification. Several high-level classification techniques have been developed, which make use of complex networks to characterize data patterns and have obtained promising results. In this paper, we propose a pure network-based high-level classification technique that uses the betweenness centrality measure. We test this model in nine different real datasets and compare it with other nine traditional and well-known classification models. The results show us a competent classification performance.

### Analysis of Generalizability of Deep Neural Networks Based on the Complexity of Decision Boundary

For supervised learning models, the analysis of generalization ability (generalizability) is vital because the generalizability expresses how well a model will perform on unseen data. Traditional generalization methods, such as the VC dimension, do not apply to deep neural network (DNN) models. Thus, new theories to explain the generalizability of DNNs are required. In this study, we hypothesize that the DNN with a simpler decision boundary has better generalizability by the law of parsimony (Occam's Razor). We create the decision boundary complexity (DBC) score to define and measure the complexity of decision boundary of DNNs. The idea of the DBC score is to generate data points (called adversarial examples) on or near the decision boundary. Our new approach then measures the complexity of the boundary using the entropy of eigenvalues of these data. The method works equally well for high-dimensional data. We use training data and the trained model to compute the DBC score. And, the ground truth for model's generalizability is its test accuracy. Experiments based on the DBC score have verified our hypothesis. The DBC is shown to provide an effective method to measure the complexity of a decision boundary and gives a quantitative measure of the generalizability of DNNs.

### Deep Collective Learning: Learning Optimal Inputs and Weights Jointly in Deep Neural Networks

It is well observed that in deep learning and computer vision literature, visual data are always represented in a manually designed coding scheme (eg., RGB images are represented as integers ranging from 0 to 255 for each channel) when they are input to an end-to-end deep neural network (DNN) for any learning task. We boldly question whether the manually designed inputs are good for DNN training for different tasks and study whether the input to a DNN can be optimally learned end-to-end together with learning the weights of the DNN. In this paper, we propose the paradigm of {\em deep collective learning} which aims to learn the weights of DNNs and the inputs to DNNs simultaneously for given tasks. We note that collective learning has been implicitly but widely used in natural language processing while it has almost never been studied in computer vision. Consequently, we propose the lookup vision networks (Lookup-VNets) as a solution to deep collective learning in computer vision. This is achieved by associating each color in each channel with a vector in lookup tables. As learning inputs in computer vision has almost never been studied in the existing literature, we explore several aspects of this question through varieties of experiments on image classification tasks. Experimental results on four benchmark datasets, i.e., CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet (ILSVRC2012) have shown several surprising characteristics of Lookup-VNets and have demonstrated the advantages and promise of Lookup-VNets and deep collective learning.

### Distilled One-Shot Federated Learning

Current federated learning algorithms take tens of communication rounds transmitting unwieldy model weights under ideal circumstances and hundreds when data is poorly distributed. Inspired by recent work on dataset distillation and distributed one-shot learning, we propose Distilled One-Shot Federated Learning, which reduces the number of communication rounds required to train a performant model to only one. Each client distills their private dataset and sends the synthetic data (e.g. images or sentences) to the server. The distilled data look like noise and become useless after model fitting. We empirically show that, in only one round of communication, our method can achieve 96% test accuracy on federated MNIST with LeNet (centralized 99%), 81% on federated IMDB with a customized CNN (centralized 86%), and 84% on federated TREC-6 with a Bi-LSTM (centralized 89%). Using only a few rounds, DOSFL can match the centralized baseline on all three tasks. By evading the need for model-wise updates (i.e., weights, gradients, loss, etc.), the total communication cost of DOSFL is reduced by over an order of magnitude. We believe that DOSFL represents a new direction orthogonal to previous work, towards weight-less and gradient-less federated learning.

### Location-based Behavioral Authentication Using GPS Distance Coherence

Most of the current user authentication systems are based on PIN code, password, or biometrics traits which can have some limitations in usage and security. Lifestyle authentication has become a new research approach. A promising idea for it is to use the location history since it is relatively unique. Even when people are living in the same area or have occasional travel, it does not vary from day to day. For Global Positioning System (GPS) data, the previous work used the longitude, the latitude, and the timestamp as the features for the classification. In this paper, we investigate a new approach utilizing the distance coherence which can be extracted from the GPS itself without the need to require other information. We applied three ensemble classification RandomForest, ExtraTrees, and Bagging algorithms; and the experimental result showed that the approach can achieve 99.42%, 99.12%, and 99.25% of accuracy, respectively.

### Discond-VAE: Disentangling Continuous Factors from the Discrete

We propose a variant of VAE capable of disentangling both variations within each class and variations shared across all classes. To represent these generative factors of data, we introduce two sets of continuous latent variables, private variable and public variable. Our proposed framework models the private variable as a Mixture of Gaussian and the public variable as a Gaussian, respectively. Each mode of the private variable is responsible for a class of the discrete variable. Most of the previous attempts to integrate the discrete generative factors to disentanglement assume statistical independence between the continuous and discrete variables. However, this assumption does not hold in general. Our proposed model, which we call Discond-VAE, DISentangles the class-dependent CONtinuous factors from the Discrete factors by introducing the private variables. The experiments show that Discond-VAE can discover the private and public factors from data qualitatively and quantitatively.

### GeneraLight: Improving Environment Generalization of Traffic Signal Control via Meta Reinforcement Learning

The heavy traffic congestion problem has always been a concern for modern cities. To alleviate traffic congestion, researchers use reinforcement learning (RL) to develop better traffic signal control (TSC) algorithms in recent years. However, most RL models are trained and tested in the same traffic flow environment, which results in a serious overfitting problem. Since the traffic flow environment in the real world keeps varying, these models can hardly be applied due to the lack of generalization ability. Besides, the limited number of accessible traffic flow data brings extra difficulty in testing the generalization ability of the models. In this paper, we design a novel traffic flow generator based on Wasserstein generative adversarial network to generate sufficient diverse and quality traffic flows and use them to build proper training and testing environments. Then we propose a meta-RL TSC framework GeneraLight to improve the generalization ability of TSC models. GeneraLight boosts the generalization performance by combining the idea of flow clustering and model-agnostic meta-learning. We conduct extensive experiments on multiple real-world datasets to show the superior performance of GeneraLight on generalizing to different traffic flows.

### Certifying Confidence via Randomized Smoothing

Randomized smoothing has been shown to provide good certified-robustness guarantees for high-dimensional classification problems. It uses the probabilities of predicting the top two most-likely classes around an input point under a smoothing distribution to generate a certified radius for a classifier's prediction. However, most smoothing methods do not give us any information about the \emph{confidence} with which the underlying classifier (e.g., deep neural network) makes a prediction. In this work, we propose a method to generate certified radii for the prediction confidence of the smoothed classifier. We consider two notions for quantifying confidence: average prediction score of a class and the margin by which the average prediction score of one class exceeds that of another. We modify the Neyman-Pearson lemma (a key theorem in randomized smoothing) to design a procedure for computing the certified radius where the confidence is guaranteed to stay above a certain threshold. Our experimental results on CIFAR-10 and ImageNet datasets show that using information about the distribution of the confidence scores allows us to achieve a significantly better certified radius than ignoring it. Thus, we demonstrate that extra information about the base classifier at the input point can help improve certified guarantees for the smoothed classifier.

### Spectral Flow on the Manifold of SPD Matrices for Multimodal Data Processing

In this paper, we consider data acquired by multimodal sensors capturing complementary aspects and features of a measured phenomenon. We focus on a scenario in which the measurements share mutual sources of variability but might also be contaminated by other measurement-specific sources such as interferences or noise. Our approach combines manifold learning, which is a class of nonlinear data-driven dimension reduction methods, with the well-known Riemannian geometry of symmetric and positive-definite (SPD) matrices. Manifold learning typically includes the spectral analysis of a kernel built from the measurements. Here, we take a different approach, utilizing the Riemannian geometry of the kernels. In particular, we study the way the spectrum of the kernels changes along geodesic paths on the manifold of SPD matrices. We show that this change enables us, in a purely unsupervised manner, to derive a compact, yet informative, description of the relations between the measurements, in terms of their underlying components. Based on this result, we present new algorithms for extracting the common latent components and for identifying common and measurement-specific components.

### FLAME: Differentially Private Federated Learning in the Shuffle Model

Differentially private federated learning has been intensively studied. The current works are mainly based on the \textit{curator model} or \textit{local model} of differential privacy. However, both of them have pros and cons. The curator model allows greater accuracy but requires a trusted analyzer. In the local model where users randomize local data before sending them to the analyzer, a trusted analyzer is not required, but the accuracy is limited. In this work, by leveraging the \textit{privacy amplification} effect in the recently proposed shuffle model of differential privacy, we achieve the best of two worlds, i.e., accuracy in the curator model and strong privacy without relying on any trusted party. We first propose an FL framework in the shuffle model and a simple protocol (SS-Simple) extended from existing work. We find that SS-Simple only provides an insufficient privacy amplification effect in FL since the dimension of the model parameter is quite large. To solve this challenge, we propose an enhanced protocol (SS-Double) to increase the privacy amplification effect by subsampling. Furthermore, for boosting the utility when the model size is greater than the user population, we propose an advanced protocol (SS-Topk) with gradient sparsification techniques. We also provide theoretical analysis and numerical evaluations of the privacy amplification of the proposed protocols. Experiments on real-world datasets validate that SS-Topk improves the testing accuracy by 60.7\% than the local model based FL. We highlight the observation that SS-Topk even can improve by 33.94\% accuracy than the curator model based FL without any trusted party. Compared with non-private FL, our protocol SS-Topk only lose 1.48\% accuracy under $(4.696, 10^{-5})$-DP.

### Layer-stacked Attention for Heterogeneous Network Embedding

The heterogeneous network is a robust data abstraction that can model entities of different types interacting in various ways. Such heterogeneity brings rich semantic information but presents nontrivial challenges in aggregating the heterogeneous relationships between objects - especially those of higher-order indirect relations. Recent graph neural network approaches for representation learning on heterogeneous networks typically employ the attention mechanism, which is often only optimized for predictions based on direct links. Furthermore, even though most deep learning methods can aggregate higher-order information by building deeper models, such a scheme can diminish the degree of interpretability. To overcome these challenges, we explore an architecture - Layer-stacked ATTention Embedding (LATTE) - that automatically decomposes higher-order meta relations at each layer to extract the relevant heterogeneous neighborhood structures for each node. Additionally, by successively stacking layer representations, the learned node embedding offers a more interpretable aggregation scheme for nodes of different types at different neighborhood ranges. We conducted experiments on several benchmark heterogeneous network datasets. In both transductive and inductive node classification tasks, LATTE can achieve state-of-the-art performance compared to existing approaches, all while offering a lightweight model. With extensive experimental analyses and visualizations, the framework can demonstrate the ability to extract informative insights on heterogeneous networks.

### Stochastic Optimization using Polynomial Chaos Expansions

Polynomial chaos based methods enable the efficient computation of output variability in the presence of input uncertainty in complex models. Consequently, they have been used extensively for propagating uncertainty through a wide variety of physical systems. These methods have also been employed to build surrogate models for accelerating inverse uncertainty quantification (infer model parameters from data) and construct transport maps. In this work, we explore the use of polynomial chaos based approaches for optimizing functions in the presence of uncertainty. These methods enable the fast propagation of uncertainty through smooth systems. If the dimensionality of the random parameters is low, these methods provide orders of magnitude acceleration over Monte Carlo sampling. We construct a generalized polynomial chaos based methodology for optimizing smooth functions in the presence of random parameters that are drawn from \emph{known} distributions. By expanding the optimization variables using orthogonal polynomials, the stochastic optimization problem reduces to a deterministic one that provides estimates for all moments of the output distribution. Thus, this approach enables one to avoid computationally expensive random sampling based approaches such as Monte Carlo and Quasi-Monte Carlo. In this work, we develop the overall framework, derive error bounds, construct the framework for the inclusion of constraints, analyze various properties of the approach, and demonstrate the proposed technique on illustrative examples.

### Distributional Generalization: A New Kind of Generalization

We introduce a new notion of generalization-- Distributional Generalization-- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the *test set* as well, while leaving other classes unaffected. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain. This example is a specific instance of our much more general conjectures which apply even on distributions where the Bayes risk is zero. Our conjectures characterize the form of distributional generalization that can be expected, in terms of problem parameters (model architecture, training procedure, number of samples, data distribution). We verify the quantitative predictions of these conjectures across a variety of domains in machine learning, including neural networks, kernel machines, and decision trees. These empirical observations are independently interesting, and form a more fine-grained characterization of interpolating classifiers beyond just their test error.

### An early prediction of covid-19 associated hospitalization surge using deep learning approach

The global pandemic caused by COVID-19 affects our lives in all aspects. As of September 11, more than 28 million people have tested positive for COVID-19 infection, and more than 911,000 people have lost their lives in this virus battle. Some patients can not receive appropriate medical treatment due the limits of hospitalization volume and shortage of ICU beds. An estimated future hospitalization is critical so that medical resources can be allocated as needed. In this study, we propose to use 4 recurrent neural networks to infer hospitalization change for the following week compared with the current week. Results show that sequence to sequence model with attention achieves a high accuracy of 0.938 and AUC of 0.850 in the hospitalization prediction. Our work has the potential to predict the hospitalization need and send a warning to medical providers and other stakeholders when a re-surge initializes.

### An Extension of Fano's Inequality for Characterizing Model Susceptibility to Membership Inference Attacks

Deep neural networks have been shown to be vulnerable to membership inference attacks wherein the attacker aims to detect whether specific input data were used to train the model. These attacks can potentially leak private or proprietary data. We present a new extension of Fano's inequality and employ it to theoretically establish that the probability of success for a membership inference attack on a deep neural network can be bounded using the mutual information between its inputs and its activations. This enables the use of mutual information to measure the susceptibility of a DNN model to membership inference attacks. In our empirical evaluation, we show that the correlation between the mutual information and the susceptibility of the DNN model to membership inference attacks is 0.966, 0.996, and 0.955 for CIFAR-10, SVHN and GTSRB models, respectively.

### Few-Shot Unsupervised Continual Learning through Meta-Examples

In real-world applications, data do not reflect the ones commonly used for neural networks training, since they are usually few, unbalanced, unlabeled and can be available as a stream. Hence many existing deep learning solutions suffer from a limited range of applications, in particular in the case of online streaming data that evolve over time. To narrow this gap, in this work we introduce a novel and complex setting involving unsupervised meta-continual learning with unbalanced tasks. These tasks are built through a clustering procedure applied to a fitted embedding space. We exploit a meta-learning scheme that simultaneously alleviates catastrophic forgetting and favors the generalization to new tasks, even Out-of-Distribution ones. Moreover, to encourage feature reuse during the meta-optimization, we exploit a single inner loop taking advantage of an aggregated representation achieved through the use of a self-attention mechanism. Experimental results on few-shot learning benchmarks show competitive performance even compared to the supervised case. Additionally, we empirically observe that in an unsupervised scenario, the small tasks and the variability in the clusters pooling play a crucial role in the generalization capability of the network. Further, on complex datasets, the exploitation of more clusters than the true number of classes leads to higher results, even compared to the ones obtained with full supervision, suggesting that a predefined partitioning into classes can miss relevant structural information.

### Finding Effective Security Strategies through Reinforcement Learning and Self-Play

We present a method to automatically find security strategies for the use case of intrusion prevention. Following this method, we model the interaction between an attacker and a defender as a Markov game and let attack and defense strategies evolve through reinforcement learning and self-play without human intervention. Using a simple infrastructure configuration, we demonstrate that effective security strategies can emerge from self-play. This shows that self-play, which has been applied in other domains with great success, can be effective in the context of network security. Inspection of the converged policies show that the emerged policies reflect common-sense knowledge and are similar to strategies of humans. Moreover, we address known challenges of reinforcement learning in this domain and present an approach that uses function approximation, an opponent pool, and an autoregressive policy representation. Through evaluations we show that our method is superior to two baseline methods but that policy convergence in self-play remains a challenge.

### Online Algorithms for Estimating Change Rates of Web Pages

For providing quick and accurate search results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. It would have been ideal if the crawler managed to update the local snapshot as soon as a page changed on the web. However, finite bandwidth availability and server restrictions mean that there is a bound on how frequently the different pages can be crawled. This then brings forth the following optimisation problem: maximise the freshness of the local cache subject to the crawling frequency being within the prescribed bounds. Recently, tractable algorithms have been proposed to solve this optimisation problem under different cost criteria. However, these assume the knowledge of exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide three novel schemes for online estimation of page change rates. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawl instance. Our first scheme is based on the law of large numbers, the second on the theory of stochastic approximation, while the third is an extension of the second and involves an additional momentum term. For all of these schemes, we prove convergence and, also, provide their convergence rates. As far as we know, the results concerning the third estimator is quite novel. Specifically, this is the first convergence type result for a stochastic approximation algorithm with momentum. Finally, we provide some numerical experiments (on real as well as synthetic data) to compare the performance of our proposed estimators with the existing ones (e.g., MLE).

### Byzantine-Robust Variance-Reduced Federated Learning over Distributed Non-i.i.d. Data

We propose a Byzantine-robust variance-reduced stochastic gradient descent (SGD) method to solve the distributed finite-sum minimization problem when the data on the workers are not independent and identically distributed (i.i.d.). During the learning process, an unknown number of Byzantine workers may send malicious messages to the master node, leading to remarkable learning error. Most of the Byzantine-robust methods address this issue by using robust aggregation rules to aggregate the received messages, but rely on the assumption that all the regular workers have i.i.d. data, which is not the case in many federated learning applications. In light of the significance of reducing stochastic gradient noise for mitigating the effect of Byzantine attacks, we use a resampling strategy to reduce the impact of both inner variation (that describes the sample heterogeneity on every regular worker) and outer variation (that describes the sample heterogeneity among the regular workers), along with a stochastic average gradient algorithm (SAGA) to fully eliminate the inner variation. The variance-reduced messages are then aggregated with a robust geometric median operator. Under certain conditions, we prove that the proposed method reaches a neighborhood of the optimal solution with linear convergence rate, and the learning error is much smaller than those given by the state-of-the-art methods in the non-i.i.d. setting. Numerical experiments corroborate the theoretical results and show satisfactory performance of the proposed method.

### Holistic Filter Pruning for Efficient Deep Neural Networks

Deep neural networks (DNNs) are usually over-parameterized to increase the likelihood of getting adequate initial weights by random initialization. Consequently, trained DNNs have many redundancies which can be pruned from the model to reduce complexity and improve the ability to generalize. Structural sparsity, as achieved by filter pruning, directly reduces the tensor sizes of weights and activations and is thus particularly effective for reducing complexity. We propose "Holistic Filter Pruning" (HFP), a novel approach for common DNN training that is easy to implement and enables to specify accurate pruning rates for the number of both parameters and multiplications. After each forward pass, the current model complexity is calculated and compared to the desired target size. By gradient descent, a global solution can be found that allocates the pruning budget over the individual layers such that the desired target size is fulfilled. In various experiments, we give insights into the training and achieve state-of-the-art performance on CIFAR-10 and ImageNet (HFP prunes 60% of the multiplications of ResNet-50 on ImageNet with no significant loss in the accuracy). We believe our simple and powerful pruning approach to constitute a valuable contribution for users of DNNs in low-cost applications.

### Algorithms and Complexity for Variants of Covariates Fine Balance

We study here several variants of the covariates fine balance problem where we generalize some of these problems and introduce a number of others. We present here a comprehensive complexity study of the covariates problems providing polynomial time algorithms, or a proof of NP-hardness. The polynomial time algorithms described are mostly combinatorial and rely on network flow techniques. In addition we present several fixed-parameter tractable results for problems where the number of covariates and the number of levels of each covariate are seen as a parameter.

### Multi-objective dynamic programming with limited precision

This paper addresses the problem of approximating the set of all solutions for Multi-objective Markov Decision Processes. We show that in the vast majority of interesting cases, the number of solutions is exponential or even infinite. In order to overcome this difficulty we propose to approximate the set of all solutions by means of a limited precision approach based on White's multi-objective value-iteration dynamic programming algorithm. We prove that the number of calculated solutions is tractable and show experimentally that the solutions obtained are a good approximation of the true Pareto front.

### Dimension Reduction in Contextual Online Learning via Nonparametric Variable Selection

We consider a contextual online learning (multi-armed bandit) problem with high-dimensional covariate $\mathbf{x}$ and decision $\mathbf{y}$. The reward function to learn, $f(\mathbf{x},\mathbf{y})$, does not have a particular parametric form. The literature has shown that the optimal regret is $\tilde{O}(T^{(d_x+d_y+1)/(d_x+d_y+2)})$, where $d_x$ and $d_y$ are the dimensions of $\mathbf x$ and $\mathbf y$, and thus it suffers from the curse of dimensionality. In many applications, only a small subset of variables in the covariate affect the value of $f$, which is referred to as \textit{sparsity} in statistics. To take advantage of the sparsity structure of the covariate, we propose a variable selection algorithm called \textit{BV-LASSO}, which incorporates novel ideas such as binning and voting to apply LASSO to nonparametric settings. Our algorithm achieves the regret $\tilde{O}(T^{(d_x^*+d_y+1)/(d_x^*+d_y+2)})$, where $d_x^*$ is the effective covariate dimension. The regret matches the optimal regret when the covariate is $d^*_x$-dimensional and thus cannot be improved. Our algorithm may serve as a general recipe to achieve dimension reduction via variable selection in nonparametric settings.

### When compressive learning fails: blame the decoder or the sketch?

In compressive learning, a mixture model (a set of centroids or a Gaussian mixture) is learned from a sketch vector, that serves as a highly compressed representation of the dataset. This requires solving a non-convex optimization problem, hence in practice approximate heuristics (such as CLOMPR) are used. In this work we explore, by numerical simulations, properties of this non-convex optimization landscape and those heuristics.

### Neural CDEs for Long Time Series via the Log-ODE Method

Neural Controlled Differential Equations (Neural CDEs) are the continuous-time analogue of an RNN, just as Neural ODEs are analogous to ResNets. However just like RNNs, training Neural CDEs can be difficult for long time series. Here, we propose to apply a technique drawn from stochastic analysis, namely the log-ODE method. Instead of using the original input sequence, our procedure summarises the information over local time intervals via the log-signature map, and uses the resulting shorter stream of log-signatures as the new input. This represents a length/channel trade-off. In doing so we demonstrate efficacy on problems of length up to 17k observations and observe significant training speed-ups, improvements in model performance, and reduced memory requirements compared to the existing algorithm.

### Multimodal Safety-Critical Scenarios Generation for Decision-Making Algorithms Evaluation

Existing neural network-based autonomous systems are shown to be vulnerable against adversarial attacks, therefore sophisticated evaluation on their robustness is of great importance. However, evaluating the robustness only under the worst-case scenarios based on known attacks is not comprehensive, not to mention that some of them even rarely occur in the real world. In addition, the distribution of safety-critical data is usually multimodal, while most traditional attacks and evaluation methods focus on a single modality. To solve the above challenges, we propose a flow-based multimodal safety-critical scenario generator for evaluating decisionmaking algorithms. The proposed generative model is optimized with weighted likelihood maximization and a gradient-based sampling procedure is integrated to improve the sampling efficiency. The safety-critical scenarios are generated by querying the task algorithms and the log-likelihood of the generated scenarios is in proportion to the risk level. Experiments on a self-driving task demonstrate our advantages in terms of testing efficiency and multimodal modeling capability. We evaluate six Reinforcement Learning algorithms with our generated traffic scenarios and provide empirical conclusions about their robustness.

### Social network analytics for supervised fraud detection in insurance

Insurance fraud occurs when policyholders file claims that are exaggerated or based on intentional damages. This contribution develops a fraud detection strategy by extracting insightful information from the social network of a claim. First, we construct a network by linking claims with all their involved parties, including the policyholders, brokers, experts, and garages. Next, we establish fraud as a social phenomenon in the network and use the BiRank algorithm with a fraud specific query vector to compute a fraud score for each claim. From the network, we extract features related to the fraud scores as well as the claims' neighborhood structure. Finally, we combine these network features with the claim-specific features and build a supervised model with fraud in motor insurance as the target variable. Although we build a model for only motor insurance, the network includes claims from all available lines of business. Our results show that models with features derived from the network perform well when detecting fraud and even outperform the models using only the classical claim-specific features. Combining network and claim-specific features further improves the performance of supervised learning models to detect fraud. The resulting model flags highly suspicions claims that need to be further investigated. Our approach provides a guided and intelligent selection of claims and contributes to a more effective fraud investigation process.

### Decoupling Representation Learning from Reinforcement Learning

In an effort to overcome limitations of reward-driven feature learning in deep reinforcement learning (RL) from images, we propose decoupling representation learning from policy learning. To this end, we introduce a new unsupervised learning (UL) task, called Augmented Temporal Contrast (ATC), which trains a convolutional encoder to associate pairs of observations separated by a short time difference, under image augmentations and using a contrastive loss. In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL in most environments. Additionally, we benchmark several leading UL algorithms by pre-training encoders on expert demonstrations and using them, with weights frozen, in RL agents; we find that agents using ATC-trained encoders outperform all others. We also train multi-task encoders on data from multiple environments and show generalization to different downstream RL tasks. Finally, we ablate components of ATC, and introduce a new data augmentation to enable replay of (compressed) latent images from pre-trained encoders when RL requires augmentation. Our experiments span visually diverse RL benchmarks in DeepMind Control, DeepMind Lab, and Atari, and our complete code is available at https://github.com/astooke/rlpyt/rlpyt/ul.

### LAAT: Locally Aligned Ant Technique for detecting manifolds of varying density

Dimensionality reduction and clustering are often used as preliminary steps for many complex machine learning tasks. The presence of noise and outliers can deteriorate the performance of such preprocessing and therefore impair the subsequent analysis tremendously. In manifold learning, several studies indicate solutions for removing background noise or noise close to the structure when the density is substantially higher than that exhibited by the noise. However, in many applications, including astronomical datasets, the density varies alongside manifolds that are buried in a noisy background. We propose a novel method to extract manifolds in the presence of noise based on the idea of Ant colony optimization. In contrast to the existing random walk solutions, our technique captures points which are locally aligned with major directions of the manifold. Moreover, we empirically show that the biologically inspired formulation of ant pheromone reinforces this behavior enabling it to recover multiple manifolds embedded in extremely noisy data clouds. The algorithm's performance is demonstrated in comparison to the state-of-the-art approaches, such as Markov Chain, LLPD, and Disperse, on several synthetic and real astronomical datasets stemming from an N-body simulation of a cosmological volume.

### Towards Stable Imbalanced Data Classification via Virtual Big Data Projection

Virtual Big Data (VBD) proved to be effective to alleviate mode collapse and vanishing generator gradient as two major problems of Generative Adversarial Neural Networks (GANs) very recently. In this paper, we investigate the capability of VBD to address two other major challenges in Machine Learning including deep autoencoder training and imbalanced data classification. First, we prove that, VBD can significantly decrease the validation loss of autoencoders via providing them a huge diversified training data which is the key to reach better generalization to minimize the over-fitting problem. Second, we use the VBD to propose the first projection-based method called cross-concatenation to balance the skewed class distributions without over-sampling. We prove that, cross-concatenation can solve uncertainty problem of data driven methods for imbalanced classification.

### United We Stand: Transfer Graph Neural Networks for Pandemic Forecasting

The recent outbreak of COVID-19 has affected millions of individuals around the world and has posed a significant challenge to global healthcare. From the early days of the pandemic, it became clear that it is highly contagious and that human mobility contributes significantly to its spread. In this paper, we study the impact of population movement on the spread of COVID-19, and we capitalize on recent advances in the field of representation learning on graphs to capture the underlying dynamics. Specifically, we create a graph where nodes correspond to a country's regions and the edge weights denote human mobility from one region to another. Then, we employ graph neural networks to predict the number of future cases, encoding the underlying diffusion patterns that govern the spread into our learning model. Furthermore, to account for the limited amount of training data, we capitalize on the pandemic's asynchronous outbreaks across countries and use a model-agnostic meta-learning based method to transfer knowledge from one country's model to another's. We compare the proposed approach against simple baselines and more traditional forecasting techniques in 3 European countries. Experimental results demonstrate the superiority of our method, highlighting the usefulness of GNNs in epidemiological prediction. Transfer learning provides the best model, highlighting its potential to improve the accuracy of the predictions in case of secondary waves, if data from past/parallel outbreaks is utilized.

### Detección de comunidades en redes: Algoritmos y aplicaciones

This master's thesis work has the objective of performing an analysis of the methods for detecting communities in networks. As an initial part, I study of the main features of graph theory and communities, as well as common measures in this problem. Subsequently, I was performed a review of the main methods of detecting communities, developing a classification, taking into account its characteristics and computational complexity for the detection of strengths and weaknesses in the methods, as well as later works. Then, study the problem of classification of a clustering method, this in order to evaluate the quality of the communities detected by analyzing different measures. Finally conclusions are elaborated and possible lines of work that can be derived.

### Utilizing remote sensing data in forest inventory sampling via Bayesian optimization

In large-area forest inventories a trade-off between the amount of data to be sampled and the costs of collecting the data is necessary. It is not always possible to have a very large data sample when dealing with sampling-based inventories. It is therefore necessary to optimize the sampling design in order to achieve optimal population parameter estimation. On the contrary, the availability of remote sensing (RS) data correlated with the forest inventory variables is usually much higher. The combination of RS and the sampled field measurement data is often used for improving the forest inventory parameter estimation. In addition, it is also reasonable to study the utilization of RS data in inventory sampling, which can further improve the estimation of forest variables. In this study, we propose a data sampling method based on Bayesian optimization which uses RS data in forest inventory sample selection. The presented method applies the learned functional relationship between the RS and inventory data in new sampling decisions. We evaluate our method by conducting simulated sampling experiments with both synthetic data and measured data from the Aland region in Finland. The proposed method is benchmarked against two baseline methods: simple random sampling and the local pivotal method. The results of the simulated experiments show the best results in terms of MSE values for the proposed method when the functional relationship between RS and inventory data is correctly learned from the available training data.

### Large Norms of CNN Layers Do Not Hurt Adversarial Robustness

Since the Lipschitz properties of convolutional neural network (CNN) are widely considered to be related to adversarial robustness, we theoretically characterize the $\ell_1$ norm and $\ell_\infty$ norm of 2D multi-channel convolutional layers and provide efficient methods to compute the exact $\ell_1$ norm and $\ell_\infty$ norm. Based on our theorem, we propose a novel regularization method termed norm decay, which can effectively reduce the norms of CNN layers. Experiments show that norm-regularization methods, including norm decay, weight decay, and singular value clipping, can improve generalization of CNNs. However, we are surprised to find that they can slightly hurt adversarial robustness. Furthermore, we compute the norms of layers in the CNNs trained with three different adversarial training frameworks and find that adversarially robust CNNs have comparable or even larger norms than their non-adversarially robust counterparts. Moreover, we prove that under a mild assumption, adversarially robust classifiers can be achieved with neural networks and an adversarially robust neural network can have arbitrarily large Lipschitz constant. For these reasons, enforcing small norms of CNN layers may be neither effective nor necessary in achieving adversarial robustness. Our code is available at https://github.com/youweiliang/norm_robustness.

### 'Less Than One'-Shot Learning: Learning N Classes From M<N Samples

Deep neural networks require large training sets but suffer from high computational cost and long training times. Training on much smaller training sets while maintaining nearly the same accuracy would be very beneficial. In the few-shot learning setting, a model must learn a new class given only a small number of samples from that class. One-shot learning is an extreme form of few-shot learning where the model must learn a new class from a single example. We propose the less than one'-shot learning task where models must learn $N$ new classes given only $M<N$ examples and we show that this is achievable with the help of soft labels. We use a soft-label generalization of the k-Nearest Neighbors classifier to explore the intricate decision landscapes that can be created in the less than one'-shot learning setting. We analyze these decision landscapes to derive theoretical lower bounds for separating $N$ classes using $M<N$ soft-label samples and investigate the robustness of the resulting systems.

### MStream: Fast Streaming Multi-Aspect Group Anomaly Detection

Given a stream of entries in a multi-aspect data setting i.e., entries having multiple dimensions, how can we detect anomalous activities? For example, in the intrusion detection setting, existing work seeks to detect anomalous events or edges in dynamic graph streams, but this does not allow us to take into account additional attributes of each entry. Our work aims to define a streaming multi-aspect data anomaly detection framework, termed MStream, which can detect unusual group anomalies as they occur, in a dynamic manner. MStream has the following properties: (a) it detects anomalies in multi-aspect data including both categorical and numeric attributes; (b) it is online, thus processing each record in constant time and constant memory; (c) it can capture the correlation between multiple aspects of the data. MStream is evaluated over the KDDCUP99, CICIDS-DoS, UNSW-NB 15 and CICIDS-DDoS datasets, and outperforms state-of-the-art baselines.

### Real-Time Streaming Anomaly Detection in Dynamic Graphs

Given a stream of graph edges from a dynamic graph, how can we assign anomaly scores to edges in an online manner, for the purpose of detecting unusual behavior, using constant time and memory? Existing approaches aim to detect individually surprising edges. In this work, we propose MIDAS, which focuses on detecting microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, such as lockstep behavior, including denial of service attacks in network traffic data. We further propose MIDAS-F, to solve the problem by which anomalies are incorporated into the algorithm's internal states, creating a 'poisoning' effect which can allow future anomalies to slip through undetected. MIDAS-F introduces two modifications: 1) We modify the anomaly scoring function, aiming to reduce the 'poisoning' effect of newly arriving edges; 2) We introduce a conditional merge step, which updates the algorithm's data structures after each time tick, but only if the anomaly score is below a threshold value, also to reduce the `poisoning' effect. Experiments show that MIDAS-F has significantly higher accuracy than MIDAS. MIDAS has the following properties: (a) it detects microcluster anomalies while providing theoretical guarantees about its false positive probability; (b) it is online, thus processing each edge in constant time and constant memory, and also processes the data 130 to 929 times faster than state-of-the-art approaches; (c) it provides 41% to 55% higher accuracy (in terms of ROC-AUC) than state-of-the-art approaches.

### ExGAN: Adversarial Generation of Extreme Samples

Mitigating the risk arising from extreme events is a fundamental goal with many applications, such as the modelling of natural disasters, financial crashes, epidemics, and many others. To manage this risk, a vital step is to be able to understand or generate a wide range of extreme scenarios. Existing approaches based on Generative Adversarial Networks (GANs) excel at generating realistic samples, but seek to generate typical samples, rather than extreme samples. Hence, in this work, we propose ExGAN, a GAN-based approach to generate realistic and extreme samples. To model the extremes of the training distribution in a principled way, our work draws from Extreme Value Theory (EVT), a probabilistic approach for modelling the extreme tails of distributions. For practical utility, our framework allows the user to specify both the desired extremeness measure, as well as the desired extremeness probability they wish to sample at. Experiments on real US Precipitation data show that our method generates realistic samples, based on visual inspection and quantitative measures, in an efficient manner. Moreover, generating increasingly extreme examples using ExGAN can be done in constant time (with respect to the extremeness probability), as opposed to the exponential time required by the baseline approach.