We develop a new model of insulin-glucose dynamics for forecasting blood glucose in type 1 diabetics. We augment an existing biomedical model by introducing time-varying dynamics driven by a machine learning sequence model. Our model maintains a physiologically plausible inductive bias and clinically interpretable parameters -- e.g., insulin sensitivity -- while inheriting the flexibility of modern pattern recognition algorithms. Critical to modeling success are the flexible, but structured representations of subject variability with a sequence model. In contrast, less constrained models like the LSTM fail to provide reliable or physiologically plausible forecasts. We conduct an extensive empirical study. We show that allowing biomedical model dynamics to vary in time improves forecasting at long time horizons, up to six hours, and produces forecasts consistent with the physiological effects of insulin and carbohydrates.

Modern machine learning often operates in the regime where the number of parameters is much higher than the number of data points, with zero training loss and yet good generalization, thereby contradicting the classical bias-variance trade-off. This \textit{benign overfitting} phenomenon has recently been characterized using so called \textit{double descent} curves where the risk undergoes another descent (in addition to the classical U-shaped learning curve when the number of parameters is small) as we increase the number of parameters beyond a certain threshold. In this paper, we examine the conditions under which \textit{Benign Overfitting} occurs in the random feature (RF) models, i.e. in a two-layer neural network with fixed first layer weights. We adopt a new view of random feature and show that \textit{benign overfitting} arises due to the noise which resides in such features (the noise may already be present in the data and propagate to the features or it may be added by the user to the features directly) and plays an important implicit regularization role in the phenomenon.

The Mat{\'e}rn family of covariance functions has played a central role in spatial statistics for decades, being a flexible parametric class with one parameter determining the smoothness of the paths of the underlying spatial field. This paper proposes a new family of spatial covariance functions, which stems from a reparameterization of the generalized Wendland family. As for the Mat{\'e}rn case, the new class allows for a continuous parameterization of the smoothness of the underlying Gaussian random field, being additionally compactly supported. More importantly, we show that the proposed covariance family generalizes the Mat{\'e}rn model which is attained as a special limit case. The practical implication of our theoretical results questions the effective flexibility of the Mat{\'e}rn covariance from modeling and computational viewpoints. Our numerical experiments elucidate the speed of convergence of the proposed model to the Mat{\'e}rn model. We also inspect the level of sparseness of the associated (inverse) covariance matrix and the asymptotic distribution of the maximum likelihood estimator under increasing and fixed domain asymptotics. The effectiveness of our proposal is illustrated by analyzing a georeferenced dataset on maximum temperatures over the southeastern United States, and performing a re-analysis of a large spatial point referenced dataset of yearly total precipitation anomalies

We study Markov chain Monte Carlo (MCMC) algorithms for target distributions defined on matrix spaces. Such an important sampling problem has yet to be analytically explored. We carry out a major step in covering this gap by developing the proper theoretical framework that allows for the identification of ergodicity properties of typical MCMC algorithms, relevant in such a context. Beyond the standard Random-Walk Metropolis (RWM) and preconditioned Crank--Nicolson (pCN), a contribution of this paper in the development of a novel algorithm, termed the `Mixed' pCN (MpCN). RWM and pCN are shown not to be geometrically ergodic for an important class of matrix distributions with heavy tails. In contrast, MpCN has very good empirical performance within this class. Geometric ergodicity for MpCN is not fully proven in this work, as some remaining drift conditions are quite challenging to obtain owing to the complexity of the state space. We do, however, make a lot of progress towards a proof, and show in detail the last steps left for future work. We illustrate the computational performance of the various algorithms through simulation studies, first for the trivial case of an Inverse-Wishart target, and then for a challenging model arising in financial statistics.

Ordinary differential equations (ODE) are widely used in modeling biological and physical processes in science. In this article, we propose a new reproducing kernel-based approach for estimation and inference of ODEs given the noisy observations. We do not restrict the functional forms in ODE to be linear or additive, and we allow pairwise interactions. We perform sparse estimation to select individual functionals, and construct confidence intervals for the estimated signal trajectories. We establish the estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. Our proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA too. We demonstrate the efficacy of our method through numerous ODE examples.

This article presents a Bayesian inferential method where the likelihood for a model is unknown but where data can easily be simulated from the model. We discretize simulated (continuous) data to estimate the implicit likelihood in a Bayesian analysis employing a Markov chain Monte Carlo algorithm. Three examples are presented as well as a small study on some of the method's properties.

The present paper, characterizes the invertibility and causality conditions of a periodic ARFIMA (PARFIMA) models. We first, discuss the conditions in the multivariate case, by considering the corresponding p-variate stationary ARFIMA models. Second, we construct the conditions using the univariate case and we deduce a new infinite autoregressive representation for the PARFIMA model, the results are investigated through a simulation study.

Optimal transport has been one of the most exciting subjects in mathematics, starting from the 18th century. As a powerful tool to transport between two probability measures, optimal transport methods have been reinvigorated nowadays in a remarkable proliferation of modern data science applications. To meet the big data challenges, various computational tools have been developed in the recent decade to accelerate the computation for optimal transport methods. In this review, we present some cutting-edge computational optimal transport methods with a focus on the regularization-based methods and the projection-based methods. We discuss their real-world applications in biomedical research.

Since the primary mode of respiratory virus transmission is person-to-person interaction, we are required to reconsider physical interaction patterns to mitigate the number of people infected with COVID-19. While non-pharmaceutical interventions (NPI) had an evident impact on national mobility patterns, only the relative regional mobility behaviour enables an unbiased perspective on the effect of human movement on the spread of COVID-19. In this paper we therefore investigate the impact of human mobility and social connectivity derived from Facebook activities on the weekly rate of new infections in Germany between March 3rd and June 22nd, 2020. Our results confirm that reduced social activity lowers the infection rate, accounting for regional and temporal patterns. The extent of social distancing, quantified by the percentage of people staying put within a federal administrative district, has an overall negative effect on the incidence of infections. Additionally, our results show spatial infection patterns based on geographic as well as social distances.

A probability forecast or probabilistic classifier is reliable or calibrated if the predicted probabilities are matched by ex post observed frequencies, as examined visually in reliability diagrams. The classical binning and counting approach to plotting reliability diagrams has been hampered by a lack of stability under unavoidable, ad hoc implementation decisions. Here we introduce the CORP approach, which generates provably statistically Consistent, Optimally binned, and Reproducible reliability diagrams in an automated way. CORP is based on non-parametric isotonic regression and implemented via the Pool-adjacent-violators (PAV) algorithm - essentially, the CORP reliability diagram shows the graph of the PAV- (re)calibrated forecast probabilities. The CORP approach allows for uncertainty quantification via either resampling techniques or asymptotic theory, furnishes a new numerical measure of miscalibration, and provides a CORP based Brier score decomposition that generalizes to any proper scoring rule. We anticipate that judicious uses of the PAV algorithm yield improved tools for diagnostics and inference for a very wide range of statistical and machine learning methods.

We propose a novel stochastic network model, called Fractal Gaussian Network (FGN), that embodies well-defined and analytically tractable fractal structures. Such fractal structures have been empirically observed in diverse applications. FGNs interpolate continuously between the popular purely random geometric graphs (a.k.a. the Poisson Boolean network), and random graphs with increasingly fractal behavior. In fact, they form a parametric family of sparse random geometric graphs that are parametrized by a fractality parameter $\nu$ which governs the strength of the fractal structure. FGNs are driven by the latent spatial geometry of Gaussian Multiplicative Chaos (GMC), a canonical model of fractality in its own right. We asymptotically characterize the expected number of edges and triangle in FGNs. We then examine the natural question of detecting the presence of fractality and the problem of parameter estimation based on observed network data, in addition to fundamental properties of the FGN as a random graph model. We also explore fractality in community structures by unveiling a natural stochastic block model in the setting of FGNs.

Robustness analysis is an emerging field in the domain of uncertainty quantification. It consists of analysing the response of a computer model with uncertain inputs to the perturbation of one or several of its input distributions. Thus, a practical robustness analysis methodology should rely on a coherent definition of a distribution perturbation. This paper addresses this issue by exposing a rigorous way of perturbing densities. The proposed methodology is based the Fisher distance on manifolds of probability distributions. A numerical method to calculate perturbed densities in practice is presented. This method comes from Lagrangian mechanics and consists of solving an ordinary differential equations system. This perturbation definition is then used to compute quantile-oriented robustness indices. The resulting Perturbed-Law based sensitivity Indices (PLI) are illustrated on several numerical models. This methodology is also applied to an industrial study (simulation of a loss of coolant accident in a nuclear reactor), where several tens of the model physical parameters are uncertain with limited knowledge concerning their distributions.

The power law is useful in describing count phenomena such as network degrees and word frequencies. With a single parameter, it captures the main feature that the frequencies are linear on the log-log scale. Nevertheless, there have been criticisms of the power law, and various approaches have been proposed to resolve issues such as selecting the required threshold and quantifying the uncertainty around it, and to test hypotheses on whether the data could have come from the power law. As extreme value theory generalises the (continuous) power law, it is natural to consider the former as a solution to these problems around the latter. In this paper, we propose two extreme value mixture distributions, in one of which the power law is incorporated, without the need of pre-specifying the threshold. The proposed distributions are shown to fit the data well, quantify the threshold uncertainty in a natural way, and satisfactorily answer whether the power law is useful enough.

Efficient sampling of many-dimensional and multimodal density functions is a task of great interest in many research fields. We describe an algorithm that allows parallelizing inherently serial Markov chain Monte Carlo (MCMC) sampling by partitioning the space of the function parameters into multiple subspaces and sampling each of them independently. The samples of the different subspaces are then reweighted by their integral values and stitched back together. This approach allows reducing sampling wall-clock time by parallel operation. It also improves sampling of multimodal target densities and results in less correlated samples. Finally, the approach yields an estimate of the integral of the target density function.

We describe the development of a multi-purpose software for Bayesian statistical inference, BAT.jl, written in the Julia language. The major design considerations and implemented algorithms are summarized here, together with a test suite that ensures the proper functioning of the algorithms. We also give an extended example from the realm of physics that demonstrates the functionalities of BAT.jl.

We propose a Monte-Carlo-based method for reconstructing sparse signals in the formulation of sparse linear regression in a high-dimensional setting. The basic idea of this algorithm is to explicitly select variables or covariates to represent a given data vector or responses and accept randomly generated updates of that selection if and only if the energy or cost function decreases. This algorithm is called the greedy Monte-Carlo (GMC) search algorithm. Its performance is examined via numerical experiments, which suggests that in the noiseless case, GMC can achieve perfect reconstruction in undersampling situations of a reasonable level: it can outperform the $\ell_1$ relaxation but does not reach the algorithmic limit of MC-based methods theoretically clarified by an earlier analysis. Additionally, an experiment on a real-world dataset supports the practicality of GMC.

We provide an algorithm to generate trajectories of sparse stochastic processes that are solutions of linear ordinary differential equations driven by L\'evy white noises. A recent paper showed that these processes are limits in law of generalized compound-Poisson processes. Based on this result, we derive an off-the-grid algorithm that generates arbitrarily close approximations of the target process. Our method relies on a B-spline representation of generalized compound-Poisson processes. We illustrate numerically the validity of our approach.

Missing value problem in spatiotemporal traffic data has long been a challenging topic, in particular for large-scale and high-dimensional data with complex missing mechanisms and diverse degrees of missingness. Recent studies based on tensor nuclear norm have demonstrated the superiority of tensor learning in imputation tasks by effectively characterizing the complex correlations/dependencies in spatiotemporal data. However, despite the promising results, these approaches do not scale well to large tensors. In this paper, we focus on addressing the missing data imputation problem for large-scale spatiotemporal traffic data. To achieve both high accuracy and efficiency, we develop a scalable autoregressive tensor learning model---Low-Tubal-Rank Autoregressive Tensor Completion (LATC-Tubal)---based on the existing framework of Low-Rank Autoregressive Tensor Completion (LATC), which is well-suited for spatiotemporal traffic data that characterized by multidimensional structure of location$\times$ time of day $\times$ day. In particular, the proposed LATC-Tubal model involves a scalable tensor nuclear norm minimization scheme by integrating linear unitary transformation. Therefore, the tensor nuclear norm minimization can be solved by singular value thresholding on the transformed matrix of each day while the day-to-day correlation can be effectively preserved by the unitary transform matrix. Before setting up the experiment, we consider two large-scale 5-minute traffic speed data sets collected by the California PeMS system with 11160 sensors. We compare LATC-Tubal with state-of-the-art baseline models, and find that LATC-Tubal can achieve competitively accuracy with a significantly lower computational cost. In addition, the LATC-Tubal will also benefit other tasks in modeling large-scale spatiotemporal traffic data, such as network-level traffic forecasting.

Data dimensionality informs us about data complexity and sets limit on the structure of successful signal processing pipelines. In this work we revisit and improve the manifold-adaptive Farahmand-Szepesv\'ari-Audibert (FSA) dimension estimator, making it one of the best nearest neighbor-based dimension estimators available. We compute the probability density function of local FSA estimates, if the local manifold density is uniform. Based on the probability density function, we propose to use the median of local estimates as a basic global measure of intrinsic dimensionality, and we demonstrate the advantages of this asymptotically unbiased estimator over the previously proposed statistics: the mode and the mean. Additionally, from the probability density function, we derive the maximum likelihood formula for global intrinsic dimensionality, if i.i.d. holds. We tackle edge and finite-sample effects with an exponential correction formula, calibrated on hypercube datasets. We compare the performance of the corrected-median-FSA estimator with kNN estimators: maximum likelihood (ML, Levina-Bickel) and two implementations of DANCo (R and matlab). We show that corrected-median-FSA estimator beats the ML estimator and it is on equal footing with DANCo for standard synthetic benchmarks according to mean percentage error and error rate metrics. With the median-FSA algorithm, we reveal diverse changes in the neural dynamics while resting state and during epileptic seizures. We identify brain areas with lower-dimensional dynamics that are possible causal sources and candidates for being seizure onset zones.

Individual Treatment Effect (ITE) estimation is an extensively researched problem, with applications in various domains. We model the case where there is heterogeneous non-compliance to a randomly assigned treatment, a typical situation in health (because of non-compliance to prescription) or digital advertising (because of competition and ad blockers for instance). The lower the compliance, the more the effect of treatment prescription, or individual prescription effect (IPE), signal fades away and becomes hard to capture. We propose a new approach to estimate IPE that takes advantage of observed compliance information to prevent signal fading. Using the Structural Causal Model framework and do-calculus, we define a general mediated causal effect setting under which our proposed estimator soundly recovers the IPE, and study its asymptotic variance. Finally, we conduct extensive experiments on both synthetic and real-world datasets that highlight the benefit of the approach, which consistently improves state-of-the-art in low compliance settings.

We consider the following data perturbation model, where the covariates incur multiplicative errors. For two $n \times m$ random matrices $U, X$, we denote by $U \circ X$ the Hadamard or Schur product, which is defined as $(U \circ X)_{ij} = (U_{ij}) \cdot (X_{ij})$. In this paper, we study the subgaussian matrix variate model, where we observe the matrix variate data $X$ through a random mask $U$: \begin{equation*} {\mathcal X} = U \circ X \; \; \; \text{ where} \; \; \;X = B^{1/2} {\mathbb Z} A^{1/2}, \end{equation*} where ${\mathbb Z}$ is a random matrix with independent subgaussian entries, and $U$ is a mask matrix with either zero or positive entries, where ${\mathbb E} U_{ij} \in [0, 1]$ and all entries are mutually independent. Subsampling in rows, or columns, or random sampling of entries of $X$ are special cases of this model. Under the assumption of independence between $U$ and $X$, we introduce componentwise unbiased estimators for estimating covariance $A$ and $B$, and prove the concentration of measure bounds in the sense of guaranteeing the restricted eigenvalue conditions to hold on the estimator for $B$, when columns of data matrix $X$ are sampled with different rates. Our results provide insight for sparse recovery for relationships among people (samples, locations, items) when features (variables, time points, user ratings) are present in the observed data matrix ${\mathcal X}$ with heterogenous rates. Our proof techniques can certainly be extended to other scenarios.

This paper reflects on a number of trends towards a more open and reproducible approach to geographic and spatial data science over recent years. In particular it considers trends towards Big Data, and the impacts this is having on spatial data analysis and modelling. It identifies a turn in academia towards coding as a core analytic tool, and away from proprietary software tools offering 'black boxes' where the internal workings of the analysis are not revealed. It is argued that this closed form software is problematic, and considers a number of ways in which issues identified in spatial data analysis (such as the MAUP) could be overlooked when working with closed tools, leading to problems of interpretation and possibly inappropriate actions and policies based on these. In addition, this paper and considers the role that reproducible and open spatial science may play in such an approach, taking into account the issues raised. It highlights the dangers of failing to account for the geographical properties of data, now that all data are spatial (they are collected somewhere), the problems of a desire for n=all observations in data science and it identifies the need for a critical approach. This is one in which openness, transparency, sharing and reproducibility provide a mantra for defensible and robust spatial data science.

The literature for count modeling provides useful tools to conduct causal inference when outcomes take non-negative integer values. Applied to the potential outcomes framework, we link the Bayesian causal inference literature to statistical models for count data. We discuss the general architectural considerations for constructing the predictive posterior of the missing potential outcomes. Special considerations for estimating average treatment effects are discussed, some generalizing certain relationships and some not yet encountered in the causal inference literature.

Linear regression models are useful statistical tools to analyze data sets in several different fields. There are several methods to estimate the parameters of a linear regression model. These methods usually perform under normally distributed and uncorrelated errors with zero mean and constant variance. However, for some data sets error terms may not satisfy these or some of these assumptions. If error terms are correlated, such as the regression models with autoregressive (AR(p)) error terms, the Conditional Maximum Likelihood (CML) under normality assumption or the Least Square (LS) methods are often used to estimate the parameters of interest. For CML estimation a distributional assumption on error terms is needed to carry on estimation, but, in practice, such distributional assumptions on error terms may not be plausible. Therefore, in such cases some alternative distribution free methods are needed to conduct the parameter estimation. In this paper, we propose to estimate the parameters of a linear regression model with AR(p) error term using the Empirical Likelihood (EL) method, which is one of the distribution free estimation methods. A small simulation study and a numerical example are provided to evaluate the performance of the proposed estimation method over the CML method. The results of simulation study show that the proposed estimators based on EL method are remarkably better than the estimators obtained from the CML method in terms of mean squared errors (MSE) and bias in almost all the simulation configurations. These findings are also confirmed by the results of the numerical and real data examples.

This is the rejoinder to the discussion by Kennedy, Balakrishnan and Wasserman on the paper "On nearly assumption-free tests of nominal confidence interval coverage for causal parameters estimated by machine learning" published in Statistical Science.

We consider the variational reconstruction framework for inverse problems and propose to learn a data-adaptive input-convex neural network (ICNN) as the regularization functional. The ICNN-based convex regularizer is trained adversarially to discern ground-truth images from unregularized reconstructions. Convexity of the regularizer is attractive since (i) one can establish analytical convergence guarantees for the corresponding variational reconstruction problem and (ii) devise efficient and provable algorithms for reconstruction. In particular, we show that the optimal solution to the variational problem converges to the ground-truth if the penalty parameter decays sub-linearly with respect to the norm of the noise. Further, we prove the existence of a subgradient-based algorithm that leads to monotonically decreasing error in the parameter space with iterations. To demonstrate the performance of our approach for solving inverse problems, we consider the tasks of deblurring natural images and reconstructing images in computed tomography (CT), and show that the proposed convex regularizer is at least competitive with and sometimes superior to state-of-the-art data-driven techniques for inverse problems.

We aim to help users estimate the state of the world in tasks like robotic teleoperation and navigation with visual impairments, where users may have systematic biases that lead to suboptimal behavior: they might struggle to process observations from multiple sensors simultaneously, receive delayed observations, or overestimate distances to obstacles. While we cannot directly change the user's internal beliefs or their internal state estimation process, our insight is that we can still assist them by modifying the user's observations. Instead of showing the user their true observations, we synthesize new observations that lead to more accurate internal state estimates when processed by the user. We refer to this method as assistive state estimation (ASE): an automated assistant uses the true observations to infer the state of the world, then generates a modified observation for the user to consume (e.g., through an augmented reality interface), and optimizes the modification to induce the user's new beliefs to match the assistant's current beliefs. We evaluate ASE in a user study with 12 participants who each perform four tasks: two tasks with known user biases -- bandwidth-limited image classification and a driving video game with observation delay -- and two with unknown biases that our method has to learn -- guided 2D navigation and a lunar lander teleoperation video game. A different assistance strategy emerges in each domain, such as quickly revealing informative pixels to speed up image classification, using a dynamics model to undo observation delay in driving, identifying nearby landmarks for navigation, and exaggerating a visual indicator of tilt in the lander game. The results show that ASE substantially improves the task performance of users with bandwidth constraints, observation delay, and other unknown biases.

This paper considers the multi-agent linear least-squares problem in a server-agent network. In this problem, the system comprises multiple agents, each having a set of local data points, that are connected to a server. The goal for the agents is to compute a linear mathematical model that optimally fits the collective data points held by all the agents, without sharing their individual local data points. This goal can be achieved, in principle, using the server-agent variant of the traditional iterative gradient-descent method. The gradient-descent method converges linearly to a solution, and its rate of convergence is lower bounded by the conditioning of the agents' collective data points. If the data points are ill-conditioned, the gradient-descent method may require a large number of iterations to converge. We propose an iterative pre-conditioning technique that mitigates the deleterious effect of the conditioning of data points on the rate of convergence of the gradient-descent method. We rigorously show that the resulting pre-conditioned gradient-descent method, with the proposed iterative pre-conditioning, achieves superlinear convergence when the least-squares problem has a unique solution. In general, the convergence is linear with improved rate of convergence in comparison to the traditional gradient-descent method and the state-of-the-art accelerated gradient-descent methods. We further illustrate the improved rate of convergence of our proposed algorithm through experiments on different real-world least-squares problems in both noise-free and noisy computation environment.

Fatigue is one of the key factors in the loss of work efficiency and health-related quality of life, and most fatigue assessment methods were based on self-reporting, which may suffer from many factors such as recall bias. To address this issue, we developed an automated system using wearable sensing and machine learning techniques for objective fatigue assessment. ECG/Actigraphy data were collected from subjects in free-living environments. Preprocessing and feature engineering methods were applied, before interpretable solution and deep learning solution were introduced. Specifically, for interpretable solution, we proposed a feature selection approach which can select less correlated and high informative features for better understanding system's decision-making process. For deep learning solution, we used state-of-the-art self-attention model, based on which we further proposed a consistency self-attention (CSA) mechanism for fatigue assessment. Extensive experiments were conducted, and very promising results were achieved.

Deep models, while being extremely flexible and accurate, are surprisingly vulnerable to "small, imperceptible" perturbations known as adversarial attacks. While the majority of existing attacks focus on measuring perturbations under the $\ell_p$ metric, Wasserstein distance, which takes geometry in pixel space into account, has long been known to be a suitable metric for measuring image quality and has recently risen as a compelling alternative to the $\ell_p$ metric in adversarial attacks. However, constructing an effective attack under the Wasserstein metric is computationally much more challenging and calls for better optimization algorithms. We address this gap in two ways: (a) we develop an exact yet efficient projection operator to enable a stronger projected gradient attack; (b) we show that the Frank-Wolfe method equipped with a suitable linear minimization oracle works extremely fast under Wasserstein constraints. Our algorithms not only converge faster but also generate much stronger attacks. For instance, we decrease the accuracy of a residual network on CIFAR-10 to $3.4\%$ within a Wasserstein perturbation ball of radius $0.005$, in contrast to $65.6\%$ using the previous Wasserstein attack based on an \emph{approximate} projection operator. Furthermore, employing our stronger attacks in adversarial training significantly improves the robustness of adversarially trained models.

Increasing demand for on-device Automatic Speech Recognition (ASR) systems has resulted in renewed interests in developing automatic model compression techniques. Past research have shown that AutoML-based Low Rank Factorization (LRF) technique, when applied to an end-to-end Encoder-Attention-Decoder style ASR model, can achieve a speedup of up to 3.7x, outperforming laborious manual rank-selection approaches. However, we show that current AutoML-based search techniques only work up to a certain compression level, beyond which they fail to produce compressed models with acceptable word error rates (WER). In this work, we propose an iterative AutoML-based LRF approach that achieves over 5x compression without degrading the WER, thereby advancing the state-of-the-art in ASR compression.

Many recent advances in neural information retrieval models, which predict top-K items given a query, learn directly from a large training set of (query, item) pairs. However, they are often insufficient when there are many previously unseen (query, item) combinations, often referred to as the cold start problem. Furthermore, the search system can be biased towards items that are frequently shown to a query previously, also known as the 'rich get richer' (a.k.a. feedback loop) problem. In light of these problems, we observed that most online content platforms have both a search and a recommender system that, while having heterogeneous input spaces, can be connected through their common output item space and a shared semantic representation. In this paper, we propose a new Zero-Shot Heterogeneous Transfer Learning framework that transfers learned knowledge from the recommender system component to improve the search component of a content platform. First, it learns representations of items and their natural-language features by predicting (item, item) correlation graphs derived from the recommender system as an auxiliary task. Then, the learned representations are transferred to solve the target search retrieval task, performing query-to-item prediction without having seen any (query, item) pairs in training. We conduct online and offline experiments on one of the world's largest search and recommender systems from Google, and present the results and lessons learned. We demonstrate that the proposed approach can achieve high performance on offline search retrieval tasks, and more importantly, achieved significant improvements on relevance and user interactions over the highly-optimized production system in online experiments.

While various complexity measures for diverse model classes have been proposed, specifying an appropriate measure capable of predicting and explaining generalization in deep networks has proven to be challenging. We propose \textit{Neural Complexity} (NC), an alternative data-driven approach that meta-learns a scalar complexity measure through interactions with a large number of heterogeneous tasks. The trained NC model can be added to the standard training loss to regularize any task learner under standard learning frameworks. We contrast NC's approach against existing manually-designed complexity measures and also against other meta-learning models, and validate NC's performance on multiple regression and classification tasks.

Unlike in the traditional statistical modeling for which a user typically hand-specify a prior, Neural Processes (NPs) implicitly define a broad class of stochastic processes with neural networks. Given a data stream, NP learns a stochastic process that best describes the data. While this "data-driven" way of learning stochastic processes has proven to handle various types of data, NPs still relies on an assumption that uncertainty in stochastic processes is modeled by a single latent variable, which potentially limits the flexibility. To this end, we propose the Bootstrapping Neural Process (BNP), a novel extension of the NP family using the bootstrap. The bootstrap is a classical data-driven technique for estimating uncertainty, which allows BNP to learn the stochasticity in NPs without assuming a particular form. We demonstrate the efficacy of BNP on various types of data and its robustness in the presence of model-data mismatch.

Resting-state functional MRI (rsfMRI) yields functional connectomes that can serve as cognitive fingerprints of individuals. Connectomic fingerprints have proven useful in many machine learning tasks, such as predicting subject-specific behavioral traits or task-evoked activity. In this work, we propose a surface-based convolutional neural network (BrainSurfCNN) model to predict individual task contrasts from their resting-state fingerprints. We introduce a reconstructive-contrastive loss that enforces subject-specificity of model outputs while minimizing predictive error. The proposed approach significantly improves the accuracy of predicted contrasts over a well-established baseline. Furthermore, BrainSurfCNN's prediction also surpasses test-retest benchmark in a subject identification task.

Currently, open-domain generative dialog systems have attracted considerable attention in academia and industry. Despite the success of single-turn dialog generation, multi-turn dialog generation is still a big challenge. So far, there are two kinds of models for open-domain multi-turn dialog generation: hierarchical and non-hierarchical models. Recently, some works have shown that the hierarchical models are better than non-hierarchical models under their experimental settings; meanwhile, some works also demonstrate the opposite conclusion. Due to the lack of adequate comparisons, it's not clear which kind of models are better in open-domain multi-turn dialog generation. Thus, in this paper, we will measure systematically nearly all representative hierarchical and non-hierarchical models over the same experimental settings to check which kind is better. Through extensive experiments, we have the following three important conclusions: (1) Nearly all hierarchical models are worse than non-hierarchical models in open-domain multi-turn dialog generation, except for the HRAN model. Through further analysis, the excellent performance of HRAN mainly depends on its word-level attention mechanism; (2) The performance of other hierarchical models will also obtain a great improvement if integrating the word-level attention mechanism into these models. The modified hierarchical models even significantly outperform the non-hierarchical models; (3) The reason why the word-level attention mechanism is so powerful for hierarchical models is because it can leverage context information more effectively, especially the fine-grained information. Besides, we have implemented all of the models and already released the codes.

Using weight decay to penalize the L2 norms of weights in neural networks has been a standard training practice to regularize the complexity of networks. In this paper, we show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with positively homogeneous activation functions, such as linear, ReLU and max-pooling functions. As a result of homogeneity, functions specified by the networks are invariant to the shifting of weight scales between layers. The ineffective regularizers are sensitive to such shifting and thus poorly regularize the model capacity, leading to overfitting. To address this shortcoming, we propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network. The derived regularizer is an upper bound for the input gradient of the network so minimizing the improved regularizer also benefits the adversarial robustness. Residual connections are also considered and we show that our regularizer also forms an upper bound to input gradients of such a residual network. We demonstrate the efficacy of our proposed regularizer on various datasets and neural network architectures at improving generalization and adversarial robustness.

Recent progress in the task of Grammatical Error Correction (GEC) has been driven by addressing data sparsity, both through new methods for generating large and noisy pretraining data and through the publication of small and higher-quality finetuning data in the BEA-2019 shared task. Building upon recent work in Neural Machine Translation (NMT), we make use of both kinds of data by deriving example-level scores on our large pretraining data based on a smaller, higher-quality dataset. In this work, we perform an empirical study to discover how to best incorporate delta-log-perplexity, a type of example scoring, into a training schedule for GEC. In doing so, we perform experiments that shed light on the function and applicability of delta-log-perplexity. Models trained on scored data achieve state-of-the-art results on common GEC test sets.

This papers explores a new outlier detection algorithm based on the spectrum of the Laplacian matrix of a graph. Taking advantage of boosting together with sparse-data based learners. The sparcity of the Laplacian matrix significantly decreases the computational burden, enabling a spectrum based outlier detection method to be applied to larger datasets compared to spectral clustering. The method is competitive on synthetic datasets with commonly used outlier detection algorithms like Isolation Forest and Local Outlier Factor.

Alchemical free energy calculations are a useful tool for predicting free energy differences associated with the transfer of molecules from one environment to another. The hallmark of these methods is the use of "bridging" potential energy functions representing \emph{alchemical} intermediate states that cannot exist as real chemical species. The data collected from these bridging alchemical thermodynamic states allows the efficient computation of transfer free energies (or differences in transfer free energies) with orders of magnitude less simulation time than simulating the transfer process directly. While these methods are highly flexible, care must be taken in avoiding common pitfalls to ensure that computed free energy differences can be robust and reproducible for the chosen force field, and that appropriate corrections are included to permit direct comparison with experimental data. In this paper, we review current best practices for several popular application domains of alchemical free energy calculations, including relative and absolute small molecule binding free energy calculations to biomolecular targets.

Spacecraft collision avoidance procedures have become an essential part of satellite operations. Complex and constantly updated estimates of the collision risk between orbiting objects inform the various operators who can then plan risk mitigation measures. Such measures could be aided by the development of suitable machine learning models predicting, for example, the evolution of the collision risk in time. In an attempt to study this opportunity, the European Space Agency released, in October 2019, a large curated dataset containing information about close approach events, in the form of Conjunction Data Messages (CDMs), collected from 2015 to 2019. This dataset was used in the Spacecraft Collision Avoidance Challenge, a machine learning competition where participants had to build models to predict the final collision risk between orbiting objects. This paper describes the design and results of the competition and discusses the challenges and lessons learned when applying machine learning methods to this problem domain.

Neural Networks (NNs) are vulnerable to adversarial examples. Such inputs differ only slightly from their benign counterparts yet provoke misclassifications of the attacked NNs. The required perturbations to craft the examples are often negligible and even human imperceptible. To protect deep learning based system from such attacks, several countermeasures have been proposed with adversarial training still being considered the most effective. Here, NNs are iteratively retrained using adversarial examples forming a computational expensive and time consuming process often leading to a performance decrease. To overcome the downsides of adversarial training while still providing a high level of security, we present a new training approach we call entropic retraining. Based on an information-theoretic analysis, entropic retraining mimics the effects of adversarial training without the need of the laborious generation of adversarial examples. We empirically show that entropic retraining leads to a significant increase in NNs' security and robustness while only relying on the given original data. With our prototype implementation we validate and show the effectiveness of our approach for various NN architectures and data sets.

Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. Interleaving the action of reading a character with that of synthesising audio reduces this latency. However, the order of this sequence of interleaved actions varies across sentences, which raises the question of how the actions should be chosen. We propose a reinforcement learning based framework to train an agent to make this decision. We compare our performance against that of deterministic, rule-based systems. Our results demonstrate that our agent successfully balances the trade-off between the latency of audio generation and the quality of synthesised audio. More broadly, we show that neural sequence-to-sequence models can be adapted to run in an incremental manner.

Process models generated through process mining depict the as-is state of a process. Through annotations with metrics such as the frequency or duration of activities, these models provide generic information to the process analyst. To improve business processes with respect to performance measures, process analysts require further guidance from the process model. In this study, we design Graph Relevance Miner (GRM), a technique based on graph neural networks, to determine the relevance scores for process activities with respect to performance measures. Annotating process models with such relevance scores facilitates a problem-focused analysis of the business process, placing these problems at the centre of the analysis. We quantitatively evaluate the predictive quality of our technique using four datasets from different domains, to demonstrate the faithfulness of the relevance scores. Furthermore, we present the results of a case study, which highlight the utility of the technique for organisations. Our work has important implications both for research and business applications, because process model-based analyses feature shortcomings that need to be urgently addressed to realise successful process mining at an enterprise level.

This article, written in Italian language, explores the contribution given by Bayes' rule and by subjective probability in the work at Bletchley Park towards cracking Enigma cyphered messages during WWII. -- In questo articolo, scritto in Italiano, esploriamo il contributo dato dal teorema di Bayes e dalle idee della probabilit\`a soggettiva nel lavoro compiuto a Bletchley Park che ha portato a decifrare i messaggi cifrati con macchine Enigma durante la Seconda Guerra Mondiale.

In this paper, we study the problem of learning continuous vector representations of knowledge graphs for predicting missing links. We present a new approach called ConEx, which infers missing links by leveraging the composition of a 2D convolution with a Hermitian inner product of complex-valued embedding vectors. We evaluate ConEx against state-of-the-art approaches on the WN18RR, FB15K-237, KINSHIP and UMLS benchmark datasets. Our experimental results show that ConEx achieves a performance superior to that of state-of-the-art approaches such as RotatE, QuatE and TuckER on the link prediction task on all datasets while requiring at least 8 times fewer parameters. We ensure the reproducibility of our results by providing an open-source implementation which includes the training, evaluation scripts along with pre-trained models at https://github.com/conex-kge/ConEx.

In this article, we give visualization and different machine learning technics for two weeks and 40 days ahead forecast based on public data. On July 15, 2020, Senegal reopened its airspace doors, while the number of confirmed cases is still increasing. The population no longer respects hygiene measures, social distancing as at the beginning of the contamination. Negligence or tiredness to always wear the masks? We make forecasting on the inflection point and possible ending time.

Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.

Uncertainty quantification in neural networks gained a lot of attention in the past years. The most popular approaches, Bayesian neural networks (BNNs), Monte Carlo dropout, and deep ensembles have one thing in common: they are all based on some kind of mixture model. While the BNNs build infinite mixture models and are derived via variational inference, the latter two build finite mixtures trained with the maximum likelihood method. In this work we investigate the effect of training an infinite mixture distribution with the maximum likelihood method instead of variational inference. We find that the proposed objective leads to stochastic networks with an increased predictive variance, which improves uncertainty based identification of miss-classification and robustness against adversarial attacks in comparison to a standard BNN with equivalent network structure. The new model also displays higher entropy on out-of-distribution data.

The space of synthesizable molecules is greater than $10^{60}$, meaning only a vanishingly small fraction of these molecules have ever been realized in the lab. In order to prioritize which regions of this space to explore next, synthetic chemists need access to accurate molecular property predictions. While great advances in molecular machine learning have been made, there is a dearth of benchmarks featuring properties that are useful for the synthetic chemist. Focussing directly on the needs of the synthetic chemist, we introduce the Photoswitch Dataset, a new benchmark for molecular machine learning where improvements in model performance can be immediately observed in the throughput of promising molecules synthesized in the lab. Photoswitches are a versatile class of molecule for medical and renewable energy applications where a molecule's efficacy is governed by its electronic transition wavelengths. We demonstrate superior performance in predicting these wavelengths compared to both time-dependent density functional theory (TD-DFT), the incumbent first principles quantum mechanical approach, as well as a panel of human experts. Our baseline models are currently being deployed in the lab as part of the decision process for candidate synthesis. It is our hope that this benchmark can drive real discoveries in photoswitch chemistry and that future benchmarks can be introduced to pivot learning algorithm development to benefit more expansive areas of synthetic chemistry.

Extracting informative and meaningful temporal segments from high-dimensional wearable sensor data, smart devices, or IoT data is a vital preprocessing step in applications such as Human Activity Recognition (HAR), trajectory prediction, gesture recognition, and lifelogging. In this paper, we propose ESPRESSO (Entropy and ShaPe awaRe timE-Series SegmentatiOn), a hybrid segmentation model for multi-dimensional time-series that is formulated to exploit the entropy and temporal shape properties of time-series. ESPRESSO differs from existing methods that focus upon particular statistical or temporal properties of time-series exclusively. As part of model development, a novel temporal representation of time-series $WCAC$ was introduced along with a greedy search approach that estimate segments based upon the entropy metric. ESPRESSO was shown to offer superior performance to four state-of-the-art methods across seven public datasets of wearable and wear-free sensing. In addition, we undertake a deeper investigation of these datasets to understand how ESPRESSO and its constituent methods perform with respect to different dataset characteristics. Finally, we provide two interesting case-studies to show how applying ESPRESSO can assist in inferring daily activity routines and the emotional state of humans.

SafePILCO is a software tool for safe and data-efficient policy search with reinforcement learning. It extends the known PILCO algorithm, originally written in MATLAB, to support safe learning. We provide a Python implementation and leverage existing libraries that allow the codebase to remain short and modular, which is appropriate for wider use by the verification, reinforcement learning, and control communities.