In many observational studies in social science and medical applications, subjects or individuals are connected, and one unit's treatment and attributes may affect another unit's treatment and outcome, violating the stable unit treatment value assumption (SUTVA) and resulting in interference. To enable feasible inference, many previous works assume the ``exchangeability'' of interfering units, under which the effect of interference is captured by the number or ratio of treated neighbors. However, in many applications with distinctive units, interference is heterogeneous. In this paper, we focus on the partial interference setting, and restrict units to be exchangeable conditional on observable characteristics. Under this framework, we propose generalized augmented inverse propensity weighted (AIPW) estimators for general causal estimands that include direct treatment effects and spillover effects. We show that they are consistent, asymptotically normal, semiparametric efficient, and robust to heterogeneous interference as well as model misspecifications. We also apply our method to the Add Health dataset and find that smoking behavior exhibits interference on academic outcomes.

In this article, we propose the outcome-adjusted balance measure to perform model selection for the generalized propensity score (GPS), which serves as an essential component in estimation of the pairwise average treatment effects (ATEs) in observational studies with more than two treatment levels. The primary goal of the balance measure is to identify the GPS model specification such that the resulting ATE estimator is consistent and efficient. Following recent empirical and theoretical evidence, we establish that the optimal GPS model should only include covariates related to the outcomes. Given a collection of candidate GPS models, the outcome-adjusted balance measure imputes all baseline covariates by matching on each candidate model, and selects the model that minimizes a weighted sum of absolute mean differences between the imputed and original values of the covariates. The weights are defined to leverage the covariate-outcome relationship, so that GPS models without optimal variable selection are penalized. Under appropriate assumptions, we show that the outcome-adjusted balance measure consistently selects the optimal GPS model, so that the resulting GPS matching estimator is asymptotically normal and efficient. We compare its finite sample performance with existing measures in a simulation study. We illustrate an application of the proposed methodology in the analysis of the Tutoring data.

Given a dataset $\mathcal{D}$, we are interested in computing the mean of a subset of $\mathcal{D}$ which matches a predicate. \algname leverages stratified sampling and proxy models to efficiently compute this statistic given a sampling budget $N$. In this document, we theoretically analyze \algname and show that the MSE of the estimate decays at rate $O(N_1^{-1} + N_2^{-1} + N_1^{1/2}N_2^{-3/2})$, where $N=K \cdot N_1+N_2$ for some integer constant $K$ and $K \cdot N_1$ and $N_2$ represent the number of samples used in Stage 1 and Stage 2 of \algname respectively. Hence, if a constant fraction of the total sample budget $N$ is allocated to each stage, we will achieve a mean squared error of $O(N^{-1})$ which matches the rate of mean squared error of the optimal stratified sampling algorithm given a priori knowledge of the predicate positive rate and standard deviation per stratum.

Employing a large dataset (at most, the order of n = 10^6), this study attempts enhance the literature on the comparison between regression and machine learning (ML)-based rent price prediction models by adding new empirical evidence and considering the spatial dependence of the observations. The regression-based approach incorporates the nearest neighbor Gaussian processes (NNGP) model, enabling the application of kriging to large datasets. In contrast, the ML-based approach utilizes typical models: extreme gradient boosting (XGBoost), random forest (RF), and deep neural network (DNN). The out-of-sample prediction accuracy of these models was compared using Japanese apartment rent data, with a varying order of sample sizes (i.e., n = 10^4, 10^5, 10^6). The results showed that, as the sample size increased, XGBoost and RF outperformed NNGP with higher out-of-sample prediction accuracy. XGBoost achieved the highest prediction accuracy for all sample sizes and error measures in both logarithmic and real scales and for all price bands (when n = 10^5 and 10^6). A comparison of several methods to account for the spatial dependence in RF showed that simply adding spatial coordinates to the explanatory variables may be sufficient.

For the nonparametric regression models with covariates contaminated with normal measurement errors, this paper proposes an extrapolation algorithm to estimate the nonparametric regression functions. By applying the conditional expectation directly to the kernel-weighted least squares of the deviations between the local linear approximation and the observed responses, the proposed algorithm successfully bypasses the simulation step needed in the classical simulation extrapolation method, thus significantly reducing the computational time. It is noted that the proposed method also provides an exact form of the extrapolation function, but the extrapolation estimate generally cannot be obtained by simply setting the extrapolation variable to negative one in the fitted extrapolation function if the bandwidth is less than the standard deviation of the measurement error. Large sample properties of the proposed estimation procedure are discussed, as well as simulation studies and a real data example being conducted to illustrate its applications.

We revisit the classical problem of nonparametric density estimation, but impose local differential privacy constraints. Under such constraints, the original data $X_1,\ldots,X_n$, taking values in $\mathbb{R}^d $, cannot be directly observed, and all estimators are functions of the randomised output of a suitable privacy mechanism. The statistician is free to choose the form of the privacy mechanism, and in this work we propose to add Laplace distributed noise to a discretisation of the location of a vector $X_i$. Based on these randomised data, we design a novel estimator of the density function, which can be viewed as a privatised version of the well-studied histogram density estimator. Our theoretical results include universal pointwise consistency and strong universal $L_1$-consistency. In addition, a convergence rate over classes of Lipschitz functions is derived, which is complemented by a matching minimax lower bound.

Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey's method (LinCDE). LinCDE admits flexible modeling of the density family and can capture distributional characteristics like modality and shape. In particular, when suitably parametrized, LinCDE will produce smooth and non-negative density estimates. Furthermore, like boosted regression trees, LinCDE does automatic feature selection. We demonstrate LinCDE's efficacy through extensive simulations and several real data examples.

We revisit on-average algorithmic stability of Gradient Descent (GD) for training overparameterised shallow neural networks and prove new generalisation and excess risk bounds without the Neural Tangent Kernel (NTK) or Polyak-{\L}ojasiewicz (PL) assumptions. In particular, we show oracle type bounds which reveal that the generalisation and excess risk of GD is controlled by an interpolating network with the shortest GD path from initialisation (in a sense, an interpolating network with the smallest relative norm). While this was known for kernelised interpolants, our proof applies directly to networks trained by GD without intermediate kernelisation. At the same time, by relaxing oracle inequalities developed here we recover existing NTK-based risk bounds in a straightforward way, which demonstrates that our analysis is tighter. Finally, unlike most of the NTK-based analyses we focus on regression with label noise and show that GD with early stopping is consistent.

This paper presents analytical and experimental results on the ranked nodes method (RNM) that is used to construct conditional probability tables for Bayesian networks by expert elicitation. The majority of the results are focused on a setting in which RNM is applied to a child node and parent nodes that all have the same amount discrete ordinal states. The results indicate on RNM properties that can be used to support its future elaboration and development.

While in recent years a number of new statistical approaches have been proposed to model group differences with a different assumption on the nature of the measurement invariance of the instruments, the tools for detecting local specifications of these models have not been fully developed yet. The main type of local misspecification concerning comparability is the non-invariance of indicators (called also differential item functioning; DIF). Such non-invariance could arise from poor translations or significant cultural differences. In this study, we present a novel approach to detect such misspecifications using a Deep Neural Network (DNN). We compared the proposed model with the most popular traditional methods: modification indices (MI) and expected parameters change (EPC) indicators from the confirmatory factor analysis (CFA) modelling, logistic DIF detection, and sequential procedure introduced with the CFA alignment approach. Simulation studies show that proposed method outperformed traditional methods in almost all scenarios, or it was at least as accurate as the best one. We also provide an empirical example utilizing European Social Survey (ESS) data including items known to be miss-translated, which are correctly identified by our approach and DIF detection based on logistic regression. This study provides a strong foundation for the future development of machine learning algorithms for detection of statistical model misspecifications.

A plug-in algorithm to estimate Bayes Optimal Classifiers for fairness-aware binary classification has been proposed in (Menon & Williamson, 2018). However, the statistical efficacy of their approach has not been established. We prove that the plug-in algorithm is statistically consistent. We also derive finite sample guarantees associated with learning the Bayes Optimal Classifiers via the plug-in algorithm. Finally, we propose a protocol that modifies the plug-in approach, so as to simultaneously guarantee fairness and differential privacy with respect to a binary feature deemed sensitive.

Gaussian processes (GPs) are a well-known nonparametric Bayesian inference technique, but they suffer from scalability problems for large sample sizes, and their performance can degrade for non-stationary or spatially heterogeneous data. In this work, we seek to overcome these issues through (i) employing variational free energy approximations of GPs operating in tandem with online expectation propagation steps; and (ii) introducing a local splitting step which instantiates a new GP whenever the posterior distribution changes significantly as quantified by the Wasserstein metric over posterior distributions. Over time, then, this yields an ensemble of sparse GPs which may be updated incrementally, and adapts to locality, heterogeneity, and non-stationarity in training data.

Estimating the contact angle of a virus infected saliva droplet is seen to be an important area of research as it presents an idea about the drying time of the respective droplet and in turn of the growth of the underlying pandemic. In this paper we extend the data presented by Balusamy, Banerjee and Sahu [Lifetime of sessile saliva droplets in the context of SARS-CoV-2, Int. J. Heat Mass Transf. 123, 105178 (2021)], where the contact angles are fitted using a newly proposed half-circular wrapped-exponential model, and a sequential confidence interval estimation approach is established which largely reduces both time and cost with regards to data collection.

In cancer trials, the analysis of longitudinal toxicity data is a difficult task due to the presence of multiple adverse events with different extents of toxic burden. Models able to summarise and quantify the overall toxic risk and its evolution during cancer therapy, which deals with both the longitudinal and the categorical aspects of toxicity levels progression, are necessary, but still not well developed. In this work, a novel approach based on Latent Markov (LM) models for longitudinal toxicity data is proposed. The latent status of interest is the Latent Overall Toxicity (LOTox) condition of each patient, which affects the distribution of the observed toxic levels over treatment. By assuming the existence of a LM chain for LOTox dynamics, the new approach aims at identifying different latent states of overall toxicity burden (LOTox states). This approach investigates how patients move between latent states during chemotherapy treatment allowing for the reconstruction of personalized longitudinal LOTox profiles. This methodology has never been applied to osteosarcoma treatment and provides new insights for medical decisions in childhood cancer therapy.

Linear mixed models (LMMs) are instrumental for regression analysis with structured dependence, such as grouped, clustered, or multilevel data. However, selection among the covariates--while accounting for this structured dependence--remains a challenge. We introduce a Bayesian decision analysis for subset selection with LMMs. Using a Mahalanobis loss function that incorporates the structured dependence, we derive optimal linear actions for any subset of covariates and under any Bayesian LMM. Crucially, these actions inherit shrinkage or regularization and uncertainty quantification from the underlying Bayesian LMM. Rather than selecting a single "best" subset, which is often unstable and limited in its information content, we collect the acceptable family of subsets that nearly match the predictive ability of the "best" subset. The acceptable family is summarized by its smallest member and key variable importance metrics. Customized subset search and out-of-sample approximation algorithms are provided for more scalable computing. These tools are applied to simulated data and a longitudinal physical activity dataset, and in both cases demonstrate excellent prediction, estimation, and selection ability.

Despite several national and local policies towards cleaner air in England, many schools in London breach the WHO-recommended concentrations of air pollutants such as NO2 and PM2.5. This is while, previous studies highlight significant adverse health effects of air pollutants on children's health. In this paper we adopted a Bayesian spatial hierarchical model to investigate factors that affect the odds of schools exceeding the WHO-recommended concentration of NO2 (i.e., 40 ug/m3 annual mean) in Greater London (UK). We considered a host of variables including schools' characteristics as well as their neighbourhoods' attributes from household, socioeconomic, transport-related, land use, built and natural environment characteristics perspectives. The results indicated that transport-related factors including the number of traffic lights and bus stops in the immediate vicinity of schools, and borough-level bus fuel consumption are determinant factors that increase the likelihood of non-compliance with the WHO guideline. In contrast, distance from roads, river transport, and underground stations, vehicle speed (an indicator of traffic congestion), the proportion of borough-level green space, and the area of green space at schools reduce the likelihood of exceeding the WHO recommended concentration of NO2. As a sensitivity analysis, we repeated our analysis under a hypothetical scenario in which the recommended concentration of NO2 is 35 ug/m3, instead of 40 ug/m3. Our results underscore the importance of adopting clean fuel technologies on buses, installing green barriers, and reducing motorised traffic around schools in reducing exposure to NO2 concentrations in proximity to schools. This study would be useful for local authority decision making with the aim of improving air quality for school-aged children in urban settings.

Partially linear additive models generalize the linear models since they model the relation between a response variable and covariates by assuming that some covariates are supposed to have a linear relation with the response but each of the others enter with unknown univariate smooth functions. The harmful effect of outliers either in the residuals or in the covariates involved in the linear component has been described in the situation of partially linear models, that is, when only one nonparametric component is involved in the model. When dealing with additive components, the problem of providing reliable estimators when atypical data arise, is of practical importance motivating the need of robust procedures. Hence, we propose a family of robust estimators for partially linear additive models by combining $B-$splines with robust linear regression estimators. We obtain consistency results, rates of convergence and asymptotic normality for the linear components, under mild assumptions. A Monte Carlo study is carried out to compare the performance of the robust proposal with its classical counterpart under different models and contamination schemes. The numerical experiments show the advantage of the proposed methodology for finite samples. We also illustrate the usefulness of the proposed approach on a real data set.

Motivated by the poor performance of cross-validation in settings where data are scarce, we propose a novel estimator of the out-of-sample performance of a policy in data-driven optimization.Our approach exploits the optimization problem's sensitivity analysis to estimate the gradient of the optimal objective value with respect to the amount of noise in the data and uses the estimated gradient to debias the policy's in-sample performance. Unlike cross-validation techniques, our approach avoids sacrificing data for a test set, utilizes all data when training and, hence, is well-suited to settings where data are scarce. We prove bounds on the bias and variance of our estimator for optimization problems with uncertain linear objectives but known, potentially non-convex, feasible regions. For more specialized optimization problems where the feasible region is ``weakly-coupled" in a certain sense, we prove stronger results. Specifically, we provide explicit high-probability bounds on the error of our estimator that hold uniformly over a policy class and depends on the problem's dimension and policy class's complexity. Our bounds show that under mild conditions, the error of our estimator vanishes as the dimension of the optimization problem grows, even if the amount of available data remains small and constant. Said differently, we prove our estimator performs well in the small-data, large-scale regime. Finally, we numerically compare our proposed method to state-of-the-art approaches through a case-study on dispatching emergency medical response services using real data. Our method provides more accurate estimates of out-of-sample performance and learns better-performing policies.

In this paper we perform robustness and sensitivity analysis of several continuous-time rough Volterra stochastic volatility models with respect to the process of market calibration. Robustness is understood in the sense of sensitivity to changes in the option data structure. The latter analyses then should validate the hypothesis on importance of the roughness in the volatility process dynamics. Empirical study is performed on a data set of Apple Inc. equity options traded in four different days in April and May 2015. In particular, the results for RFSV, rBergomi and aRFSV models are provided.

This paper makes the following econometric contributions. First, we develop a unifying framework for testing shape restrictions based on the Wald principle. Second, we examine the applicability and usefulness of some prominent shape enforcing operators in implementing our test, including rearrangement and the greatest convex minorization (or the least concave majorization). In particular, the influential rearrangement operator is inapplicable due to a lack of convexity, while the greatest convex minorization is shown to enjoy the analytic properties required to employ our framework. The importance of convexity in establishing size control has been noted elsewhere in the literature. Third, we show that, despite that the projection operator may not be well-defined/behaved in general non-Hilbert parameter spaces (e.g., ones defined by uniform norms), one may nonetheless devise a powerful distance-based test by applying our framework. The finite sample performance of our test is evaluated through Monte Carlo simulations, and its empirical relevance is showcased by investigating the relationship between weekly working hours and the annual wage growth in the high-end labor market.

This is a tutorial and survey paper on Boltzmann Machine (BM), Restricted Boltzmann Machine (RBM), and Deep Belief Network (DBN). We start with the required background on probabilistic graphical models, Markov random field, Gibbs sampling, statistical physics, Ising model, and the Hopfield network. Then, we introduce the structures of BM and RBM. The conditional distributions of visible and hidden variables, Gibbs sampling in RBM for generating variables, training BM and RBM by maximum likelihood estimation, and contrastive divergence are explained. Then, we discuss different possible discrete and continuous distributions for the variables. We introduce conditional RBM and how it is trained. Finally, we explain deep belief network as a stack of RBM models. This paper on Boltzmann machines can be useful in various fields including data science, statistics, neural computation, and statistical physics.

Bi-Directional Grid Constrained (BGC) stochastic processes (BGCSPs) constrain the random movement toward the origin steadily more and more, the further they deviate from the origin, rather than all at once imposing reflective barriers, as does the well-established theory of It^o diffusions with such reflective barriers. We identify that BGCSPs are a variant rather than a special case of the multi-skew Brownian motion (M-SBM). This is because they have their own complexities, such as the barriers being hidden (not known in advance) and not necessarily constant over time. We provide an M-SBM theoretical framework and also a simulation framework to elaborate deeper properties of BGCSPs. The simulation framework is then applied by generating numerous simulations of the constrained paths and the results are analysed. BGCSPs have applications in finance and indeed many other fields requiring graduated constraining, from both above and below the initial position.

The successes of deep learning critically rely on the ability of neural networks to output meaningful predictions on unseen data -- generalization. Yet despite its criticality, there remain fundamental open questions on how neural networks generalize. How much do neural networks rely on memorization -- seeing highly similar training examples -- and how much are they capable of human-intelligence styled reasoning -- identifying abstract rules underlying the data? In this paper we introduce a novel benchmark, Pointer Value Retrieval (PVR) tasks, that explore the limits of neural network generalization. While PVR tasks can consist of visual as well as symbolic inputs, each with varying levels of difficulty, they all have a simple underlying rule. One part of the PVR task input acts as a pointer, giving the location of a different part of the input, which forms the value (and output). We demonstrate that this task structure provides a rich testbed for understanding generalization, with our empirical study showing large variations in neural network performance based on dataset size, task complexity and model architecture. The interaction of position, values and the pointer rule also allow the development of nuanced tests of generalization, by introducing distribution shift and increasing functional complexity. These reveal both subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.

Organizations such as government departments and financial institutions provide online service facilities accessible via an increasing number of internet connected devices which make their operational environment vulnerable to cyber attacks. Consequently, there is a need to have mechanisms in place to detect cyber security attacks in a timely manner. A variety of Network Intrusion Detection Systems (NIDS) have been proposed and can be categorized into signature-based NIDS and anomaly-based NIDS. The signature-based NIDS, which identify the misuse through scanning the activity signature against the list of known attack activities, are criticized for their inability to identify new attacks (never-before-seen attacks). Among anomaly-based NIDS, which declare a connection anomalous if it expresses deviation from a trained model, the unsupervised learning algorithms circumvent this issue since they have the ability to identify new attacks. In this study, we use an unsupervised learning algorithm based on principal component analysis to detect cyber attacks. In the training phase, our approach has the advantage of also identifying outliers in the training dataset. In the monitoring phase, our approach first identifies the affected dimensions and then calculates an anomaly score by aggregating across only those components that are affected by the anomalies. We explore the performance of the algorithm via simulations and through two applications, namely to the UNSW-NB15 dataset recently released by the Australian Centre for Cyber Security and to the well-known KDD'99 dataset. The algorithm is scalable to large datasets in both training and monitoring phases, and the results from both the simulated and real datasets show that the method has promise in detecting suspicious network activities.

Empirically it has been observed that the performance of deep neural networks steadily improves as we increase model size, contradicting the classical view on overfitting and generalization. Recently, the double descent phenomena has been proposed to reconcile this observation with theory, suggesting that the test error has a second descent when the model becomes sufficiently overparameterized, as the model size itself acts as an implicit regularizer. In this paper we add to the growing body of work in this space, providing a careful study of learning dynamics as a function of model size for the least squares scenario. We show an excess risk bound for the gradient descent solution of the least squares objective. The bound depends on the smallest non-zero eigenvalue of the covariance matrix of the input features, via a functional form that has the double descent behavior. This gives a new perspective on the double descent curves reported in the literature. Our analysis of the excess risk allows to decouple the effect of optimization and generalization error. In particular, we find that in case of noiseless regression, double descent is explained solely by optimization-related quantities, which was missed in studies focusing on the Moore-Penrose pseudoinverse solution. We believe that our derivation provides an alternative view compared to existing work, shedding some light on a possible cause of this phenomena, at least in the considered least squares setting. We empirically explore if our predictions hold for neural networks, in particular whether the covariance of intermediary hidden activations has a similar behavior as the one predicted by our derivations.

Stochastic approximation (SA) algorithms are widely used in system optimization problems when only noisy measurements of the system are available. This paper studies two types of SA algorithms in a multivariate Kiefer-Wolfowitz setting: random-direction SA (RDSA) and simultaneous-perturbation SA (SPSA), and then describes the bias term, convergence, and asymptotic normality of RDSA algorithms. The gradient estimations in RDSA and SPSA have different forms and, consequently, use different types of random perturbations. This paper looks at various valid distributions for perturbations in RDSA and SPSA and then compares the two algorithms using mean-square errors computed from asymptotic distribution. From both a theoretical and numerical point of view, we find that SPSA generally outperforms RDSA.

Survival analysis, or time-to-event modelling, is a classical statistical problem that has garnered a lot of interest for its practical use in epidemiology, demographics or actuarial sciences. Recent advances on the subject from the point of view of machine learning have been concerned with precise per-individual predictions instead of population studies, driven by the rise of individualized medicine. We introduce here a conditional normalizing flow based estimate of the time-to-event density as a way to model highly flexible and individualized conditional survival distributions. We use a novel hierarchical formulation of normalizing flows to enable efficient fitting of flexible conditional distributions without overfitting and show how the normalizing flow formulation can be efficiently adapted to the censored setting. We experimentally validate the proposed approach on a synthetic dataset as well as four open medical datasets and an example of a common financial problem.

Validating the safety of autonomous systems generally requires the use of high-fidelity simulators that adequately capture the variability of real-world scenarios. However, it is generally not feasible to exhaustively search the space of simulation scenarios for failures. Adaptive stress testing (AST) is a method that uses reinforcement learning to find the most likely failure of a system. AST with a deep reinforcement learning solver has been shown to be effective in finding failures across a range of different systems. This approach generally involves running many simulations, which can be very expensive when using a high-fidelity simulator. To improve efficiency, we present a method that first finds failures in a low-fidelity simulator. It then uses the backward algorithm, which trains a deep neural network policy using a single expert demonstration, to adapt the low-fidelity failures to high-fidelity. We have created a series of autonomous vehicle validation case studies that represent some of the ways low-fidelity and high-fidelity simulators can differ, such as time discretization. We demonstrate in a variety of case studies that this new AST approach is able to find failures with significantly fewer high-fidelity simulation steps than are needed when just running AST directly in high-fidelity. As a proof of concept, we also demonstrate AST on NVIDIA's DriveSim simulator, an industry state-of-the-art high-fidelity simulator for finding failures in autonomous vehicles.

State-of-the-art neural network architectures continue to scale in size and deliver impressive generalization results, although this comes at the expense of limited interpretability. In particular, a key challenge is to determine when to stop training the model, as this has a significant impact on generalization. Convolutional neural networks (ConvNets) comprise high-dimensional feature spaces formed by the aggregation of multiple channels, where analyzing intermediate data representations and the model's evolution can be challenging owing to the curse of dimensionality. We present channel-wise DeepNNK (CW-DeepNNK), a novel channel-wise generalization estimate based on non-negative kernel regression (NNK) graphs with which we perform local polytope interpolation on low-dimensional channels. This method leads to instance-based interpretability of both the learned data representations and the relationship between channels. Motivated by our observations, we use CW-DeepNNK to propose a novel early stopping criterion that (i) does not require a validation set, (ii) is based on a task performance metric, and (iii) allows stopping to be reached at different points for each channel. Our experiments demonstrate that our proposed method has advantages as compared to the standard criterion based on validation set performance.