New articles on Statistics

[1] 2306.03957

survextrap: a package for flexible and transparent survival extrapolation

Health policy decisions are often informed by estimates of long-term survival based primarily on short-term data. A range of methods are available to include longer-term information, but there has previously been no comprehensive and accessible tool for implementing these. This paper introduces a novel model and software package for parametric survival modelling of individual-level, right-censored data, optionally combined with summary survival data on one or more time periods. It could be used to estimate long-term survival based on short-term data from a clinical trial, combined with longer-term disease registry or population data, or elicited judgements. All data sources are represented jointly in a Bayesian model. The hazard is modelled as an M-spline function, which can represent potential changes in the hazard trajectory at any time. Through Bayesian estimation, the model automatically adapts to fit the available data, and acknowledges uncertainty where the data are weak. Therefore long-term estimates are only confident if there are strong long-term data, and inferences do not rely on extrapolating parametric functions learned from short-term data. The effects of treatment or other explanatory variables can be estimated through proportional hazards or with a flexible non-proportional hazards model. Some commonly-used mechanisms for survival can also be assumed: cure models, additive hazards models with known background mortality, and models where the effect of a treatment wanes over time. All of these features are provided for the first time in an R package, $\texttt{survextrap}$, in which models can be fitted using standard R survival modelling syntax. This paper explains the model, and demonstrates the use of the package to fit a range of models to common forms of survival data used in health technology assessments.

[2] 2306.03968

Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels

Selecting hyperparameters in deep learning greatly impacts its effectiveness but requires manual effort and expertise. Recent works show that Bayesian model selection with Laplace approximations can allow to optimize such hyperparameters just like standard neural network parameters using gradients and on the training data. However, estimating a single hyperparameter gradient requires a pass through the entire dataset, limiting the scalability of such algorithms. In this work, we overcome this issue by introducing lower bounds to the linearized Laplace approximation of the marginal likelihood. In contrast to previous estimators, these bounds are amenable to stochastic-gradient-based optimization and allow to trade off estimation accuracy against computational complexity. We derive them using the function-space form of the linearized Laplace, which can be estimated using the neural tangent kernel. Experimentally, we show that the estimators can significantly accelerate gradient-based hyperparameter optimization.

[3] 2306.03981

Developing an Index of National Research Capacity

Public managers lack feedback on the effectiveness of public investments, policies, and programs instituted to build and use research capacity. Numerous reports rank countries on global performance on innovation and competitiveness, but the highly globalized data does not distinguish country contributions from global ones. We suggest improving upon global reports by removing globalized measures and combining a reliable set of national indicators into an index. We factor analyze 14 variables for 172 countries from 2013 to 2021. Two factors emerge, one for raw or core research capacity and the other indicating the wider context of governance. Analysis shows convergent validity within the two factors and divergent validity between them. Nations rank differently between capacity, governance context, and the product of the two. Ranks also vary as a function of the chosen aggregation method. Finally, as a test of the predictive validity of the capacity index, a regression analysis was implemented predicting national citation strength. Policymakers and analysts may find stronger feedback from this approach to quantifying national research strength.

[4] 2306.04003

Enhancing detection of labor violations in the agricultural sector: A multilevel generalized linear regression model of H-2A violation counts

Agricultural workers are essential to the supply chain for our daily food and yet, many face harmful work conditions, including garnished wages, and other labor violations. Workers on H-2A visas are particularly vulnerable due to the precarity of their immigration status being tied to their employer. Although worksite inspections are one mechanism to detect such violations, many labor violations affecting agricultural workers go undetected due to limited inspection resources. In this study, we identify multiple state and industry level factors that correlate with H-2A violations identified by the U.S. Department of Labor Wage and Hour Division using a multilevel zero-inflated negative binomial model. We find that three state-level factors (average farm acreage size, the number of agricultural establishments with less than 20 employees, and higher poverty rates) are correlated with H-2A violations. These findings provide guidance for inspection agencies regarding how to prioritize their limited resources to more effectively inspect agricultural workplaces, thereby improving workplace conditions for H-2A workers.

[5] 2306.04027

Intervention Generalization: A View from Factor Graph Models

One of the goals of causal inference is to generalize from past experiments and observational data to novel conditions. While it is in principle possible to eventually learn a mapping from a novel experimental condition to an outcome of interest, provided a sufficient variety of experiments is available in the training data, coping with a large combinatorial space of possible interventions is hard. Under a typical sparse experimental design, this mapping is ill-posed without relying on heavy regularization or prior distributions. Such assumptions may or may not be reliable, and can be hard to defend or test. In this paper, we take a close look at how to warrant a leap from past experiments to novel conditions based on minimal assumptions about the factorization of the distribution of the manipulated system, communicated in the well-understood language of factor graph models. A postulated $\textit{interventional factor model}$ (IFM) may not always be informative, but it conveniently abstracts away a need for explicit unmeasured confounding and feedback mechanisms, leading to directly testable claims. We derive necessary and sufficient conditions for causal effect identifiability with IFMs using data from a collection of experimental settings, and implement practical algorithms for generalizing expected outcomes to novel conditions never observed in the data.

[6] 2306.04084

Modelling the discretization error of initial value problems using the Wishart distribution

This paper presents a new discretization error quantification method for the numerical integration of ordinary differential equations. The error is modelled by using the Wishart distribution, which enables us to capture the correlation between variables. Error quantification is achieved by solving an optimization problem under the order constraints for the covariance matrices. An algorithm for the optimization problem is also established in a slightly broader context.

[7] 2306.04092

A note on the optimum allocation of resources to follow up unit nonrespondents in probability

Common practice to address nonresponse in probability surveys in National Statistical Offices is to follow up every nonrespondent with a view to lifting response rates. As response rate is an insufficient indicator of data quality, it is argued that one should follow up nonrespondents with a view to reducing the mean squared error (MSE) of the estimator of the variable of interest. In this paper, we propose a method to allocate the nonresponse follow-up resources in such a way as to minimise the MSE under a quasi-randomisation framework. An example to illustrate the method using the 2018/19 Rural Environment and Agricultural Commodities Survey from the Australian Bureau of Statistics is provided.

[8] 2306.04093

Subnetwork Estimation for Spatial Autoregressive Models in Large-scale Networks

Large-scale networks are commonly encountered in practice (e.g., Facebook and Twitter) by researchers. In order to study the network interaction between different nodes of large-scale networks, the spatial autoregressive (SAR) model has been popularly employed. Despite its popularity, the estimation of a SAR model on large-scale networks remains very challenging. On the one hand, due to policy limitations or high collection costs, it is often impossible for independent researchers to observe or collect all network information. On the other hand, even if the entire network is accessible, estimating the SAR model using the quasi-maximum likelihood estimator (QMLE) could be computationally infeasible due to its high computational cost. To address these challenges, we propose here a subnetwork estimation method based on QMLE for the SAR model. By using appropriate sampling methods, a subnetwork, consisting of a much-reduced number of nodes, can be constructed. Subsequently, the standard QMLE can be computed by treating the sampled subnetwork as if it were the entire network. This leads to a significant reduction in information collection and model computation costs, which increases the practical feasibility of the effort. Theoretically, we show that the subnetwork-based QMLE is consistent and asymptotically normal under appropriate regularity conditions. Extensive simulation studies, based on both simulated and real network structures, are presented.

[9] 2306.04119

Improving Survey Inference in Two-phase Designs Using Bayesian Machine Learning

The two-phase sampling design is a cost-effective sampling strategy that has been widely used in public health research. The conventional approach in this design is to create subsample specific weights that adjust for probability of selection and response in the second phase. However, these weights can be highly variable which in turn results in unstable weighted analyses. Alternatively, we can use the rich data collected in the first phase of the study to improve the survey inference of the second phase sample. In this paper, we use a Bayesian tree-based multiple imputation (MI) approach for estimating population means using a two-phase survey design. We demonstrate how to incorporate complex survey design features, such as strata, clusters, and weights, into the imputation procedure. We use a simulation study to evaluate the performance of the tree-based MI approach in comparison to the alternative weighted analyses using the subsample weights. We find the tree-based MI method outperforms weighting methods with smaller bias, reduced root mean squared error, and narrower 95\% confidence intervals that have closer to the nominal level coverage rate. We illustrate the application of the proposed method by estimating the prevalence of diabetes among the United States non-institutionalized adult population using the fasting blood glucose data collected only on a subsample of participants in the 2017-2018 National Health and Nutrition Examination Survey.

[10] 2306.04126

Bootstrap Prediction Inference of Non-linear Autoregressive Models

The non-linear autoregressive (NLAR) model plays an important role in modeling and predicting time series. One-step ahead prediction is straightforward using the NLAR model, but the multi-step ahead prediction is cumbersome. For instance, iterating the one-step ahead predictor is a convenient strategy for linear autoregressive (LAR) models, but it is suboptimal under NLAR. In this paper, we first propose a simulation and/or bootstrap algorithm to construct optimal point predictors under an $L_1$ or $L_2$ loss criterion. In addition, we construct bootstrap prediction intervals in the multi-step ahead prediction problem; in particular, we develop an asymptotically valid quantile prediction interval as well as a pertinent prediction interval for future values. In order to correct the undercoverage of prediction intervals with finite samples, we further employ predictive -- as opposed to fitted -- residuals in the bootstrap process. Simulation studies are also given to substantiate the finite sample performance of our methods.

[11] 2306.04138

Weighted trajectory analysis and application to clinical outcome assessment

The Kaplan-Meier estimator (KM) is widely used in medical research to estimate the survival function from lifetime data. KM is a powerful tool to evaluate clinical trials due to simple computational requirements, a logrank hypothesis test, and the ability to censor patients. However, KM has several constraints and fails to generalize to ordinal variables of interest such as toxicity and ECOG performance. We devised Weighted Trajectory Analysis (WTA) to combine the advantages of KM with the ability to compare treatment groups for ordinal variables and fluctuating outcomes. To assess statistical significance, we developed a new hypothesis test analogous to the logrank test. We demonstrate the functionality of WTA through 1000-fold clinical trial simulations of unique stochastic models of chemotherapy toxicity and schizophrenia progression. At several increments of sample size and hazard ratio, we compare the performance of WTA to both KM and Generalized Estimating Equations (GEE). WTA generally required half the sample size to achieve comparable power to KM; advantages over GEE include its robust non-parametric approach and summary plot. We also apply WTA to real clinical data: the toxicity outcomes of melanoma patients receiving immunotherapy and the disease progression of patients with metastatic breast cancer receiving ramucirumab. The application of WTA demonstrates that using traditional methods such as percent incidence and KM can lead to both Type I and II errors by failing to model illness trajectory. This article outlines a novel method for clinical outcome assessment that extends the advantages of Kaplan-Meier estimates to ordinal outcome variables.

[12] 2306.04168

Goodness of fit tests for the pseudo-Poisson distribution

Bivariate count models having one marginal and the other conditionals being of the Poissons form are called pseudo-Poisson distributions. Such models have simple exible dependence structures, possess fast computation algorithms and generate a sufficiently large number of parametric families. It has been strongly argued that the pseudo-Poisson model will be the first choice to consider in modelling bivariate over-dispersed data with positive correlation and having one of the marginal equi-dispersed. Yet, before we start fitting, it is necessary to test whether the given data is compatible with the assumed pseudo-Poisson model. Hence, in the present note we derive and propose a few goodness-of-fit tests for the bivariate pseudo-Poisson distribution. Also we emphasize two tests, a lesser known test based on the supremes of the absolute difference between the estimated probability generating function and its empirical counterpart. A new test has been proposed based on the difference between the estimated bivariate Fisher dispersion index and its empirical indices. However, we also consider the potential of applying the bivariate tests that depend on the generating function (like the Kocherlakota and Kocherlakota and Mu~noz and Gamero tests) and the univariate goodness-of-fit tests (like the Chi-square test) to the pseudo-Poisson data. However, for each of the tests considered we analyse finite, large and asymptotic properties. Nevertheless, we compare the power (bivariate classical Poisson and Com-Max bivariate Poisson as alternatives) of each of the tests suggested and also include examples of application to real-life data. In a nutshell we are developing an R package which includes a test for compatibility of the data with the bivariate pseudo-Poisson model.

[13] 2306.04182

Transfer Learning for General M-estimators with Decomposable Regularizers in High-dimensions

To incorporate useful information from related statistical tasks into the target one, we propose a two-step transfer learning algorithm in the general M-estimators framework with decomposable regularizers in high-dimensions. When the informative sources are known in the oracle sense, in the first step, we acquire knowledge of the target parameter by pooling the useful source datasets and the target one. In the second step, the primal estimator is fine-tuned using the target dataset. In contrast to the existing literatures which exert homogeneity conditions for Hessian matrices of various population loss functions, our theoretical analysis shows that even if the Hessian matrices are heterogeneous, the pooling estimators still provide adequate information by slightly enlarging regularization, and numerical studies further validate the assertion. Sparse regression and low-rank trace regression for both linear and generalized linear cases are discussed as two specific examples under the M-estimators framework. When the informative source datasets are unknown, a novel truncated-penalized algorithm is proposed to acquire the primal estimator and its oracle property is proved. Extensive numerical experiments are conducted to support the theoretical arguments.

[14] 2306.04198

Visualizing the Wavenumber Content of a Point Pattern

Spatial point patterns are a commonly recorded form of data in ecology, medicine, astronomy, criminology, epidemiology and many other application fields. One way to understand their second order dependence structure is via their spectral density function. However, unlike time series analysis, for point processes such approaches are currently underutilized. In part, this is because the interpretation of the spectral representation of point patterns is challenging. In this paper, we demonstrate how to band-pass filter point patterns, thus enabling us to explore the spectral representation of point patterns in space by isolating the signal corresponding to certain sets of wavenumbers.

[15] 2306.04223

Causally Learning an Optimal Rework Policy

In manufacturing, rework refers to an optional step of a production process which aims to eliminate errors or remedy products that do not meet the desired quality standards. Reworking a production lot involves repeating a previous production stage with adjustments to ensure that the final product meets the required specifications. While offering the chance to improve the yield and thus increase the revenue of a production lot, a rework step also incurs additional costs. Additionally, the rework of parts that already meet the target specifications may damage them and decrease the yield. In this paper, we apply double/debiased machine learning (DML) to estimate the conditional treatment effect of a rework step during the color conversion process in opto-electronic semiconductor manufacturing on the final product yield. We utilize the implementation DoubleML to develop policies for the rework of components and estimate their value empirically. From our causal machine learning analysis we derive implications for the coating of monochromatic LEDs with conversion layers.

[16] 2306.04232

Non-minimaxity of debiased shrinkage estimators

We consider the estimation of the $p$-variate normal mean of $X\sim N_p(\theta,I)$ under the quadratic loss function. We investigate the decision theoretic properties of debiased shrinkage estimator, the estimator which shrinks towards the origin for smaller $\|x\|^2$ and which is exactly equal to the unbiased estimator $X$ for larger $\|x\|^2$. Such debiased shrinkage estimator seems superior to the unbiased estimator $X$, which implies minimaxity. However we show that it is not minimax under mild conditions.

[17] 2306.04254

funBIalign: a hierachical algorithm for functional motif discovery based on mean squared residue scores

Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical "shapes" or "patterns" that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose funBIalign for their discovery and evaluation. Inspired by clustering and biclustering techniques, funBIalign is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use funBIalign for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes.

[18] 2306.04255

Accounting For Informative Sampling When Learning to Forecast Treatment Outcomes Over Time

Machine learning (ML) holds great potential for accurately forecasting treatment outcomes over time, which could ultimately enable the adoption of more individualized treatment strategies in many practical applications. However, a significant challenge that has been largely overlooked by the ML literature on this topic is the presence of informative sampling in observational data. When instances are observed irregularly over time, sampling times are typically not random, but rather informative -- depending on the instance's characteristics, past outcomes, and administered treatments. In this work, we formalize informative sampling as a covariate shift problem and show that it can prohibit accurate estimation of treatment outcomes if not properly accounted for. To overcome this challenge, we present a general framework for learning treatment outcomes in the presence of informative sampling using inverse intensity-weighting, and propose a novel method, TESAR-CDE, that instantiates this framework using Neural CDEs. Using a simulation environment based on a clinical use case, we demonstrate the effectiveness of our approach in learning under informative sampling.

[19] 2306.04315

Inferring unknown unknowns: Regularized bias-aware ensemble Kalman filter

Because of physical assumptions and numerical approximations, reduced-order models are affected by uncertainties in the state and parameters, and by model biases. Model biases, also known as model errors or systematic errors, are difficult to infer because they are 'unknown unknowns', i.e., we do not necessarily know their functional form a priori. With biased models, data assimilation methods may be ill-posed because they are either (i) 'bias-unaware', i.e. the estimators are assumed unbiased, or (ii) they rely on an a priori parametric model for the bias, or (iii) they can infer model biases that are not unique for the same model and data. First, we design a data assimilation framework to perform combined state, parameter, and bias estimation. Second, we propose a mathematical solution with a sequential method, i.e., the regularized bias-aware Kalman Filter (r-EnKF). The method requires a model of the bias and its gradient (i.e., the Jacobian). Third, we propose an echo state network as the model bias estimator. We derive the Jacobian of the network, and design a robust training strategy with data augmentation to accurately infer the bias in different scenarios. Fourth, we apply the r-EnKF to nonlinearly coupled oscillators (with and without time-delay) affected by different forms of bias. The r-EnKF infers in real-time parameters and states, and a unique bias. The applications that we showcase are relevant to acoustics, thermoacoustics, and vibrations; however, the r-EnKF opens new opportunities for combined state, parameter and bias estimation for real-time prediction and control in nonlinear systems.

[20] 2306.04338

Changing Data Sources in the Age of Machine Learning for Official Statistics

Data science has become increasingly essential for the production of official statistics, as it enables the automated collection, processing, and analysis of large amounts of data. With such data science practices in place, it enables more timely, more insightful and more flexible reporting. However, the quality and integrity of data-science-driven statistics rely on the accuracy and reliability of the data sources and the machine learning techniques that support them. In particular, changes in data sources are inevitable to occur and pose significant risks that are crucial to address in the context of machine learning for official statistics. This paper gives an overview of the main risks, liabilities, and uncertainties associated with changing data sources in the context of machine learning for official statistics. We provide a checklist of the most prevalent origins and causes of changing data sources; not only on a technical level but also regarding ownership, ethics, regulation, and public perception. Next, we highlight the repercussions of changing data sources on statistical reporting. These include technical effects such as concept drift, bias, availability, validity, accuracy and completeness, but also the neutrality and potential discontinuation of the statistical offering. We offer a few important precautionary measures, such as enhancing robustness in both data sourcing and statistical techniques, and thorough monitoring. In doing so, machine learning-based official statistics can maintain integrity, reliability, consistency, and relevance in policy-making, decision-making, and public discourse.

[21] 2306.04375

Learning via Wasserstein-Based High Probability Generalisation Bounds

Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) - this is in particular at the core of PAC-Bayesian learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayesian framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), which might exhibit erratic behavior and fail to capture the underlying geometric structure of the learning problem - hence restricting its use in practical applications. As a remedy, recent studies have attempted to replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein distance. Even though these bounds alleviated the aforementioned issues to a certain extent, they either hold in expectation, are for bounded losses, or are nontrivial to minimize in an SRM framework. In this work, we contribute to this line of research and prove novel Wasserstein distance-based PAC-Bayesian generalisation bounds for both batch learning with independent and identically distributed (i.i.d.) data, and online learning with potentially non-i.i.d. data. Contrary to previous art, our bounds are stronger in the sense that (i) they hold with high probability, (ii) they apply to unbounded (potentially heavy-tailed) losses, and (iii) they lead to optimizable training objectives that can be used in SRM. As a result we derive novel Wasserstein-based PAC-Bayesian learning algorithms and we illustrate their empirical advantage on a variety of experiments.

[22] 2306.04376

Label Shift Quantification with Robustness Guarantees via Distribution Feature Matching

Quantification learning deals with the task of estimating the target label distribution under label shift. In this paper, we first present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures, improving in several key aspects upon previous bounds derived in particular cases. We then extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. These theoretical findings are confirmed by a detailed numerical study on simulated and real-world datasets. We also introduce an efficient, scalable and robust version of kernel-based DFM using the Random Fourier Feature principle.

[23] 2306.04430

Evaluating the impact of outcome delay on the efficiency of two-arm group-sequential trials

Adaptive designs(AD) are a broad class of trial designs that allow preplanned modifications based on patient data providing improved efficiency and flexibility. However, a delay in observing the primary outcome variable can harm this added efficiency. In this paper, we aim to ascertain the size of such outcome delay that results in the realised efficiency gains of ADs becoming negligible compared to classical fixed sample RCTs. We measure the impact of delay by developing formulae for the no. of overruns in 2 arm GSDs with normal data, assuming different recruitment models. The efficiency of a GSD is usually measured in terms of the expected sample size (ESS), with GSDs generally reducing the ESS compared to a standard RCT. Our formulae measures the efficiency gain from a GSD in terms of ESS reduction that is lost due to delay. We assess whether careful choice of design (e.g., altering the spacing of the IAs) can help recover the benefits of GSDs in presence of delay. We also analyse the efficiency of GSDs with respect to time to complete the trial. Comparing the expected efficiency gains, with and without consideration of delay, it is evident GSDs suffer considerable losses due to delay. Even a small delay can have a significant impact on the trial's efficiency. In contrast, even in the presence of substantial delay, a GSD will have a smaller expected time to trial completion in comparison to a simple RCT. Although the no. of stages have little influence on the efficiency losses, the timing of IAs can impact the efficiency of a GSDs with delay. Particularly, for unequally spaced IAs, pushing IAs towards latter end of the trial can be harmful for the design with delay.

[24] 2306.04456

Tree models for assessing covariate-dependent method agreement

Method comparison studies explore the agreement of measurements made by two or more methods. Commonly, agreement is evaluated by the well-established Bland-Altman analysis. However, the underlying assumption is that differences between measurements are identically distributed for all observational units and in all application settings. We introduce the concept of conditional method agreement and propose a respective modeling approach to alleviate this constraint. Therefore, the Bland-Altman analysis is embedded in the framework of recursive partitioning to explicitly define subgroups with heterogeneous agreement in dependence of covariates in an exploratory analysis. Three different modeling approaches, conditional inference trees with an appropriate transformation of the modeled differences (CTreeTrafo), distributional regression trees (DistTree), and model-based trees (MOB) are considered. The performance of these models is evaluated in terms of type-I error probability and power in several simulation studies. Further, the adjusted rand index (ARI) is used to quantify the models' ability to uncover given subgroups. An application example to real data of accelerometer device measurements is used to demonstrate the applicability. Additionally, a two-sample Bland-Altman test is proposed for exploratory or confirmatory hypothesis testing of differences in agreement between subgroups. Results indicate that all models were able to detect given subgroups with high accuracy as the sample size increased. Relevant covariates that may affect agreement could be detected in the application to accelerometer data. We conclude that conditional method agreement trees (COAT) enable the exploratory analysis of method agreement in dependence of covariates and the respective exploratory or confirmatory hypothesis testing of group differences. It is made publicly available through the R package coat.

[25] 2306.04483

Versatile Parametric Classes of Covariance Functions that Interlace Anisotropies and Hole Effects

Covariance functions are a fundamental tool for modeling the dependence structure of spatial processes. This work investigates novel constructions for covariance functions that enable the integration of anisotropies and hole effects in complex and versatile ways, having the potential to provide more accurate representations of dependence structures arising with real-world data. We show that these constructions extend widely used covariance models, including the Mat\'ern, Cauchy, compactly-supported hypergeometric and cardinal sine models. We apply our results to a geophysical data set from a rock-carbonate aquifer and demonstrate that the proposed models yield more accurate predictions at unsampled locations compared to basic covariance models.

[26] 2306.04520

Estimating Koopman operators with sketching to provably learn large scale dynamical systems

The theory of Koopman operators allows to deploy non-parametric machine learning algorithms to predict and analyze complex dynamical systems. Estimators such as principal component regression (PCR) or reduced rank regression (RRR) in kernel spaces can be shown to provably learn Koopman operators from finite empirical observations of the system's time evolution. Scaling these approaches to very long trajectories is a challenge and requires introducing suitable approximations to make computations feasible. In this paper, we boost the efficiency of different kernel-based Koopman operator estimators using random projections (sketching). We derive, implement and test the new "sketched" estimators with extensive experiments on synthetic and large-scale molecular dynamics datasets. Further, we establish non asymptotic error bounds giving a sharp characterization of the trade-offs between statistical learning rates and computational efficiency. Our empirical and theoretical analysis shows that the proposed estimators provide a sound and efficient way to learn large scale dynamical systems. In particular our experiments indicate that the proposed estimators retain the same accuracy of PCR or RRR, while being much faster.

[27] 2306.04550

From dense to sparse design: Optimal rates under the supremum norm for estimating the mean function in functional data analysis

In the setting of functional data analysis, we derive optimal rates of convergence in the supremum norm for estimating the H\"older-smooth mean function of a stochastic processes which is repeatedly and discretely observed at fixed, multivariate, synchronous design points and with additional errors. Similarly to the rates in $L_2$ obtained in Cai and Yuan (2011), for sparse design a discretization term dominates, while in the dense case the $\sqrt n$ rate can be achieved as if the $n$ processes were continuously observed without errors. However, our analysis differs in several respects from Cai and Yuan (2011). First, we do not assume that the paths of the processes are as smooth as the mean, but still obtain the $\sqrt n$ rate of convergence without additional logarithmic factors in the dense setting. Second, we show that in the supremum norm, there is an intermediate regime between the sparse and dense cases dominated by the contribution of the observation errors. Third, and in contrast to the analysis in $L_2$, interpolation estimators turn out to be sub-optimal in $L_\infty$ in the dense setting, which explains their poor empirical performance. We also obtain a central limit theorem in the supremum norm and discuss the selection of the bandwidth. Simulations and real data applications illustrate the results.

[28] 2306.03949

Partial Inference in Structured Prediction

In this paper, we examine the problem of partial inference in the context of structured prediction. Using a generative model approach, we consider the task of maximizing a score function with unary and pairwise potentials in the space of labels on graphs. Employing a two-stage convex optimization algorithm for label recovery, we analyze the conditions under which a majority of the labels can be recovered. We introduce a novel perspective on the Karush-Kuhn-Tucker (KKT) conditions and primal and dual construction, and provide statistical and topological requirements for partial recovery with provable guarantees.

[29] 2306.03955

Kernel Quadrature with Randomly Pivoted Cholesky

This paper presents new quadrature rules for functions in a reproducing kernel Hilbert space using nodes drawn by a sampling algorithm known as randomly pivoted Cholesky. The resulting computational procedure compares favorably to previous kernel quadrature methods, which either achieve low accuracy or require solving a computationally challenging sampling problem. Theoretical and numerical results show that randomly pivoted Cholesky is fast and achieves comparable quadrature error rates to more computationally expensive quadrature schemes based on continuous volume sampling, thinning, and recombination. Randomly pivoted Cholesky is easily adapted to complicated geometries with arbitrary kernels, unlocking new potential for kernel quadrature.

[30] 2306.03962

PILLAR: How to make semi-private learning more effective

In Semi-Supervised Semi-Private (SP) learning, the learner has access to both public unlabelled and private labelled data. We propose a computationally efficient algorithm that, under mild assumptions on the data, provably achieves significantly lower private labelled sample complexity and can be efficiently run on real-world datasets. For this purpose, we leverage the features extracted by networks pre-trained on public (labelled or unlabelled) data, whose distribution can significantly differ from the one on which SP learning is performed. To validate its empirical effectiveness, we propose a wide variety of experiments under tight privacy constraints (\(\epsilon=0.1\)) and with a focus on low-data regimes. In all of these settings, our algorithm exhibits significantly improved performance over available baselines that use similar amounts of public data.

[31] 2306.03982

Globally injective and bijective neural operators

Recently there has been great interest in operator learning, where networks learn operators between function spaces from an essentially infinite-dimensional perspective. In this work we present results for when the operators learned by these networks are injective and surjective. As a warmup, we combine prior work in both the finite-dimensional ReLU and operator learning setting by giving sharp conditions under which ReLU layers with linear neural operators are injective. We then consider the case the case when the activation function is pointwise bijective and obtain sufficient conditions for the layer to be injective. We remark that this question, while trivial in the finite-rank case, is subtler in the infinite-rank case and is proved using tools from Fredholm theory. Next, we prove that our supplied injective neural operators are universal approximators and that their implementation, with finite-rank neural networks, are still injective. This ensures that injectivity is not `lost' in the transcription from analytical operators to their finite-rank implementation with networks. Finally, we conclude with an increase in abstraction and consider general conditions when subnetworks, which may be many layers deep, are injective and surjective and provide an exact inversion from a `linearization.' This section uses general arguments from Fredholm theory and Leray-Schauder degree theory for non-linear integral equations to analyze the mapping properties of neural operators in function spaces. These results apply to subnetworks formed from the layers considered in this work, under natural conditions. We believe that our work has applications in Bayesian UQ where injectivity enables likelihood estimation and in inverse problems where surjectivity and injectivity corresponds to existence and uniqueness, respectively.

[32] 2306.04049

One-sided Matrix Completion from Two Observations Per Row

Given only a few observed entries from a low-rank matrix $X$, matrix completion is the problem of imputing the missing entries, and it formalizes a wide range of real-world settings that involve estimating missing data. However, when there are too few observed entries to complete the matrix, what other aspects of the underlying matrix can be reliably recovered? We study one such problem setting, that of "one-sided" matrix completion, where our goal is to recover the right singular vectors of $X$, even in the regime where recovering the left singular vectors is impossible, which arises when there are more rows than columns and very few observations. We propose a natural algorithm that involves imputing the missing values of the matrix $X^TX$ and show that even with only two observations per row in $X$, we can provably recover $X^TX$ as long as we have at least $\Omega(r^2 d \log d)$ rows, where $r$ is the rank and $d$ is the number of columns. We evaluate our algorithm on one-sided recovery of synthetic data and low-coverage genome sequencing. In these settings, our algorithm substantially outperforms standard matrix completion and a variety of direct factorization methods.

[33] 2306.04066

Intelligent sampling for surrogate modeling, hyperparameter optimization, and data analysis

Sampling techniques are used in many fields, including design of experiments, image processing, and graphics. The techniques in each field are designed to meet the constraints specific to that field such as uniform coverage of the range of each dimension or random samples that are at least a certain distance apart from each other. When an application imposes new constraints, for example, by requiring samples in a non-rectangular domain or the addition of new samples to an existing set, a common solution is to modify the algorithm currently in use, often with less than satisfactory results. As an alternative, we propose the concept of intelligent sampling, where we devise algorithms specifically tailored to meet our sampling needs, either by creating new algorithms or by modifying suitable algorithms from other fields. Surprisingly, both qualitative and quantitative comparisons indicate that some relatively simple algorithms can be easily modified to meet the many sampling requirements of surrogate modeling, hyperparameter optimization, and data analysis; these algorithms outperform their more sophisticated counterparts currently in use, resulting in better use of time and computer resources.

[34] 2306.04111

Quasi-Newton Updating for Large-Scale Distributed Learning

Distributed computing is critically important for modern statistical analysis. Herein, we develop a distributed quasi-Newton (DQN) framework with excellent statistical, computation, and communication efficiency. In the DQN method, no Hessian matrix inversion or communication is needed. This considerably reduces the computation and communication complexity of the proposed method. Notably, related existing methods only analyze numerical convergence and require a diverging number of iterations to converge. However, we investigate the statistical properties of the DQN method and theoretically demonstrate that the resulting estimator is statistically efficient over a small number of iterations under mild conditions. Extensive numerical analyses demonstrate the finite sample performance.

[35] 2306.04120

MESSY Estimation: Maximum-Entropy based Stochastic and Symbolic densitY Estimation

We introduce MESSY estimation, a Maximum-Entropy based Stochastic and Symbolic densitY estimation method. The proposed approach recovers probability density functions symbolically from samples using moments of a Gradient flow in which the ansatz serves as the driving force. In particular, we construct a gradient-based drift-diffusion process that connects samples of the unknown distribution function to a guess symbolic expression. We then show that when the guess distribution has the maximum entropy form, the parameters of this distribution can be found efficiently by solving a linear system of equations constructed using the moments of the provided samples. Furthermore, we use Symbolic regression to explore the space of smooth functions and find optimal basis functions for the exponent of the maximum entropy functional leading to good conditioning. The cost of the proposed method in each iteration of the random search is linear with the number of samples and quadratic with the number of basis functions. We validate the proposed MESSY estimation method against other benchmark methods for the case of a bi-modal and a discontinuous density, as well as a density at the limit of physical realizability. We find that the addition of a symbolic search for basis functions improves the accuracy of the estimation at a reasonable additional computational cost. Our results suggest that the proposed method outperforms existing density recovery methods in the limit of a small to moderate number of samples by providing a low-bias and tractable symbolic description of the unknown density at a reasonable computational cost.

[36] 2306.04174

End-to-End Learning for Stochastic Optimization: A Bayesian Perspective

We develop a principled approach to end-to-end learning in stochastic optimization. First, we show that the standard end-to-end learning algorithm admits a Bayesian interpretation and trains a posterior Bayes action map. Building on the insights of this analysis, we then propose new end-to-end learning algorithms for training decision maps that output solutions of empirical risk minimization and distributionally robust optimization problems, two dominant modeling paradigms in optimization under uncertainty. Numerical results for a synthetic newsvendor problem illustrate the key differences between alternative training schemes. We also investigate an economic dispatch problem based on real data to showcase the impact of the neural network architecture of the decision maps on their test performance.

[37] 2306.04177

Semiparametric Efficiency Gains from Parametric Restrictions on the Generalized Propensity Score

Knowledge of the propensity score weakly improves efficiency when estimating causal parameters, but what kind of knowledge is more useful? To examine this, we first derive the semiparametric efficiency bound of multivalued treatment effects when the propensity score is correctly specified by a parametric model. We then reveal which parametric structure on the propensity score enhances the efficiency even when the the model is large. Finally, we apply the general theory we develop to a stratified experiment setup and find that knowing the strata improves the efficiency, especially when the size of each stratum component is small.

[38] 2306.04201

Improving Hyperparameter Learning under Approximate Inference in Gaussian Process Models

Approximate inference in Gaussian process (GP) models with non-conjugate likelihoods gets entangled with the learning of the model hyperparameters. We improve hyperparameter learning in GP models and focus on the interplay between variational inference (VI) and the learning target. While VI's lower bound to the marginal likelihood is a suitable objective for inferring the approximate posterior, we show that a direct approximation of the marginal likelihood as in Expectation Propagation (EP) is a better learning objective for hyperparameter optimization. We design a hybrid training procedure to bring the best of both worlds: it leverages conjugate-computation VI for inference and uses an EP-like marginal likelihood approximation for hyperparameter learning. We compare VI, EP, Laplace approximation, and our proposed training procedure and empirically demonstrate the effectiveness of our proposal across a wide range of data sets.

[39] 2306.04251

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.

[40] 2306.04292

Dear XAI Community, We Need to Talk! Fundamental Misconceptions in Current XAI Research

Despite progress in the field, significant parts of current XAI research are still not on solid conceptual, ethical, or methodological grounds. Unfortunately, these unfounded parts are not on the decline but continue to grow. Many explanation techniques are still proposed without clarifying their purpose. Instead, they are advertised with ever more fancy-looking heatmaps or only seemingly relevant benchmarks. Moreover, explanation techniques are motivated with questionable goals, such as building trust, or rely on strong assumptions about the 'concepts' that deep learning algorithms learn. In this paper, we highlight and discuss these and other misconceptions in current XAI research. We also suggest steps to make XAI a more substantive area of research.

[41] 2306.04396

Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance

Diffusion models have shown significant progress in image translation tasks recently. However, due to their stochastic nature, there's often a trade-off between style transformation and content preservation. Current strategies aim to disentangle style and content, preserving the source image's structure while successfully transitioning from a source to a target domain under text or one-shot image conditions. Yet, these methods often require computationally intense fine-tuning of diffusion models or additional neural networks. To address these challenges, here we present an approach that guides the reverse process of diffusion sampling by applying asymmetric gradient guidance. This results in quicker and more stable image manipulation for both text-guided and image-guided image translation. Our model's adaptability allows it to be implemented with both image- and latent-diffusion models. Experiments show that our method outperforms various state-of-the-art models in image translation tasks.

[42] 2306.04444

Fast Optimal Locally Private Mean Estimation via Random Projections

We study the problem of locally private mean estimation of high-dimensional vectors in the Euclidean ball. Existing algorithms for this problem either incur sub-optimal error or have high communication and/or run-time complexity. We propose a new algorithmic framework, ProjUnit, for private mean estimation that yields algorithms that are computationally efficient, have low communication complexity, and incur optimal error up to a $1+o(1)$-factor. Our framework is deceptively simple: each randomizer projects its input to a random low-dimensional subspace, normalizes the result, and then runs an optimal algorithm such as PrivUnitG in the lower-dimensional space. In addition, we show that, by appropriately correlating the random projection matrices across devices, we can achieve fast server run-time. We mathematically analyze the error of the algorithm in terms of properties of the random projections, and study two instantiations. Lastly, our experiments for private mean estimation and private federated learning demonstrate that our algorithms empirically obtain nearly the same utility as optimal ones while having significantly lower communication and computational cost.

[43] 2306.04539

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

In many machine learning systems that jointly learn from multiple modalities, a core research question is to understand the nature of multimodal interactions: the emergence of new task-relevant information during learning from both modalities that was not present in either alone. We study this challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data and naturally co-occurring multimodal data (e.g., unlabeled images and captions, video and corresponding audio) but when labeling them is time-consuming. Using a precise information-theoretic definition of interactions, our key contributions are the derivations of lower and upper bounds to quantify the amount of multimodal interactions in this semi-supervised setting. We propose two lower bounds based on the amount of shared information between modalities and the disagreement between separately trained unimodal classifiers, and derive an upper bound through connections to approximate algorithms for min-entropy couplings. We validate these estimated bounds and show how they accurately track true interactions. Finally, two semi-supervised multimodal applications are explored based on these theoretical results: (1) analyzing the relationship between multimodal performance and estimated interactions, and (2) self-supervised learning that embraces disagreement between modalities beyond agreement as is typically done.

[44] 2306.04606

Network-based Representations and Dynamic Discrete Choice Models for Multiple Discrete Choice Analysis

In many choice modeling applications, people demand is frequently characterized as multiple discrete, which means that people choose multiple items simultaneously. The analysis and prediction of people behavior in multiple discrete choice situations pose several challenges. In this paper, to address this, we propose a random utility maximization (RUM) based model that considers each subset of choice alternatives as a composite alternative, where individuals choose a subset according to the RUM framework. While this approach offers a natural and intuitive modeling approach for multiple-choice analysis, the large number of subsets of choices in the formulation makes its estimation and application intractable. To overcome this challenge, we introduce directed acyclic graph (DAG) based representations of choices where each node of the DAG is associated with an elemental alternative and additional information such that the number of selected elemental alternatives. Our innovation is to show that the multi-choice model is equivalent to a recursive route choice model on the DAG, leading to the development of new efficient estimation algorithms based on dynamic programming. In addition, the DAG representations enable us to bring some advanced route choice models to capture the correlation between subset choice alternatives. Numerical experiments based on synthetic and real datasets show many advantages of our modeling approach and the proposed estimation algorithms.

[45] 2306.04637

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection

Neural sequence models based on the transformer architecture have demonstrated remarkable \emph{in-context learning} (ICL) abilities, where they can perform new tasks when prompted with training and test examples, without any parameter update to the model. This work first provides a comprehensive statistical theory for transformers to perform ICL. Concretely, we show that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, learning generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various in-context data distributions. Using an efficient implementation of in-context gradient descent as the underlying mechanism, our transformer constructions admit mild size bounds, and can be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show that transformers can implement more complex ICL procedures involving \emph{in-context algorithm selection}, akin to what a statistician can do in real life -- A \emph{single} transformer can adaptively select different base ICL algorithms -- or even perform qualitatively different tasks -- on different input sequences, without any explicit prompting of the right algorithm or task. We both establish this in theory by explicit constructions, and also observe this phenomenon experimentally. In theory, we construct two general mechanisms for algorithm selection with concrete examples: pre-ICL testing, and post-ICL validation. As an example, we use the post-ICL validation mechanism to construct a transformer that can perform nearly Bayes-optimal ICL on a challenging task -- noisy linear models with mixed noise levels. Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures.