A kernel density estimator for data on the polysphere $\mathbb{S}^{d_1}\times\cdots\times\mathbb{S}^{d_r}$, with $r,d_1,\ldots,d_r\geq 1$, is presented in this paper. We derive the main asymptotic properties of the estimator, including mean square error, normality, and optimal bandwidths. We address the kernel theory of the estimator beyond the von Mises-Fisher kernel, introducing new kernels that are more efficient and investigating normalizing constants, moments, and sampling methods thereof. Plug-in and cross-validated bandwidth selectors are also obtained. As a spin-off of the kernel density estimator, we propose a nonparametric $k$-sample test based on the Jensen-Shannon divergence. Numerical experiments illuminate the asymptotic theory of the kernel density estimator and demonstrate the superior performance of the $k$-sample test with respect to parametric alternatives in certain scenarios. Our smoothing methodology is applied to the analysis of the morphology of a sample of hippocampi of infants embedded on the high-dimensional polysphere $(\mathbb{S}^2)^{168}$ via skeletal representations ($s$-reps).
Linked micromaps were originally developed to display geographically indexed statistics in an intuitive way by connecting them to a sequence of small maps. The approach integrates several visualization design principles, such as small multiples, discrete color indexing, and ordering. Linked micromaps allow for other types of data displays that are connected to and conditional on geographic areas. Initial applications of micromaps used data from the National Cancer Institute and the Environmental Protection Agency. In this paper, we will show how linked micromaps can be used to better understand and explore relationships and distributions of statistics linked to US states and Washington, DC. We will compare linked micromaps with other popular data displays of geographic data, such as bubble maps, choropleth maps, and bar charts. We will illustrate how linked micromaps can be used for evidence-based decision-making using data from the Bureau of Labor Statistics, the Census Bureau, and the Economic Research Service. The presentations, R scripts, and the data sets used in this article are available here: https://github.com/wlmcensus/Joint-Statistical-Meetings-Presentation-2024. The work discussed in this article was presented at the Joint Statistical Meetings (JSM) 2024 and the American Association for Public Opinion Research (AAPOR) 2024 Annual Conference.
While synthetic data hold great promise for privacy protection, their statistical analysis poses significant challenges that necessitate innovative solutions. The use of deep generative models (DGMs) for synthetic data generation is known to induce considerable bias and imprecision into synthetic data analyses, compromising their inferential utility as opposed to original data analyses. This bias and uncertainty can be substantial enough to impede statistical convergence rates, even in seemingly straightforward analyses like mean calculation. The standard errors of such estimators then exhibit slower shrinkage with sample size than the typical 1 over root-$n$ rate. This complicates fundamental calculations like p-values and confidence intervals, with no straightforward remedy currently available. In response to these challenges, we propose a new strategy that targets synthetic data created by DGMs for specific data analyses. Drawing insights from debiased and targeted machine learning, our approach accounts for biases, enhances convergence rates, and facilitates the calculation of estimators with easily approximated large sample variances. We exemplify our proposal through a simulation study on toy data and two case studies on real-world data, highlighting the importance of tailoring DGMs for targeted data analysis. This debiasing strategy contributes to advancing the reliability and applicability of synthetic data in statistical inference.
The growing power of data science can play a crucial role in addressing social discrimination, necessitating nuanced understanding and effective mitigation strategies of potential biases. Data Science Looks At Discrimination (dsld) is an R and Python package designed to provide users with a comprehensive toolkit of statistical and graphical methods for assessing possible discrimination related to protected groups, such as race, gender, and age. Our software offers techniques for discrimination analysis by identifying and mitigating confounding variables, along with methods for reducing bias in predictive models. In educational settings, dsld offers instructors powerful tools to teach important statistical principles through motivating real world examples of discrimination analysis. The inclusion of an 80-page Quarto book further supports users, from statistics educators to legal professionals, in effectively applying these analytical tools to real world scenarios.
A key question in brain sciences is how to identify time-evolving functional connectivity, such as that obtained from recordings of neuronal activity over time. We wish to explain the observed phenomena in terms of latent states which, in the case of neuronal activity, might correspond to subnetworks of neurons within a brain or organoid. Many existing approaches assume that only one latent state can be active at a time, in contrast to our domain knowledge. We propose a switching dynamical system based on the factorial hidden Markov model. Unlike existing approaches, our model acknowledges that neuronal activity can be caused by multiple subnetworks, which may be activated either jointly or independently. A change in one part of the network does not mean that the entire connectivity pattern will change. We pair our model with scalable variational inference algorithm, using a concrete relaxation of the underlying factorial hidden Markov model, to effectively infer the latent states and model parameters. We show that our algorithm can recover ground-truth structure and yield insights about the maturation of neuronal activity in microelectrode array recordings from in vitro neuronal cultures.
In many causal learning problems, variables of interest are often not all measured over the same observations, but are instead distributed across multiple datasets with overlapping variables. Tillman et al. (2008) presented the first algorithm for enumerating the minimal equivalence class of ground-truth DAGs consistent with all input graphs by exploiting local independence relations, called ION. In this paper, this problem is formulated as a more computationally efficient answer set programming (ASP) problem, which we call ION-C, and solved with the ASP system clingo. The ION-C algorithm was run on random synthetic graphs with varying sizes, densities, and degrees of overlap between subgraphs, with overlap having the largest impact on runtime, number of solution graphs, and agreement within the output set. To validate ION-C on real-world data, we ran the algorithm on overlapping graphs learned from data from two successive iterations of the European Social Survey (ESS), using a procedure for conducting joint independence tests to prevent inconsistencies in the input.
Today, cheap numerical hardware offers huge amounts of parallel computing power, much of which is used for the task of fitting neural networks to data. Adoption of this hardware to accelerate statistical Markov chain Monte Carlo (MCMC) applications has been much slower. In this chapter, we suggest some patterns for speeding up MCMC workloads using the hardware (e.g., GPUs, TPUs) and software (e.g., PyTorch, JAX) that have driven progress in deep learning over the last fifteen years or so. We offer some intuitions for why these new systems are so well suited to MCMC, and show some examples (with code) where we use them to achieve dramatic speedups over a CPU-based workflow. Finally, we discuss some potential pitfalls to watch out for.
Graph neural networks (GNNs) provide state-of-the-art results in a wide variety of tasks which typically involve predicting features at the vertices of a graph. They are built from layers of graph convolutions which serve as a powerful inductive bias for describing the flow of information among the vertices. Often, more than one data modality is available. This work considers a setting in which several graphs have the same vertex set and a common vertex-level learning task. This generalizes standard GNN models to GNNs with several graph operators that do not commute. We may call this model graph-tuple neural networks (GtNN). In this work, we develop the mathematical theory to address the stability and transferability of GtNNs using properties of non-commuting non-expansive operators. We develop a limit theory of graphon-tuple neural networks and use it to prove a universal transferability theorem that guarantees that all graph-tuple neural networks are transferable on convergent graph-tuple sequences. In particular, there is no non-transferable energy under the convergence we consider here. Our theoretical results extend well-known transferability theorems for GNNs to the case of several simultaneous graphs (GtNNs) and provide a strict improvement on what is currently known even in the GNN case. We illustrate our theoretical results with simple experiments on synthetic and real-world data. To this end, we derive a training procedure that provably enforces the stability of the resulting model.
Congestive heart failure (CHF) is a leading cause of morbidity, mortality and healthcare costs, impacting $>$23 million individuals worldwide. Large electronic health records data provide an opportunity to improve clinical management of diseases, but statistical inference on large amounts of relevant personal data is still a challenge. Thus, accurately identifying influential risk factors is pivotal to reducing the dimensionality of information. Bayesian variable selection in survival regression is a common approach towards solving this problem. In this paper, we propose placing a beta prior directly on the model coefficient of determination (Bayesian $R^2$), which induces a prior on the global variance of the predictors and provides shrinkage. Through reparameterization using an auxiliary variable, we are able to update a majority of the parameters with Gibbs sampling, simplifying computation and quickening convergence. Performance gains over competing variable selection methods are showcased through an extensive simulation study. Finally, the method is applied in a mediation analysis to identify community context attributes impacting time to first congestive heart failure diagnosis of patients enrolled in University of North Carolina Cardiovascular Device Surveillance Registry. The model has high predictive performance with a C-index of over 0.7 and we find that factors associated with higher socioeconomic inequality and air pollution increase risk of heart failure.
We observe an unknown regression function of $d$ variables $f(\boldsymbol{t})$, $\boldsymbol{t} \in[0,1]^d$, in the Gaussian white noise model of intensity $\varepsilon>0$. We assume that the function $f$ is regular and that it is a sum of $k$-variate functions, where $k$ varies from $1$ to $s$ ($1\leq s\leq d$). These functions are unknown to us and only few of them are nonzero. In this article, we address the problem of identifying the nonzero components of $f$ in the case when $d=d_\varepsilon\to \infty$ as $\varepsilon\to 0$ and $s$ is either fixed or $s=s_\varepsilon\to \infty$, $s=o(d)$ as $\varepsilon\to \infty$. This may be viewed as a variable selection problem. We derive the conditions when exact variable selection in the model at hand is possible and provide a selection procedure that achieves this type of selection. The procedure is adaptive to a degree of model sparsity described by the sparsity parameter $\beta\in(0,1)$. We also derive conditions that make the exact variable selection impossible. Our results augment previous work in this area.
Models based on recursive partitioning such as decision trees and their ensembles are popular for high-dimensional regression as they can potentially avoid the curse of dimensionality. Because empirical risk minimization (ERM) is computationally infeasible, these models are typically trained using greedy algorithms. Although effective in many cases, these algorithms have been empirically observed to get stuck at local optima. We explore this phenomenon in the context of learning sparse regression functions over $d$ binary features, showing that when the true regression function $f^*$ does not satisfy the so-called Merged Staircase Property (MSP), greedy training requires $\exp(\Omega(d))$ to achieve low estimation error. Conversely, when $f^*$ does satisfy MSP, greedy training can attain small estimation error with only $O(\log d)$ samples. This performance mirrors that of two-layer neural networks trained with stochastic gradient descent (SGD) in the mean-field regime, thereby establishing a head-to-head comparison between SGD-trained neural networks and greedy recursive partitioning estimators. Furthermore, ERM-trained recursive partitioning estimators achieve low estimation error with $O(\log d)$ samples irrespective of whether $f^*$ satisfies MSP, thereby demonstrating a statistical-computational trade-off for greedy training. Our proofs are based on a novel interpretation of greedy recursive partitioning using stochastic process theory and a coupling technique that may be of independent interest.
Multivariate random effects with unstructured variance-covariance matrices of large dimensions, $q$, can be a major challenge to estimate. In this paper, we introduce a new implementation of a reduced-rank approach to fit large dimensional multivariate random effects by writing them as a linear combination of $d < q$ latent variables. By adding reduced-rank functionality to the package glmmTMB, we enhance the mixed models available to include random effects of dimensions that were previously not possible. We apply the reduced-rank random effect to two examples, estimating a generalized latent variable model for multivariate abundance data and a random-slopes model.
We consider the problem of estimating a high-dimensional covariance matrix from a small number of observations when covariates on pairs of variables are available and the variables can have spatial structure. This is motivated by the problem arising in demography of estimating the covariance matrix of the total fertility rate (TFR) of 195 different countries when only 11 observations are available. We construct an estimator for high-dimensional covariance matrices by exploiting information about pairwise covariates, such as whether pairs of variables belong to the same cluster, or spatial structure of the variables, and interactions between the covariates. We reformulate the problem in terms of a mixed effects model. This requires the estimation of only a small number of parameters, which are easy to interpret and which can be selected using standard procedures. The estimator is consistent under general conditions, and asymptotically normal. It works if the mean and variance structure of the data is already specified or if some of the data are missing. We assess its performance under our model assumptions, as well as under model misspecification, using simulations. We find that it outperforms several popular alternatives. We apply it to the TFR dataset and draw some conclusions.
We consider linear models with scalar responses and covariates from a separable Hilbert space. The aim is to detect change points in the error distribution, based on sequential residual empirical distribution functions. Expansions for those estimated functions are more challenging in models with infinite-dimensional covariates than in regression models with scalar or vector-valued covariates due to a slower rate of convergence of the parameter estimators. Yet the suggested change point test is asymptotically distribution-free and consistent for one-change point alternatives. In the latter case we also show consistency of a change point estimator.
Generalized linear mixed models (GLMMs) are a widely used tool in statistical analysis. The main bottleneck of many computational approaches lies in the inversion of the high dimensional precision matrices associated with the random effects. Such matrices are typically sparse; however, the sparsity pattern resembles a multi partite random graph, which does not lend itself well to default sparse linear algebra techniques. Notably, we show that, for typical GLMMs, the Cholesky factor is dense even when the original precision is sparse. We thus turn to approximate iterative techniques, in particular to the conjugate gradient (CG) method. We combine a detailed analysis of the spectrum of said precision matrices with results from random graph theory to show that CG-based methods applied to high-dimensional GLMMs typically achieve a fixed approximation error with a total cost that scales linearly with the number of parameters and observations. Numerical illustrations with both real and simulated data confirm the theoretical findings, while at the same time illustrating situations, such as nested structures, where CG-based methods struggle.
An open question in \emph{Imprecise Probabilistic Machine Learning} is how to empirically derive a credal region (i.e., a closed and convex family of probabilities on the output space) from the available data, without any prior knowledge or assumption. In classification problems, credal regions are a tool that is able to provide provable guarantees under realistic assumptions by characterizing the uncertainty about the distribution of the labels. Building on previous work, we show that credal regions can be directly constructed using conformal methods. This allows us to provide a novel extension of classical conformal prediction to problems with ambiguous ground truth, that is, when the exact labels for given inputs are not exactly known. The resulting construction enjoys desirable practical and theoretical properties: (i) conformal coverage guarantees, (ii) smaller prediction sets (compared to classical conformal prediction regions) and (iii) disentanglement of uncertainty sources (epistemic, aleatoric). We empirically verify our findings on both synthetic and real datasets.
This paper extends doubly robust censoring unbiased transformations to a broad class of censored data structures under the assumption of coarsening at random and positivity. This includes the classic survival and competing risks setting, but also encompasses multiple events. A doubly robust representation for the conditional bias of the transformed data is derived. This leads to rate double robustness and oracle efficiency properties for estimating conditional expectations when combined with cross-fitting and linear smoothers. Simulation studies demonstrate favourable performance of the proposed method relative to existing approaches. An application of the methods to a regression discontinuity design with censored data illustrates its practical utility.
The problem of identifying the best answer among a collection of items having real-valued distribution is well-understood. Despite its practical relevance for many applications, fewer works have studied its extension when multiple and potentially conflicting metrics are available to assess an item's quality. Pareto set identification (PSI) aims to identify the set of answers whose means are not uniformly worse than another. This paper studies PSI in the transductive linear setting with potentially correlated objectives. Building on posterior sampling in both the stopping and the sampling rules, we propose the PSIPS algorithm that deals simultaneously with structure and correlation without paying the computational cost of existing oracle-based algorithms. Both from a frequentist and Bayesian perspective, PSIPS is asymptotically optimal. We demonstrate its good empirical performance in real-world and synthetic instances.
In general, it is challenging to release differentially private versions of survey-weighted statistics with low error for acceptable privacy loss. This is because weighted statistics from complex sample survey data can be more sensitive to individual survey response and weight values than unweighted statistics, resulting in differentially private mechanisms that can add substantial noise to the unbiased estimate of the finite population quantity. On the other hand, simply disregarding the survey weights adds noise to a biased estimator, which also can result in an inaccurate estimate. Thus, the problem of releasing an accurate survey-weighted estimate essentially involves a trade-off among bias, precision, and privacy. We leverage this trade-off to develop a differentially private method for estimating finite population quantities. The key step is to privately estimate a hyperparameter that determines how much to regularize or shrink survey weights as a function of privacy loss. We illustrate the differentially private finite population estimation using the Panel Study of Income Dynamics. We show that optimal strategies for releasing DP survey-weighted mean income estimates require orders-of-magnitude less noise than naively using the original survey weights without modification.
Wind energy's role in the global electric grid is set to expand significantly. New York State alone anticipates offshore wind farms (WFs) contributing 9GW by 2035. Integration of energy storage emerges as crucial for this advancement. In this study, we focus on a WF paired with a captive battery energy storage system (BESS). We aim to ascertain the capacity credit for a BESS with specified energy and power ratings. Unlike prior methods rooted in reliability theory, we define a power alignment function, which leads to a straightforward definition of capacity and incremental capacity for the BESS. We develop a solution method based on a linear programming formulation. Our analysis utilizes wind data, collected by NYSERDA off Long Island's coast and load demand data from NYISO. Additionally, we present theoretical insights into BESS sizing and a key time-series property influencing BESS capacity, aiding in simulating wind and demand for estimating BESS energy requirements.
The Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) is a natural Bayesian nonparametric extension of the classical Hidden Markov Model for learning from (spatio-)temporal data. A sticky HDP-HMM has been proposed to strengthen the self-persistence probability in the HDP-HMM. Then, disentangled sticky HDP-HMM has been proposed to disentangle the strength of the self-persistence prior and transition prior. However, the sticky HDP-HMM assumes that the self-persistence probability is stationary, limiting its expressiveness. Here, we build on previous work on sticky HDP-HMM and disentangled sticky HDP-HMM, developing a more general model: the recurrent sticky HDP-HMM (RS-HDP-HMM). We develop a novel Gibbs sampling strategy for efficient inference in this model. We show that RS-HDP-HMM outperforms disentangled sticky HDP-HMM, sticky HDP-HMM, and HDP-HMM in both synthetic and real data segmentation.
In this paper, we propose a novel model called Recurrent Explicit Duration Switching Linear Dynamical Systems (REDSLDS) that incorporates recurrent explicit duration variables into the rSLDS model. We also propose an inference and learning scheme that involves the use of P\'olya-gamma augmentation. We demonstrate the improved segmentation capabilities of our model on three benchmark datasets, including two quantitative datasets and one qualitative dataset.
We conduct a scoping review of existing approaches for synthetic EHR data generation, and benchmark major methods with proposed open-source software to offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, MIMIC-III/IV. Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost. 42 studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, GAN-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III; rule-based methods excel in privacy protection. Similar findings are observed on MIMIC-IV, except that GAN-based methods further outperform the baseline methods in preserving fidelity. A Python package, ``SynthEHRella'', is provided to integrate various choices of approaches and evaluation metrics, enabling more streamlined exploration and evaluation of multiple methods. We found that method choice is governed by the relative importance of the evaluation metrics in downstream use cases. We provide a decision tree to guide the choice among the benchmarked methods. Based on the decision tree, GAN-based methods excel when distributional shifts exist between the training and testing populations. Otherwise, CorGAN and MedGAN are most suitable for association modeling and predictive modeling, respectively. Future research should prioritize enhancing fidelity of the synthetic data while controlling privacy exposure, and comprehensive benchmarking of longitudinal or conditional generation methods.
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{https://github.com/SalesforceAIResearch/LaTRO}.
This study explores the influence of FOMC sentiment on market expectations, focusing on cognitive differences between experts and non-experts. Using sentiment analysis of FOMC minutes, we integrate these insights into a bounded rationality model to examine the impact on inflation expectations. Results show that experts form more conservative expectations, anticipating FOMC stabilization actions, while non-experts react more directly to inflation concerns. A lead-lag analysis indicates that institutions adjust faster, though the gap with individual investors narrows in the short term. These findings highlight the need for tailored communication strategies to better align public expectations with policy goals.
Drought has been perceived as a persistent threat globally and the complex mechanism of various factors contributing to its emergence makes it more troublesome to understand. Droughts and their severity trends have been a point of concern in the USA as well, since the economic impact of droughts has been substantial, especially in parts that contribute majorly to US agriculture. California is the biggest agricultural contributor to the United States with its share amounting up to 12% approximately for all of US agricultural produce. Although, according to a 20-year average, California ranks fifth on the list of the highest average percentage of drought-hit regions. Therefore, drought analysis and drought prediction are of crucial importance for California in order to mitigate the associated risks. However, the design of a consistent drought prediction model based on the dynamic relationship of the drought index remains a challenging task. In the present study, we trained a Voting Ensemble classifier utilizing a soft voting system and three different Random Forest models, to predict the presence of drought and also its intensity. In this paper, initially, we have discussed the trends of droughts and their intensities in various California counties reviewed the correlation of meteorological indicators with drought intensities and used these meteorological indicators for drought prediction so as to evaluate their effectiveness as well as significance.
Amazon is the world number one online retailer and has nearly every product a person could need along with a treasure trove of product reviews to help consumers make educated purchases. Companies want to find a way to increase their sales in a very crowded market, and using this data is key. A very good indicator of how a product is selling is its sales rank; which is calculated based on all-time sales of a product where recent sales are weighted more than older sales. Using the data from the Amazon products and reviews we determined that the most influential factors in determining the sales rank of a product were the number of products Amazon showed that other customers also bought, the number of products Amazon showed that customers also viewed, and the price of the product. These results were consistent for the Digital Music category, the Office Products category, and the subcategory Holsters under Cell Phones and Accessories.
Communities on the web rely on open conversation forums for a number of tasks, including governance, information sharing, and decision making. However these forms of collective deliberation can often result in biased outcomes. A prime example are Articles for Deletion (AfD) discussions on Wikipedia, which allow editors to gauge the notability of existing articles, and that, as prior work has suggested, may play a role in perpetuating the notorious gender gap of Wikipedia. Prior attempts to address this question have been hampered by access to narrow observation windows, reliance on limited subsets of both biographies and editorial outcomes, and by potential confounding factors. To address these limitations, here we adopt a competing risk survival framework to fully situate biographical AfD discussions within the full editorial cycle of Wikipedia content. We find that biographies of women are nominated for deletion faster than those of men, despite editors taking longer to reach a consensus for deletion of women, even after controlling for the size of the discussion. Furthermore, we find that AfDs about historical figures show a strong tendency to result into the redirecting or merging of the biography under discussion into other encyclopedic entries, and that there is a striking gender asymmetry: biographies of women are redirected or merged into biographies of men more often than the other way round. Our study provides a more complete picture of the role of AfD in the gender gap of Wikipedia, with implications for the governance of the open knowledge infrastructure of the web.
Recent literature proposes combining short-term experimental and long-term observational data to provide credible alternatives to conventional observational studies for identification of long-term average treatment effects (LTEs). I show that experimental data have an auxiliary role in this context. They bring no identifying power without additional modeling assumptions. When modeling assumptions are imposed, experimental data serve to amplify their identifying power. If the assumptions fail, adding experimental data may only yield results that are farther from the truth. Motivated by this, I introduce two assumptions on treatment response that may be defensible based on economic theory or intuition. To utilize them, I develop a novel two-step identification approach that centers on bounding temporal link functions -- the relationship between short-term and mean long-term potential outcomes. The approach provides sharp bounds on LTEs for a general class of assumptions, and allows for imperfect experimental compliance -- extending existing results.
We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. The accuracy is also better than the other Bayesian alternatives, yet the cost is lower and the implementation is easier. Our work provides additional evidence for the effectiveness of IVON for large language models. The code is available at https://github.com/team-approx-bayes/ivon-lora.
The wider application of end-to-end learning methods to embodied decision-making domains remains bottlenecked by their reliance on a superabundance of training data representative of the target domain. Meta-reinforcement learning (meta-RL) approaches abandon the aim of zero-shot generalization--the goal of standard reinforcement learning (RL)--in favor of few-shot adaptation, and thus hold promise for bridging larger generalization gaps. While learning this meta-level adaptive behavior still requires substantial data, efficient environment simulators approaching real-world complexity are growing in prevalence. Even so, hand-designing sufficiently diverse and numerous simulated training tasks for these complex domains is prohibitively labor-intensive. Domain randomization (DR) and procedural generation (PG), offered as solutions to this problem, require simulators to possess carefully-defined parameters which directly translate to meaningful task diversity--a similarly prohibitive assumption. In this work, we present DIVA, an evolutionary approach for generating diverse training tasks in such complex, open-ended simulators. Like unsupervised environment design (UED) methods, DIVA can be applied to arbitrary parameterizations, but can additionally incorporate realistically-available domain knowledge--thus inheriting the flexibility and generality of UED, and the supervised structure embedded in well-designed simulators exploited by DR and PG. Our empirical results showcase DIVA's unique ability to overcome complex parameterizations and successfully train adaptive agent behavior, far outperforming competitive baselines from prior literature. These findings highlight the potential of such semi-supervised environment design (SSED) approaches, of which DIVA is the first humble constituent, to enable training in realistic simulated domains, and produce more robust and capable adaptive agents.
Deep learning models for image classification have become standard tools in recent years. A well known vulnerability of these models is their susceptibility to adversarial examples. These are generated by slightly altering an image of a certain class in a way that is imperceptible to humans but causes the model to classify it wrongly as another class. Many algorithms have been proposed to address this problem, falling generally into one of two categories: (i) building robust classifiers (ii) directly detecting attacked images. Despite the good performance of these detectors, we argue that in a white-box setting, where the attacker knows the configuration and weights of the network and the detector, they can overcome the detector by running many examples on a local copy, and sending only those that were not detected to the actual model. This problem is common in security applications where even a very good model is not sufficient to ensure safety. In this paper we propose to overcome this inherent limitation of any static defence with randomization. To do so, one must generate a very large family of detectors with consistent performance, and select one or more of them randomly for each input. For the individual detectors, we suggest the method of neural fingerprints. In the training phase, for each class we repeatedly sample a tiny random subset of neurons from certain layers of the network, and if their average is sufficiently different between clean and attacked images of the focal class they are considered a fingerprint and added to the detector bank. During test time, we sample fingerprints from the bank associated with the label predicted by the model, and detect attacks using a likelihood ratio test. We evaluate our detectors on ImageNet with different attack methods and model architectures, and show near-perfect detection with low rates of false detection.
Transformers are deep neural network architectures that underpin the recent successes of large language models. Unlike more classical architectures that can be viewed as point-to-point maps, a Transformer acts as a measure-to-measure map implemented as specific interacting particle system on the unit sphere: the input is the empirical measure of tokens in a prompt and its evolution is governed by the continuity equation. In fact, Transformers are not limited to empirical measures and can in principle process any input measure. As the nature of data processed by Transformers is expanding rapidly, it is important to investigate their expressive power as maps from an arbitrary measure to another arbitrary measure. To that end, we provide an explicit choice of parameters that allows a single Transformer to match $N$ arbitrary input measures to $N$ arbitrary target measures, under the minimal assumption that every pair of input-target measures can be matched by some transport map.
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same $\mathcal{O}(1 / \epsilon^2)$ sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an $\mathcal{O}(1 / \epsilon)$ sample complexity when $\epsilon$ is sufficiently small. We further explore the role of data coverage in contextual bandits and RLHF. While the coverage assumption is commonly employed in offline RLHF to link the samples from the reference policy to the optimal policy, often at the cost of a multiplicative dependence on the coverage coefficient, its impact on the sample complexity of online RLHF remains unclear. Previous theoretical analyses of online RLHF typically require explicit exploration and additional structural assumptions on the reward function class. In contrast, we show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.
Graph Shift Operators (GSOs), such as the adjacency and graph Laplacian matrices, play a fundamental role in graph theory and graph representation learning. Traditional GSOs are typically constructed by normalizing the adjacency matrix by the degree matrix, a local centrality metric. In this work, we instead propose and study Centrality GSOs (CGSOs), which normalize adjacency matrices by global centrality metrics such as the PageRank, $k$-core or count of fixed length walks. We study spectral properties of the CGSOs, allowing us to get an understanding of their action on graph signals. We confirm this understanding by defining and running the spectral clustering algorithm based on different CGSOs on several synthetic and real-world datasets. We furthermore outline how our CGSO can act as the message passing operator in any Graph Neural Network and in particular demonstrate strong performance of a variant of the Graph Convolutional Network and Graph Attention Network using our CGSOs on several real-world benchmark datasets.
In the 1970s, the United States Environmental Protection Agency sponsored Documerica, a large-scale photography initiative to document environmental subjects nation-wide. While over 15,000 digitized public-domain photographs from the collection are available online, most of the images were scanned from damaged copies of the original prints. We present and evaluate a modified histogram matching technique based on the underlying chemistry of the prints for correcting the damaged images by using training data collected from a small set of undamaged prints. The entire set of color-adjusted Documerica images is made available in an open repository.
The Koopman operator plays a crucial role in analyzing the global behavior of dynamical systems. Existing data-driven methods for approximating the Koopman operator or discovering the governing equations of the underlying system typically require a fixed set of basis functions, also called dictionary. The optimal choice of basis functions is highly problem-dependent and often requires domain knowledge. We present a novel gradient descent-based optimization framework for learning suitable and interpretable basis functions from data and show how it can be used in combination with EDMD, SINDy, and PDE-FIND. We illustrate the efficacy of the proposed approach with the aid of various benchmark problems such as the Ornstein-Uhlenbeck process, Chua's circuit, a nonlinear heat equation, as well as protein-folding data.