New articles on Statistics

[1] 2405.17573

Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets

We study Leaky ResNets, which interpolate between ResNets ($\tilde{L}=0$) and Fully-Connected nets ($\tilde{L}\to\infty$) depending on an 'effective depth' hyper-parameter $\tilde{L}$. In the infinite depth limit, we study 'representation geodesics' $A_{p}$: continuous paths in representation space (similar to NeuralODEs) from input $p=0$ to output $p=1$ that minimize the parameter norm of the network. We give a Lagrangian and Hamiltonian reformulation, which highlight the importance of two terms: a kinetic energy which favors small layer derivatives $\partial_{p}A_{p}$ and a potential energy that favors low-dimensional representations, as measured by the 'Cost of Identity'. The balance between these two forces offers an intuitive understanding of feature learning in ResNets. We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work: for large $\tilde{L}$ the potential energy dominates and leads to a separation of timescales, where the representation jumps rapidly from the high dimensional inputs to a low-dimensional representation, move slowly inside the space of low-dimensional representations, before jumping back to the potentially high-dimensional outputs. Inspired by this phenomenon, we train with an adaptive layer step-size to adapt to the separation of timescales.

[2] 2405.17591

Individualized Dynamic Mediation Analysis Using Latent Factor Models

Mediation analysis plays a crucial role in causal inference as it can investigate the pathways through which treatment influences outcome. Most existing mediation analysis assumes that mediation effects are static and homogeneous within populations. However, mediation effects usually change over time and exhibit significant heterogeneity in many real-world applications. Additionally, the presence of unobserved confounding variables imposes a significant challenge to inferring both causal effect and mediation effect. To address these issues, we propose an individualized dynamic mediation analysis method. Our approach can identify the significant mediators of the population level while capturing the time-varying and heterogeneous mediation effects via latent factor modeling on coefficients of structural equation models. Another advantage of our method is that we can infer individualized mediation effects in the presence of unmeasured time-varying confounders. We provide estimation consistency for our proposed causal estimand and selection consistency for significant mediators. Extensive simulation studies and an application to a DNA methylation study demonstrate the effectiveness and advantages of our method.

[3] 2405.17666

Structured Partial Stochasticity in Bayesian Neural Networks

Bayesian neural network posterior distributions have a great number of modes that correspond to the same network function. The abundance of such modes can make it difficult for approximate inference methods to do their job. Recent work has demonstrated the benefits of partial stochasticity for approximate inference in Bayesian neural networks; inference can be less costly and performance can sometimes be improved. I propose a structured way to select the deterministic subset of weights that removes neuron permutation symmetries, and therefore the corresponding redundant posterior modes. With a drastically simplified posterior distribution, the performance of existing approximate inference schemes is found to be greatly improved.

[4] 2405.17668

Towards the use of multiple ROIs for radiomics-based survival modelling: finding a strategy of aggregating lesions

The main objective of this work is to explore the possibility of incorporating radiomic information from multiple lesions into survival models. We hypothesise that when more lesions are present, their inclusion can improve model performance, and we aim to find an optimal strategy for using multiple distinct regions in modelling. The idea of using multiple regions of interest (ROIs) to extract radiomic features for predictive models has been implemented in many recent works. However, in almost all studies, analogous regions were segmented according to particular criteria for all patients -- for example, the primary tumour and peritumoral area, or subregions of the primary tumour. They can be included in a model in a straightforward way as additional features. A more interesting scenario occurs when multiple distinct ROIs are present, such as multiple lesions in a regionally disseminated cancer. Since the number of such regions may differ between patients, their inclusion in a model is non-trivial and requires additional processing steps. We proposed several methods of handling multiple ROIs representing either ROI or risk aggregation strategy, compared them to a published one, and evaluated their performance in different classes of survival models in a Monte Carlo Cross-Validation scheme. We demonstrated the effectiveness of the methods using a cohort of 115 non-small cell lung cancer patients, for whom we predicted the metastasis risk based on features extracted from PET images in original resolution or interpolated to CT image resolution. For both feature sets, incorporating all available lesions, as opposed to a singular ROI representing the primary tumour, allowed for considerable improvement of predictive ability regardless of the model.

[5] 2405.17669

Bayesian Nonparametrics for Principal Stratification with Continuous Post-Treatment Variables

Principal stratification provides a causal inference framework that allows adjustment for confounded post-treatment variables when comparing treatments. Although the literature has focused mainly on binary post-treatment variables, there is a growing interest in principal stratification involving continuous post-treatment variables. However, characterizing the latent principal strata with a continuous post-treatment presents a significant challenge, which is further complicated in observational studies where the treatment is not randomized. In this paper, we introduce the Confounders-Aware SHared atoms BAyesian mixture (CASBAH), a novel approach for principal stratification with continuous post-treatment variables that can be directly applied to observational studies. CASBAH leverages a dependent Dirichlet process, utilizing shared atoms across treatment levels, to effectively control for measured confounders and facilitate information sharing between treatment groups in the identification of principal strata membership. CASBAH also offers a comprehensive quantification of uncertainty surrounding the membership of the principal strata. Through Monte Carlo simulations, we show that the proposed methodology has excellent performance in characterizing the latent principal strata and estimating the effects of treatment on post-treatment variables and outcomes. Finally, CASBAH is applied to a case study in which we estimate the causal effects of US national air quality regulations on pollution levels and health outcomes.

[6] 2405.17684

ZIKQ: An innovative centile chart method for utilizing natural history data in rare disease clinical development

Utilizing natural history data as external control plays an important role in the clinical development of rare diseases, since placebo groups in double-blind randomization trials may not be available due to ethical reasons and low disease prevalence. This article proposed an innovative approach for utilizing natural history data to support rare disease clinical development by constructing reference centile charts. Due to the deterioration nature of certain rare diseases, the distributions of clinical endpoints can be age-dependent and have an absorbing state of zero, which can result in censored natural history data. Existing methods of reference centile charts can not be directly used in the censored natural history data. Therefore, we propose a new calibrated zero-inflated kernel quantile (ZIKQ) estimation to construct reference centile charts from censored natural history data. Using the application to Duchenne Muscular Dystrophy drug development, we demonstrate that the reference centile charts using the ZIKQ method can be implemented to evaluate treatment efficacy and facilitate a more targeted patient enrollment in rare disease clinical development.

[7] 2405.17693

Tamed Langevin sampling under weaker conditions

Motivated by applications to deep learning which often fail standard Lipschitz smoothness requirements, we examine the problem of sampling from distributions that are not log-concave and are only weakly dissipative, with log-gradients allowed to grow superlinearly at infinity. In terms of structure, we only assume that the target distribution satisfies either a log-Sobolev or a Poincar\'e inequality and a local Lipschitz smoothness assumption with modulus growing possibly polynomially at infinity. This set of assumptions greatly exceeds the operational limits of the "vanilla" unadjusted Langevin algorithm (ULA), making sampling from such distributions a highly involved affair. To account for this, we introduce a taming scheme which is tailored to the growth and decay properties of the target distribution, and we provide explicit non-asymptotic guarantees for the proposed sampler in terms of the Kullback-Leibler (KL) divergence, total variation, and Wasserstein distance to the target distribution.

[8] 2405.17707

The Multiplex $p_2$ Model: Mixed-Effects Modeling for Multiplex Social Networks

Social actors are often embedded in multiple social networks, and there is a growing interest in studying social systems from a multiplex network perspective. In this paper, we propose a mixed-effects model for cross-sectional multiplex network data that assumes dyads to be conditionally independent. Building on the uniplex $p_2$ model, we incorporate dependencies between different network layers via cross-layer dyadic effects and actor random effects. These cross-layer effects model the tendencies for ties between two actors and the ties to and from the same actor to be dependent across different relational dimensions. The model can also study the effect of actor and dyad covariates. As simulation-based goodness-of-fit analyses are common practice in applied network studies, we here propose goodness-of-fit measures for multiplex network analyses. We evaluate our choice of priors and the computational faithfulness and inferential properties of the proposed method through simulation. We illustrate the utility of the multiplex $p_2$ model in a replication study of a toxic chemical policy network. An original study that reflects on gossip as perceived by gossip senders and gossip targets, and their differences in perspectives, based on data from 34 Hungarian elementary school classes, highlights the applicability of the proposed method.

[9] 2405.17744

Factor Augmented Matrix Regression

We introduce \underline{F}actor-\underline{A}ugmented \underline{Ma}trix \underline{R}egression (FAMAR) to address the growing applications of matrix-variate data and their associated challenges, particularly with high-dimensionality and covariate correlations. FAMAR encompasses two key algorithms. The first is a novel non-iterative approach that efficiently estimates the factors and loadings of the matrix factor model, utilizing techniques of pre-training, diverse projection, and block-wise averaging. The second algorithm offers an accelerated solution for penalized matrix factor regression. Both algorithms are supported by established statistical and numerical convergence properties. Empirical evaluations, conducted on synthetic and real economics datasets, demonstrate FAMAR's superiority in terms of accuracy, interpretability, and computational speed. Our application to economic data showcases how matrix factors can be incorporated to predict the GDPs of the countries of interest, and the influence of these factors on the GDPs.

[10] 2405.17806

Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents

Topic modeling is a widely utilized tool in text analysis. We investigate the optimal rate for estimating a topic model. Specifically, we consider a scenario with $n$ documents, a vocabulary of size $p$, and document lengths at the order $N$. When $N\geq c\cdot p$, referred to as the long-document case, the optimal rate is established in the literature at $\sqrt{p/(Nn)}$. However, when $N=o(p)$, referred to as the short-document case, the optimal rate remains unknown. In this paper, we first provide new entry-wise large-deviation bounds for the empirical singular vectors of a topic model. We then apply these bounds to improve the error rate of a spectral algorithm, Topic-SCORE. Finally, by comparing the improved error rate with the minimax lower bound, we conclude that the optimal rate is still $\sqrt{p/(Nn)}$ in the short-document case.

[11] 2405.17823

Spectral Truncation Kernels: Noncommutativity in $C^*$-algebraic Kernel Machines

In this paper, we propose a new class of positive definite kernels based on the spectral truncation, which has been discussed in the fields of noncommutative geometry and $C^*$-algebra. We focus on kernels whose inputs and outputs are functions and generalize existing kernels, such as polynomial, product, and separable kernels, by introducing a truncation parameter $n$ that describes the noncommutativity of the products appearing in the kernels. When $n$ goes to infinity, the proposed kernels tend to the existing commutative kernels. If $n$ is finite, they exhibit different behavior, and the noncommutativity induces interactions along the data function domain. We show that the truncation parameter $n$ is a governing factor leading to performance enhancement: by setting an appropriate $n$, we can balance the representation power and the complexity of the representation space. The flexibility of the proposed class of kernels allows us to go beyond previous commutative kernels.

[12] 2405.17828

On Robust Clustering of Temporal Point Process

Clustering of event stream data is of great importance in many application scenarios, including but not limited to, e-commerce, electronic health, online testing, mobile music service, etc. Existing clustering algorithms fail to take outlier data into consideration and are implemented without theoretical guarantees. In this paper, we propose a robust temporal point processes clustering framework which works under mild assumptions and meanwhile addresses several important issues in the event stream clustering problem.Specifically, we introduce a computationally efficient model-free distance function to quantify the dissimilarity between different event streams so that the outliers can be detected and the good initial clusters could be obtained. We further consider an expectation-maximization-type algorithm incorporated with a Catoni's influence function for robust estimation and fine-tuning of clusters. We also establish the theoretical results including algorithmic convergence, estimation error bound, outlier detection, etc. Simulation results corroborate our theoretical findings and real data applications show the effectiveness of our proposed methodology.

[13] 2405.17834

Revisiting Step-Size Assumptions in Stochastic Approximation

Many machine learning and optimization algorithms are built upon the framework of stochastic approximation (SA), for which the selection of step-size (or learning rate) is essential for success. For the sake of clarity, this paper focuses on the special case $\alpha_n = \alpha_0 n^{-\rho}$ at iteration $n$, with $\rho \in [0,1]$ and $\alpha_0>0$ design parameters. It is most common in practice to take $\rho=0$ (constant step-size), while in more theoretically oriented papers a vanishing step-size is preferred. In particular, with $\rho \in (1/2, 1)$ it is known that on applying the averaging technique of Polyak and Ruppert, the mean-squared error (MSE) converges at the optimal rate of $O(1/n)$ and the covariance in the central limit theorem (CLT) is minimal in a precise sense. The paper revisits step-size selection in a general Markovian setting. Under readily verifiable assumptions, the following conclusions are obtained provided $0<\rho<1$: $\bullet$ Parameter estimates converge with probability one, and also in $L_p$ for any $p\ge 1$. $\bullet$ The MSE may converge very slowly for small $\rho$, of order $O(\alpha_n^2)$ even with averaging. $\bullet$ For linear stochastic approximation the source of slow convergence is identified: for any $\rho\in (0,1)$, averaging results in estimates for which the error $\textit{covariance}$ vanishes at the optimal rate, and moreover the CLT covariance is optimal in the sense of Polyak and Ruppert. However, necessary and sufficient conditions are obtained under which the $\textit{bias}$ converges to zero at rate $O(\alpha_n)$. This is the first paper to obtain such strong conclusions while allowing for $\rho \le 1/2$. A major conclusion is that the choice of $\rho =0$ or even $\rho<1/2$ is justified only in select settings -- In general, bias may preclude fast convergence.

[14] 2405.17919

Fisher's Legacy of Directional Statistics, and Beyond to Statistics on Manifolds

It will not be an exaggeration to say that R A Fisher is the Albert Einstein of Statistics. He pioneered almost all the main branches of statistics, but it is not as well known that he opened the area of Directional Statistics with his 1953 paper introducing a distribution on the sphere which is now known as the Fisher distribution. He stressed that for spherical data one should take into account that the data is on a manifold. We will describe this Fisher distribution and reanalyse his geological data. We also comment on the two goals he set himself in that paper, and how he reinvented the von Mises distribution on the circle. Since then, many extensions of this distribution have appeared bearing Fisher's name such as the von Mises Fisher distribution and the matrix Fisher distribution. In fact, the subject of Directional Statistics has grown tremendously in the last two decades with new applications emerging in Life Sciences, Image Analysis, Machine Learning and so on. We give a recent new method of constructing the Fisher type distribution which has been motivated by some problems in Machine Learning. The subject related to his distribution has evolved since then more broadly as Statistics on Manifolds which also includes the new field of Shape Analysis. We end with a historical note pointing out some correspondence between D'Arcy Thompson and R A Fisher related to Shape Analysis.

[15] 2405.17954

Comparison of predictive values with paired samples

Positive predictive value and negative predictive value are two widely used parameters to assess the clinical usefulness of a medical diagnostic test. When there are two diagnostic tests, it is recommendable to make a comparative assessment of the values of these two parameters after applying the two tests to the same subjects (paired samples). The objective is then to make individual or global inferences about the difference or the ratio of the predictive value of the two diagnostic tests. These inferences are usually based on complex and not very intuitive expressions, some of which have subsequently been reformulated. We define the two properties of symmetry which any inference method must verify - symmetry in diagnoses and symmetry in the tests -, we propose new inference methods, and we define them with simple expressions. All of the methods are compared with each other, selecting the optimal method: (a) to obtain a confidence interval for the difference or ratio; (b) to perform an individual homogeneity test of the two predictive values; and (c) to carry out a global homogeneity test of the two predictive values.

[16] 2405.17955

Efficient Prior Calibration From Indirect Data

Bayesian inversion is central to the quantification of uncertainty within problems arising from numerous applications in science and engineering. To formulate the approach, four ingredients are required: a forward model mapping the unknown parameter to an element of a solution space, often the solution space for a differential equation; an observation operator mapping an element of the solution space to the data space; a noise model describing how noise pollutes the observations; and a prior model describing knowledge about the unknown parameter before the data is acquired. This paper is concerned with learning the prior model from data; in particular, learning the prior from multiple realizations of indirect data obtained through the noisy observation process. The prior is represented, using a generative model, as the pushforward of a Gaussian in a latent space; the pushforward map is learned by minimizing an appropriate loss function. A metric that is well-defined under empirical approximation is used to define the loss function for the pushforward map to make an implementable methodology. Furthermore, an efficient residual-based neural operator approximation of the forward model is proposed and it is shown that this may be learned concurrently with the pushforward map, using a bilevel optimization formulation of the problem; this use of neural operator approximation has the potential to make prior learning from indirect data more computationally efficient, especially when the observation process is expensive, non-smooth or not known. The ideas are illustrated with the Darcy flow inverse problem of finding permeability from piezometric head measurements.

[17] 2405.17972

Inference for the stochastic FitzHugh-Nagumo model from real action potential data via approximate Bayesian computation

The stochastic FitzHugh-Nagumo (FHN) model considered here is a two-dimensional nonlinear stochastic differential equation with additive degenerate noise, whose first component, the only one observed, describes the membrane voltage evolution of a single neuron. Due to its low dimensionality, its analytical and numerical tractability, and its neuronal interpretation, it has been used as a case study to test the performance of different statistical methods in estimating the underlying model parameters. Existing methods, however, often require complete observations, non-degeneracy of the noise or a complex architecture (e.g., to estimate the transition density of the process, "recovering" the unobserved second component), and they may not (satisfactorily) estimate all model parameters simultaneously. Moreover, these studies lack real data applications for the stochastic FHN model. Here, we tackle all challenges (non-globally Lipschitz drift, non-explicit solution, lack of available transition density, degeneracy of the noise, and partial observations) via an intuitive and easy-to-implement sequential Monte Carlo approximate Bayesian computation algorithm. The proposed method relies on a recent computationally efficient and structure-preserving numerical splitting scheme for synthetic data generation, and on summary statistics exploiting the structural properties of the process. We succeed in estimating all model parameters from simulated data and, more remarkably, real action potential data of rats. The presented novel real-data fit may broaden the scope and credibility of this classic and widely used neuronal model.

[18] 2405.18005

Persistence Diagram Estimation : Beyond Plug-in Approaches

Persistent homology is a tool from Topological Data Analysis (TDA) used to summarize the topology underlying data. It can be conveniently represented through persistence diagrams. Observing a noisy signal, common strategies to infer its persistence diagram involve plug-in estimators, and convergence properties are then derived from sup-norm stability. This dependence on the sup-norm convergence of the preliminary estimator is restrictive, as it essentially imposes to consider regular classes of signals. Departing from these approaches, we design an estimator based on image persistence. In the context of the Gaussian white noise model, and for large classes of piecewise-H\"older signals, we prove that the proposed estimator is consistent and achieves minimax rates. Notably, these rates coincide with the well known minimax rates for H\"older continuous signals.

[19] 2405.18020

The association between environmental variables and short-term mortality: evidence from Europe

Using fine-grained, publicly available data, this paper studies the association between environmental factors, i.e., variables capturing weather and air pollution characteristics, and weekly mortality rates in small geographical regions in Europe. Hereto, we develop a mortality modelling framework where a baseline captures a region-specific, seasonal historical trend observed within the weekly mortality rates. Using a machine learning algorithm, we then explain deviations from this baseline using anomalies and extreme indices constructed from the environmental data. We illustrate our proposed modelling framework through a case study on more than 550 NUTS 3 regions (Nomenclature of Territorial Units for Statistics, level 3) located in 20 different European countries. Through interpretation tools, we unravel insights into which environmental features are most important when estimating excess or deficit mortality with respect to the baseline and explore how these features interact. Moreover, we investigate harvesting effects of the environmental features through our constructed weekly mortality modelling framework. Our findings show that temperature-related features exert the most significant influence in explaining deviations in mortality from the baseline. Furthermore, we find that environmental features prove particularly beneficial in southern regions for explaining elevated levels of mortality over short time periods.

[20] 2405.18051

Predicting Progression Events in Multiple Myeloma from Routine Blood Work

The ability to accurately predict disease progression is paramount for optimizing multiple myeloma patient care. This study introduces a hybrid neural network architecture, combining Long Short-Term Memory networks with a Conditional Restricted Boltzmann Machine, to predict future blood work of affected patients from a series of historical laboratory results. We demonstrate that our model can replicate the statistical moments of the time series ($0.95~\pm~0.01~\geq~R^2~\geq~0.83~\pm~0.03$) and forecast future blood work features with high correlation to actual patient data ($0.92\pm0.02~\geq~r~\geq~0.52~\pm~0.09$). Subsequently, a second Long Short-Term Memory network is employed to detect and annotate disease progression events within the forecasted blood work time series. We show that these annotations enable the prediction of progression events with significant reliability (AUROC$~=~0.88~\pm~0.01$), up to 12 months in advance (AUROC($t+12~$mos)$~=0.65~\pm~0.01$). Our system is designed in a modular fashion, featuring separate entities for forecasting and progression event annotation. This structure not only enhances interpretability but also facilitates the integration of additional modules to perform subsequent operations on the generated outputs. Our approach utilizes a minimal set of routine blood work measurements, which avoids the need for expensive or resource-intensive tests and ensures accessibility of the system in clinical routine. This capability allows for individualized risk assessment and making informed treatment decisions tailored to a patient's unique disease kinetics. The represented approach contributes to the development of a scalable and cost-effective virtual human twin system for optimized healthcare resource utilization and improved patient outcomes in multiple myeloma care.

[21] 2405.18055

Dimension-free uniform concentration bound for logistic regression

We provide a novel dimension-free uniform concentration bound for the empirical risk function of constrained logistic regression. Our bound yields a milder sufficient condition for a uniform law of large numbers than conditions derived by the Rademacher complexity argument and McDiarmid's inequality. The derivation is based on the PAC-Bayes approach with second-order expansion and Rademacher-complexity-based bounds for the residual term of the expansion.

[22] 2405.18076

Prediction of energy consumption in hotels using ANN

The increase in travelers and stays in tourist destinations is leading hotels to be aware of their ecological management and the need for efficient energy consumption. To achieve this, hotels are increasingly using digitalized systems and more frequent measurements are made of the variables that affect their management. Electricity can play a significant role, predicting electricity usage in hotels, which in turn can enhance their circularity - an approach aimed at sustainable and efficient resource use. In this study, neural networks are trained to predict electricity usage patterns in two hotels based on historical data. The results indicate that the predictions have a good accuracy level of around 2.5% in MAPE, showing the potential of using these techniques for electricity forecasting in hotels. Additionally, neural network models can use climatological data to improve predictions. By accurately forecasting energy demand, hotels can optimize their energy procurement and usage, moving energy-intensive activities to off-peak hours to reduce costs and strain on the grid, assisting in the better integration of renewable energy sources, or identifying patterns and anomalies in energy consumption, suggesting areas for efficiency improvements, among other. Hence, by optimizing the allocation of resources, reducing waste and improving efficiency these models can improve hotel's circularity.

[23] 2405.18081

Optimality of Approximate Message Passing Algorithms for Spiked Matrix Models with Rotationally Invariant Noise

We study the problem of estimating a rank one signal matrix from an observed matrix generated by corrupting the signal with additive rotationally invariant noise. We develop a new class of approximate message-passing algorithms for this problem and provide a simple and concise characterization of their dynamics in the high-dimensional limit. At each iteration, these algorithms exploit prior knowledge about the noise structure by applying a non-linear matrix denoiser to the eigenvalues of the observed matrix and prior information regarding the signal structure by applying a non-linear iterate denoiser to the previous iterates generated by the algorithm. We exploit our result on the dynamics of these algorithms to derive the optimal choices for the matrix and iterate denoisers. We show that the resulting algorithm achieves the smallest possible asymptotic estimation error among a broad class of iterative algorithms under a fixed iteration budget.

[24] 2405.18091

An adaptive transfer learning perspective on classification in non-stationary environments

We consider a semi-supervised classification problem with non-stationary label-shift in which we observe a labelled data set followed by a sequence of unlabelled covariate vectors in which the marginal probabilities of the class labels may change over time. Our objective is to predict the corresponding class-label for each covariate vector, without ever observing the ground-truth labels, beyond the initial labelled data set. Previous work has demonstrated the potential of sophisticated variants of online gradient descent to perform competitively with the optimal dynamic strategy (Bai et al. 2022). In this work we explore an alternative approach grounded in statistical methods for adaptive transfer learning. We demonstrate the merits of this alternative methodology by establishing a high-probability regret bound on the test error at any given individual test-time, which adapt automatically to the unknown dynamics of the marginal label probabilities. Further more, we give bounds on the average dynamic regret which match the average guarantees of the online learning perspective for any given time interval.

[25] 2405.18095

Is machine learning good or bad for the natural sciences?

Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here, we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they introduce strong confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics.

[26] 2405.18176

SEMF: Supervised Expectation-Maximization Framework for Predicting Intervals

This work introduces the Supervised Expectation-Maximization Framework (SEMF), a versatile and model-agnostic framework that generates prediction intervals for datasets with complete or missing data. SEMF extends the Expectation-Maximization (EM) algorithm, traditionally used in unsupervised learning, to a supervised context, enabling it to extract latent representations for uncertainty estimation. The framework demonstrates robustness through extensive empirical evaluation across 11 tabular datasets, achieving$\unicode{x2013}$in some cases$\unicode{x2013}$narrower normalized prediction intervals and higher coverage than traditional quantile regression methods. Furthermore, SEMF integrates seamlessly with existing machine learning algorithms, such as gradient-boosted trees and neural networks, exemplifying its usefulness for real-world applications. The experimental results highlight SEMF's potential to advance state-of-the-art techniques in uncertainty quantification.

[27] 2405.18220

Non-negative Tensor Mixture Learning for Discrete Density Estimation

We present an expectation-maximization (EM) based unified framework for non-negative tensor decomposition that optimizes the Kullback-Leibler divergence. To avoid iterations in each M-step and learning rate tuning, we establish a general relationship between low-rank decomposition and many-body approximation. Using this connection, we exploit that the closed-form solution of the many-body approximation can be used to update all parameters simultaneously in the M-step. Our framework not only offers a unified methodology for a variety of low-rank structures, including CP, Tucker, and Train decompositions, but also their combinations forming mixtures of tensors as well as robust adaptive noise modeling. Empirically, we demonstrate that our framework provides superior generalization for discrete density estimation compared to conventional tensor-based approaches.

[28] 2405.18232

Guidelines and Best Practices to Share Deidentified Data and Code

In 2022, the Journal of Statistics and Data Science Education (JSDSE) instituted augmented requirements for authors to post deidentified data and code underlying their papers. These changes were prompted by an increased focus on reproducibility and open science (NASEM 2019). A recent review of data availability practices noted that "such policies help increase the reproducibility of the published literature, as well as make a larger body of data available for reuse and re-analysis" (PLOS ONE, 2024). JSDSE values accessibility as it endeavors to share knowledge that can improve educational approaches to teaching statistics and data science. Because institution, environment, and students differ across readers of the journal, it is especially important to facilitate the transfer of a journal article's findings to new contexts. This process may require digging into more of the details, including the deidentified data and code. Our goal is to provide our readers and authors with a review of why the requirements for code and data sharing were instituted, summarize ongoing trends and developments in open science, discuss options for data and code sharing, and share advice for authors.

[29] 2405.18271

Unraveling Factors Influencing Shooting Incidents: Preliminary Analysis and Insights

The following is a write up of the progress of modeling data from the K12 organization \cite{Riedman_2023}. Data was characterized and investigated for statistically significant factors. The incident data was spilt into three sets: the entire set of incidents, incidents from 1966 - 2017, and incidents from 2018 - 2023. This was done in an attempt to discern key factors for the acceleration of incidents over the last several years. The data set was cleaned and processed primarily through RStudio. The individual factors were studied and subjected to statistical analysis where appropriate. As it turns out, there are differences between media portrayals of shooters and actual shooters. Then, multiple regression techniques were performed then followed by ANOVA of the models to determine statistically significant independent variables and their influence on casualties. Thus far, linear regression and negative binomial regression have been attempted. Further refining of the methods will be necessary for Poisson regression and logistic regression to be viably attempted. At this point in time a common theme among each of the models is the presence of targeted attacks affecting casualties. Further study can lead to improved safe guarding strategies to eliminate or minimize casualties. Further, increased understanding of shooter demographics can also lead to outreach and prevention programs.

[30] 2405.18274

Signal-Plus-Noise Decomposition of Nonlinear Spiked Random Matrix Models

In this paper, we study a nonlinear spiked random matrix model where a nonlinear function is applied element-wise to a noise matrix perturbed by a rank-one signal. We establish a signal-plus-noise decomposition for this model and identify precise phase transitions in the structure of the signal components at critical thresholds of signal strength. To demonstrate the applicability of this decomposition, we then utilize it to study new phenomena in the problems of signed signal recovery in nonlinear models and community detection in transformed stochastic block models. Finally, we validate our results through a series of numerical simulations.

[31] 2405.18279

Identifiability, Observability, Uncertainty and Bayesian System Identification of Epidemiological Models

In this project, identifiability, observability and uncertainty properties of the deterministic and Chain Binomial stochastic SIR, SEIR and SEIAR epidemiological models are studied. Techniques for modeling overdispersion are investigated and used to compare simulated trajectories for moderately sized, homogenous populations. With the chosen model parameters overdispersion was found to have small impact, but larger impact on smaller populations and simulations closer to the initial outbreak of an epidemic. Using a software tool for model identifiability and observability (DAISY[Bellu et al. 2007]), the deterministic SIR and SEIR models was found to be structurally identifiable and observable under mild conditions, while SEIAR in general remains structurally unidentifiable and unobservable. Sequential Monte Carlo and Markov Chain Monte Carlo methods were implemented in a custom C++ library and applied to stochastic SIR, SEIR and SEIAR models in order to generate parameter distributions. With the chosen model parameters overdispersion was found to have a small impact on parameter distributions for SIR and SEIR models. For SEIAR, the algorithm did not converge around the true parameters of the deterministic model. The custom C++ library was found to be computationally efficient, and is very likely to be used in future projects.

[32] 2405.18284

Adaptive debiased SGD in high-dimensional GLMs with steaming data

Online statistical inference facilitates real-time analysis of sequentially collected data, making it different from traditional methods that rely on static datasets. This paper introduces a novel approach to online inference in high-dimensional generalized linear models, where we update regression coefficient estimates and their standard errors upon each new data arrival. In contrast to existing methods that either require full dataset access or large-dimensional summary statistics storage, our method operates in a single-pass mode, significantly reducing both time and space complexity. The core of our methodological innovation lies in an adaptive stochastic gradient descent algorithm tailored for dynamic objective functions, coupled with a novel online debiasing procedure. This allows us to maintain low-dimensional summary statistics while effectively controlling optimization errors introduced by the dynamically changing loss functions. We demonstrate that our method, termed the Approximated Debiased Lasso (ADL), not only mitigates the need for the bounded individual probability condition but also significantly improves numerical performance. Numerical experiments demonstrate that the proposed ADL method consistently exhibits robust performance across various covariance matrix structures.

[33] 2405.18288

Stagewise Boosting Distributional Regression

Forward stagewise regression is a simple algorithm that can be used to estimate regularized models. The updating rule adds a small constant to a regression coefficient in each iteration, such that the underlying optimization problem is solved slowly with small improvements. This is similar to gradient boosting, with the essential difference that the step size is determined by the product of the gradient and a step length parameter in the latter algorithm. One often overlooked challenge in gradient boosting for distributional regression is the issue of a vanishing small gradient, which practically halts the algorithm's progress. We show that gradient boosting in this case oftentimes results in suboptimal models, especially for complex problems certain distributional parameters are never updated due to the vanishing gradient. Therefore, we propose a stagewise boosting-type algorithm for distributional regression, combining stagewise regression ideas with gradient boosting. Additionally, we extend it with a novel regularization method, correlation filtering, to provide additional stability when the problem involves a large number of covariates. Furthermore, the algorithm includes best-subset selection for parameters and can be applied to big data problems by leveraging stochastic approximations of the updating steps. Besides the advantage of processing large datasets, the stochastic nature of the approximations can lead to better results, especially for complex distributions, by reducing the risk of being trapped in a local optimum. The performance of our proposed stagewise boosting distributional regression approach is investigated in an extensive simulation study and by estimating a full probabilistic model for lightning counts with data of more than 9.1 million observations and 672 covariates.

[34] 2405.18298

Context-Specific Refinements of Bayesian Network Classifiers

Supervised classification is one of the most ubiquitous tasks in machine learning. Generative classifiers based on Bayesian networks are often used because of their interpretability and competitive accuracy. The widely used naive and TAN classifiers are specific instances of Bayesian network classifiers with a constrained underlying graph. This paper introduces novel classes of generative classifiers extending TAN and other famous types of Bayesian network classifiers. Our approach is based on staged tree models, which extend Bayesian networks by allowing for complex, context-specific patterns of dependence. We formally study the relationship between our novel classes of classifiers and Bayesian networks. We introduce and implement data-driven learning routines for our models and investigate their accuracy in an extensive computational study. The study demonstrates that models embedding asymmetric information can enhance classification accuracy.

[35] 2405.18306

Learning Staged Trees from Incomplete Data

Staged trees are probabilistic graphical models capable of representing any class of non-symmetric independence via a coloring of its vertices. Several structural learning routines have been defined and implemented to learn staged trees from data, under the frequentist or Bayesian paradigm. They assume a data set has been observed fully and, in practice, observations with missing entries are either dropped or imputed before learning the model. Here, we introduce the first algorithms for staged trees that handle missingness within the learning of the model. To this end, we characterize the likelihood of staged tree models in the presence of missing data and discuss pseudo-likelihoods that approximate it. A structural expectation-maximization algorithm estimating the model directly from the full likelihood is also implemented and evaluated. A computational experiment showcases the performance of the novel learning algorithms, demonstrating that it is feasible to account for different missingness patterns when learning staged trees.

[36] 2405.18323

Optimal Design in Repeated Testing for Count Data

In this paper, we develop optimal designs for growth curve models with count data based on the Rasch Poisson-Gamma counts (RPGCM) model. This model is often used in educational and psychological testing when test results yield count data. In the RPGCM, the test scores are determined by respondents ability and item difficulty. Locally D-optimal designs are derived for maximum quasi-likelihood estimation to efficiently estimate the mean abilities of the respondents over time. Using the log link, both unstructured, linear and nonlinear growth curves of log mean abilities are taken into account. Finally, the sensitivity of the derived optimal designs due to an imprecise choice of parameter values is analyzed using D-efficiency.

[37] 2405.18373

A Hessian-Aware Stochastic Differential Equation for Modelling SGD

Continuous-time approximation of Stochastic Gradient Descent (SGD) is a crucial tool to study its escaping behaviors from stationary points. However, existing stochastic differential equation (SDE) models fail to fully capture these behaviors, even for simple quadratic objectives. Built on a novel stochastic backward error analysis framework, we derive the Hessian-Aware Stochastic Modified Equation (HA-SME), an SDE that incorporates Hessian information of the objective function into both its drift and diffusion terms. Our analysis shows that HA-SME matches the order-best approximation error guarantee among existing SDE models in the literature, while achieving a significantly reduced dependence on the smoothness parameter of the objective. Further, for quadratic objectives, under mild conditions, HA-SME is proved to be the first SDE model that recovers exactly the SGD dynamics in the distributional sense. Consequently, when the local landscape near a stationary point can be approximated by quadratics, HA-SME is expected to accurately predict the local escaping behaviors of SGD.

[38] 2405.18379

A Note on the Prediction-Powered Bootstrap

We introduce PPBoot: a bootstrap-based method for prediction-powered inference. PPBoot is applicable to arbitrary estimation problems and is very simple to implement, essentially only requiring one application of the bootstrap. Through a series of examples, we demonstrate that PPBoot often performs nearly identically to (and sometimes better than) the earlier PPI(++) method based on asymptotic normality$\unicode{x2013}$when the latter is applicable$\unicode{x2013}$without requiring any asymptotic characterizations. Given its versatility, PPBoot could simplify and expand the scope of application of prediction-powered inference to problems where central limit theorems are hard to prove.

[39] 2405.18412

Tensor Methods in High Dimensional Data Analysis: Opportunities and Challenges

Large amount of multidimensional data represented by multiway arrays or tensors are prevalent in modern applications across various fields such as chemometrics, genomics, physics, psychology, and signal processing. The structural complexity of such data provides vast new opportunities for modeling and analysis, but efficiently extracting information content from them, both statistically and computationally, presents unique and fundamental challenges. Addressing these challenges requires an interdisciplinary approach that brings together tools and insights from statistics, optimization and numerical linear algebra among other fields. Despite these hurdles, significant progress has been made in the last decade. This review seeks to examine some of the key advancements and identify common threads among them, under eight different statistical settings.

[40] 2405.18413

Homophily-adjusted social influence estimation

Homophily and social influence are two key concepts of social network analysis. Distinguishing between these phenomena is difficult, and approaches to disambiguate the two have been primarily limited to longitudinal data analyses. In this study, we provide sufficient conditions for valid estimation of social influence through cross-sectional data, leading to a novel homophily-adjusted social influence model which addresses the backdoor pathway of latent homophilic features. The oft-used network autocorrelation model (NAM) is the special case of our proposed model with no latent homophily, suggesting that the NAM is only valid when all homophilic attributes are observed. We conducted an extensive simulation study to evaluate the performance of our proposed homophily-adjusted model, comparing its results with those from the conventional NAM. Our findings shed light on the nuanced dynamics of social networks, presenting a valuable tool for researchers seeking to estimate the effects of social influence while accounting for homophily. Code to implement our approach is available at

[41] 2405.18427

Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets

We derive closed-form expressions for the Bayes optimal decision boundaries in binary classification of high dimensional overlapping Gaussian mixture model (GMM) data, and show how they depend on the eigenstructure of the class covariances, for particularly interesting structured data. We empirically demonstrate, through experiments on synthetic GMMs inspired by real-world data, that deep neural networks trained for classification, learn predictors which approximate the derived optimal classifiers. We further extend our study to networks trained on authentic data, observing that decision thresholds correlate with the covariance eigenvectors rather than the eigenvalues, mirroring our GMM analysis. This provides theoretical insights regarding neural networks' ability to perform probabilistic inference and distill statistical patterns from intricate distributions.

[42] 2405.14741

Bagging Improves Generalization Exponentially

Bagging is a popular ensemble technique to improve the accuracy of machine learning models. It hinges on the well-established rationale that, by repeatedly retraining on resampled data, the aggregated model exhibits lower variance and hence higher stability, especially for discontinuous base learners. In this paper, we provide a new perspective on bagging: By suitably aggregating the base learners at the parametrization instead of the output level, bagging improves generalization performances exponentially, a strength that is significantly more powerful than variance reduction. More precisely, we show that for general stochastic optimization problems that suffer from slowly (i.e., polynomially) decaying generalization errors, bagging can effectively reduce these errors to an exponential decay. Moreover, this power of bagging is agnostic to the solution schemes, including common empirical risk minimization, distributionally robust optimization, and various regularizations. We demonstrate how bagging can substantially improve generalization performances in a range of examples involving heavy-tailed data that suffer from intrinsically slow rates.

[43] 2405.16413

Augmented Risk Prediction for the Onset of Alzheimer's Disease from Electronic Health Records with Large Language Models

Alzheimer's disease (AD) is the fifth-leading cause of death among Americans aged 65 and older. Screening and early detection of AD and related dementias (ADRD) are critical for timely intervention and for identifying clinical trial participants. The widespread adoption of electronic health records (EHRs) offers an important resource for developing ADRD screening tools such as machine learning based predictive models. Recent advancements in large language models (LLMs) demonstrate their unprecedented capability of encoding knowledge and performing reasoning, which offers them strong potential for enhancing risk prediction. This paper proposes a novel pipeline that augments risk prediction by leveraging the few-shot inference power of LLMs to make predictions on cases where traditional supervised learning methods (SLs) may not excel. Specifically, we develop a collaborative pipeline that combines SLs and LLMs via a confidence-driven decision-making mechanism, leveraging the strengths of SLs in clear-cut cases and LLMs in more complex scenarios. We evaluate this pipeline using a real-world EHR data warehouse from Oregon Health \& Science University (OHSU) Hospital, encompassing EHRs from over 2.5 million patients and more than 20 million patient encounters. Our results show that our proposed approach effectively combines the power of SLs and LLMs, offering significant improvements in predictive performance. This advancement holds promise for revolutionizing ADRD screening and early detection practices, with potential implications for better strategies of patient management and thus improving healthcare.

[44] 2405.17395

CrEIMBO: Cross Ensemble Interactions in Multi-view Brain Observations

Modern recordings of neural activity provide diverse observations of neurons across brain areas, behavioral conditions, and subjects -- thus presenting an exciting opportunity to reveal the fundamentals of brain-wide dynamics underlying cognitive function. Current methods, however, often fail to fully harness the richness of such data as they either provide an uninterpretable representation (e.g., via "black box" deep networks) or over-simplify the model (e.g., assume stationary dynamics or analyze each session independently). Here, instead of regarding asynchronous recordings that lack alignment in neural identity or brain areas as a limitation, we exploit these diverse views of the same brain system to learn a unified model of brain dynamics. We assume that brain observations stem from the joint activity of a set of functional neural ensembles (groups of co-active neurons) that are similar in functionality across recordings, and propose to discover the ensemble and their non-stationary dynamical interactions in a new model we term CrEIMBO (Cross-Ensemble Interactions in Multi-view Brain Observations). CrEIMBO identifies the composition of the per-session neural ensembles through graph-driven dictionary learning and models the ensemble dynamics as a latent sparse time-varying decomposition of global sub-circuits, thereby capturing non-stationary dynamics. CrEIMBO identifies multiple co-active sub-circuits while maintaining representation interpretability due to sharing sub-circuits across sessions. CrEIMBO distinguishes session-specific from global (session-invariant) computations by exploring when distinct sub-circuits are active. We demonstrate CrEIMBO's ability to recover ground truth components in synthetic data and uncover meaningful brain dynamics, capturing cross-subject and inter- and intra-area variability, in high-density electrode recordings of humans performing a memory task.

[45] 2405.17455

WeatherFormer: A Pretrained Encoder Model for Learning Robust Weather Representations from Small Datasets

This paper introduces WeatherFormer, a transformer encoder-based model designed to learn robust weather features from minimal observations. It addresses the challenge of modeling complex weather dynamics from small datasets, a bottleneck for many prediction tasks in agriculture, epidemiology, and climate science. WeatherFormer was pretrained on a large pretraining dataset comprised of 39 years of satellite measurements across the Americas. With a novel pretraining task and fine-tuning, WeatherFormer achieves state-of-the-art performance in county-level soybean yield prediction and influenza forecasting. Technical innovations include a unique spatiotemporal encoding that captures geographical, annual, and seasonal variations, adapting the transformer architecture to continuous weather data, and a pretraining strategy to learn representations that are robust to missing weather features. This paper for the first time demonstrates the effectiveness of pretraining large transformer encoder models for weather-dependent applications across multiple domains.

[46] 2405.17464

Data Valuation by Leveraging Global and Local Statistical Information

Data valuation has garnered increasing attention in recent years, given the critical role of high-quality data in various applications, particularly in machine learning tasks. There are diverse technical avenues to quantify the value of data within a corpus. While Shapley value-based methods are among the most widely used techniques in the literature due to their solid theoretical foundation, the accurate calculation of Shapley values is often intractable, leading to the proposal of numerous approximated calculation methods. Despite significant progress, nearly all existing methods overlook the utilization of distribution information of values within a data corpus. In this paper, we demonstrate that both global and local statistical information of value distributions hold significant potential for data valuation within the context of machine learning. Firstly, we explore the characteristics of both global and local value distributions across several simulated and real data corpora. Useful observations and clues are obtained. Secondly, we propose a new data valuation method that estimates Shapley values by incorporating the explored distribution characteristics into an existing method, AME. Thirdly, we present a new path to address the dynamic data valuation problem by formulating an optimization problem that integrates information of both global and local value distributions. Extensive experiments are conducted on Shapley value estimation, value-based data removal/adding, mislabeled data detection, and incremental/decremental data valuation. The results showcase the effectiveness and efficiency of our proposed methodologies, affirming the significant potential of global and local value distributions in data valuation.

[47] 2405.17478

ROSE: Register Assisted General Time Series Forecasting with Decomposed Frequency Learning

With the increasing collection of time series data from various domains, there arises a strong demand for general time series forecasting models pre-trained on a large number of time-series datasets to support a variety of downstream prediction tasks. Enabling general time series forecasting faces two challenges: how to obtain unified representations from multi-domian time series data, and how to capture domain-specific features from time series data across various domains for adaptive transfer in downstream tasks. To address these challenges, we propose a Register Assisted General Time Series Forecasting Model with Decomposed Frequency Learning (ROSE), a novel pre-trained model for time series forecasting. ROSE employs Decomposed Frequency Learning for the pre-training task, which decomposes coupled semantic and periodic information in time series with frequency-based masking and reconstruction to obtain unified representations across domains. We also equip ROSE with a Time Series Register, which learns to generate a register codebook to capture domain-specific representations during pre-training and enhances domain-adaptive transfer by selecting related register tokens on downstream tasks. After pre-training on large-scale time series data, ROSE achieves state-of-the-art forecasting performance on 8 real-world benchmarks. Remarkably, even in few-shot scenarios, it demonstrates competitive or superior performance compared to existing methods trained with full data.

[48] 2405.17479

A rationale from frequency perspective for grokking in training neural network

Grokking is the phenomenon where neural networks NNs initially fit the training data and later generalize to the test data during training. In this paper, we empirically provide a frequency perspective to explain the emergence of this phenomenon in NNs. The core insight is that the networks initially learn the less salient frequency components present in the test data. We observe this phenomenon across both synthetic and real datasets, offering a novel viewpoint for elucidating the grokking phenomenon by characterizing it through the lens of frequency dynamics during the training process. Our empirical frequency-based analysis sheds new light on understanding the grokking phenomenon and its underlying mechanisms.

[49] 2405.17490

Revisit, Extend, and Enhance Hessian-Free Influence Functions

Influence functions serve as crucial tools for assessing sample influence in model interpretation, subset training set selection, noisy label detection, and more. By employing the first-order Taylor extension, influence functions can estimate sample influence without the need for expensive model retraining. However, applying influence functions directly to deep models presents challenges, primarily due to the non-convex nature of the loss function and the large size of model parameters. This difficulty not only makes computing the inverse of the Hessian matrix costly but also renders it non-existent in some cases. Various approaches, including matrix decomposition, have been explored to expedite and approximate the inversion of the Hessian matrix, with the aim of making influence functions applicable to deep models. In this paper, we revisit a specific, albeit naive, yet effective approximation method known as TracIn. This method substitutes the inverse of the Hessian matrix with an identity matrix. We provide deeper insights into why this simple approximation method performs well. Furthermore, we extend its applications beyond measuring model utility to include considerations of fairness and robustness. Finally, we enhance TracIn through an ensemble strategy. To validate its effectiveness, we conduct experiments on synthetic data and extensive evaluations on noisy label detection, sample selection for large language model fine-tuning, and defense against adversarial attacks.

[50] 2405.17508

Unveiling the Secrets: How Masking Strategies Shape Time Series Imputation

In this study, we explore the impact of different masking strategies on time series imputation models. We evaluate the effects of pre-masking versus in-mini-batch masking, normalization timing, and the choice between augmenting and overlaying artificial missingness. Using three diverse datasets, we benchmark eleven imputation models with different missing rates. Our results demonstrate that masking strategies significantly influence imputation accuracy, revealing that more sophisticated and data-driven masking designs are essential for robust model evaluation. We advocate for refined experimental designs and comprehensive disclosureto better simulate real-world patterns, enhancing the practical applicability of imputation models.

[51] 2405.17517

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.

[52] 2405.17535

Calibrated Dataset Condensation for Faster Hyperparameter Search

Dataset condensation can be used to reduce the computational cost of training multiple models on a large dataset by condensing the training dataset into a small synthetic set. State-of-the-art approaches rely on matching the model gradients between the real and synthetic data. However, there is no theoretical guarantee of the generalizability of the condensed data: data condensation often generalizes poorly across hyperparameters/architectures in practice. This paper considers a different condensation objective specifically geared toward hyperparameter search. We aim to generate a synthetic validation dataset so that the validation-performance rankings of the models, with different hyperparameters, on the condensed and original datasets are comparable. We propose a novel hyperparameter-calibrated dataset condensation (HCDC) algorithm, which obtains the synthetic validation dataset by matching the hyperparameter gradients computed via implicit differentiation and efficient inverse Hessian approximation. Experiments demonstrate that the proposed framework effectively maintains the validation-performance rankings of models and speeds up hyperparameter/architecture search for tasks on both images and graphs.

[53] 2405.17575

Interpretable Prognostics with Concept Bottleneck Models

Deep learning approaches have recently been extensively explored for the prognostics of industrial assets. However, they still suffer from a lack of interpretability, which hinders their adoption in safety-critical applications. To improve their trustworthiness, explainable AI (XAI) techniques have been applied in prognostics, primarily to quantify the importance of input variables for predicting the remaining useful life (RUL) using post-hoc attribution methods. In this work, we propose the application of Concept Bottleneck Models (CBMs), a family of inherently interpretable neural network architectures based on concept explanations, to the task of RUL prediction. Unlike attribution methods, which explain decisions in terms of low-level input features, concepts represent high-level information that is easily understandable by users. Moreover, once verified in actual applications, CBMs enable domain experts to intervene on the concept activations at test-time. We propose using the different degradation modes of an asset as intermediate concepts. Our case studies on the New Commercial Modular AeroPropulsion System Simulation (N-CMAPSS) aircraft engine dataset for RUL prediction demonstrate that the performance of CBMs can be on par or superior to black-box models, while being more interpretable, even when the available labeled concepts are limited. Code available at \href{}{\url{}}.

[54] 2405.17580

Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes

The training dynamics of linear networks are well studied in two distinct setups: the lazy regime and balanced/active regime, depending on the initialization and width of the network. We provide a surprisingly simple unyfing formula for the evolution of the learned matrix that contains as special cases both lazy and balanced regimes but also a mixed regime in between the two. In the mixed regime, a part of the network is lazy while the other is balanced. More precisely the network is lazy along singular values that are below a certain threshold and balanced along those that are above the same threshold. At initialization, all singular values are lazy, allowing for the network to align itself with the task, so that later in time, when some of the singular value cross the threshold and become active they will converge rapidly (convergence in the balanced regime is notoriously difficult in the absence of alignment). The mixed regime is the `best of both worlds': it converges from any random initialization (in contrast to balanced dynamics which require special initialization), and has a low rank bias (absent in the lazy dynamics). This allows us to prove an almost complete phase diagram of training behavior as a function of the variance at initialization and the width, for a MSE training task.

[55] 2405.17595

Element-Free Probability Distributions and Random Partitions

An "element-free" probability distribution is what remains of a probability distribution after we forget the elements to which the probabilities were assigned. These objects naturally arise in Bayesian statistics, in situations where elements are used as labels and their specific identity is not important. This paper develops the structural theory of element-free distributions, using multisets and category theory. We give operations for moving between element-free and ordinary distributions, and we show that these operations commute with multinomial sampling. We then exploit this theory to prove two representation theorems. These theorems show that element-free distributions provide a natural representation for key random structures in Bayesian nonparametric clustering: exchangeable random partitions, and random distributions parametrized by a base measure.

[56] 2405.17640

Probabilistically Plausible Counterfactual Explanations with Normalizing Flows

We present PPCEF, a novel method for generating probabilistically plausible counterfactual explanations (CFs). PPCEF advances beyond existing methods by combining a probabilistic formulation that leverages the data distribution with the optimization of plausibility within a unified framework. Compared to reference approaches, our method enforces plausibility by directly optimizing the explicit density function without assuming a particular family of parametrized distributions. This ensures CFs are not only valid (i.e., achieve class change) but also align with the underlying data's probability density. For that purpose, our approach leverages normalizing flows as powerful density estimators to capture the complex high-dimensional data distribution. Furthermore, we introduce a novel loss that balances the trade-off between achieving class change and maintaining closeness to the original instance while also incorporating a probabilistic plausibility term. PPCEF's unconstrained formulation allows for efficient gradient-based optimization with batch processing, leading to orders of magnitude faster computation compared to prior methods. Moreover, the unconstrained formulation of PPCEF allows for the seamless integration of future constraints tailored to specific counterfactual properties. Finally, extensive evaluations demonstrate PPCEF's superiority in generating high-quality, probabilistically plausible counterfactual explanations in high-dimensional tabular settings. This makes PPCEF a powerful tool for not only interpreting complex machine learning models but also for improving fairness, accountability, and trust in AI systems.

[57] 2405.17642

Unifying Perspectives: Plausible Counterfactual Explanations on Global, Group-wise, and Local Levels

Growing regulatory and societal pressures demand increased transparency in AI, particularly in understanding the decisions made by complex machine learning models. Counterfactual Explanations (CFs) have emerged as a promising technique within Explainable AI (xAI), offering insights into individual model predictions. However, to understand the systemic biases and disparate impacts of AI models, it is crucial to move beyond local CFs and embrace global explanations, which offer a~holistic view across diverse scenarios and populations. Unfortunately, generating Global Counterfactual Explanations (GCEs) faces challenges in computational complexity, defining the scope of "global," and ensuring the explanations are both globally representative and locally plausible. We introduce a novel unified approach for generating Local, Group-wise, and Global Counterfactual Explanations for differentiable classification models via gradient-based optimization to address these challenges. This framework aims to bridge the gap between individual and systemic insights, enabling a deeper understanding of model decisions and their potential impact on diverse populations. Our approach further innovates by incorporating a probabilistic plausibility criterion, enhancing actionability and trustworthiness. By offering a cohesive solution to the optimization and plausibility challenges in GCEs, our work significantly advances the interpretability and accountability of AI models, marking a step forward in the pursuit of transparent AI.

[58] 2405.17672

Exploring Loss Design Techniques For Decision Tree Robustness To Label Noise

In the real world, data is often noisy, affecting not only the quality of features but also the accuracy of labels. Current research on mitigating label errors stems primarily from advances in deep learning, and a gap exists in exploring interpretable models, particularly those rooted in decision trees. In this study, we investigate whether ideas from deep learning loss design can be applied to improve the robustness of decision trees. In particular, we show that loss correction and symmetric losses, both standard approaches, are not effective. We argue that other directions need to be explored to improve the robustness of decision trees to label noise.

[59] 2405.17673

Fast Samplers for Inverse Problems in Iterative Refinement Models

Constructing fast samplers for unconditional diffusion and flow-matching models has received much attention recently; however, existing methods for solving inverse problems, such as super-resolution, inpainting, or deblurring, still require hundreds to thousands of iterative steps to obtain high-quality results. We propose a plug-and-play framework for constructing efficient samplers for inverse problems, requiring only pre-trained diffusion or flow-matching models. We present Conditional Conjugate Integrators, which leverage the specific form of the inverse problem to project the respective conditional diffusion/flow dynamics into a more amenable space for sampling. Our method complements popular posterior approximation methods for solving inverse problems using diffusion/flow models. We evaluate the proposed method's performance on various linear image restoration tasks across multiple datasets, employing diffusion and flow-matching models. Notably, on challenging inverse problems like 4$\times$ super-resolution on the ImageNet dataset, our method can generate high-quality samples in as few as 5 conditional sampling steps and outperforms competing baselines requiring 20-1000 steps. Our code and models will be publicly available at

[60] 2405.17708

OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators

Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been proposed in the last decade, many of which have hyperparameters and require training. Unfortunately, choosing the best OPE algorithm for each task and domain is still unclear. In this paper, we propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. We prove that our estimator is consistent and satisfies several desirable properties for policy evaluation. Additionally, we demonstrate that when compared to alternative approaches, our estimator can be used to select higher-performing policies in healthcare and robotics. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.

[61] 2405.17734

Towards Efficient Disaster Response via Cost-effective Unbiased Class Rate Estimation through Neyman Allocation Stratified Sampling Active Learning

With the rapid development of earth observation technology, we have entered an era of massively available satellite remote-sensing data. However, a large amount of satellite remote sensing data lacks a label or the label cost is too high to hinder the potential of AI technology mining satellite data. Especially in such an emergency response scenario that uses satellite data to evaluate the degree of disaster damage. Disaster damage assessment encountered bottlenecks due to excessive focus on the damage of a certain building in a specific geographical space or a certain area on a larger scale. In fact, in the early days of disaster emergency response, government departments were more concerned about the overall damage rate of the disaster area instead of single-building damage, because this helps the government decide the level of emergency response. We present an innovative algorithm that constructs Neyman stratified random sampling trees for binary classification and extends this approach to multiclass problems. Through extensive experimentation on various datasets and model structures, our findings demonstrate that our method surpasses both passive and conventional active learning techniques in terms of class rate estimation and model enhancement with only 30\%-60\% of the annotation cost of simple sampling. It effectively addresses the 'sampling bias' challenge in traditional active learning strategies and mitigates the 'cold start' dilemma. The efficacy of our approach is further substantiated through application to disaster evaluation tasks using Xview2 Satellite imagery, showcasing its practical utility in real-world contexts.

[62] 2405.17764

On the Sequence Evaluation based on Stochastic Processes

Modeling and analyzing long sequences of text is an essential task for Natural Language Processing. Success in capturing long text dynamics using neural language models will facilitate many downstream tasks such as coherence evaluation, text generation, machine translation and so on. This paper presents a novel approach to model sequences through a stochastic process. We introduce a likelihood-based training objective for the text encoder and design a more thorough measurement (score) for long text evaluation compared to the previous approach. The proposed training objective effectively preserves the sequence coherence, while the new score comprehensively captures both temporal and spatial dependencies. Theoretical properties of our new score show its advantages in sequence evaluation. Experimental results show superior performance in various sequence evaluation tasks, including global and local discrimination within and between documents of different lengths. We also demonstrate the encoder achieves competitive results on discriminating human and AI written text.

[63] 2405.17767

Linguistic Collapse: Neural Collapse in (Large) Language Models

Neural collapse ($\mathcal{NC}$) is a phenomenon observed in classification tasks where top-layer representations collapse into their class means, which become equinorm, equiangular and aligned with the classifiers. These behaviors -- associated with generalization and robustness -- would manifest under specific conditions: models are trained towards zero loss, with noise-free labels belonging to balanced classes, which do not outnumber the model's hidden dimension. Recent studies have explored $\mathcal{NC}$ in the absence of one or more of these conditions to extend and capitalize on the associated benefits of ideal geometries. Language modeling presents a curious frontier, as \textit{training by token prediction} constitutes a classification task where none of the conditions exist: the vocabulary is imbalanced and exceeds the embedding dimension; different tokens might correspond to similar contextual embeddings; and large language models (LLMs) in particular are typically only trained for a few epochs. This paper empirically investigates the impact of scaling the architectures and training of causal language models (CLMs) on their progression towards $\mathcal{NC}$. We find that $\mathcal{NC}$ properties that develop with scaling are linked to generalization. Moreover, there is evidence of some relationship between $\mathcal{NC}$ and generalization independent of scale. Our work therefore underscores the generality of $\mathcal{NC}$ as it extends to the novel and more challenging setting of language modeling. Downstream, we seek to inspire further research on the phenomenon to deepen our understanding of LLMs -- and neural networks at large -- and improve existing architectures based on $\mathcal{NC}$-related properties.

[64] 2405.17796

Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff

Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression (Simchi-Levi and Xu, 2021), we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon H (as known as CMDP with H layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class M containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only O(HlogT) calls to an offline density estimation algorithm (or oracle) across all T rounds of interaction. This number can be further reduced to O(HloglogT) if T is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.

[65] 2405.17836

An Innovative Networks in Federated Learning

This paper presents the development and application of Wavelet Kolmogorov-Arnold Networks (Wav-KAN) in federated learning. We implemented Wav-KAN \cite{wav-kan} in the clients. Indeed, we have considered both continuous wavelet transform (CWT) and also discrete wavelet transform (DWT) to enable multiresolution capabaility which helps in heteregeneous data distribution across clients. Extensive experiments were conducted on different datasets, demonstrating Wav-KAN's superior performance in terms of interpretability, computational speed, training and test accuracy. Our federated learning algorithm integrates wavelet-based activation functions, parameterized by weight, scale, and translation, to enhance local and global model performance. Results show significant improvements in computational efficiency, robustness, and accuracy, highlighting the effectiveness of wavelet selection in scalable neural network design.

[66] 2405.17862

Towards robust prediction of material properties for nuclear reactor design under scarce data -- a study in creep rupture property

Advances in Deep Learning bring further investigation into credibility and robustness, especially for safety-critical engineering applications such as the nuclear industry. The key challenges include the availability of data set (often scarce and sparse) and insufficient consideration of the uncertainty in the data, model, and prediction. This paper therefore presents a meta-learning based approach that is both uncertainty- and prior knowledge-informed, aiming at trustful predictions of material properties for the nuclear reactor design. It is suited for robust learning under limited data. Uncertainty has been accounted for where a distribution of predictor functions are produced for extrapolation. Results suggest it achieves superior performance than existing empirical methods in rupture life prediction, a case which is typically under a small data regime. While demonstrated herein with rupture properties, this learning approach is transferable to solve similar problems of data scarcity across the nuclear industry. It is of great importance to boosting the AI analytics in the nuclear industry by proving the applicability and robustness while providing tools that can be trusted.

[67] 2405.18034

Convergence rates of particle approximation of forward-backward splitting algorithm for granular medium equations

We study the spatially homogeneous granular medium equation \[\partial_t\mu=\rm{div}(\mu\nabla V)+\rm{div}(\mu(\nabla W \ast \mu))+\Delta\mu\,,\] within a large and natural class of the confinement potentials $V$ and interaction potentials $W$. The considered problem do not need to assume that $\nabla V$ or $\nabla W$ are globally Lipschitz. With the aim of providing particle approximation of solutions, we design efficient forward-backward splitting algorithms. Sharp convergence rates in terms of the Wasserstein distance are provided.

[68] 2405.18075

Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient

Across scientific domains, generating new models or optimizing existing ones while meeting specific criteria is crucial. Traditional machine learning frameworks for guided design use a generative model and a surrogate model (discriminator), requiring large datasets. However, real-world scientific applications often have limited data and complex landscapes, making data-hungry models inefficient or impractical. We propose a new framework, PropEn, inspired by ``matching'', which enables implicit guidance without training a discriminator. By matching each sample with a similar one that has a better property value, we create a larger training dataset that inherently indicates the direction of improvement. Matching, combined with an encoder-decoder architecture, forms a domain-agnostic generative framework for property enhancement. We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution, allowing efficient design optimization. Extensive evaluations in toy problems and scientific applications, such as therapeutic protein design and airfoil optimization, demonstrate PropEn's advantages over common baselines. Notably, the protein design results are validated with wet lab experiments, confirming the competitiveness and effectiveness of our approach.

[69] 2405.18108

Simulation of Single-Phase Natural Circulation within the BEPU Framework: Sketching Scaling Uncertainty Principle by Multi-Scale CFD Approaches

In order to enhance safety, nuclear reactors in the design phase consider natural circulation as a mean to remove residual power. The simulation of this passive mechanism must be qualified between the validation range and the scope of utilization (reactor case), introducing potential physical and numerical distortion effects. In this study, we simulate the flow of liquid sodium using the TrioCFD code, employing both higher-fidelity (HF) LES and lower-fidelity (LF) URANS models. We tackle respectively numerical uncertainties through the Grid Convergence Index method, and physical modelling uncertainties through the Polynomial Chaos Expansion method available on the URANIE platform. HF simulations are shown to exhibit a strong resilience to physical distortion effects, with numerical uncertainties being intricately correlated. Conversely, the LF approach, the only one applicable at the reactor scale, is likely to present a reduced predictability. If so, the HF approach should be effective in pinpointing the LF weaknesses: the concept of scaling uncertainty is inline introduced as the growth of the LF simulation uncertainty associated with distortion effects. Thus, the paper outlines that a specific methodology within the BEPU framework - leveraging both HF and LF approaches - could pragmatically enable correlating distortion effects with scaling uncertainty, thereby providing a metric principle.

[70] 2405.18127

Graph Coarsening with Message-Passing Guarantees

Graph coarsening aims to reduce the size of a large graph while preserving some of its key properties, which has been used in many applications to reduce computational load and memory footprint. For instance, in graph machine learning, training Graph Neural Networks (GNNs) on coarsened graphs leads to drastic savings in time and memory. However, GNNs rely on the Message-Passing (MP) paradigm, and classical spectral preservation guarantees for graph coarsening do not directly lead to theoretical guarantees when performing naive message-passing on the coarsened graph. In this work, we propose a new message-passing operation specific to coarsened graphs, which exhibit theoretical guarantees on the preservation of the propagated signal. Interestingly, and in a sharp departure from previous proposals, this operation on coarsened graphs is oriented, even when the original graph is undirected. We conduct node classification tasks on synthetic and real data and observe improved results compared to performing naive message-passing on the coarsened graph.

[71] 2405.18206

Multi-CATE: Multi-Accurate Conditional Average Treatment Effect Estimation Robust to Unknown Covariate Shifts

Estimating heterogeneous treatment effects is important to tailor treatments to those individuals who would most likely benefit. However, conditional average treatment effect predictors may often be trained on one population but possibly deployed on different, possibly unknown populations. We use methodology for learning multi-accurate predictors to post-process CATE T-learners (differenced regressions) to become robust to unknown covariate shifts at the time of deployment. The method works in general for pseudo-outcome regression, such as the DR-learner. We show how this approach can combine (large) confounded observational and (smaller) randomized datasets by learning a confounded predictor from the observational dataset, and auditing for multi-accuracy on the randomized controlled trial. We show improvements in bias and mean squared error in simulations with increasingly larger covariate shift, and on a semi-synthetic case study of a parallel large observational study and smaller randomized controlled experiment. Overall, we establish a connection between methods developed for multi-distribution learning and achieve appealing desiderata (e.g. external validity) in causal inference and machine learning.

[72] 2405.18221

Recurrent Natural Policy Gradient for POMDPs

In this paper, we study a natural policy gradient method based on recurrent neural networks (RNNs) for partially-observable Markov decision processes, whereby RNNs are used for policy parameterization and policy evaluation to address curse of dimensionality in non-Markovian reinforcement learning. We present finite-time and finite-width analyses for both the critic (recurrent temporal difference learning), and correspondingly-operated recurrent natural policy gradient method in the near-initialization regime. Our analysis demonstrates the efficiency of RNNs for problems with short-term memory with explicit bounds on the required network widths and sample complexity, and points out the challenges in the case of long-term dependencies.

[73] 2405.18237

Unveiling the Cycloid Trajectory of EM Iterations in Mixed Linear Regression

We study the trajectory of iterations and the convergence rates of the Expectation-Maximization (EM) algorithm for two-component Mixed Linear Regression (2MLR). The fundamental goal of MLR is to learn the regression models from unlabeled observations. The EM algorithm finds extensive applications in solving the mixture of linear regressions. Recent results have established the super-linear convergence of EM for 2MLR in the noiseless and high SNR settings under some assumptions and its global convergence rate with random initialization has been affirmed. However, the exponent of convergence has not been theoretically estimated and the geometric properties of the trajectory of EM iterations are not well-understood. In this paper, first, using Bessel functions we provide explicit closed-form expressions for the EM updates under all SNR regimes. Then, in the noiseless setting, we completely characterize the behavior of EM iterations by deriving a recurrence relation at the population level and notably show that all the iterations lie on a certain cycloid. Based on this new trajectory-based analysis, we exhibit the theoretical estimate for the exponent of super-linear convergence and further improve the statistical error bound at the finite-sample level. Our analysis provides a new framework for studying the behavior of EM for Mixed Linear Regression.

[74] 2405.18296

Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training

Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy variably across different sub-populations. Current understanding of bias formation mostly focuses on the initial and final stages of learning, leaving a gap in knowledge regarding the transient dynamics. To address this gap, this paper explores the evolution of bias in a teacher-student setup modeling different data sub-populations with a Gaussian-mixture model. We provide an analytical description of the stochastic gradient descent dynamics of a linear classifier in this setting, which we prove to be exact in high dimension. Notably, our analysis reveals how different properties of sub-populations influence bias at different timescales, showing a shifting preference of the classifier during training. Applying our findings to fairness and robustness, we delineate how and when heterogeneous data and spurious features can generate and amplify bias. We empirically validate our results in more complex scenarios by training deeper networks on synthetic and real datasets, including CIFAR10, MNIST, and CelebA.

[75] 2405.18328

Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes

Gaussian processes are a versatile probabilistic machine learning model whose effectiveness often depends on good hyperparameters, which are typically learned by maximising the marginal likelihood. In this work, we consider iterative methods, which use iterative linear system solvers to approximate marginal likelihood gradients up to a specified numerical precision, allowing a trade-off between compute time and accuracy of a solution. We introduce a three-level hierarchy of marginal likelihood optimisation for iterative Gaussian processes, and identify that the computational costs are dominated by solving sequential batches of large positive-definite systems of linear equations. We then propose to amortise computations by reusing solutions of linear system solvers as initialisations in the next step, providing a $\textit{warm start}$. Finally, we discuss the necessary conditions and quantify the consequences of warm starts and demonstrate their effectiveness on regression tasks, where warm starts achieve the same results as the conventional procedure while providing up to a $16 \times$ average speed-up among datasets.

[76] 2405.18353

Simulating infinite-dimensional nonlinear diffusion bridges

The diffusion bridge is a type of diffusion process that conditions on hitting a specific state within a finite time period. It has broad applications in fields such as Bayesian inference, financial mathematics, control theory, and shape analysis. However, simulating the diffusion bridge for natural data can be challenging due to both the intractability of the drift term and continuous representations of the data. Although several methods are available to simulate finite-dimensional diffusion bridges, infinite-dimensional cases remain unresolved. In the paper, we present a solution to this problem by merging score-matching techniques with operator learning, enabling a direct approach to score-matching for the infinite-dimensional bridge. We construct the score to be discretization invariant, which is natural given the underlying spatially continuous process. We conduct a series of experiments, ranging from synthetic examples with closed-form solutions to the stochastic nonlinear evolution of real-world biological shape data, and our method demonstrates high efficacy, particularly due to its ability to adapt to any resolution without extra training.

[77] 2405.18395

MC-GTA: Metric-Constrained Model-Based Clustering using Goodness-of-fit Tests with Autocorrelations

A wide range of (multivariate) temporal (1D) and spatial (2D) data analysis tasks, such as grouping vehicle sensor trajectories, can be formulated as clustering with given metric constraints. Existing metric-constrained clustering algorithms overlook the rich correlation between feature similarity and metric distance, i.e., metric autocorrelation. The model-based variations of these clustering algorithms (e.g. TICC and STICC) achieve SOTA performance, yet suffer from computational instability and complexity by using a metric-constrained Expectation-Maximization procedure. In order to address these two problems, we propose a novel clustering algorithm, MC-GTA (Model-based Clustering via Goodness-of-fit Tests with Autocorrelations). Its objective is only composed of pairwise weighted sums of feature similarity terms (square Wasserstein-2 distance) and metric autocorrelation terms (a novel multivariate generalization of classic semivariogram). We show that MC-GTA is effectively minimizing the total hinge loss for intra-cluster observation pairs not passing goodness-of-fit tests, i.e., statistically not originating from the same distribution. Experiments on 1D/2D synthetic and real-world datasets demonstrate that MC-GTA successfully incorporates metric autocorrelation. It outperforms strong baselines by large margins (up to 14.3% in ARI and 32.1% in NMI) with faster and stabler optimization (>10x speedup).

[78] 2405.18401

Explicit Formulae to Interchangeably use Hyperplanes and Hyperballs using Inversive Geometry

Many algorithms require discriminative boundaries, such as separating hyperplanes or hyperballs, or are specifically designed to work on spherical data. By applying inversive geometry, we show that the two discriminative boundaries can be used interchangeably, and that general Euclidean data can be transformed into spherical data, whenever a change in point distances is acceptable. We provide explicit formulae to embed general Euclidean data into spherical data and to unembed it back. We further show a duality between hyperspherical caps, i.e., the volume created by a separating hyperplane on spherical data, and hyperballs and provide explicit formulae to map between the two. We further provide equations to translate inner products and Euclidean distances between the two spaces, to avoid explicit embedding and unembedding. We also provide a method to enforce projections of the general Euclidean space onto hemi-hyperspheres and propose an intrinsic dimensionality based method to obtain "all-purpose" parameters. To show the usefulness of the cap-ball-duality, we discuss example applications in machine learning and vector similarity search.