Federated learning has attracted significant recent attention due to its applicability across a wide range of settings where data is collected and analyzed across disparate locations. In this paper, we study federated nonparametric goodness-of-fit testing in the white-noise-with-drift model under distributed differential privacy (DP) constraints. We first establish matching lower and upper bounds, up to a logarithmic factor, on the minimax separation rate. This optimal rate serves as a benchmark for the difficulty of the testing problem, factoring in model characteristics such as the number of observations, noise level, and regularity of the signal class, along with the strictness of the $(\epsilon,\delta)$-DP requirement. The results demonstrate interesting and novel phase transition phenomena. Furthermore, the results reveal an interesting phenomenon that distributed one-shot protocols with access to shared randomness outperform those without access to shared randomness. We also construct a data-driven testing procedure that possesses the ability to adapt to an unknown regularity parameter over a large collection of function classes with minimal additional cost, all while maintaining adherence to the same set of DP constraints.

This paper studies federated learning for nonparametric regression in the context of distributed samples across different servers, each adhering to distinct differential privacy constraints. The setting we consider is heterogeneous, encompassing both varying sample sizes and differential privacy constraints across servers. Within this framework, both global and pointwise estimation are considered, and optimal rates of convergence over the Besov spaces are established. Distributed privacy-preserving estimators are proposed and their risk properties are investigated. Matching minimax lower bounds, up to a logarithmic factor, are established for both global and pointwise estimation. Together, these findings shed light on the tradeoff between statistical accuracy and privacy preservation. In particular, we characterize the compromise not only in terms of the privacy budget but also concerning the loss incurred by distributing data within the privacy framework as a whole. This insight captures the folklore wisdom that it is easier to retain privacy in larger samples, and explores the differences between pointwise and global estimation under distributed privacy constraints.

Molecular and genomic technological advancements have greatly enhanced our understanding of biological processes by allowing us to quantify key biological variables such as gene expression, protein levels, and microbiome compositions. These breakthroughs have enabled us to achieve increasingly higher levels of resolution in our measurements, exemplified by our ability to comprehensively profile biological information at the single-cell level. However, the analysis of such data faces several critical challenges: limited number of individuals, non-normality, potential dropouts, outliers, and repeated measurements from the same individual. In this article, we propose a novel method, which we call U-statistic based latent variable (ULV). Our proposed method takes advantage of the robustness of rank-based statistics and exploits the statistical efficiency of parametric methods for small sample sizes. It is a computationally feasible framework that addresses all the issues mentioned above simultaneously. An additional advantage of ULV is its flexibility in modeling various types of single-cell data, including both RNA and protein abundance. The usefulness of our method is demonstrated in two studies: a single-cell proteomics study of acute myelogenous leukemia (AML) and a single-cell RNA study of COVID-19 symptoms. In the AML study, ULV successfully identified differentially expressed proteins that would have been missed by the pseudobulk version of the Wilcoxon rank-sum test. In the COVID-19 study, ULV identified genes associated with covariates such as age and gender, and genes that would be missed without adjusting for covariates. The differentially expressed genes identified by our method are less biased toward genes with high expression levels. Furthermore, ULV identified additional gene pathways likely contributing to the mechanisms of COVID-19 severity.

We study the design and analysis of switchback experiments conducted on a single aggregate unit. The design problem is to partition the continuous time space into intervals and switch treatments between intervals, in order to minimize the estimation error of the treatment effect. We show that the estimation error depends on four factors: carryover effects, periodicity, serially correlated outcomes, and impacts from simultaneous experiments. We derive a rigorous bias-variance decomposition and show the tradeoffs of the estimation error from these factors. The decomposition provides three new insights in choosing a design: First, balancing the periodicity between treated and control intervals reduces the variance; second, switching less frequently reduces the bias from carryover effects while increasing the variance from correlated outcomes, and vice versa; third, randomizing interval start and end points reduces both bias and variance from simultaneous experiments. Combining these insights, we propose a new empirical Bayes design approach. This approach uses prior data and experiments for designing future experiments. We illustrate this approach using real data from a ride-sharing platform, yielding a design that reduces MSE by 33% compared to the status quo design used on the platform.

Motivated by the concept of satisficing in decision-making, we consider the problem of satisficing exploration in bandit optimization. In this setting, the learner aims at selecting satisficing arms (arms with mean reward exceeding a certain threshold value) as frequently as possible. The performance is measured by satisficing regret, which is the cumulative deficit of the chosen arm's mean reward compared to the threshold. We propose SELECT, a general algorithmic template for Satisficing Exploration via LowEr Confidence bound Testing, that attains constant satisficing regret for a wide variety of bandit optimization problems in the realizable case (i.e., a satisficing arm exists). Specifically, given a class of bandit optimization problems and a corresponding learning oracle with sub-linear (standard) regret upper bound, SELECT iteratively makes use of the oracle to identify a potential satisficing arm with low regret. Then, it collects data samples from this arm, and continuously compares the LCB of the identified arm's mean reward against the threshold value to determine if it is a satisficing arm. As a complement, SELECT also enjoys the same (standard) regret guarantee as the oracle in the non-realizable case. Finally, we conduct numerical experiments to validate the performance of SELECT for several popular bandit optimization settings.

In this paper, we propose a local squared Wasserstein-2 (W_2) method to solve the inverse problem of reconstructing models with uncertain latent variables or parameters. A key advantage of our approach is that it does not require prior information on the distribution of the latent variables or parameters in the underlying models. Instead, our method can efficiently reconstruct the distributions of the output associated with different inputs based on empirical distributions of observation data. We demonstrate the effectiveness of our proposed method across several uncertainty quantification (UQ) tasks, including linear regression with coefficient uncertainty, training neural networks with weight uncertainty, and reconstructing ordinary differential equations (ODEs) with a latent random variable.

We describe how to calculate standard errors for A/B tests that include clustered data, ratio metrics, and/or covariate adjustment. We may do this for power analysis/sample size calculations prior to running an experiment using historical data, or after an experiment for hypothesis testing and confidence intervals. The different applications have a common framework, using the sample variance of certain residuals. The framework is compatible with modular software, can be plugged into standard tools, doesn't require computing covariance matrices, and is numerically stable. Using this approach we estimate that covariate adjustment gives a median 66% variance reduction for a key metric, reducing experiment run time by 66%.

Many modern spatio-temporal data sets, in sociology, epidemiology or seismology, for example, exhibit self-exciting characteristics, triggering and clustering behaviors both at the same time, that a suitable Hawkes space-time process can accurately capture. This paper aims to develop a fast and flexible parametric inference technique to recover the parameters of the kernel functions involved in the intensity function of a space-time Hawkes process based on such data. Our statistical approach combines three key ingredients: 1) kernels with finite support are considered, 2) the space-time domain is appropriately discretized, and 3) (approximate) precomputations are used. The inference technique we propose then consists of a $\ell_2$ gradient-based solver that is fast and statistically accurate. In addition to describing the algorithmic aspects, numerical experiments have been carried out on synthetic and real spatio-temporal data, providing solid empirical evidence of the relevance of the proposed methodology.

This document presents methods to remove the initialization or burn-in bias from Markov chain Monte Carlo (MCMC) estimates, with consequences on parallel computing, convergence diagnostics and performance assessment. The document is written as an introduction to these methods for MCMC users. Some theoretical results are mentioned, but the focus is on the methodology.

Causal inference in longitudinal studies is often hampered by treatment-confounder feedback. Existing methods typically assume discrete time steps or step-like data changes, which we term ``regular and irregular functional studies,'' limiting their applicability to studies with continuous monitoring data, like intensive care units or continuous glucose monitoring. These studies, which we formally term ``functional longitudinal studies,'' require new approaches. Moreover, existing methods tailored for ``functional longitudinal studies'' can only investigate static treatment regimes, which are independent of historical covariates or treatments, leading to either stringent parametric assumptions or strong positivity assumptions. This restriction has limited the range of causal questions these methods can answer and their practicality. We address these limitations by developing a nonparametric framework for functional longitudinal data, accommodating dynamic treatment regimes that depend on historical covariates or treatments, and may or may not depend on the actual treatment administered. To build intuition and explain our approach, we provide a comprehensive review of existing methods for regular and irregular longitudinal studies. We then formally define the potential outcomes and causal effects of interest, develop identification assumptions, and derive g-computation and inverse probability weighting formulas through novel applications of stochastic process and measure theory. Additionally, we compute the efficient influence curve using semiparametric theory. Our framework generalizes existing literature, and achieves double robustness under specific conditions. Finally, to aid interpretation, we provide sufficient and intuitive conditions for our identification assumptions, enhancing the applicability of our methodology to real-world scenarios.

The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information. Motivated by these capabilities, Sanford et al. proposed the sparse token selection task, in which transformers excel while fully-connected networks (FCNs) fail in the worst case. Building upon that, we strengthen the FCN lower bound to an average-case setting and establish an algorithmic separation of transformers over FCNs. Specifically, a one-layer transformer trained with gradient descent provably learns the sparse token selection task and, surprisingly, exhibits strong out-of-distribution length generalization. We provide empirical simulations to justify our theoretical findings.

A simple and intuitive method for feature selection consists of choosing the feature subset that maximizes a nonparametric measure of dependence between the response and the features. A popular proposal from the literature uses the Hilbert-Schmidt Independence Criterion (HSIC) as the nonparametric dependence measure. The rationale behind this approach to feature selection is that important features will exhibit a high dependence with the response and their inclusion in the set of selected features will increase the HSIC. Through counterexamples, we demonstrate that this rationale is flawed and that feature selection via HSIC maximization can miss critical features.

The rapid spread of West Nile Virus (WNV) is a growing concern. With no vaccines or specific medications available, prevention through mosquito control is the only solution to curb the spread. Mosquito traps, used to detect viral presence in mosquito populations, are essential tools for WNV surveillance. But how do we decide where to place a mosquito trap? And what makes a good trap location, anyway? We present a robust statistical approach to determine a mosquito trap's ability to predict human WNV cases in the Chicago metropolitan area and its suburbs. We then use this value to detect the landscape, demographic, and socioeconomic factors associated with a mosquito trap's predictive ability. This approach enables resource-limited mosquito control programs to identify better trap locations while reducing trap numbers to increase trap-based surveillance efficiency. The approach can also be applied to a wide range of different environmental surveillance programs.

Estimating the correlation coefficient has been a daunting work with the increasing complexity of dataset's pattern. One of the problems in manufacturing applications consists of the estimation of a critical process variable during a machining operation from directly measurable process variables. For example, the prediction of surface roughness of a workpiece during finish turning processes. In this paper, we did exhaustive study on the existing popular correlation coefficients: Pearson correlation coefficient, Spearman's rank correlation coefficient, Kendall's Tau correlation coefficient, Fechner correlation coefficient, and Nonlinear correlation coefficient. However, no one of them can capture all the nonlinear and linear correlations. So, we represent a universal non-linear non-parametric correlation measurement, g-correlation coefficient. Unlike other correlation measurements, g-correlation doesn't require assumptions and pick the dominating patterns of the dataset after examining all the major patterns no matter it is linear or nonlinear. Results of testing on both linearly correlated and non-linearly correlated dataset and comparison with the introduced correlation coefficients in literature show that g-correlation is robust on all the linearly correlated dataset and outperforms for some non-linearly correlated dataset. Results of the application of different correlation concepts to surface roughness assessment show that g-correlation has a central role among all standard concepts of correlation.

A researcher collecting data from a randomized controlled trial (RCT) often has access to an auxiliary observational dataset that may be confounded or otherwise biased for estimating causal effects. Common modeling assumptions impose restrictions on the outcome mean function - the conditional expectation of the outcome of interest given observed covariates - in the two datasets. Running examples from the literature include settings where the observational dataset is subject to outcome-mediated selection bias or to confounding bias taking an assumed parametric form. We propose a succinct framework to derive the efficient influence function for any identifiable pathwise differentiable estimand under a general class of restrictions on the outcome mean function. This uncovers surprising results that with homoskedastic outcomes and a constant propensity score in the RCT, even strong parametric assumptions cannot improve the semiparametric lower bound for estimating various average treatment effects. We then leverage double machine learning to construct a one-step estimator that achieves the semiparametric efficiency bound even in cases when the outcome mean function and other nuisance parameters are estimated nonparametrically. The goal is to empower a researcher with custom, previously unstudied modeling restrictions on the outcome mean function to systematically construct causal estimators that maximially leverage their assumptions for variance reduction. We demonstrate the finite sample precision gains of our estimator over existing approaches in extensions of various numerical studies and data examples from the literature.

The test-negative design has become popular for evaluating the effectiveness of post-licensure vaccines using observational data. In addition to its logistical convenience on data collection, the design is also believed to control for the differential health-care-seeking behavior between vaccinated and unvaccinated individuals, which is an important while often unmeasured confounder between the vaccination and infection. Hence, the design has been employed routinely to monitor seasonal flu vaccines and more recently to measure the COVID-19 vaccine effectiveness. Despite its popularity, the design has been questioned, in particular about its ability to fully control for the unmeasured confounding. In this paper, we explore deviations from a perfect test-negative design, and propose various sensitivity analysis methods for estimating the effect of vaccination measured by the causal odds ratio on the subpopulation of individuals with good health-care-seeking behavior. We start with point identification of the causal odds ratio under a test-negative design, considering two forms of assumptions on the unmeasured confounder. These assumptions then lead to two approaches for conducting sensitivity analysis, addressing the influence of the unmeasured confounding in different ways. Specifically, one approach investigates partial control for unmeasured confounder in the test-negative design, while the other examines the impact of unmeasured confounder on both vaccination and infection. Furthermore, these approaches can be combined to provide narrower bounds on the true causal odds ratio, and can be further extended to sharpen the bounds by restricting the treatment effect heterogeneity. Finally, we apply the proposed methods to evaluate the effectiveness of COVID-19 vaccines using observational data from test-negative designs.

Causal inference on time series data is a challenging problem, especially in the presence of unobserved confounders. This work focuses on estimating the causal effect between two time series, which are confounded by a third, unobserved time series. Assuming spectral sparsity of the confounder, we show how in the frequency domain this problem can be framed as an adversarial outlier problem. We introduce Deconfounding by Robust regression (DecoR), a novel approach that estimates the causal effect using robust linear regression in the frequency domain. Considering two different robust regression techniques, we first improve existing bounds on the estimation error for such techniques. Crucially, our results do not require distributional assumptions on the covariates. We can therefore use them in time series settings. Applying these results to DecoR, we prove, under suitable assumptions, upper bounds for the estimation error of DecoR that imply consistency. We show DecoR's effectiveness through experiments on synthetic data. Our experiments furthermore suggest that our method is robust with respect to model misspecification.

We consider a system of binary interacting chains describing the dynamics of a group of $N$ components that, at each time unit, either send some signal to the others or remain silent otherwise. The interactions among the chains are encoded by a directed Erd\"os-R\'enyi random graph with unknown parameter $ p \in (0, 1) .$ Moreover, the system is structured within two populations (excitatory chains versus inhibitory ones) which are coupled via a mean field interaction on the underlying Erd\"os-R\'enyi graph. In this paper, we address the question of inferring the connectivity parameter $p$ based only on the observation of the interacting chains over $T$ time units. In our main result, we show that the connectivity parameter $p$ can be estimated with rate $N^{-1/2}+N^{1/2}/T+(\log(T)/T)^{1/2}$ through an easy-to-compute estimator. Our analysis relies on a precise study of the spatio-temporal decay of correlations of the interacting chains. This is done through the study of coalescing random walks defining a backward regeneration representation of the system. Interestingly, we also show that this backward regeneration representation allows us to perfectly sample the system of interacting chains (conditionally on each realization of the underlying Erd\"os-R\'enyi graph) from its stationary distribution. These probabilistic results have an interest in its own.

The Coordinate Ascent Variational Inference scheme is a popular algorithm used to compute the mean-field approximation of a probability distribution of interest. We analyze its random scan version, under log-concavity assumptions on the target density. Our approach builds on the recent work of M. Arnese and D. Lacker, \emph{Convergence of coordinate ascent variational inference for log-concave measures via optimal transport} [arXiv:2404.08792] which studies the deterministic scan version of the algorithm, phrasing it as a block-coordinate descent algorithm in the space of probability distributions endowed with the geometry of optimal transport. We obtain tight rates for the random scan version, which imply that the total number of factor updates required to converge scales linearly with the condition number and the number of blocks of the target distribution. By contrast, available bounds for the deterministic scan case scale quadratically in the same quantities, which is analogue to what happens for optimization of convex functions in Euclidean spaces.

This paper studies the robust Hankel recovery problem, which simultaneously removes the sparse outliers and fulfills missing entries from the partial observation. We propose a novel non-convex algorithm, coined Hankel Structured Newton-Like Descent (HSNLD), to tackle the robust Hankel recovery problem. HSNLD is highly efficient with linear convergence, and its convergence rate is independent of the condition number of the underlying Hankel matrix. The recovery guarantee has been established under some mild conditions. Numerical experiments on both synthetic and real datasets show the superior performance of HSNLD against state-of-the-art algorithms.

This paper introduces a boosted conformal procedure designed to tailor conformalized prediction intervals toward specific desired properties, such as enhanced conditional coverage or reduced interval length. We employ machine learning techniques, notably gradient boosting, to systematically improve upon a predefined conformity score function. This process is guided by carefully constructed loss functions that measure the deviation of prediction intervals from the targeted properties. The procedure operates post-training, relying solely on model predictions and without modifying the trained model (e.g., the deep network). Systematic experiments demonstrate that starting from conventional conformal methods, our boosted procedure achieves substantial improvements in reducing interval length and decreasing deviation from target conditional coverage.

Real-world applications of machine learning models are often subject to legal or policy-based regulations. Some of these regulations require ensuring the validity of the model, i.e., the approximation error being smaller than a threshold. A global metric is generally too insensitive to determine the validity of a specific prediction, whereas evaluating local validity is costly since it requires gathering additional data.We propose learning the model error to acquire a local validity estimate while reducing the amount of required data through active learning. Using model validation benchmarks, we provide empirical evidence that the proposed method can lead to an error model with sufficient discriminative properties using a relatively small amount of data. Furthermore, an increased sensitivity to local changes of the validity bounds compared to alternative approaches is demonstrated.

The surging demand for new energy vehicles is driven by the imperative to conserve energy, reduce emissions, and enhance the ecological ambiance. By conducting behavioral analysis and mining usage patterns of new energy vehicles, particular patterns can be identified. For instance, overloading the battery, operating with low battery power, and driving at excessive speeds can all detrimentally affect the battery's performance. To assess the impact of such driving behavior on the urban ecology, an environmental computational modeling method has been proposed to simulate the interaction between new energy vehicles and the environment. To extend the time series data of the vehicle's entire life cycle and the ecological environment within the model sequence data, the LSTM model with Bayesian optimizer is utilized for simulation. The analysis revealed the detrimental effects of poor driving behavior on the environment.

Efficiently post-training large language models remains a challenging task due to the vast computational resources required. We present Spectrum, a method that accelerates LLM training by selectively targeting layer modules based on their signal-to-noise ratio (SNR), and freezing the remaining modules. Our approach, which utilizes an algorithm to compute module SNRs prior to training, has shown to effectively match the performance of full fine-tuning while reducing GPU memory usage. Experiments comparing Spectrum to existing methods such as QLoRA demonstrate its effectiveness in terms of model quality and VRAM efficiency in distributed environments.

Accurate time series forecasts are crucial for various applications, such as traffic management, electricity consumption, and healthcare. However, limitations in models and data quality can significantly impact forecasts accuracy. One common issue with data quality is the absence of data points, referred to as missing data. It is often caused by sensor malfunctions, equipment failures, or human errors. This paper proposes Hinge-FM2I, a novel method for handling missing data values in univariate time series data. Hinge-FM2I builds upon the strengths of the Forecasting Method by Image Inpainting (FM2I). FM2I has proven effective, but selecting the most accurate forecasts remain a challenge. To overcome this issue, we proposed a selection algorithm. Inspired by door hinges, Hinge-FM2I drops a data point either before or after the gap (left/right-hinge), then use FM2I for imputation, and then select the imputed gap based on the lowest error of the dropped data point. Hinge-FM2I was evaluated on a comprehensive sample composed of 1356 time series, extracted from the M3 competition benchmark dataset, with missing value rates ranging from 3.57\% to 28.57\%. Experimental results demonstrate that Hinge-FM2I significantly outperforms established methods such as, linear/spline interpolation, K-Nearest Neighbors (K-NN), and ARIMA. Notably, Hinge-FM2I achieves an average Symmetric Mean Absolute Percentage Error (sMAPE) score of 5.6\% for small gaps, and up to 10\% for larger ones. These findings highlight the effectiveness of Hinge-FM2I as a promising new method for addressing missing values in univariate time series data.

Current clinical decision support systems (DSS) are trained and validated on observational data from the target clinic. This is problematic for treatments validated in a randomized clinical trial (RCT), but not yet introduced in any clinic. In this work, we report on a method for training and validating the DSS using the RCT data. The key challenges we address are of missingness -- missing rationale for treatment assignment (the assignment is at random), and missing verification evidence, since the effectiveness of a treatment for a patient can only be verified (ground truth) for treatments what were actually assigned to a patient. We use data from a multi-armed RCT that investigated the effectiveness of single- and combination- treatments for 240+ tinnitus patients recruited and treated in 5 clinical centers. To deal with the 'missing rationale' challenge, we re-model the target variable (outcome) in order to suppress the effect of the randomly-assigned treatment, and control on the effect of treatment in general. Our methods are also robust to missing values in features and with a small number of patients per RCT arm. We deal with 'missing verification evidence' by using counterfactual treatment verification, which compares the effectiveness of the DSS recommendations to the effectiveness of the RCT assignments when they are aligned v/s not aligned. We demonstrate that our approach leverages the RCT data for learning and verification, by showing that the DSS suggests treatments that improve the outcome. The results are limited through the small number of patients per treatment; while our ensemble is designed to mitigate this effect, the predictive performance of the methods is affected by the smallness of the data. We provide a basis for the establishment of decision supporting routines on treatments that have been tested in RCTs but have not yet been deployed clinically.

Decision support systems based on prediction sets help humans solve multiclass classification tasks by narrowing down the set of potential label values to a subset of them, namely a prediction set, and asking them to always predict label values from the prediction sets. While this type of systems have been proven to be effective at improving the average accuracy of the predictions made by humans, by restricting human agency, they may cause harm$\unicode{x2014}$a human who has succeeded at predicting the ground-truth label of an instance on their own may have failed had they used these systems. In this paper, our goal is to control how frequently a decision support system based on prediction sets may cause harm, by design. To this end, we start by characterizing the above notion of harm using the theoretical framework of structural causal models. Then, we show that, under a natural, albeit unverifiable, monotonicity assumption, we can estimate how frequently a system may cause harm using only predictions made by humans on their own. Further, we also show that, under a weaker monotonicity assumption, which can be verified experimentally, we can bound how frequently a system may cause harm again using only predictions made by humans on their own. Building upon these assumptions, we introduce a computational framework to design decision support systems based on prediction sets that are guaranteed to cause harm less frequently than a user-specified value using conformal risk control. We validate our framework using real human predictions from two different human subject studies and show that, in decision support systems based on prediction sets, there is a trade-off between accuracy and counterfactual harm.

\begin{abstract} In this paper, we integrated the statistical arbitrage strategy, pairs trading, into the Black-Litterman model and constructed efficient mean-variance portfolios. Typically, pairs trading underperforms under volatile or distressed market condition because the selected asset pairs fail to revert to equilibrium within the investment horizon. By enhancing this strategy with the Black-Litterman portfolio optimization, we achieved superior performance compared to the S\&P 500 market index under both normal and extreme market conditions. Furthermore, this research presents an innovative idea of incorporating traditional pairs trading strategies into the portfolio optimization framework in a scalable and systematic manner.

Mathematical modeling offers the opportunity to test hypothesis concerning Myeloproliferative emergence and development. We tested different mathematical models based on a training cohort (n=264 patients) (Registre de la c\^ote d'Or) to determine the emergence and evolution times before JAK2V617F classical Myeloproliferative disorders (respectively Polycythemia Vera and Essential Thrombocytemia) are diagnosed. We dissected the time before diagnosis as two main periods: the time from embryonic development for the JAK2V617F mutation to occur, not disappear and enter in proliferation, and a second time corresponding to the expansion of the clonal population until diagnosis. We demonstrate using progressively complexified models that the rate of active mutation occurrence is not constant and doesn't just rely on individual variability, but rather increases with age and takes a median time of 63.1+/-13 years. A contrario, the expansion time can be considered as constant: 8.8 years once the mutation has emerged. Results were validated in an external cohort (national FIMBANK Cohort, n=1248 patients). Analyzing JAK2V617F Essential Thrombocytema versus Polycythemia Vera, we noticed that the first period of time (rate of active homozygous mutation occurrence) for PV takes approximatively 1.5 years more than for ET to develop when the expansion time was quasi-similar. In conclusion, our multi-step approach and the ultimate time-dependent model of MPN emergence and development demonstrates that the emergence of a JAK2V617F mutation should be linked to an aging mechanism, and indicates a 8-9 years period of time to develop a full MPN.

The causal dependence in data is often characterized by Directed Acyclic Graphical (DAG) models, widely used in many areas. Causal discovery aims to recover the DAG structure using observational data. This paper focuses on causal discovery with multi-variate count data. We are motivated by real-world web visit data, recording individual user visits to multiple websites. Building a causal diagram can help understand user behavior in transitioning between websites, inspiring operational strategy. A challenge in modeling is user heterogeneity, as users with different backgrounds exhibit varied behaviors. Additionally, social network connections can result in similar behaviors among friends. We introduce personalized Binomial DAG models to address heterogeneity and network dependency between observations, which are common in real-world applications. To learn the proposed DAG model, we develop an algorithm that embeds the network structure into a dimension-reduced covariate, learns each node's neighborhood to reduce the DAG search space, and explores the variance-mean relation to determine the ordering. Simulations show our algorithm outperforms state-of-the-art competitors in heterogeneous data. We demonstrate its practical usefulness on a real-world web visit dataset.

Building on the theoretical insights of Part I, this paper, as the second part of the tutorial, dives deeper into data-driven power flow linearization (DPFL), focusing on comprehensive numerical testing. The necessity of these simulations stems from the theoretical analysis's inherent limitations, particularly the challenge of identifying the differences in real-world performance among DPFL methods with overlapping theoretical capabilities and/or limitations. The absence of a comprehensive numerical comparison of DPFL approaches in the literature also motivates this paper, especially given the fact that over 95% of existing DPFL studies have not provided any open-source codes. To bridge the gap, this paper first reviews existing DPFL experiments, examining the adopted test scenarios, load fluctuation settings, data sources, considerations for data noise/outliers, and the comparison made so far. Subsequently, this paper evaluates a total of 44 methods, containing over 30 existing DPFL approaches, some innovative DPFL techniques, and several classic physics-driven power flow linearization methods for benchmarking. The evaluation spans various dimensions, including generalizability, applicability, accuracy, and computational efficiency, using various different test cases scaling from 9-bus to 1354-bus systems. The numerical analysis identifies and examines significant trends and consistent findings across all methods under various test cases. It also offers theoretical insights into phenomena like under-performance, failure, excessive computation times, etc. Overall, this paper identifies the differences in the performances of the wide range of DPFL methods, reveals gaps not evident from theoretical discussions, assists in method selection for real-world applications, and provides thorough discussions on open questions within DPFL research, indicating ten potential future directions.

A Bayesian data assimilation scheme is formulated for advection-dominated advective and diffusive evolutionary problems, based upon the Dynamic Likelihood (DLF) approach to filtering. The DLF was developed specifically for hyperbolic problems -waves-, and in this paper, it is extended via a split step formulation, to handle advection-diffusion problems. In the dynamic likelihood approach, observations and their statistics are used to propagate probabilities along characteristics, evolving the likelihood in time. The estimate posterior thus inherits phase information. For advection-diffusion the advective part of the time evolution is handled on the basis of observations alone, while the diffusive part is informed through the model as well as observations. We expect, and indeed show here, that in advection-dominated problems, the DLF approach produces better estimates than other assimilation approaches, particularly when the observations are sparse and have low uncertainty. The added computational expense of the method is cubic in the total number of observations over time, which is on the same order of magnitude as a standard Kalman filter and can be mitigated by bounding the number of forward propagated observations, discarding the least informative data.

We study the generalization of two-layer ReLU neural networks in a univariate nonparametric regression problem with noisy labels. This is a problem where kernels (\emph{e.g.} NTK) are provably sub-optimal and benign overfitting does not happen, thus disqualifying existing theory for interpolating (0-loss, global optimal) solutions. We present a new theory of generalization for local minima that gradient descent with a constant learning rate can \emph{stably} converge to. We show that gradient descent with a fixed learning rate $\eta$ can only find local minima that represent smooth functions with a certain weighted \emph{first order total variation} bounded by $1/\eta - 1/2 + \widetilde{O}(\sigma + \sqrt{\mathrm{MSE}})$ where $\sigma$ is the label noise level, $\mathrm{MSE}$ is short for mean squared error against the ground truth, and $\widetilde{O}(\cdot)$ hides a logarithmic factor. Under mild assumptions, we also prove a nearly-optimal MSE bound of $\widetilde{O}(n^{-4/5})$ within the strict interior of the support of the $n$ data points. Our theoretical results are validated by extensive simulation that demonstrates large learning rate training induces sparse linear spline fits. To the best of our knowledge, we are the first to obtain generalization bound via minima stability in the non-interpolation case and the first to show ReLU NNs without regularization can achieve near-optimal rates in nonparametric regression.

In the wild, we often encounter collections of sequential data such as electrocardiograms, motion capture, genomes, and natural language, and sequences may be multichannel or symbolic with nonlinear dynamics. We introduce a new method to learn low-dimensional representations of nonlinear time series without supervision and can have provable recovery guarantees. The learned representation can be used for downstream machine-learning tasks such as clustering and classification. The method is based on the assumption that the observed sequences arise from a common domain, but each sequence obeys its own autoregressive models that are related to each other through low-rank regularization. We cast the problem as a computationally efficient convex matrix parameter recovery problem using monotone Variational Inequality and encode the common domain assumption via low-rank constraint across the learned representations, which can learn the geometry for the entire domain as well as faithful representations for the dynamics of each individual sequence using the domain information in totality. We show the competitive performance of our method on real-world time-series data with the baselines and demonstrate its effectiveness for symbolic text modeling and RNA sequence clustering.

This letter presents a high-dimensional analysis of the training dynamics for a single-layer nonlinear contrastive learning model. The empirical distribution of the model weights converges to a deterministic measure governed by a McKean-Vlasov nonlinear partial differential equation (PDE). Under L2 regularization, this PDE reduces to a closed set of low-dimensional ordinary differential equations (ODEs), reflecting the evolution of the model performance during the training process. We analyze the fixed point locations and their stability of the ODEs unveiling several interesting findings. First, only the hidden variable's second moment affects feature learnability at the state with uninformative initialization. Second, higher moments influence the probability of feature selection by controlling the attraction region, rather than affecting local stability. Finally, independent noises added in the data argumentation degrade performance but negatively correlated noise can reduces the variance of gradient estimation yielding better performance. Despite of the simplicity of the analyzed model, it exhibits a rich phenomena of training dynamics, paving a way to understand more complex mechanism behind practical large models.

The objective of drug discovery is to identify chemical compounds that possess specific pharmaceutical properties toward a binding target. Existing large language models (LLMS) can achieve high token matching scores in terms of likelihood for molecule generation. However, relying solely on LLM decoding often results in the generation of molecules that are either invalid due to a single misused token, or suboptimal due to unbalanced exploration and exploitation as a consequence of the LLMs prior experience. Here we propose ERP, Entropy-Reinforced Planning for Transformer Decoding, which employs an entropy-reinforced planning algorithm to enhance the Transformer decoding process and strike a balance between exploitation and exploration. ERP aims to achieve improvements in multiple properties compared to direct sampling from the Transformer. We evaluated ERP on the SARS-CoV-2 virus (3CLPro) and human cancer cell target protein (RTCB) benchmarks and demonstrated that, in both benchmarks, ERP consistently outperforms the current state-of-the-art algorithm by 1-5 percent, and baselines by 5-10 percent, respectively. Moreover, such improvement is robust across Transformer models trained with different objectives. Finally, to further illustrate the capabilities of ERP, we tested our algorithm on three code generation benchmarks and outperformed the current state-of-the-art approach as well. Our code is publicly available at: https://github.com/xuefeng-cs/ERP.

The quest for successful variational quantum machine learning (QML) relies on the design of suitable parametrized quantum circuits (PQCs), as analogues to neural networks in classical machine learning. Successful QML models must fulfill the properties of trainability and non-dequantization, among others. Recent works have highlighted an intricate interplay between trainability and dequantization of such models, which is still unresolved. In this work we contribute to this debate from the perspective of machine learning, proving a number of results identifying, among others when trainability and non-dequantization are not mutually exclusive. We begin by providing a number of new somewhat broader definitions of the relevant concepts, compared to what is found in other literature, which are operationally motivated, and consistent with prior art. With these precise definitions given and motivated, we then study the relation between trainability and dequantization of variational QML. Next, we also discuss the degrees of "variationalness" of QML models, where we distinguish between models like the hardware efficient ansatz and quantum kernel methods. Finally, we introduce recipes for building PQC-based QML models which are both trainable and nondequantizable, and corresponding to different degrees of variationalness. We do not address the practical utility for such models. Our work however does point toward a way forward for finding more general constructions, for which finding applications may become feasible.

Deriving exact density functions for Gibbs point processes has been challenging due to their general intractability, stemming from the intractability of their normalising constants/partition functions. This paper offers a solution to this open problem by exploiting a recent alternative representation of point process densities. Here, for a finite point process, the density is expressed as the void probability multiplied by a higher-order Papangelou conditional intensity function. By leveraging recent results on dependent thinnings, exact expressions for generating functionals and void probabilities of locally stable point processes are derived. Consequently, exact expressions for density/likelihood functions, partition functions and posterior densities are also obtained. The paper finally extends the results to locally stable Gibbsian random fields on lattices by representing them as point processes.

Mixture variational distributions in black box variational inference (BBVI) have demonstrated impressive results in challenging density estimation tasks. However, currently scaling the number of mixture components can lead to a linear increase in the number of learnable parameters and a quadratic increase in inference time due to the evaluation of the evidence lower bound (ELBO). Our two key contributions address these limitations. First, we introduce the novel Multiple Importance Sampling Variational Autoencoder (MISVAE), which amortizes the mapping from input to mixture-parameter space using one-hot encodings. Fortunately, with MISVAE, each additional mixture component incurs a negligible increase in network parameters. Second, we construct two new estimators of the ELBO for mixtures in BBVI, enabling a tremendous reduction in inference time with marginal or even improved impact on performance. Collectively, our contributions enable scalability to hundreds of mixture components and provide superior estimation performance in shorter time, with fewer network parameters compared to previous Mixture VAEs. Experimenting with MISVAE, we achieve astonishing, SOTA results on MNIST. Furthermore, we empirically validate our estimators in other BBVI settings, including Bayesian phylogenetic inference, where we improve inference times for the SOTA mixture model on eight data sets.

Rank-Biased Overlap (RBO) is a similarity measure for indefinite rankings: it is top-weighted, and can be computed when only a prefix of the rankings is known or when they have only some items in common. It is widely used for instance to analyze differences between search engines by comparing the rankings of documents they retrieve for the same queries. In these situations, though, it is very frequent to find tied documents that have the same score. Unfortunately, the treatment of ties in RBO remains superficial and incomplete, in the sense that it is not clear how to calculate it from the ranking prefixes only. In addition, the existing way of dealing with ties is very different from the one traditionally followed in the field of Statistics, most notably found in rank correlation coefficients such as Kendall's and Spearman's. In this paper we propose a generalized formulation for RBO to handle ties, thanks to which we complete the original definitions by showing how to perform prefix evaluation. We also use it to fully develop two variants that align with the ones found in the Statistics literature: one when there is a reference ranking to compare to, and one when there is not. Overall, these three variants provide researchers with flexibility when comparing rankings with RBO, by clearly determining what ties mean, and how they should be treated. Finally, using both synthetic and TREC data, we demonstrate the use of these new tie-aware RBO measures. We show that the scores may differ substantially from the original tie-unaware RBO measure, where ties had to be broken at random or by arbitrary criteria such as by document ID. Overall, these results evidence the need for a proper account of ties in rank similarity measures such as RBO.

Green hydrogen is critical for decarbonising hard-to-electrify sectors, but faces high costs and investment risks. Here we define and quantify the green hydrogen ambition and implementation gap, showing that meeting hydrogen expectations will remain challenging despite surging announcements of projects and subsidies. Tracking 137 projects over three years, we identify a wide 2022 implementation gap with only 2% of global capacity announcements finished on schedule. In contrast, the 2030 ambition gap towards 1.5{\deg}C scenarios is gradually closing as the announced project pipeline has nearly tripled to 441 GW within three years. However, we estimate that, without carbon pricing, realising all these projects would require global subsidies of \$1.6 trillion (\$1.2 - 2.6 trillion range), far exceeding announced subsidies. Given past and future implementation gaps, policymakers must prepare for prolonged green hydrogen scarcity. Policy support needs to secure hydrogen investments, but should focus on applications where hydrogen is indispensable.

The primary objective of most lead optimization campaigns is to enhance the binding affinity of ligands. For large molecules such as antibodies, identifying mutations that enhance antibody affinity is particularly challenging due to the combinatorial explosion of potential mutations. When the structure of the antibody-antigen complex is available, relative binding free energy (RBFE) methods can offer valuable insights into how different mutations will impact the potency and selectivity of a drug candidate, thereby reducing the reliance on costly and time-consuming wet-lab experiments. However, accurately simulating the physics of large molecules is computationally intensive. We present an active learning framework that iteratively proposes promising sequences for simulators to evaluate, thereby accelerating the search for improved binders. We explore different modeling approaches to identify the most effective surrogate model for this task, and evaluate our framework both using pre-computed pools of data and in a realistic full-loop setting.

Model performance evaluation is a critical and expensive task in machine learning and computer vision. Without clear guidelines, practitioners often estimate model accuracy using a one-time random selection of the data. However, by employing tailored sampling and estimation strategies, one can obtain more precise estimates and reduce annotation costs. In this paper, we propose a statistical framework for model evaluation that includes stratification, sampling, and estimation components. We examine the statistical properties of each component and evaluate their efficiency (precision). One key result of our work is that stratification via k-means clustering based on accurate predictions of model performance yields efficient estimators. Our experiments on computer vision datasets show that this method consistently provides more precise accuracy estimates than the traditional simple random sampling, even with substantial efficiency gains of 10x. We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates based solely on the labeled data.

The scope of this manuscript is to review some recent developments in statistics for discretely observed semimartingales which are motivated by applications for financial markets. Our journey through this area stops to take closer looks at a few selected topics discussing recent literature. We moreover highlight and explain the important role played by some classical concepts of probability and statistics. We focus on three main aspects: Testing for jumps; rough fractional stochastic volatility; and limit order microstructure noise. We review jump tests based on extreme value theory and complement the literature proposing new statistical methods. They are based on asymptotic theory of order statistics and the R\'{e}nyi representation. The second stage of our journey visits a recent strand of research showing that volatility is rough. We further investigate this and establish a minimax lower bound exploring frontiers to what extent the regularity of latent volatility can be recovered in a more general framework. Finally, we discuss a stochastic boundary model with one-sided microstructure noise for high-frequency limit order prices and its probabilistic and statistical foundation.

Monte Carlo methods, Variational Inference, and their combinations play a pivotal role in sampling from intractable probability distributions. However, current studies lack a unified evaluation framework, relying on disparate performance measures and limited method comparisons across diverse tasks, complicating the assessment of progress and hindering the decision-making of practitioners. In response to these challenges, our work introduces a benchmark that evaluates sampling methods using a standardized task suite and a broad range of performance criteria. Moreover, we study existing metrics for quantifying mode collapse and introduce novel metrics for this purpose. Our findings provide insights into strengths and weaknesses of existing sampling methods, serving as a valuable reference for future developments. The code is publicly available here.

In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called $\mathsf{BSAD}$, without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. $\mathsf{BSAD}$ adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a provable, instance-dependent sample complexity $\tilde{\mathcal{O}}(c_{\mathcal{M}}SA^3H^3M\log\frac{1}{\delta})$ which resembles the result in classic RL, where $c_{\mathcal{M}}$ is the instance-dependent constant and $M$ is the batch size. Moreover, $\mathsf{BSAD}$ can be transformed into an explore-then-commit algorithm with logarithmic regret and generalized to discounted MDPs using a frame-based approach. Our results show: (i) sample-complexity-wise, RLHF is not significantly harder than classic RL and (ii) end-to-end RLHF may deliver improved performance by avoiding pitfalls in reward inferring such as overfit and distribution shift.

This work is about estimating the hallucination rate for in-context learning (ICL) with Generative AI. In ICL, a conditional generative model (CGM) is prompted with a dataset and asked to make a prediction based on that dataset. The Bayesian interpretation of ICL assumes that the CGM is calculating a posterior predictive distribution over an unknown Bayesian model of a latent parameter and data. With this perspective, we define a \textit{hallucination} as a generated prediction that has low-probability under the true latent parameter. We develop a new method that takes an ICL problem -- that is, a CGM, a dataset, and a prediction question -- and estimates the probability that a CGM will generate a hallucination. Our method only requires generating queries and responses from the model and evaluating its response log probability. We empirically evaluate our method on synthetic regression and natural language ICL tasks using large language models.

Trajectory inference seeks to recover the temporal dynamics of a population from snapshots of its (uncoupled) temporal marginals, i.e. where observed particles are not tracked over time. Lavenant et al. arXiv:2102.09204 addressed this challenging problem under a stochastic differential equation (SDE) model with a gradient-driven drift in the observed space, introducing a minimum entropy estimator relative to the Wiener measure. Chizat et al. arXiv:2205.07146 then provided a practical grid-free mean-field Langevin (MFL) algorithm using Schr\"odinger bridges. Motivated by the overwhelming success of observable state space models in the traditional paired trajectory inference problem (e.g. target tracking), we extend the above framework to a class of latent SDEs in the form of observable state space models. In this setting, we use partial observations to infer trajectories in the latent space under a specified dynamics model (e.g. the constant velocity/acceleration models from target tracking). We introduce PO-MFL to solve this latent trajectory inference problem and provide theoretical guarantees by extending the results of arXiv:2102.09204 to the partially observed setting. We leverage the MFL framework of arXiv:2205.07146, yielding an algorithm based on entropic OT between dynamics-adjusted adjacent time marginals. Experiments validate the robustness of our method and the exponential convergence of the MFL dynamics, and demonstrate significant outperformance over the latent-free method of arXiv:2205.07146 in key scenarios.

Synthesized data from generative models is increasingly considered as an alternative to human-annotated data for fine-tuning Large Language Models. This raises concerns about model collapse: a drop in performance of models fine-tuned on generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of feedback on synthesized data to prevent model collapse. We derive theoretical conditions under which a Gaussian mixture classification model can achieve asymptotically optimal performance when trained on feedback-augmented synthesized data, and provide supporting simulations for finite regimes. We illustrate our theoretical predictions on two practical problems: computing matrix eigenvalues with transformers and news summarization with large language models, which both undergo model collapse when trained on model-generated data. We show that training from feedback-augmented synthesized data, either by pruning incorrect predictions or by selecting the best of several guesses, can prevent model collapse, validating popular approaches like RLHF.

The COVID-19 pandemic has compelled multinational corporations to diversify their global supply chain risk and to relocate their factories to Southeast Asian countries beyond China. Such recent phenomena provide a good opportunity to understand the factors that influenced offshore decisions in the last two decades. We propose a new conceptual framework based on econometric approaches to examine the relationships between these factors. Firstly, the Vector Auto Regression (VAR) for multi-way cointegration analysis by a Johansen test as well as the embedding Granger causality analysis to examine offshore decisions--innovation, technology readiness, infrastructure, foreign direct investment (FDI), and intermediate imports. Secondly, a Quantile Vector Autoregressive (QVAR) model is used to assess the dynamic connectedness among Southeast Asian countries based on the offshore factors. This study explores a system-wide experiment to evaluate the spillover effects of offshore decisions. It reports a comprehensive analysis using time-series data collected from the World Bank. The results of the cointegration, causality, and dynamic connectedness analyses show that a subset of Southeast Asian countries have spillover effects on each other. These countries present a multi-way cointegration and dynamic connectedness relationship. The study contributes to policymaking by providing a data-driven innovative approach through a new conceptual framework.

The advancement of deep learning technologies is bringing new models every day, motivating the study of scalable model selection. An ideal model selection scheme should minimally support two operations efficiently over a large pool of candidate models: update, which involves either adding a new candidate model or removing an existing candidate model, and selection, which involves locating highly performing models for a given task. However, previous solutions to model selection require high computational complexity for at least one of these two operations. In this work, we target fundamentally (more) scalable model selection that supports asymptotically fast update and asymptotically fast selection at the same time. Firstly, we define isolated model embedding, a family of model selection schemes supporting asymptotically fast update and selection: With respect to the number of candidate models $m$, the update complexity is O(1) and the selection consists of a single sweep over $m$ vectors in addition to O(1) model operations. Isolated model embedding also implies several desirable properties for applications. Secondly, we present Standardized Embedder, an empirical realization of isolated model embedding. We assess its effectiveness by using it to select representations from a pool of 100 pre-trained vision models for classification tasks and measuring the performance gaps between the selected models and the best candidates with a linear probing protocol. Experiments suggest our realization is effective in selecting models with competitive performances and highlight isolated model embedding as a promising direction towards model selection that is fundamentally (more) scalable.