The management of chronic Heart Failure (HF) presents significant challenges in modern healthcare, requiring continuous monitoring, early detection of exacerbations, and personalized treatment strategies. In this paper, we present a predictive model founded on Machine Learning (ML) techniques to identify patients at HF risk. This model is an ensemble learning approach, a modified stacking technique, that uses two specialized models leveraging clinical and echocardiographic features and then a meta-model to combine the predictions of these two models. We initially assess the model on a real dataset and the obtained results suggest that it performs well in the stratification of patients at HR risk. Specifically, we obtained high sensitivity (95\%), ensuring that nearly all high-risk patients are identified. As for accuracy, we obtained 84\%, which can be considered moderate in some ML contexts. However, it is acceptable given our priority of identifying patients at risk of HF because they will be asked to participate in the telemonitoring program of the PrediHealth research project on which some of the authors of this paper are working. The initial findings also suggest that ML-based risk stratification models can serve as valuable decision-support tools not only in the PrediHealth project but also for healthcare professionals, aiding in early intervention and personalized patient management. To have a better understanding of the value and of potentiality of our predictive model, we also contrasted its results with those obtained by using three baseline models. The preliminary results indicate that our predictive model outperforms these baselines that flatly consider features, \ie not grouping them in clinical and echocardiographic features.
Quantifying the causal influence of input features within neural networks has become a topic of increasing interest. Existing approaches typically assess direct, indirect, and total causal effects. This work treats NNs as structural causal models (SCMs) and extends our focus to include intrinsic causal contributions (ICC). We propose an identifiable generative post-hoc framework for quantifying ICC. We also draw a relationship between ICC and Sobol' indices. Our experiments on synthetic and real-world datasets demonstrate that ICC generates more intuitive and reliable explanations compared to existing global explanation techniques.
This paper introduces the Difference-in-Differences Bayesian Causal Forest (DiD-BCF), a novel non-parametric model addressing key challenges in DiD estimation, such as staggered adoption and heterogeneous treatment effects. DiD-BCF provides a unified framework for estimating Average (ATE), Group-Average (GATE), and Conditional Average Treatment Effects (CATE). A core innovation, its Parallel Trends Assumption (PTA)-based reparameterization, enhances estimation accuracy and stability in complex panel data settings. Extensive simulations demonstrate DiD-BCF's superior performance over established benchmarks, particularly under non-linearity, selection biases, and effect heterogeneity. Applied to U.S. minimum wage policy, the model uncovers significant conditional treatment effect heterogeneity related to county population, insights obscured by traditional methods. DiD-BCF offers a robust and versatile tool for more nuanced causal inference in modern DiD applications.
The paper is devoted to the spreading of a message within the random graph evolving by the Norros-Reittu preferential attachment model. The latter model forms random Poissonian numbers of edges between newly added nodes and existing ones. For a pre-fixed time $T^*$, the probability mass functions of the number of nodes obtained the message and the total number of nodes in the graph, as well as the distribution function of their ratio are derived. To this end, the success probability to disseminate the message from the node with the message to the node without message is proved. The exposition is illustrated by the simulation study.
We consider the problem of estimating differences in two multi-attribute Gaussian graphical models (GGMs) which are known to have similar structure, using a penalized D-trace loss function with non-convex penalties. The GGM structure is encoded in its precision (inverse covariance) matrix. Existing methods for multi-attribute differential graph estimation are based on a group lasso penalized loss function. In this paper, we consider a penalized D-trace loss function with non-convex (log-sum and smoothly clipped absolute deviation (SCAD)) penalties. Two proximal gradient descent methods are presented to optimize the objective function. Theoretical analysis establishing sufficient conditions for consistency in support recovery, convexity and estimation in high-dimensional settings is provided. We illustrate our approaches with numerical examples based on synthetic and real data.
Accurate prediction of pure component physiochemical properties is crucial for process integration, multiscale modeling, and optimization. In this work, an enhanced framework for pure component property prediction by using explainable machine learning methods is proposed. In this framework, the molecular representation method based on the connectivity matrix effectively considers atomic bonding relationships to automatically generate features. The supervised machine learning model random forest is applied for feature ranking and pooling. The adjusted R2 is introduced to penalize the inclusion of additional features, providing an assessment of the true contribution of features. The prediction results for normal boiling point (Tb), liquid molar volume, critical temperature (Tc) and critical pressure (Pc) obtained using Artificial Neural Network and Gaussian Process Regression models confirm the accuracy of the molecular representation method. Comparison with GC based models shows that the root-mean-square error on the test set can be reduced by up to 83.8%. To enhance the interpretability of the model, a feature analysis method based on Shapley values is employed to determine the contribution of each feature to the property predictions. The results indicate that using the feature pooling method reduces the number of features from 13316 to 100 without compromising model accuracy. The feature analysis results for Tb, Tc, and Pc confirms that different molecular properties are influenced by different structural features, aligning with mechanistic interpretations. In conclusion, the proposed framework is demonstrated to be feasible and provides a solid foundation for mixture component reconstruction and process integration modelling.
There are few inference methods available to accommodate covariate-dependent anisotropy in point process models. To address this, we propose an extended Bayesian MCMC approach for Neyman-Scott cluster processes. We focus on anisotropy and inhomogeneity in the offspring distribution. Our approach provides parameter estimates as well as significance tests for the covariates and anisotropy through credible intervals, which are determined by the posterior distributions. Additionally, it is possible to test the hypothesis of constant orientation of clusters or constant elongation of clusters. We demonstrate the applicability of this approach through a simulation study for a Thomas-type cluster process.
In many scientific and industrial applications, we are given a handful of instances (a 'small ensemble') of a spatially distributed quantity (a 'field') but would like to acquire many more. For example, a large ensemble of global temperature sensitivity fields from a climate model can help farmers, insurers, and governments plan appropriately. When acquiring more data is prohibitively expensive -- as is the case with climate models -- statistical emulation offers an efficient alternative for simulating synthetic yet realistic fields. However, parameter inference using maximum likelihood estimation (MLE) is computationally prohibitive, especially for large, non-stationary fields. Thus, many recent works train neural networks to estimate parameters given spatial fields as input, sidestepping MLE completely. In this work we focus on a popular class of parametric, spatially autoregressive (SAR) models. We make a simple yet impactful observation; because the SAR parameters can be arranged on a regular grid, both inputs (spatial fields) and outputs (model parameters) can be viewed as images. Using this insight, we demonstrate that image-to-image (I2I) networks enable faster and more accurate parameter estimation for a class of non-stationary SAR models with unprecedented complexity.
Multi-fidelity methods that use an ensemble of models to compute a Monte Carlo estimator of the expectation of a high-fidelity model can significantly reduce computational costs compared to single-model approaches. These methods use oracle statistics, specifically the covariance between models, to optimally allocate samples to each model in the ensemble. However, in practice, the oracle statistics are estimated using additional model evaluations, whose computational cost and induced error are typically ignored. To address this issue, this paper proposes an adaptive algorithm to optimally balance the resources between oracle statistics estimation and final multi-fidelity estimator construction, leveraging ideas from multilevel best linear unbiased estimators in Schaden and Ullmann (2020) and a bandit-learning procedure in Xu et al. (2022). Under mild assumptions, we demonstrate that the multi-fidelity estimator produced by the proposed algorithm exhibits mean-squared error commensurate with that of the best linear unbiased estimator under the optimal allocation computed with oracle statistics. Our theoretical findings are supported by detailed numerical experiments, including a parametric elliptic PDE and an ice-sheet mass-change modeling problem.
There are many open questions pertaining to the statistical analysis of random objects, which are increasingly encountered. A major challenge is the absence of linear operations in such spaces. A basic statistical task is to quantify statistical dispersion or spread. For two measures of dispersion for data objects in geodesic metric spaces, Fr\'echet variance and metric variance, we derive a central limit theorem (CLT) for their joint distribution. This analysis reveals that the Alexandrov curvature of the geodesic space determines the relationship between these two dispersion measures. This suggests a novel test for inferring the curvature of a space based on the asymptotic distribution of the dispersion measures. We demonstrate how this test can be employed to detect the intrinsic curvature of an unknown underlying space, which emerges as a joint property of the space and the underlying probability measure that generates the random objects. We investigate the asymptotic properties of the test and its finite-sample behavior for various data types, including distributional data and point cloud data. We illustrate the proposed inference for intrinsic curvature of random objects using gait synchronization data represented as symmetric positive definite matrices and energy compositional data on the sphere.
This paper proposes a robust and computationally efficient estimation framework for fitting parametric distributions based on trimmed L-moments. Trimmed L-moments extend classical L-moment theory by downweighting or excluding extreme order statistics, resulting in estimators that are less sensitive to outliers and heavy tails. We construct estimators for both location-scale and shape parameters using asymmetric trimming schemes tailored to different moments, and establish their asymptotic properties for inferential justification using the general structural theory of L-statistics, deriving simplified single-integration expressions to ensure numerical stability. State-of-the-art algorithms are developed to resolve the sign ambiguity in estimating the scale parameter for location-scale models and the tail index for the Frechet model. The proposed estimators offer improved efficiency over traditional robust alternatives for selected asymmetric trimming configurations, while retaining closed-form expressions for a wide range of common distributions, facilitating fast and stable computation. Simulation studies demonstrate strong finite-sample performance. An application to financial claim severity modeling highlights the practical relevance and flexibility of the approach.
In this paper, we derive closed-form estimators for the parameters of certain exponential family distributions through the maximum a posteriori (MAP) equations. A Monte Carlo simulation is conducted to assess the performance of the proposed estimators. The results show that, as expected, their accuracy improves with increasing sample size, with both bias and mean squared error approaching zero. Moreover, the proposed estimators exhibit performance comparable to that of traditional MAP and maximum likelihood (ML) estimators. A notable advantage of the proposed method lies in its computational simplicity, as it eliminates the need for numerical optimization required by MAP and ML estimation.
Mover-stayer models are used in social sciences and economics to model heterogeneous population dynamics in which some individuals never experience the event of interest ("stayers"), while others transition between states over time ("movers"). Conventionally, the mover-stayer status is determined at baseline and time-dependent covariates are only incorporated in the movers' transition probabilities. In this paper, we present a novel dynamic version of the mover-stayer model, allowing potential movers to become stayers over time based on time-varying circumstances. Using a multinomial logistic framework, our model incorporates both time-fixed and exogenous time-varying covariates to estimate transition probabilities among the states of potential movers, movers, and stayers. Both the initial state and transitions to the stayer state are treated as latent. The introduction of this new model is motivated by the study of student mobility. Specifically focusing on panel data on the inter-university mobility of Italian students, factors such as the students' change of course and university size are considered as time-varying covariates in modelling their probability of moving or becoming stayers; sex and age at enrolment as time-fixed covariates. We propose a maximum likelihood estimation approach and investigate its finite-sample performance through simulations, comparing it to established models in the literature.
Portfolio optimization involves selecting asset weights to minimize a risk-reward objective, such as the portfolio variance in the classical minimum-variance framework. Sparse portfolio selection extends this by imposing a cardinality constraint: only $k$ assets from a universe of $p$ may be included. The standard approach models this problem as a mixed-integer quadratic program and relies on commercial solvers to find the optimal solution. However, the computational costs of such methods increase exponentially with $k$ and $p$, making them too slow for problems of even moderate size. We propose a fast and scalable gradient-based approach that transforms the combinatorial sparse selection problem into a constrained continuous optimization task via Boolean relaxation, while preserving equivalence with the original problem on the set of binary points. Our algorithm employs a tunable parameter that transmutes the auxiliary objective from a convex to a concave function. This allows a stable convex starting point, followed by a controlled path toward a sparse binary solution as the tuning parameter increases and the objective moves toward concavity. In practice, our method matches commercial solvers in asset selection for most instances and, in rare instances, the solution differs by a few assets whilst showing a negligible error in portfolio variance.
Boltzmann Generators have emerged as a promising machine learning tool for generating samples from equilibrium distributions of molecular systems using Normalizing Flows and importance weighting. Recently, Flow Matching has helped speed up Continuous Normalizing Flows (CNFs), scale them to more complex molecular systems, and minimize the length of the flow integration trajectories. We investigate the benefits of using path gradients to fine-tune CNFs initially trained by Flow Matching, in the setting where a target energy is known. Our experiments show that this hybrid approach yields up to a threefold increase in sampling efficiency for molecular systems, all while using the same model, a similar computational budget and without the need for additional sampling. Furthermore, by measuring the length of the flow trajectories during fine-tuning, we show that path gradients largely preserve the learned structure of the flow.
We introduce the first one-stage Top-$k$ Learning-to-Defer framework, which unifies prediction and deferral by learning a shared score-based model that selects the $k$ most cost-effective entities-labels or experts-per input. While existing one-stage L2D methods are limited to deferring to a single expert, our approach jointly optimizes prediction and deferral across multiple entities through a single end-to-end objective. We define a cost-sensitive loss and derive a novel convex surrogate that is independent of the cardinality parameter $k$, enabling generalization across Top-$k$ regimes without retraining. Our formulation recovers the Top-1 deferral policy of prior score-based methods as a special case, and we prove that our surrogate is both Bayes-consistent and $\mathcal{H}$-consistent under mild assumptions. We further introduce an adaptive variant, Top-$k(x)$, which dynamically selects the number of consulted entities per input to balance predictive accuracy and consultation cost. Experiments on CIFAR-10 and SVHN confirm that our one-stage Top-$k$ method strictly outperforms Top-1 deferral, while Top-$k(x)$ achieves superior accuracy-cost trade-offs by tailoring allocations to input complexity.
TV customers today face many choices from many live channels and on-demand services. Providing a personalised experience that saves customers time when discovering content is essential for TV providers. However, a reliable understanding of their behaviour and preferences is key. When creating personalised recommendations for TV, the biggest challenge is understanding viewing behaviour within households when multiple people are watching. The objective is to detect and combine individual profiles to make better-personalised recommendations for group viewing. Our challenge is that we have little explicit information about who is watching the devices at any time (individuals or groups). Also, we do not have a way to combine more than one individual profile to make better recommendations for group viewing. We propose a novel framework using a Gaussian mixture model averaging to obtain point estimates for the number of household TV profiles and a Bayesian random walk model to introduce uncertainty. We applied our approach using data from real customers whose TV-watching data totalled approximately half a million observations. Our results indicate that combining our framework with the selected features provides a means to estimate the number of household TV profiles and their characteristics, including shifts over time and quantification of uncertainty.
Basket trials have gained increasing attention for their efficiency, as multiple patient subgroups are evaluated simultaneously. Conducted basket trials focus primarily on establishing the early efficacy of a treatment, yet continued monitoring of toxicity is essential. In this paper, we propose two Bayesian hierarchical models that enable bivariate analyses of toxicity and efficacy, while accounting for heterogeneity present in the treatment effects across patient subgroups. Specifically, one assumes the subgroup-specific toxicity and efficacy treatment effects, as a parameter vector, can be exchangeable or non-exchangeable; the other allows either the toxicity or efficacy parameters specific to the subgroups, to be exchangeable or non-exchangeable. The bivariate exchangeability and non-exchangeability distributions introduce a correlation parameter between treatment effects, while we stipulate a zero correlation when only toxicity or efficacy parameters are exchangeable. Simulation results show that our models perform robustly under different scenarios compared to the standard Bayesian hierarchical model and the stand-alone analyses, especially in producing higher power when the subgroup-specific effects are exchangeable in toxicity or efficacy only. When considerable correlation between the toxicity and efficacy effects exists, our methodology gives small error rates and greater power than alternatives that analyse toxicity and efficacy by parts.
The European Union Emissions Trading System (EU ETS) is a key policy tool for reducing greenhouse gas emissions and advancing toward a net-zero economy. Under this scheme, tradeable carbon credits, European Union Allowances (EUAs), are issued to large emitters, who can buy and sell them on regulated markets. We investigate the influence of financial, economic, and energy-related factors on EUA futures prices using discrete and dynamic Bayesian networks to model both contemporaneous and time-lagged dependencies. The analysis is based on daily data spanning the third and fourth ETS trading phases (2013-2025), incorporating a wide range of indicators including energy commodities, equity indices, exchange rates, and bond markets. Results reveal that EUA pricing is most influenced by energy commodities, especially coal and oil futures, and by the performance of the European energy sector. Broader market sentiment, captured through stock indices and volatility measures, affects EUA prices indirectly via changes in energy demand. The dynamic model confirms a modest next-day predictive influence from oil markets, while most other effects remain contemporaneous. These insights offer regulators, institutional investors, and firms subject to ETS compliance a clearer understanding of the interconnected forces shaping the carbon market, supporting more effective hedging, investment strategies, and policy design.
Bayesian inference with Markov Chain Monte Carlo (MCMC) is challenging when the likelihood function is irregular and expensive to compute. We explore several sampling algorithms that make use of subset evaluations to reduce computational overhead. We adapt the subset samplers for this setting where gradient information is not available or is unreliable. To achieve this, we introduce data-driven proxies in place of Taylor expansions and define a novel computation-cost aware adaptive controller. We undertake an extensive evaluation for a challenging disease modelling task and a configurable task with similar irregularity in the likelihood surface. We find our improved version of Hierarchical Importance with Nested Training Samples (HINTS), with adaptive proposals and a data-driven proxy, obtains the best sampling error in a fixed computational budget. We conclude that subset evaluations can provide cheap and naturally-tempered exploration, while a data-driven proxy can pre-screen proposals successfully in explored regions of the state space. These two elements combine through hierarchical delayed acceptance to achieve efficient, exact sampling.
Multi-modal and high-dimensional posteriors present significant challenges for variational inference, causing mode-seeking behavior and collapse despite the theoretical expressiveness of normalizing flows. Traditional annealing methods require temperature schedules and hyperparameter tuning, falling short of the goal of truly black-box variational inference. We introduce FlowVAT, a conditional tempering approach for normalizing flow variational inference that addresses these limitations. Our method tempers both the base and target distributions simultaneously, maintaining affine-invariance under tempering. By conditioning the normalizing flow on temperature, we leverage overparameterized neural networks' generalization capabilities to train a single flow representing the posterior across a range of temperatures. This preserves modes identified at higher temperatures when sampling from the variational posterior at $T = 1$, mitigating standard variational methods' mode-seeking behavior. In experiments with 2, 10, and 20 dimensional multi-modal distributions, FlowVAT outperforms traditional and adaptive annealing methods, finding more modes and achieving better ELBO values, particularly in higher dimensions where existing approaches fail. Our method requires minimal hyperparameter tuning and does not require an annealing schedule, advancing toward fully-automatic black-box variational inference for complicated posteriors.
Causal discovery (CD) aims to discover the causal graph underlying the data generation mechanism of observed variables. In many real-world applications, the observed variables are vector-valued, such as in climate science where variables are defined over a spatial grid and the task is called spatio-temporal causal discovery. We motivate CD in vector-valued variable setting while considering different possibilities for the underlying model, and highlight the pitfalls of commonly-used approaches when compared to a fully vectorized approach. Furthermore, often the vector-valued variables are high-dimensional, and aggregations of the variables, such as averages, are considered in interest of efficiency and robustness. In the absence of interventional data, testing for the soundness of aggregate variables as consistent abstractions that map a low-level to a high-level structural causal model (SCM) is hard, and recent works have illustrated the stringency of conditions required for testing consistency. In this work, we take a careful look at the task of vector-valued CD via constraint-based methods, focusing on the problem of consistency of aggregation for this task. We derive three aggregation consistency scores, based on compatibility of independence models and (partial) aggregation, that quantify different aspects of the soundness of an aggregation map for the CD problem. We present the argument that the consistency of causal abstractions must be separated from the task-dependent consistency of aggregation maps. As an actionable conclusion of our findings, we propose a wrapper Adag to optimize a chosen aggregation consistency score for aggregate-CD, to make the output of CD over aggregate variables more reliable. We supplement all our findings with experimental evaluations on synthetic non-time series and spatio-temporal data.
We study sequential decision-making in batched nonparametric contextual bandits, where actions are selected over a finite horizon divided into a small number of batches. Motivated by constraints in domains such as medicine and marketing -- where online feedback is limited -- we propose a nonparametric algorithm that combines adaptive k-nearest neighbor (k-NN) regression with the upper confidence bound (UCB) principle. Our method, BaNk-UCB, is fully nonparametric, adapts to the context dimension, and is simple to implement. Unlike prior work relying on parametric or binning-based estimators, BaNk-UCB uses local geometry to estimate rewards and adaptively balances exploration and exploitation. We provide near-optimal regret guarantees under standard Lipschitz smoothness and margin assumptions, using a theoretically motivated batch schedule that balances regret across batches and achieves minimax-optimal rates. Empirical evaluations on synthetic and real-world datasets demonstrate that BaNk-UCB consistently outperforms binning-based baselines.
Bayesian inference provides a principled framework for probabilistic reasoning. If inference is performed in two steps, uncertainty propagation plays a crucial role in accounting for all sources of uncertainty and variability. This becomes particularly important when both aleatoric uncertainty, caused by data variability, and epistemic uncertainty, arising from incomplete knowledge or missing data, are present. Examples include surrogate models and missing data problems. In surrogate modeling, the surrogate is used as a simplified approximation of a resource-heavy and costly simulation. The uncertainty from the surrogate-fitting process can be propagated using a two-step procedure. For modeling with missing data, methods like Multivariate Imputation by Chained Equations (MICE) generate multiple datasets to account for imputation uncertainty. These approaches, however, are computationally expensive, as multiple models must be fitted separately to surrogate parameters respectively imputed datasets. To address these challenges, we propose an efficient two-step approach that reduces computational overhead while maintaining accuracy. By selecting a representative subset of draws or imputations, we construct a mixture distribution to approximate the desired posteriors using Pareto smoothed importance sampling. For more complex scenarios, this is further refined with importance weighted moment matching and an iterative procedure that broadens the mixture distribution to better capture diverse posterior distributions.
The design-based paradigm may be adopted in causal queries and survey sampling when we assume Rubin's stable unit treatment value assumption (SUTVA) or impose similar frameworks. While often taken for granted, such assumptions entail strong claims about the data generating process. We develop an alternative design-based approach: we first invoke a generalized, non-parametric model that allows for unrestricted forms of interference, such as spillover. We define a new set of inferential targets and discuss their interpretation under SUTVA and a weaker assumption that we call the No Unmodeled Revealable Variation Assumption (NURVA). We then reconstruct the standard paradigm, reconsidering SUTVA at the end rather than assuming it at the beginning. Despite its similarity to SUTVA, we demonstrate the practical insufficiency of NURVA in identifying substantively interesting quantities. In so doing, we provide clarity on the nature and importance of SUTVA for applied research.
We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $\eta = \Theta(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $\Omega(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $\Theta(1/(\eta T) + \eta \sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $\Omega(\eta \sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = \smash{\widetilde O}(n)$, improving on recent results of Koren et al.(2022) and Schliserman et al.(2024).
The anomaly detection literature is abundant with offline methods, which require repeated access to data in memory, and impose impractical assumptions when applied to a streaming context. Existing online anomaly detection methods also generally fail to address these constraints, resorting to periodic retraining to adapt to the online context. We propose Online-iForest, a novel method explicitly designed for streaming conditions that seamlessly tracks the data generating process as it evolves over time. Experimental validation on real-world datasets demonstrated that Online-iForest is on par with online alternatives and closely rivals state-of-the-art offline anomaly detection techniques that undergo periodic retraining. Notably, Online-iForest consistently outperforms all competitors in terms of efficiency, making it a promising solution in applications where fast identification of anomalies is of primary importance such as cybersecurity, fraud and fault detection.
Climate policy and legislation has a significant influence on both domestic and global responses to the pressing environmental challenges of our time. The effectiveness of such climate legislation is closely tied to the complex dynamics among elected officials, a dynamic significantly shaped by the relentless efforts of lobbying. This project aims to develop a novel compartmental model to forecast the trajectory of climate legislation within the United States. By understanding the dynamics surrounding floor votes, the ramifications of lobbying, and the flow of campaign donations within the chambers of the U.S. Congress, we aim to validate our model through a comprehensive case study of the American Clean Energy and Security Act (ACESA). Our model adeptly captures the nonlinear dynamics among diverse legislative factions, including centrists, ardent supporters, and vocal opponents of the bill, culminating in a rich dynamics of final voting outcomes. We conduct a stability analysis of the model, estimating parameters from public lobbying records and a robust body of existing literature. The numerical verification against the pivotal 2009 ACESA vote, alongside contemporary research, underscores the models promising potential as a tool to understand the dynamics of climate lobbying. We also analyse the pathways of the model that aims to guide future legislative endeavors in the pursuit of effective climate action.
We describe an algorithm for sampling a low-rank random matrix $Q$ that best approximates a fixed target matrix $P\in\mathbb{C}^{n\times m}$ in the following sense: $Q$ is unbiased, i.e., $\mathbb{E}[Q] = P$; $\mathsf{rank}(Q)\leq r$; and $Q$ minimizes the expected Frobenius norm error $\mathbb{E}\|P-Q\|_F^2$. Our algorithm mirrors the solution to the efficient unbiased sparsification problem for vectors, except applied to the singular components of the matrix $P$. Optimality is proven by showing that our algorithm matches the error from an existing lower bound.
We propose a new framework for multi-agent reinforcement learning (MARL), where the agents cooperate in a time-evolving network with latent community structures and mixed memberships. Unlike traditional neighbor-based or fixed interaction graphs, our community-based framework captures flexible and abstract coordination patterns by allowing each agent to belong to multiple overlapping communities. Each community maintains shared policy and value functions, which are aggregated by individual agents according to personalized membership weights. We also design actor-critic algorithms that exploit this structure: agents inherit community-level estimates for policy updates and value learning, enabling structured information sharing without requiring access to other agents' policies. Importantly, our approach supports both transfer learning by adapting to new agents or tasks via membership estimation, and active learning by prioritizing uncertain communities during exploration. Theoretically, we establish convergence guarantees under linear function approximation for both actor and critic updates. To our knowledge, this is the first MARL framework that integrates community structure, transferability, and active learning with provable guarantees.
Clustering patient subgroups is essential for personalized care and efficient resource use. Traditional clustering methods struggle with high-dimensional, heterogeneous healthcare data and lack contextual understanding. This study evaluates Large Language Model (LLM) based clustering against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated cluster quality and distinctiveness. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight potential of LLMs for contextual phenotyping and informed decision-making in resource-limited settings.
Enterprise networks are growing ever larger with a rapidly expanding attack surface, increasing the volume of security alerts generated from security controls. Security Operations Centre (SOC) analysts triage these alerts to identify malicious activity, but they struggle with alert fatigue due to the overwhelming number of benign alerts. Organisations are turning to managed SOC providers, where the problem is amplified by context switching and limited visibility into business processes. A novel system, named AACT, is introduced that automates SOC workflows by learning from analysts' triage actions on cybersecurity alerts. It accurately predicts triage decisions in real time, allowing benign alerts to be closed automatically and critical ones prioritised. This reduces the SOC queue allowing analysts to focus on the most severe, relevant or ambiguous threats. The system has been trained and evaluated on both real SOC data and an open dataset, obtaining high performance in identifying malicious alerts from benign alerts. Additionally, the system has demonstrated high accuracy in a real SOC environment, reducing alerts shown to analysts by 61% over six months, with a low false negative rate of 1.36% over millions of alerts.
The sales process involves sales functions converting leads or opportunities to customers and selling more products to existing customers. The optimization of the sales process thus is key to success of any B2B business. In this work, we introduce a principled approach to sales optimization and business AI, namely the Causal Predictive Optimization and Generation, which includes three layers: 1) prediction layer with causal ML 2) optimization layer with constraint optimization and contextual bandit 3) serving layer with Generative AI and feedback-loop for system enhancement. We detail the implementation and deployment of the system in LinkedIn, showcasing significant wins over legacy systems and sharing learning and insight broadly applicable to this field.
Data Driven Attribution, which assigns conversion credits to marketing interactions based on causal patterns learned from data, is the foundation of modern marketing intelligence and vital to any marketing businesses and advertising platform. In this paper, we introduce a unified transformer-based attribution approach that can handle member-level data, aggregate-level data, and integration of external macro factors. We detail the large scale implementation of the approach at LinkedIn, showcasing significant impact. We also share learning and insights that are broadly applicable to the marketing and ad tech fields.
Understanding the factors contributing to traffic crashes and developing strategies to mitigate their severity is essential. Traditional statistical methods and machine learning models often struggle to capture the complex interactions between various factors and the unique characteristics of each crash. This research leverages large language model (LLM) to analyze freeway crash data and provide crash causation analysis accordingly. By compiling 226 traffic safety studies related to freeway crashes, a training dataset encompassing environmental, driver, traffic, and geometric design factors was created. The Llama3 8B model was fine-tuned using QLoRA to enhance its understanding of freeway crashes and their contributing factors, as covered in these studies. The fine-tuned Llama3 8B model was then used to identify crash causation without pre-labeled data through zero-shot classification, providing comprehensive explanations to ensure that the identified causes were reasonable and aligned with existing research. Results demonstrate that LLMs effectively identify primary crash causes such as alcohol-impaired driving, speeding, aggressive driving, and driver inattention. Incorporating event data, such as road maintenance, offers more profound insights. The model's practical applicability and potential to improve traffic safety measures were validated by a high level of agreement among researchers in the field of traffic safety, as reflected in questionnaire results with 88.89%. This research highlights the complex nature of traffic crashes and how LLMs can be used for comprehensive analysis of crash causation and other contributing factors. Moreover, it provides valuable insights and potential countermeasures to aid planners and policymakers in developing more effective and efficient traffic safety practices.
This note establishes that the opposite Gaussian product inequality (GPI) of the type proved by Russell & Sun (2022a) in two dimensions, and partially extended to higher dimensions by Zhou et al. (2024), continues to hold for an arbitrary mix of positive and negative exponents. A general quantitative lower bound is also obtained conditionally on the GPI conjecture being true.
Many multi-variate time series obtained in the natural sciences and engineering possess a repetitive behavior, as for instance state-space trajectories of industrial machines in discrete automation. Recovering the times of recurrence from such a multi-variate time series is of a fundamental importance for many monitoring and control tasks. For a periodic time series this is equivalent to determining its period length. In this work we present a persistent homology framework to estimate recurrence times in multi-variate time series with different generalizations of cyclic behavior (periodic, repetitive, and recurring). To this end, we provide three specialized methods within our framework that are provably stable and validate them using real-world data, including a new benchmark dataset from an injection molding machine.
Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.
Scrambling quantum systems have been demonstrated as effective substrates for temporal information processing. While their role in providing rich feature maps has been widely studied, a theoretical understanding of their performance in temporal tasks is still lacking. Here we consider a general quantum reservoir processing framework that captures a broad range of physical computing models with quantum systems. We examine the scalability and memory retention of the model with scrambling reservoirs modelled by high-order unitary designs in both noiseless and noisy settings. In the former regime, we show that measurement readouts become exponentially concentrated with increasing reservoir size, yet strikingly do not worsen with the reservoir iterations. Thus, while repeatedly reusing a small scrambling reservoir with quantum data might be viable, scaling up the problem size deteriorates generalization unless one can afford an exponential shot overhead. In contrast, the memory of early inputs and initial states decays exponentially in both reservoir size and reservoir iterations. In the noisy regime, we also prove exponential memory decays with iterations for local noisy channels. Proving these results required us to introduce new proof techniques for bounding concentration in temporal quantum learning models.
With the widespread adoption of Large Language Models (LLMs), there is a growing need to establish best practices for leveraging their capabilities beyond traditional natural language tasks. In this paper, a novel cross-domain knowledge transfer framework is proposed to enhance the performance of LLMs in time series forecasting -- a task of increasing relevance in fields such as energy systems, finance, and healthcare. The approach systematically infuses LLMs with structured temporal information to improve their forecasting accuracy. This study evaluates the proposed method on a real-world time series dataset and compares it to a naive baseline where the LLM receives no auxiliary information. Results show that knowledge-informed forecasting significantly outperforms the uninformed baseline in terms of predictive accuracy and generalization. These findings highlight the potential of knowledge transfer strategies to bridge the gap between LLMs and domain-specific forecasting tasks.
Hebbian learning is a key principle underlying learning in biological neural networks. It postulates that synaptic changes occur locally, depending on the activities of pre- and postsynaptic neurons. While Hebbian learning based on neuronal firing rates is well explored, much less is known about learning rules that account for precise spike-timing. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a natural loss function on the probability simplex. This connection allows us to prove that the learning rule eventually identifies the presynaptic neuron with the highest activity. We also discover an intrinsic connection to noisy mirror descent.
Two maximum likelihood-based algorithms for unfolding or deconvolution are considered: the Richardson-Lucy method and the Data Unfolding method with Mean Integrated Square Error (MISE) optimization [10]. Unfolding is viewed as a procedure for estimating an unknown probability density function. Both external and internal quality assessment methods can be applied for this purpose. In some cases, external criteria exist to evaluate deconvolution quality. A typical example is the deconvolution of a blurred image, where the sharpness of the restored image serves as an indicator of quality. However, defining such external criteria can be challenging, particularly when a measurement has not been performed previously. In such instances, internal criteria are necessary to assess the quality of the result independently of external information. The article discusses two internal criteria: MISE for the unfolded distribution and the condition number of the correlation matrix of the unfolded distribution. These internal quality criteria are applied to a comparative analysis of the two methods using identical numerical data. The results of the analysis demonstrate the superiority of the Data Unfolding method with MISE optimization over the Richardson-Lucy method.
Conventional score-based diffusion models (DMs) may struggle with anisotropic Gaussian diffusion processes due to the required inversion of covariance matrices in the denoising score matching training objective \cite{vincent_connection_2011}. We propose Whitened Score (WS) diffusion models, a novel SDE-based framework that learns the Whitened Score function instead of the standard score. This approach circumvents covariance inversion, extending score-based DMs by enabling stable training of DMs on arbitrary Gaussian forward noising processes. WS DMs establish equivalence with FM for arbitrary Gaussian noise, allow for tailored spectral inductive biases, and provide strong Bayesian priors for imaging inverse problems with structured noise. We experiment with a variety of computational imaging tasks using the CIFAR and CelebA ($64\times64$) datasets and demonstrate that WS diffusion priors trained on anisotropic Gaussian noising processes consistently outperform conventional diffusion priors based on isotropic Gaussian noise.
There are now a multitude of AI-scribing solutions for healthcare promising the utilization of large language models for ambient documentation. However, these AI scribes still rely on one-shot, or few-shot prompts for generating notes after the consultation has ended, employing little to no reasoning. This risks long notes with an increase in hallucinations, misrepresentation of the intent of the clinician, and reliance on the proofreading of the clinician to catch errors. A dangerous combination for patient safety if vigilance is compromised by workload and fatigue. In this paper, we introduce a method for extracting salient clinical information in real-time alongside the healthcare consultation, denoted Facts, and use that information recursively to generate the final note. The FactsR method results in more accurate and concise notes by placing the clinician-in-the-loop of note generation, while opening up new use cases within real-time decision support.
For many economic questions, the empirical results are not interesting unless they are strong. For these questions, theorizing before the results are known is not always optimal. Instead, the optimal sequencing of theory and empirics trades off a ``Darwinian Learning'' effect from theorizing first with a ``Statistical Learning'' effect from examining the data first. This short paper formalizes the tradeoff in a Bayesian model. In the modern era of mature economic theory and enormous datasets, I argue that post hoc theorizing is typically optimal.
We address the problem of detecting anomalies with respect to structured patterns. To this end, we conceive a novel anomaly detection method called PIF, that combines the advantages of adaptive isolation methods with the flexibility of preference embedding. Specifically, we propose to embed the data in a high dimensional space where an efficient tree-based method, PI-Forest, is employed to compute an anomaly score. Experiments on synthetic and real datasets demonstrate that PIF favorably compares with state-of-the-art anomaly detection techniques, and confirm that PI-Forest is better at measuring arbitrary distances and isolate points in the preference space.
Beyond neural scaling laws, little is known about the laws underlying large language models (LLMs). We introduce Neural Thermodynamic Laws (NTL) -- a new framework that offers fresh insights into LLM training dynamics. On the theoretical side, we demonstrate that key thermodynamic quantities (e.g., temperature, entropy, heat capacity, thermal conduction) and classical thermodynamic principles (e.g., the three laws of thermodynamics and the equipartition theorem) naturally emerge under river-valley loss landscape assumptions. On the practical side, this scientific perspective yields intuitive guidelines for designing learning rate schedules.