The project Agriculture 4.0 without chemical synthetical plant protection (NOcsPS) tests a number of cropping systems that avoid the use of chemical synthetical pesticides while at the same time using mineral fertilizers. The experiment started in 2020 (sowing fall 2019). In 2024 (sowing fall 2023), some of the cropping systems were modified. Analysis of this experiment may be done using linear mixed models. In order to include the data from 2020-2023 in joint analyses with the data collected for the modified systems from 2024 onwards, the mixed modelling approach needs to be reconsidered. In this paper, we develop models for this purpose. A key feature is the use of network meta-analytic concepts that allow a combination of direct and indirect comparisons among systems from the different years. The approach is first illustrated using a toy example. This is followed by detailed analyses of data from two the two trials sites Dahnsdorf and Hohenheim.
We introduce a modified Benamou-Brenier type approach leading to a Wasserstein type distance that allows global invariance, specifically, isometries, and we show that the problem can be summarized to orthogonal transformations. This distance is defined by penalizing the action with a costless movement of the particle that does not change the direction and speed of its trajectory. We show that for Gaussian distribution resume to measuring the Euclidean distance between their ordered vector of eigenvalues and we show a direct application in recovering Latent Gaussian distributions.
We consider estimating the proportion of random variables for two types of composite null hypotheses: (i) the means or medians of the random variables belonging to a non-empty, bounded interval; (ii) the means or medians of the random variables belonging to an unbounded interval that is not the whole real line. For each type of composite null hypotheses, uniformly consistent estimators of the proportion of false null hypotheses are constructed for random variables whose distributions are members of a Type I location-shift family. Further, uniformly consistent estimators of certain functions of a bounded null on the means or medians are provided for the random variables mentioned earlier; these functions are continuous and of bounded variation. The estimators are constructed via solutions to Lebesgue-Stieltjes integral equations and harmonic analysis, do not rely on a concept of p-value, and have various applications.
Predicting the risk of clinical progression from cognitively normal (CN) status to mild cognitive impairment (MCI) or Alzheimer's disease (AD) is critical for early intervention in Alzheimer's disease (AD). Traditional survival models often fail to capture complex longitudinal biomarker patterns associated with disease progression. We propose an ensemble survival analysis framework integrating multiple survival models to improve early prediction of clinical progression in initially cognitively normal individuals. We analyzed longitudinal biomarker data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort, including 721 participants, limiting analysis to up to three visits (baseline, 6-month follow-up, 12-month follow-up). Of these, 142 (19.7%) experienced clinical progression to MCI or AD. Our approach combined penalized Cox regression (LASSO, Elastic Net) with advanced survival models (Random Survival Forest, DeepSurv, XGBoost). Model predictions were aggregated using ensemble averaging and Bayesian Model Averaging (BMA). Predictive performance was assessed using Harrell's concordance index (C-index) and time-dependent area under the curve (AUC). The ensemble model achieved a peak C-index of 0.907 and an integrated time-dependent AUC of 0.904, outperforming baseline-only models (C-index 0.608). One follow-up visit after baseline significantly improved prediction accuracy (48.1% C-index, 48.2% AUC gains), while adding a second follow-up provided only marginal gains (2.1% C-index, 2.7% AUC). Our ensemble survival framework effectively integrates diverse survival models and aggregation techniques to enhance early prediction of preclinical AD progression. These findings highlight the importance of leveraging longitudinal biomarker data, particularly one follow-up visit, for accurate risk stratification and personalized intervention strategies.
Recent workshops brought together several developers, educators and users of software packages extending popular languages for spatial data handling, with a primary focus on R, Python and Julia. Common challenges discussed included handling of spatial or spatio-temporal support, geodetic coordinates, in-memory vector data formats, data cubes, inter-package dependencies, packaging upstream libraries, differences in habits or conventions between the GIS and physical modelling communities, and statistical models. The following set of insights have been formulated: (i) considering software problems across data science language silos helps to understand and standardise analysis approaches, also outside the domain of formal standardisation bodies; (ii) whether attribute variables have block or point support, and whether they are spatially intensive or extensive has consequences for permitted operations, and hence for software implementing those; (iii) handling geometries on the sphere rather than on the flat plane requires modifications to the logic of {\em simple features}, (iv) managing communities and fostering diversity is a necessary, on-going effort, and (v) tools for cross-language development need more attention and support.
We introduce Binacox+, an advanced extension of the Binacox method for prognostic analysis of high-dimensional survival data, enabling the detection of multiple cut-points per feature. The original Binacox method leverages the Cox proportional hazards model, combining one-hot encoding with the binarsity penalty to simultaneously perform feature selection and cut-point detection. In this work, we enhance Binacox by incorporating a novel penalty term based on the L1 norm of coefficients for cumulative binarization, defined over a set of pre-specified, context-dependent cut-point candidates. This new penalty not only improves interpretability but also significantly reduces computational time and enhances prediction performance compared to the original method. We conducted extensive simulation studies to evaluate the statistical and computational properties of Binacox+ in comparison to Binacox. Our simulation results demonstrate that Binacox+ achieves superior performance in important cut-point detection, particularly in high-dimensional settings, while drastically reducing computation time. As a case study, we applied both methods to three real-world genomic cancer datasets from The Cancer Genome Atlas (TCGA). The empirical results confirm that Binacox+ outperforms Binacox+ in risk prediction accuracy and computational efficiency, making it a powerful tool for survival analysis in high-dimensional biomedical data.
The USDA Forest Inventory and Analysis (FIA) program conducts a national forest inventory for the United States through a network of permanent field plots. FIA produces estimates of area averages/totals for plot-measured forest variables through design-based inference, assuming a fixed population and a probability sample of field plot locations. The fixed-population assumption and characteristics of the FIA sampling scheme make it difficult to estimate change in forest variables over time using design-based inference. We propose spatial-temporal models based on Gaussian processes as a flexible tool for forest inventory data, capable of inferring forest variables and change thereof over arbitrary spatial and temporal domains. It is shown to be beneficial for the covariance function governing the latent Gaussian process to account for variation at multiple scales, separating spatially local variation from ecosystem-scale variation. We demonstrate a model for forest biomass density, inferring 20 years of biomass change within two US National Forests.
Massive network datasets are becoming increasingly common in scientific applications. Existing community detection methods encounter significant computational challenges for such massive networks due to two reasons. First, the full network needs to be stored and analyzed on a single server, leading to high memory costs. Second, existing methods typically use matrix factorization or iterative optimization using the full network, resulting in high runtimes. We propose a strategy called \textit{predictive assignment} to enable computationally efficient community detection while ensuring statistical accuracy. The core idea is to avoid large-scale matrix computations by breaking up the task into a smaller matrix computation plus a large number of vector computations that can be carried out in parallel. Under the proposed method, community detection is carried out on a small subgraph to estimate the relevant model parameters. Next, each remaining node is assigned to a community based on these estimates. We prove that predictive assignment achieves strong consistency under the stochastic blockmodel and its degree-corrected version. We also demonstrate the empirical performance of predictive assignment on simulated networks and two large real-world datasets: DBLP (Digital Bibliography \& Library Project), a computer science bibliographical database, and the Twitch Gamers Social Network.
Accurate survival predicting models are essential for improving targeted cancer therapies and clinical care among cancer patients. In this article, we investigate and develop a method to improve predictions of survival in cancer by leveraging two-phase data with expert knowledge and prognostic index. Our work is motivated by two-phase data in nasopharyngeal cancer (NPC), where traditional covariates are readily available for all subjects, but the primary viral factor, Human Papillomavirus (HPV), is substantially missing. To address this challenge, we propose an expert guided method that incorporates prognostic index based on the observed covariates and clinical importance of key factors. The proposed method makes efficient use of available data, not simply discarding patients with unknown HPV status. We apply the proposed method and evaluate it against other existing approaches through a series of simulation studies and real data example of NPC patients. Under various settings, the proposed method consistently outperforms competing methods in terms of c-index, calibration slope, and integrated Brier score. By efficiently leveraging two-phase data, the model provides a more accurate and reliable predictive ability of survival models.
We consider price competition among multiple sellers over a selling horizon of $T$ periods. In each period, sellers simultaneously offer their prices and subsequently observe their respective demand that is unobservable to competitors. The demand function for each seller depends on all sellers' prices through a private, unknown, and nonlinear relationship. To address this challenge, we propose a semi-parametric least-squares estimation of the nonlinear mean function, which does not require sellers to communicate demand information. We show that when all sellers employ our policy, their prices converge at a rate of $O(T^{-1/7})$ to the Nash equilibrium prices that sellers would reach if they were fully informed. Each seller incurs a regret of $O(T^{5/7})$ relative to a dynamic benchmark policy. A theoretical contribution of our work is proving the existence of equilibrium under shape-constrained demand functions via the concept of $s$-concavity and establishing regret bounds of our proposed policy. Technically, we also establish new concentration results for the least squares estimator under shape constraints. Our findings offer significant insights into dynamic competition-aware pricing and contribute to the broader study of non-parametric learning in strategic decision-making.
This paper presents several forecasting methods to model and forecast subnational age distribution of death counts. The age distribution of death counts has many similarities to probability density functions, which are nonnegative and have a constrained integral, and thus live in a constrained nonlinear space. To address the nonlinear nature of objects, we implement a cumulative distribution function transformation that has an additional monotonicity. Using the Japanese subnational life-table death counts obtained from the Japanese Mortality Database (2025), we evaluate the forecast accuracy of the transformation and forecasting methods. The improved forecast accuracy of life-table death counts implemented here will be of great interest to demographers in estimating regional age-specific survival probabilities and life expectancy.
Iterative learning procedures are ubiquitous in machine learning and modern statistics. Regularision is typically required to prevent inflating the expected loss of a procedure in later iterations via the propagation of noise inherent in the data. Significant emphasis has been placed on achieving this regularisation implicitly by stopping procedures early. The EarlyStopping-package provides a toolbox of (in-sample) sequential early stopping rules for several well-known iterative estimation procedures, such as truncated SVD, Landweber (gradient descent), conjugate gradient descent, L2-boosting and regression trees. One of the central features of the package is that the algorithms allow the specification of the true data-generating process and keep track of relevant theoretical quantities. In this paper, we detail the principles governing the implementation of the EarlyStopping-package and provide a survey of recent foundational advances in the theoretical literature. We demonstrate how to use the EarlyStopping-package to explore core features of implicit regularisation and replicate results from the literature.
Multi-omics integration offers novel insights into complex biological mechanisms by utlizing the fused information from various omics datasets. However, the inherent within- and inter-modality correlations in multi-omics data present significant challenges for traditional variable selection methods, such as Lasso regression. These correlations can lead to multicollinearity, compromising the stability and interpretability of selected variables. To address these problems, we introduce the Multi-View Orthogonal Projection Regression (MVOPR), a novel approach for variable selection in multi-omics analysis. MVOPR leverages the unidirectional associations among omics layers, inspired by the Central Dogma of Molecular Biology, to transform predictors into an uncorrelated feature space. This orthogonal projection framework effectively mitigates the correlations, allowing penalized regression models to operate on independent components. Through simulations under both well-specified and misspecified scenarios, MVOPR demonstrates superior performance in variable selection, outperforming traditional Lasso-based methods and factor-based models. In real-data analysis on the CAARS dataset, MVOPR consistently identifies biologically relevant features, including the Bacteroidaceae family and key metabolites which align well with known asthma biomarkers. These findings illustrate MVOPR's ability to enhance variable selection while offering biologically interpretable insights, offering a robust tool for integrative multi-omics research.
In online selective conformal inference, data arrives sequentially, and prediction intervals are constructed only when an online selection rule is met. Since online selections may break the exchangeability between the selected test datum and the rest of the data, one must correct for this by suitably selecting the calibration data. In this paper, we evaluate existing calibration selection strategies and pinpoint some fundamental errors in the associated claims that guarantee selection-conditional coverage and control of the false coverage rate (FCR). To address these shortcomings, we propose novel calibration selection strategies that provably preserve the exchangeability of the calibration data and the selected test datum. Consequently, we demonstrate that online selective conformal inference with these strategies guarantees both selection-conditional coverage and FCR control. Our theoretical findings are supported by experimental evidence examining tradeoffs between valid methods.
Personalized services are central to today's digital landscape, where online decision-making is commonly formulated as contextual bandit problems. Two key challenges emerge in modern applications: high-dimensional covariates and the need for nonparametric models to capture complex reward-covariate relationships. We address these challenges by developing a contextual bandit algorithm based on sparse additive reward models in reproducing kernel Hilbert spaces. We establish statistical properties of the doubly penalized method applied to random regions, introducing novel analyses under bandit feedback. Our algorithm achieves sublinear cumulative regret over the time horizon $T$ while scaling logarithmically with covariate dimensionality $d$. Notably, we provide the first regret upper bound with logarithmic growth in $d$ for nonparametric contextual bandits with high-dimensional covariates. We also establish a lower bound, with the gap to the upper bound vanishing as smoothness increases. Extensive numerical experiments demonstrate our algorithm's superior performance in high-dimensional settings compared to existing approaches.
In regression analysis, associations between continuous predictors and the outcome are often assumed to be linear. However, modeling the associations as non-linear can improve model fit. Many flexible modeling techniques, like (fractional) polynomials and spline-based approaches, are available. Such methods can be systematically compared in simulation studies, which require suitable performance measures to evaluate the accuracy of the estimated curves against the true data-generating functions. Although various measures have been proposed in the literature, no systematic overview exists so far. To fill this gap, we introduce a categorization of performance measures for evaluating estimated non-linear associations between an outcome and continuous predictors. This categorization includes many commonly used measures. The measures can not only be used in simulation studies, but also in application studies to compare different estimates to each other. We further illustrate and compare the behavior of different performance measures through some examples and a Shiny app.
Exposure assessment in occupational epidemiology may involve multiple unknown quantities that are measured or reconstructed simultaneously for groups of workers and over several years. Additionally, exposures may be collected using different assessment strategies, depending on the period of exposure. As a consequence, researchers who are analyzing occupational cohort studies are commonly faced with challenging structures of exposure measurement error, involving complex dependence structures and multiple measurement error models, depending on the period of exposure. However, previous work has often made many simplifying assumptions concerning these errors. In this work, we propose a Bayesian hierarchical approach to account for a broad range of error structures arising in occupational epidemiology. The considered error structures may involve several unknown quantities that can be subject to mixtures of Berkson and classical measurement error. It is possible to account for different error structures, depending on the exposure period and the location of a worker. Moreover, errors can present complex dependence structures over time and between workers. We illustrate the proposed hierarchical approach on a subgroup of the German cohort of uranium miners to account for potential exposure uncertainties in the association between radon exposure and lung cancer mortality. The performance of the proposed approach and its sensitivity to model misspecification are evaluated in a simulation study. The results show that biases in estimates arising from very complex measurement errors can be corrected through the proposed Bayesian hierarchical approach.
This paper presents a computational method for generating virtual 3D morphologies of functional materials using low-parametric stochastic geometry models, i.e., digital twins, calibrated with 2D microscopy images. These digital twins allow systematic parameter variations to simulate various morphologies, that can be deployed for virtual materials testing by means of spatially resolved numerical simulations of macroscopic properties. Generative adversarial networks (GANs) have gained popularity for calibrating models to generate realistic 3D morphologies. However, GANs often comprise of numerous uninterpretable parameters make systematic variation of morphologies for virtual materials testing challenging. In contrast, low-parametric stochastic geometry models (e.g., based on Gaussian random fields) enable targeted variation but may struggle to mimic complex morphologies. Combining GANs with advanced stochastic geometry models (e.g., excursion sets of more general random fields) addresses these limitations, allowing model calibration solely from 2D image data. This approach is demonstrated by generating a digital twin of all-solid-state battery (ASSB) cathodes. Since the digital twins are parametric, they support systematic exploration of structural scenarios and their macroscopic properties. The proposed method facilitates simulation studies for optimizing 3D morphologies, benefiting not only ASSB cathodes but also other materials with similar structures.
The Friedman test has been extensively applied as a nonparametric alternative to the conventional F procedure for comparing treatment effects in randomized complete block designs. A chi-square distribution provides a convenient approximation to determining the critical values for the Friedman procedure in hypothesis testing. However, the chi-square approximation is generally conservative and the accuracy declines with increasing number of treatments. This paper describes an alternative transformation of the Friedman statistic along with an approximate F distribution that has the same numerator degrees of freedom as the ANOVA F test. Moreover, two approximate noncentral F distributions are presented for the proposed F-transformation under the alternative hypothesis of heterogeneous location shifts. Explicit power functions are derived when the underlying populations have the uniform, normal, Laplace, and exponential distributions. Theoretical examination and empirical assessment are presented to validate the advantages of the proposed approaches over the existing methods of the Friedman test. The developed test and power procedures are recommended due to their consistently acceptable Type I error rates and accurate power calculations for the location shift structures and population distributions considered here.
This study proposes a mathematical programming-based algorithm for the integrated selection of variable subsets and bandwidth estimation in geographically weighted regression, a local regression method that allows the kernel bandwidth and regression coefficients to vary across study areas. Unlike standard approaches in the literature, in which bandwidth and regression parameters are estimated separately for each focal point on the basis of different criteria, our model uses a single objective function for the integrated estimation of regression and bandwidth parameters across all focal points, based on the regression likelihood function and variance modeling. The proposed model further integrates a procedure to select a single subset of independent variables for all focal points, whereas existing approaches may return heterogeneous subsets across focal points. We then propose an alternative direction method to solve the nonconvex mathematical model and show that it converges to a partial minimum. The computational experiment indicates that the proposed algorithm provides competitive explanatory power with stable spatially varying patterns, with the ability to select the best subset and account for additional constraints.
Solving multiple parametrised related systems is an essential component of many numerical tasks. Borrowing strength from the solved systems and learning will make this process faster. In this work, we propose a novel probabilistic linear solver over the parameter space. This leverages information from the solved linear systems in a regression setting to provide an efficient posterior mean and covariance. We advocate using this as companion regression model for the preconditioned conjugate gradient method, and discuss the favourable properties of the posterior mean and covariance as the initial guess and preconditioner. We also provide several design choices for this companion solver. Numerical experiments showcase the benefits of using our novel solver in a hyperparameter optimisation problem.
The partitioning of data for estimation and calibration critically impacts the performance of propensity score based estimators like inverse probability weighting (IPW) and double/debiased machine learning (DML) frameworks. We extend recent advances in calibration techniques for propensity score estimation, improving the robustness of propensity scores in challenging settings such as limited overlap, small sample sizes, or unbalanced data. Our contributions are twofold: First, we provide a theoretical analysis of the properties of calibrated estimators in the context of DML. To this end, we refine existing calibration frameworks for propensity score models, with a particular emphasis on the role of sample-splitting schemes in ensuring valid causal inference. Second, through extensive simulations, we show that calibration reduces variance of inverse-based propensity score estimators while also mitigating bias in IPW, even in small-sample regimes. Notably, calibration improves stability for flexible learners (e.g., gradient boosting) while preserving the doubly robust properties of DML. A key insight is that, even when methods perform well without calibration, incorporating a calibration step does not degrade performance, provided that an appropriate sample-splitting approach is chosen.
We extend the celebrated Glivenko-Cantelli theorem, sometimes called the fundamental theorem of statistics, from its standard setting of total variation distance to all $f$-divergences. A key obstacle in this endeavor is to define $f$-divergence on a subcollection of a $\sigma$-algebra that forms a $\pi$-system but not a $\sigma$-subalgebra. This is a side contribution of our work. We will show that this notion of $f$-divergence on the $\pi$-system of rays preserves nearly all known properties of standard $f$-divergence, yields a novel integral representation of the Kolmogorov-Smirnov distance, and has a Glivenko-Cantelli theorem.
Quantum computers offer a promising route to tackling problems that are classically intractable such as in prime-factorization, solving large-scale linear algebra and simulating complex quantum systems, but require fault-tolerant quantum hardware. On the other hand, variational quantum algorithms (VQAs) have the potential to provide a near-term route to quantum utility or advantage, and is usually constructed by using parametrized quantum circuits (PQCs) in combination with a classical optimizer for training. Although VQAs have been proposed for a multitude of tasks such as ground-state estimation, combinatorial optimization and unitary compilation, there remain major challenges in its trainability and resource costs on quantum hardware. Here we address these challenges by adopting Hardware Efficient and dynamical LIe algebra Supported Ansatz (HELIA), and propose two training schemes that combine an existing g-sim method (that uses the underlying group structure of the operators) and the Parameter-Shift Rule (PSR). Our improvement comes from distributing the resources required for gradient estimation and training to both classical and quantum hardware. We numerically test our proposal for ground-state estimation using Variational Quantum Eigensolver (VQE) and classification of quantum phases using quantum neural networks. Our methods show better accuracy and success of trials, and also need fewer calls to the quantum hardware on an average than using only PSR (upto 60% reduction), that runs exclusively on quantum hardware. We also numerically demonstrate the capability of HELIA in mitigating barren plateaus, paving the way for training large-scale quantum models.
Effective visualisation of multidimensional data is crucial for generating insights. Glyph-based visualisations, which encode data dimensions onto multiple visual channels such as colour, shape, and size, provide an effective means of representing complex datasets. Pie-chart glyphs (pie-glyphs) are one such approach, where multiple data attributes are mapped to slices within a pie chart. This paper introduces the PieGlyph R package, which enables users to overlay any 2D plot with axis-invariant pie-glyphs, offering a compact and intuitive representation of multidimensional data. Unlike existing R packages such as scatterpie or ggforce, PieGlyph generates pie-glyphs independently of the plot axes by employing a nested coordinate system, ensuring they remain circular regardless of changes to the underlying coordinate system. This enhances interpretability, particularly in when visualising spatial data, as users can select the most appropriate map projection without distorting the glyphs' shape. Pie-glyphs are also particularly well-suited for visualising compositional data, where there is a natural sum-to-one constraint on the data attributes. PieGlyph is developed under the Grammar of Graphics paradigm using the ggplot2 framework and supports the generation of interactive pie-glyphs through the ggiraph package. Designed to integrate seamlessly with all features and extensions offered by ggplot2 and ggiraph, PieGlyph provides users with full flexibility in customising every aspect of the visualisation. This paper outlines the conceptual framework of PieGlyph, compares it with existing alternatives, and demonstrates its applications through example visualisations.
A key trait of stochastic optimizers is that multiple runs of the same optimizer in attempting to solve the same problem can produce different results. As a result, their performance is evaluated over several repeats, or runs, on the problem. However, the accuracy of the estimated performance metrics depends on the number of runs and should be studied using statistical tools. We present a statistical analysis of the common metrics, and develop guidelines for experiment design to measure the optimizer's performance using these metrics to a high level of confidence and accuracy. To this end, we first discuss the confidence interval of the metrics and how they are related to the number of runs of an experiment. We then derive a lower bound on the number of repeats in order to guarantee achieving a given accuracy in the metrics. Using this bound, we propose an algorithm to adaptively adjust the number of repeats needed to ensure the accuracy of the evaluated metric. Our simulation results demonstrate the utility of our analysis and how it allows us to conduct reliable benchmarking as well as hyperparameter tuning and prevent us from drawing premature conclusions regarding the performance of stochastic optimizers.
Pre-trained Transformers, through in-context learning (ICL), have demonstrated exceptional capabilities to adapt to new tasks using example prompts without model update. Transformer-based wireless receivers, where prompts consist of the pilot data in the form of transmitted and received signal pairs, have shown high detection accuracy when pilot data are abundant. However, pilot information is often costly and limited in practice. In this work, we propose the DEcision Feedback INcontExt Detection (DEFINED) solution as a new wireless receiver design, which bypasses channel estimation and directly performs symbol detection using the (sometimes extremely) limited pilot data. The key innovation in DEFINED is the proposed decision feedback mechanism in ICL, where we sequentially incorporate the detected symbols into the prompts as pseudo-labels to improve the detection for subsequent symbols. Furthermore, we proposed another detection method where we combine ICL with Semi-Supervised Learning (SSL) to extract information from both labeled and unlabeled data during inference, thus avoiding the errors propagated during the decision feedback process of the original DEFINED. Extensive experiments across a broad range of wireless communication settings demonstrate that a small Transformer trained with DEFINED or IC-SSL achieves significant performance improvements over conventional methods, in some cases only needing a single pilot pair to achieve similar performance of the latter with more than 4 pilot pairs.
We identify various classes of neural networks that are able to approximate continuous functions locally uniformly subject to fixed global linear growth constraints. For such neural networks the associated neural stochastic differential equations can approximate general stochastic differential equations, both of It\^o diffusion type, arbitrarily well. Moreover, quantitative error estimates are derived for stochastic differential equations with sufficiently regular coefficients.
Nearly all identifiability results in unsupervised representation learning inspired by, e.g., independent component analysis, factor analysis, and causal representation learning, rely on assumptions of additive independent noise or noiseless regimes. In contrast, we study the more general case where noise can take arbitrary forms, depend on latent variables, and be non-invertibly entangled within a nonlinear function. We propose a general framework for identifying latent variables in the nonparametric noisy settings. We first show that, under suitable conditions, the generative model is identifiable up to certain submanifold indeterminacies even in the presence of non-negligible noise. Furthermore, under the structural or distributional variability conditions, we prove that latent variables of the general nonlinear models are identifiable up to trivial indeterminacies. Based on the proposed theoretical framework, we have also developed corresponding estimation methods and validated them in various synthetic and real-world settings. Interestingly, our estimate of the true GDP growth from alternative measurements suggests more insightful information on the economies than official reports. We expect our framework to provide new insight into how both researchers and practitioners deal with latent variables in real-world scenarios.
Mitigating shortcuts, where models exploit spurious correlations in training data, remains a significant challenge for improving generalization. Regularization methods have been proposed to address this issue by enhancing model generalizability. However, we demonstrate that these methods can sometimes overregularize, inadvertently suppressing causal features along with spurious ones. In this work, we analyze the theoretical mechanisms by which regularization mitigates shortcuts and explore the limits of its effectiveness. Additionally, we identify the conditions under which regularization can successfully eliminate shortcuts without compromising causal features. Through experiments on synthetic and real-world datasets, our comprehensive analysis provides valuable insights into the strengths and limitations of regularization techniques for addressing shortcuts, offering guidance for developing more robust models.
Quantum kernels quantify similarity between data points by measuring the inner product between quantum states, computed through quantum circuit measurements. By embedding data into quantum systems, quantum kernel feature maps, that may be classically intractable to compute, could efficiently exploit high-dimensional Hilbert spaces to capture complex patterns. However, designing effective quantum feature maps remains a major challenge. Many quantum kernels, such as the fidelity kernel, suffer from exponential concentration, leading to near-identity kernel matrices that fail to capture meaningful data correlations and lead to overfitting and poor generalization. In this paper, we propose a novel strategy for constructing quantum kernels that achieve good generalization performance, drawing inspiration from benign overfitting in classical machine learning. Our approach introduces the concept of local-global quantum kernels, which combine two complementary components: a local quantum kernel based on measurements of small subsystems and a global quantum kernel derived from full-system measurements. Through numerical experiments, we demonstrate that local-global quantum kernels exhibit benign overfitting, supporting the effectiveness of our approach in enhancing quantum kernel methods.
Federated learning (FL) allows collaborative machine learning (ML) model training among decentralized clients' information, ensuring data privacy. The decentralized nature of FL deals with non-independent and identically distributed (non-IID) data. This open problem has notable consequences, such as decreased model performance and more significant convergence times. Despite its importance, experimental studies systematically addressing all types of data heterogeneity (a.k.a. non-IIDness) remain scarce. We aim to fill this gap by assessing and quantifying the non-IID effect through a thorough empirical analysis. We use the Hellinger Distance (HD) to measure differences in distribution among clients. Our study benchmarks four state-of-the-art strategies for handling non-IID data, including label, feature, quantity, and spatiotemporal skewness, under realistic and controlled conditions. This is the first comprehensive analysis of the spatiotemporal skew effect in FL. Our findings highlight the significant impact of label and spatiotemporal skew non-IID types on FL model performance, with notable performance drops occurring at specific HD thresholds. Additionally, the FL performance is heavily affected mainly when the non-IIDness is extreme. Thus, we provide recommendations for FL research to tackle data heterogeneity effectively. Our work represents the most extensive examination of non-IIDness in FL, offering a robust foundation for future research.
The search for exoplanets is an active field in astronomy, with direct imaging as one of the most challenging methods due to faint exoplanet signals buried within stronger residual starlight. Successful detection requires advanced image processing to separate the exoplanet signal from this nuisance component. This paper presents a novel statistical model that captures nuisance fluctuations using a multi-scale approach, leveraging problem symmetries and a joint spectral channel representation grounded in physical principles. Our model integrates into an interpretable, end-to-end learnable framework for simultaneous exoplanet detection and flux estimation. The proposed algorithm is evaluated against the state of the art using datasets from the SPHERE instrument operating at the Very Large Telescope (VLT). It significantly improves the precision-recall trade-off, notably on challenging datasets that are otherwise unusable by astronomers. The proposed approach is computationally efficient, robust to varying data quality, and well suited for large-scale observational surveys.
The increasing integration of renewable energy sources has led to greater volatility and unpredictability in electricity generation, posing challenges to grid stability. Ancillary service markets, such as the German control reserve market, allow industrial consumers and producers to offer flexibility in their power consumption or generation, contributing to grid stability while earning additional income. However, many participants use simple bidding strategies that may not maximize their revenues. This paper presents a methodology for forecasting bidding prices in pay-as-bid ancillary service markets, focusing on the German control reserve market. We evaluate various machine learning models, including Support Vector Regression, Decision Trees, and k-Nearest Neighbors, and compare their performance against benchmark models. To address the asymmetry in the revenue function of pay-as-bid markets, we introduce an offset adjustment technique that enhances the practical applicability of the forecasting models. Our analysis demonstrates that the proposed approach improves potential revenues by 27.43 % to 37.31 % compared to baseline models. When analyzing the relationship between the model forecasting errors and the revenue, a negative correlation is measured for three markets; according to the results, a reduction of 1 EUR/MW model price forecasting error (MAE) statistically leads to a yearly revenue increase between 483 EUR/MW and 3,631 EUR/MW. The proposed methodology enables industrial participants to optimize their bidding strategies, leading to increased earnings and contributing to the efficiency and stability of the electrical grid.
We investigate differentially private estimators for individual parameters within larger parametric models. While generic private estimators exist, the estimators we provide repose on new local notions of estimand stability, and these notions allow procedures that provide private certificates of their own stability. By leveraging these private certificates, we provide computationally and statistical efficient mechanisms that release private statistics that are, at least asymptotically in the sample size, essentially unimprovable: they achieve instance optimal bounds. Additionally, we investigate the practicality of the algorithms both in simulated data and in real-world data from the American Community Survey and US Census, highlighting scenarios in which the new procedures are successful and identifying areas for future work.
Effective management of recreational fisheries requires accurate forecasting of future harvests and real-time monitoring of ongoing harvests. Traditional methods that rely on historical catch data to predict short-term harvests can be unreliable, particularly if changes in management regulations alter angler behavior. In contrast, statistical modeling approaches can provide faster, more flexible, and potentially more accurate predictions, enhancing management outcomes. In this study, we developed and tested models to improve predictions of Gulf of Mexico gag harvests for both pre-season planning and in-season monitoring. Our best-fitting model outperformed traditional methods (i.e., estimates derived from historical average harvest) for both cumulative pre-season projections and in-season monitoring. Notably, our modeling framework appeared to be more accurate in more recent, shorter seasons due to its ability to account for effort compression. A key advantage of our framework is its ability to explicitly quantify the probability of exceeding harvest quotas for any given season duration. This feature enables managers to evaluate trade-offs between season duration and conservation goals. This is especially critical for vulnerable, highly targeted stocks. Our findings also underscore the value of statistical models to complement and advance traditional fisheries management approaches.
With the growing interest in quantum machine learning, the perceptron -- a fundamental building block in traditional machine learning -- has emerged as a valuable model for exploring quantum advantages. Two quantum perceptron algorithms based on Grover's search, were developed in arXiv:1602.04799 to accelerate training and improve statistical efficiency in perceptron learning. This paper points out and corrects a mistake in the proof of Theorem 2 in arXiv:1602.04799. Specifically, we show that the probability of sampling from a normal distribution for a $D$-dimensional hyperplane that perfectly classifies the data scales as $\Omega(\gamma^{D})$ instead of $\Theta({\gamma})$, where $\gamma$ is the margin. We then revisit two well-established linear programming algorithms -- the ellipsoid method and the cutting plane random walk algorithm -- in the context of perceptron learning, and show how quantum search algorithms can be leveraged to enhance the overall complexity. Specifically, both algorithms gain a sub-linear speed-up $O(\sqrt{N})$ in the number of data points $N$ as a result of Grover's algorithm and an additional $O(D^{1.5})$ speed-up is possible for cutting plane random walk algorithm employing quantum walk search.
Reinforcement learning from human feedback usually models preferences using a reward model that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We propose a method to specialise a reward model to a person or group of people. Our approach builds on the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts, including models that do in-context personalisation. Depending on how much disagreement there is in the training data, our model either significantly outperforms the baselines or matches their performance with a simpler architecture and more stable training.