This paper investigates an approach to both speed up business decision-making and lower the cost of learning through experimentation by factorizing business policies and employing fractional factorial experimental designs for their evaluation. We illustrate how this method integrates with advances in the estimation of heterogeneous treatment effects, elaborating on its advantages and foundational assumptions. We empirically demonstrate the implementation and benefits of our approach and assess its validity in evaluating consumer promotion policies at DoorDash, which is one of the largest delivery platforms in the US. Our approach discovers a policy with 5% incremental profit at 67% lower implementation cost.
Effectively modeling phenomena present in highly nonlinear dynamical systems whilst also accurately quantifying uncertainty is a challenging task, which often requires problem-specific techniques. We outline the deep latent force model (DLFM), a domain-agnostic approach to tackling this problem, which consists of a deep Gaussian process architecture where the kernel at each layer is derived from an ordinary differential equation using the framework of process convolutions. Two distinct formulations of the DLFM are presented which utilise weight-space and variational inducing points-based Gaussian process approximations, both of which are amenable to doubly stochastic variational inference. We provide evidence that our model is capable of capturing highly nonlinear behaviour in real-world multivariate time series data. In addition, we find that our approach achieves comparable performance to a number of other probabilistic models on benchmark regression tasks. We also empirically assess the negative impact of the inducing points framework on the extrapolation capabilities of LFM-based models.
The use of weather index insurances is subject to spatial basis risk, which arises from the fact that the location of the user's risk exposure is not the same as the location of any of the weather stations where an index can be measured. To gauge the effectiveness of weather index insurances, spatial interpolation techniques such as kriging can be adopted to estimate the relevant weather index from observations taken at nearby locations. In this paper, we study the performance of various statistical methods, ranging from simple nearest neighbor to more advanced trans-Gaussian kriging, in spatial interpolations of daily precipitations with data obtained from the US National Oceanic and Atmospheric Administration. We also investigate how spatial interpolations should be implemented in practice when the insurance is linked to popular weather indexes including annual consecutive dry days ($CDD$) and maximum five-day precipitation in one month ($MFP$). It is found that although spatially interpolating the raw weather variables on a daily basis is more sophisticated and computationally demanding, it does not necessarily yield superior results compared to direct interpolations of $CDD$/$MFP$ on a yearly/monthly basis. This intriguing outcome can be explained by the statistical properties of the weather indexes and the underlying weather variables.
In mortality modelling, cohort effects are often taken into consideration as they add insights about variations in mortality across different generations. Statistically speaking, models such as the Renshaw-Haberman model may provide a better fit to historical data compared to their counterparts that incorporate no cohort effects. However, when such models are estimated using an iterative maximum likelihood method in which parameters are updated one at a time, convergence is typically slow and may not even be reached within a reasonably established maximum number of iterations. Among others, the slow convergence problem hinders the study of parameter uncertainty through bootstrapping methods. In this paper, we propose an intuitive estimation method that minimizes the sum of squared errors between actual and fitted log central death rates. The complications arising from the incorporation of cohort effects are overcome by formulating part of the optimization as a principal component analysis with missing values. We also show how the proposed method can be generalized to variants of the Renshaw-Haberman model with further computational improvement, either with a simplified model structure or an additional constraint. Using mortality data from the Human Mortality Database (HMD), we demonstrate that our proposed method produces satisfactory estimation results and is significantly more efficient compared to the traditional likelihood-based approach.
Low-frequency time-series (e.g., quarterly data) are often treated as benchmarks for interpolating to higher frequencies, since they generally exhibit greater precision and accuracy in contrast to their high-frequency counterparts (e.g., monthly data) reported by governmental bodies. An array of regression-based methods have been proposed in the literature which aim to estimate a target high-frequency series using higher frequency indicators. However, in the era of big data and with the prevalence of large volume of administrative data-sources there is a need to extend traditional methods to work in high-dimensional settings, i.e. where the number of indicators is similar or larger than the number of low-frequency samples. The package DisaggregateTS includes both classical regressions-based disaggregation methods alongside recent extensions to high-dimensional settings, c.f. Mosley et al. (2022). This paper provides guidance on how to implement these methods via the package in R, and demonstrates their use in an application to disaggregating CO2 emissions.
In this paper we review recent advances in statistical methods for the evaluation of the heterogeneity of treatment effects (HTE), including subgroup identification and estimation of individualized treatment regimens, from randomized clinical trials and observational studies. We identify several types of approaches using the features introduced in Lipkovich, Dmitrienko and D'Agostino (2017) that distinguish the recommended principled methods from basic methods for HTE evaluation that typically rely on rules of thumb and general guidelines (the methods are often referred to as common practices). We discuss the advantages and disadvantages of various principled methods as well as common measures for evaluating their performance. We use simulated data and a case study based on a historical clinical trial to illustrate several new approaches to HTE evaluation.
ANOVA decomposition of function with random input variables provides ANOVA functionals (AFs), which contain information about the contributions of the input variables on the output variable(s). By embedding AFs into an appropriate reproducing kernel Hilbert space regarding their distributions, we propose an efficient statistical test of independence between the input variables and output variable(s). The resulting test statistic leads to new dependent measures of association between inputs and outputs that allow for i) dealing with any distribution of AFs, including the Cauchy distribution, ii) accounting for the necessary or desirable moments of AFs and the interactions among the input variables. In uncertainty quantification for mathematical models, a number of existing measures are special cases of this framework. We then provide unified and general global sensitivity indices and their consistent estimators, including asymptotic distributions. For Gaussian-distributed AFs, we obtain Sobol' indices and dependent generalized sensitivity indices using quadratic kernels.
In this study, we investigate the quantification of the statistical reliability of detected change points (CPs) in time series using a Recurrent Neural Network (RNN). Thanks to its flexibility, RNN holds the potential to effectively identify CPs in time series characterized by complex dynamics. However, there is an increased risk of erroneously detecting random noise fluctuations as CPs. The primary goal of this study is to rigorously control the risk of false detections by providing theoretically valid p-values to the CPs detected by RNN. To achieve this, we introduce a novel method based on the framework of Selective Inference (SI). SI enables valid inferences by conditioning on the event of hypothesis selection, thus mitigating selection bias. In this study, we apply SI framework to RNN-based CP detection, where characterizing the complex process of RNN selecting CPs is our main technical challenge. We demonstrate the validity and effectiveness of the proposed method through artificial and real data experiments.
Identifying significant sites in sequence data and analogous data is of fundamental importance in many biological fields. Fisher's exact test is a popular technique, however this approach to sparse count data is not appropriate due to conservative decisions. Since count data in HIV data are typically very sparse, it is crucial to use additional information to statistical models to improve testing power. In order to develop new approaches to incorporate biological information in the false discovery controlling procedure, we propose two models: one based on the empirical Bayes model under independence of amino acids and the other uses pairwise associations of amino acids based on Markov random field with on the BLOSUM62 substitution matrix. We apply the proposed methods to HIV data and identify significant sites incorporating BLOSUM62 matrix while the traditional method based on Fisher's test does not discover any site. These newly developed methods have the potential to handle many biological problems in the studies of vaccine and drug trials and phenotype studies.
This study presents new closed-form estimators for the Dirichlet and the Multivariate Gamma distribution families, whose maximum likelihood estimator cannot be explicitly derived. The methodology builds upon the score-adjusted estimators for the Beta and Gamma distributions, extending their applicability to the Dirichlet and Multivariate Gamma distributions. Expressions for the asymptotic variance-covariance matrices are provided, demonstrating the superior performance of score-adjusted estimators over the traditional moment ones. Leveraging well-established connections between Dirichlet and Multivariate Gamma distributions, a novel class of estimators for the latter is introduced, referred to as "Dirichlet-based moment-type estimators". The general asymptotic variance-covariance matrix form for this estimator class is derived. To facilitate the application of these innovative estimators, an R package called estimators is developed and made publicly available.
In biomedical studies, it is often desirable to characterize the interactive mode of multiple disease outcomes beyond their marginal risk. Ising model is one of the most popular choices serving for this purpose. Nevertheless, learning efficiency of Ising models can be impeded by the scarcity of accurate disease labels, which is a prominent problem in contemporary studies driven by electronic health records (EHR). Semi-supervised learning (SSL) leverages the large unlabeled sample with auxiliary EHR features to assist the learning with labeled data only and is a potential solution to this issue. In this paper, we develop a novel SSL method for efficient inference of Ising model. Our method first models the outcomes against the auxiliary features, then uses it to project the score function of the supervised estimator onto the EHR features, and incorporates the unlabeled sample to augment the supervised estimator for variance reduction without introducing bias. For the key step of conditional modeling, we propose strategies that can effectively leverage the auxiliary EHR information while maintaining moderate model complexity. In addition, we introduce approaches including intrinsic efficient updates and ensemble, to overcome the potential misspecification of the conditional model that may cause efficiency loss. Our method is justified by asymptotic theory and shown to outperform existing SSL methods through simulation studies. We also illustrate its utility in a real example about several key phenotypes related to frequent ICU admission on MIMIC-III data set.
The problem of quickest change detection in a sequence of independent observations is considered. The pre-change distribution is assumed to be known, while the post-change distribution is unknown. Two tests based on post-change density estimation are developed for this problem, the window-limited non-parametric generalized likelihood ratio (NGLR) CuSum test and the non-parametric window-limited adaptive (NWLA) CuSum test. Both tests do not assume any knowledge of the post-change distribution, except that the post-change density satisfies certain smoothness conditions that allows for efficient non-parametric estimation. Also, they do not require any pre-collected post-change training samples. Under certain convergence conditions on the density estimator, it is shown that both tests are first-order asymptotically optimal, as the false alarm rate goes to zero. The analysis is validated through numerical results, where both tests are compared with baseline tests that have distributional knowledge.
Political scientists and sociologists study how individuals switch back and forth between public and private organizations, for example between regulator and lobbyist positions, a phenomenon called "revolving doors". However, they face an important issue of data missingness, as not all data relevant to this question is freely available. For example, the nomination of an individual in a given public-sector position of power might be publically disclosed, but not their subsequent positions in the private sector. In this article, we adopt a Bayesian data augmentation strategy for discrete time series and propose measures of public-private mobility across the French state at large, mobilizing administrative and digital data. We relax homogeneity hypotheses of traditional hidden Markov models and implement a version of a Markov switching model, which allows for varying parameters across individuals and time and auto-correlated behaviors. We describe how the revolving doors phenomenon varies across the French state and how it has evolved between 1990 and 2022.
The effective utilization of structural information in data while ensuring statistical validity poses a significant challenge in false discovery rate (FDR) analyses. Conformal inference provides rigorous theory for grounding complex machine learning methods without relying on strong assumptions or highly idealized models. However, existing conformal methods have limitations in handling structured multiple testing. This is because their validity requires the deployment of symmetric rules, which assume the exchangeability of data points and permutation-invariance of fitting algorithms. To overcome these limitations, we introduce the pseudo local index of significance (PLIS) procedure, which is capable of accommodating asymmetric rules and requires only pairwise exchangeability between the null conformity scores. We demonstrate that PLIS offers finite-sample guarantees in FDR control and the ability to assign higher weights to relevant data points. Numerical results confirm the effectiveness and robustness of PLIS and show improvements in power compared to existing model-free methods in various scenarios.
In this paper we study two important representations for extreme value distributions and their max-domains of attraction (MDA), namely von Mises representation (vMR) and variation representation (VR), which are convenient ways to gain limit results. Both VR and vMR are defined via so-called auxiliary functions psi. Up to now, however, the set of valid auxiliary functions for vMR has neither been characterized completely nor separated from those for VR. We contribute to the current literature by introducing ''universal'' auxiliary functions which are valid for both VR and vMR representations for the entire MDA distribution families. Then we identify exactly the sets of valid auxiliary functions for both VR and vMR. Moreover, we propose a method for finding appropriate auxiliary functions with analytically simple structure and provide them for several important distributions.
In this paper we introduce a novel statistical framework based on the first two quantile conditional moments that facilitates effective goodness-of-fit testing for one-sided L\'evy distributions. The scale-ratio framework introduced in this paper extends our previous results in which we have shown how to extract unique distribution features using conditional variance ratio for the generic class of {\alpha}-stable distributions. We show that the conditional moment-based goodness-of-fit statistics are a good alternative to other methods introduced in the literature tailored to the one-sided L\'evy distributions. The usefulness of our approach is verified using an empirical test power study. For completeness, we also derive the asymptotic distributions of the test statistics and show how to apply our framework to real data.
Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These encompass a heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When confronted with data containing noisy or outlier-laden observations, the Median-of-Means (MoM) estimator emerges as a stabilizing force for any centroid-based clustering framework. On a different note, a prevalent constraint among existing clustering methodologies resides in the prerequisite knowledge of the number of clusters prior to analysis. Utilizing model-based methodologies, such as Bayesian nonparametric models, offers the advantage of infinite mixture models, thereby circumventing the need for such requirements. Motivated by these facts, in this article, we present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies that mitigates the effect of noise on the quality of clustering while ensuring that the number of clusters need not be specified in advance. Statistical guarantees on the upper bound of clustering error, and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
The last two centuries have seen a significant increase in life expectancy. Although past trends suggest that mortality will continue to decline in the future, uncertainty and instability about the development is greatly increased due to the ongoing COVID-19 pandemic. It is therefore of essential interest, particularly to annuity and life insurers, to predict the mortality of their members or policyholders with reliable accuracy. The goal of this study is to improve the state-of-the-art stochastic mortality models using machine learning techniques and generalize them to a multi-population model. Detailed cross-country results conducted for Finland, Germany, Italy, the Netherlands, and the United States show that the best forecasting performance is achieved by a generalized additive model that uses the framework of APC analysis. Based on this finding, trend forecasts of mortality rates as a measure of longevity are fulfilled for the future, given a range of COVID-19 scenarios, from mild to severe. Discussing and evaluating the plausibility of these scenarios, this study is useful for preparation, planning, and informed decision-making.
In the realm of medical research, the intricate interplay of epidemiological risk, genomic activity, adverse events, and clinical response necessitates a nuanced consideration of multiple variables. Clinical trials, designed to meticulously assess the efficacy and safety of interventions, routinely incorporate a diverse array of endpoints. While a primary endpoint is customary, supplemented by key secondary endpoints, the statistical significance is typically evaluated independently for each. To address the inherent challenges in studying multiple endpoints, diverse strategies, including composite endpoints and global testing, have been proposed. This work stands apart by focusing on the evaluation of a clinical trial, deviating from the conventional approach to underscore the efficacy of a multiple-endpoint procedure. A double-blind study was conducted to gauge the treatment efficacy in adults infected with human immunodeficiency virus type 1 (HIV-1), featuring CD4 cell counts ranging from 200 to 500 per cubic millimeter. A total of 2467 HIV-1-infected patients (43 percent without prior antiretroviral treatment) were randomly assigned to one of four daily regimens: 600 mg of zidovudine; 600 mg of zidovudine plus 400 mg of didanosine; 600 mg of zidovudine plus 2.25 mg of zalcitabine; or 400 mg of didanosine. The primary endpoint comprised a >50 percent decline in CD4 cell count, development of acquired immunodeficiency syndrome (AIDS), or death. This study sought to determine the efficacy and safety of zidovudine (AZT) versus didanosine (ddI), AZT plus ddI, and AZT plus zalcitabine (ddC) in preventing disease progression in HIV-infected patients with CD4 counts of 200-500 cells/mm3. By jointly considering all endpoints, the multiple-endpoints approach yields results of greater significance than a single-endpoint approach.
This study is concerned with the estimation of fatigue damage for a 5-MW offshore wind turbine supported by a semi-submersible floating platform. The fore-aft bending moment at the turbine tower base and the fairlead tension in the windward mooring line are selected for evaluation. The selected wind turbine is sited in 200 meters of water. Metocean data provide information on joint statistics of the wind speed, wave height, and wave period along with their relative likelihoods for the installation site in the Mediterranean Sea, near the coast of Sicily. A frequency-domain (FD) model provides needed power spectra for the desired response processes. With the ultimate goal of efficient evaluation of fatigue limit states for such floating offshore wind turbine systems, a detailed computational framework is introduced and used to develop a surrogate model using Gaussian process regression. The surrogate model, at first, relies only on a small subset of representative sea states and, then, is supplemented by the evaluation of additional sea states that lead to efficient convergence and accurate prediction of fatigue damage. The proposed approach offers an efficient and accurate alternative to exhaustive evaluation of a larger number of sea states and, as such, avoids excessive response simulations.
Structural discovery amongst a set of variables is of interest in both static and dynamic settings. In the presence of lead-lag dependencies in the data, the dynamics of the system can be represented through a structural equation model (SEM) that simultaneously captures the contemporaneous and temporal relationships amongst the variables, with the former encoded through a directed acyclic graph (DAG) for model identification. In many real applications, a partial ordering amongst the nodes of the DAG is available, which makes it either beneficial or imperative to incorporate it as a constraint in the problem formulation. This paper develops an algorithm that can seamlessly incorporate a priori partial ordering information for solving a linear SEM (also known as Structural Vector Autoregression) under a high-dimensional setting. The proposed algorithm is provably convergent to a stationary point, and exhibits competitive performance on both synthetic and real data sets.
Based on the comprehensive national death registry of Mexico spanning from 1998 to 2022 a point and interval estimation method for the excess mortality in Mexico during the years 2020-2022 is proposed based on illness-induced deaths only, using a polynomial regression model. The results obtained estimate that the excess mortality is around 788,000 people (39.3%) equivalently to a rate of 626 per 100,000 inhabitants. The male/female ratio is estimated to be 1.7 times. As a reference for comparison, for the whole period 2020-2020 Mexico's INEGI estimated an excess of mortality between 673,000 with a quasi-Poisson model and 808,000 using endemic channels estimation.
We provide a simple and general solution to the fundamental open problem of inaccurate uncertainty quantification of Bayesian inference in misspecified or approximate models, and of generalized Bayesian posteriors more generally. While existing solutions are based on explicit Gaussian posterior approximations, or computationally onerous post-processing procedures, we demonstrate that correct uncertainty quantification can be achieved by substituting the usual posterior with an alternative posterior that conveys the same information. This solution applies to both likelihood-based and loss-based posteriors, and we formally demonstrate the reliable uncertainty quantification of this approach. The new approach is demonstrated through a range of examples, including generalized linear models, and doubly intractable models.
Adjustment of statistical significance levels for repeated analysis in group sequential trials has been understood for some time. Similarly, methods for adjustment accounting for testing multiple hypotheses are common. There is limited research on simultaneously adjusting for both multiple hypothesis testing and multiple analyses of one or more hypotheses. We address this gap by proposing adjusted-sequential p-values that reject an elementary hypothesis when its adjusted-sequential p-values are less than or equal to the family-wise Type I error rate (FWER) in a group sequential design. We also propose sequential p-values for intersection hypotheses as a tool to compute adjusted sequential p-values for elementary hypotheses. We demonstrate the application using weighted Bonferroni tests and weighted parametric tests, comparing adjusted sequential p-values to a desired FWER for inference on each elementary hypothesis tested.
Graphical and sparse (inverse) covariance models have found widespread use in modern sample-starved high dimensional applications. A part of their wide appeal stems from the significantly low sample sizes required for the existence of estimators, especially in comparison with the classical full covariance model. For undirected Gaussian graphical models, the minimum sample size required for the existence of maximum likelihood estimators had been an open question for almost half a century, and has been recently settled. The very same question for pseudo-likelihood estimators has remained unsolved ever since their introduction in the '70s. Pseudo-likelihood estimators have recently received renewed attention as they impose fewer restrictive assumptions and have better computational tractability, improved statistical performance, and appropriateness in modern high dimensional applications, thus renewing interest in this longstanding problem. In this paper, we undertake a comprehensive study of this open problem within the context of the two classes of pseudo-likelihood methods proposed in the literature. We provide a precise answer to this question for both pseudo-likelihood approaches and relate the corresponding solutions to their Gaussian counterpart.
(Aim) Dragon Boat Racing, a popular aquatic folklore team sport, is traditionally held during the Dragon Boat Festival. Inspired by this event, we propose a novel human-based meta-heuristic algorithm called dragon boat optimization (DBO) in this paper. (Method) It models the unique behaviors of each crew member on the dragon boat during the race by introducing social psychology mechanisms (social loafing, social incentive). Throughout this process, the focus is on the interaction and collaboration among the crew members, as well as their decision-making in different situations. During each iteration, DBO implements different state updating strategies. By modelling the crew's behavior and adjusting the state updating strategies, DBO is able to maintain high-performance efficiency. (Results) We have tested the DBO algorithm with 29 mathematical optimization problems and 2 structural design problems. (Conclusion) The experimental results demonstrate that DBO is competitive with state-of-the-art meta-heuristic algorithms as well as conventional methods.
In this paper, we first study the fundamental limit of clustering networks when a multi-layer network is present. Under the mixture multi-layer stochastic block model (MMSBM), we show that the minimax optimal network clustering error rate, which takes an exponential form and is characterized by the Renyi divergence between the edge probability distributions of the component networks. We propose a novel two-stage network clustering method including a tensor-based initialization algorithm involving both node and sample splitting and a refinement procedure by likelihood-based Lloyd algorithm. Network clustering must be accompanied by node community detection. Our proposed algorithm achieves the minimax optimal network clustering error rate and allows extreme network sparsity under MMSBM. Numerical simulations and real data experiments both validate that our method outperforms existing methods. Oftentimes, the edges of networks carry count-type weights. We then extend our methodology and analysis framework to study the minimax optimal clustering error rate for mixture of discrete distributions including Binomial, Poisson, and multi-layer Poisson networks. The minimax optimal clustering error rates in these discrete mixtures all take the same exponential form characterized by the Renyi divergences. These optimal clustering error rates in discrete mixtures can also be achieved by our proposed two-stage clustering algorithm.
This study proposes the first Bayesian approach for learning high-dimensional linear Bayesian networks. The proposed approach iteratively estimates each element of the topological ordering from backward and its parent using the inverse of a partial covariance matrix. The proposed method successfully recovers the underlying structure when Bayesian regularization for the inverse covariance matrix with unequal shrinkage is applied. Specifically, it shows that the number of samples $n = \Omega( d_M^2 \log p)$ and $n = \Omega(d_M^2 p^{2/m})$ are sufficient for the proposed algorithm to learn linear Bayesian networks with sub-Gaussian and 4m-th bounded-moment error distributions, respectively, where $p$ is the number of nodes and $d_M$ is the maximum degree of the moralized graph. The theoretical findings are supported by extensive simulation studies including real data analysis. Furthermore the proposed method is demonstrated to outperform state-of-the-art frequentist approaches, such as the BHLSM, LISTEN, and TD algorithms in synthetic data.
Emissions of nitric oxide and nitrogen dioxide, which are named as NOx, are a major environmental and health concern.To react to the climate crisis, the South Korean government has strengthened NOx emission regulations. An accurate NOx prediction model can help companies to meet their NOx emission quotas and achieve cost savings. This study focuses on developing a model which forecasts the amount of NOx emissions in Pohang, a heavy industrial city in South Korea with serious air pollution problems.In this study, the Long-short term memory (LSTM) modeling is applied to predict the amount of NOx emissions, with missing data imputation using stochastic regression. Two parameters (i.e., time windows and learning rates) necessary to run the LSTM model are tested and selected using the Adam optimizer, one of the popular optimization methods in LSTM. I found that the model that I applied achieved the acceptable prediction performance since its Mean Absolute Scaled Error (MASE), the most important evaluation criterion, is less than 1. This means that applying the model that I developed in predicting future NOx emissions will perform better than a naive prediction, a model that simply predicts them based on the last observed data point.
In our previously published work, we introduced a supervised deep learning method for event detection in multivariate time series data, employing regression instead of binary classification. This simplification avoids the need for point-wise labels throughout the entire dataset, relying solely on ground truth events defined as time points or intervals. In this paper, we establish mathematically that our method is universal, and capable of detecting any type of event with arbitrary precision under mild continuity assumptions on the time series. These events may encompass change points, frauds, anomalies, physical occurrences, and more. We substantiate our theoretical results using the universal approximation theorem for feed-forward neural networks (FFN). Additionally, we provide empirical validations that confirm our claims, demonstrating that our method, with a limited number of parameters, outperforms other deep learning approaches, particularly for rare events and imbalanced datasets from different domains.
With the growing digital transformation of the worldwide economy, cyber risk has become a major issue. As 1 % of the world's GDP (around $1,000 billion) is allegedly lost to cybercrime every year, IT systems continue to get increasingly interconnected, making them vulnerable to accumulation phenomena that undermine the pooling mechanism of insurance. As highlighted in the literature, Hawkes processes appear to be suitable models to capture contagion phenomena and clustering features of cyber events. This paper extends the standard Hawkes modeling of cyber risk frequency by adding external shocks, modelled by the publication of cyber vulnerabilities that are deemed to increase the likelihood of attacks in the short term. The aim of the proposed model is to provide a better quantification of contagion effects since, while the standard Hawkes model allocates all the clustering phenomena to self-excitation, our model allows to capture the external common factors that may explain part of the systemic pattern. We propose a Hawkes model with two kernels, one for the endogenous factor (the contagion from other cyber events) and one for the exogenous component (cyber vulnerability publications). We use parametric exponential specifications for both the internal and exogenous intensity kernels, and we compare different methods to tackle the inference problem based on public datasets containing features of cyber attacks found in the Hackmageddon database and cyber vulnerabilities from the Known Exploited Vulnerability database and the National Vulnerability Dataset. By refining the external excitation database selection, the degree of endogeneity of the model is nearly halved. We illustrate our model with simulations and discuss the impact of taking into account the external factor driven by vulnerabilities. Once an attack has occurred, response measures are implemented to limit the effects of an attack. These measures include patching vulnerabilities and reducing the attack's contagion. We use an augmented version of the model by adding a second phase modeling a reduction in the contagion pattern from the remediation measures. Based on this model, we explore various scenarios and quantify the effect of mitigation measures of an insurance company that aims to mitigate the effects of a cyber pandemic in its insured portfolio.
Understanding complex spatial dependency structures is a crucial consideration when attempting to build a modeling framework for wind speeds. Ideally, wind speed modeling should be very efficient since the wind speed can vary significantly from day to day or even hour to hour. But complex models usually require high computational resources. This paper illustrates how to construct and implement a hierarchical Bayesian model for wind speeds using the Weibull density function based on a continuously-indexed spatial field. For efficient (near real-time) inference the proposed model is implemented in the r package R-INLA, based on the integrated nested Laplace approximation (INLA). Specific attention is given to the theoretical and practical considerations of including a spatial component within a Bayesian hierarchical model. The proposed model is then applied and evaluated using a large volume of real data sourced from the coastal regions of South Africa between 2011 and 2021. By projecting the mean and standard deviation of the Matern field, the results show that the spatial modeling component is effectively capturing variation in wind speeds which cannot be explained by the other model components. The mean of the spatial field varies between $\pm 0.3$ across the domain. These insights are valuable for planning and implementation of green energy resources such as wind farms in South Africa. Furthermore, shortcomings in the spatial sampling domain is evident in the analysis and this is important for future sampling strategies. The proposed model, and the conglomerated dataset, can serve as a foundational framework for future investigations into wind energy in South Africa.
The accurate prediction of patient prognosis is a critical challenge in clinical practice. With the availability of various patient information, physicians can optimize medical care by closely monitoring disease progression and therapy responses. To enable better individualized treatment, dynamic prediction models are required to continuously update survival probability predictions as new information becomes available. This article aims to offer a comprehensive survey of current methods in dynamic survival analysis, encompassing both classical statistical approaches and deep learning techniques. Additionally, it will also discuss the limitations of existing methods and the prospects for future advancements in this field.
Selecting the best regularization parameter in inverse problems is a classical and yet challenging problem. Recently, data-driven approaches have become popular to tackle this challenge. These approaches are appealing since they do require less a priori knowledge, but their theoretical analysis is limited. In this paper, we propose and study a statistical machine learning approach, based on empirical risk minimization. Our main contribution is a theoretical analysis, showing that, provided with enough data, this approach can reach sharp rates while being essentially adaptive to the noise and smoothness of the problem. Numerical simulations corroborate and illustrate the theoretical findings. Our results are a step towards grounding theoretically data-driven approaches to inverse problems.
Citizen science databases that consist of volunteer-led sampling efforts of species communities are relied on as essential sources of data in ecology. Summarizing such data across counties with frequentist-valid prediction sets for each county provides an interpretable comparison across counties of varying size or composition. As citizen science data often feature unequal sampling efforts across a spatial domain, prediction sets constructed with indirect methods that share information across counties may be used to improve precision. In this article, we present a nonparametric framework to obtain precise prediction sets for a multinomial random sample based on indirect information that maintain frequentist coverage guarantees for each county. We detail a simple algorithm to obtain prediction sets for each county using indirect information where the computation time does not depend on the sample size and scales nicely with the number of species considered. The indirect information may be estimated by a proposed empirical Bayes procedure based on information from auxiliary data. Our approach makes inference for under-sampled counties more precise, while maintaining area-specific frequentist validity for each county. Our method is used to provide a useful description of avian species abundance in North Carolina, USA based on citizen science data from the eBird database.
In this paper, we explore optimal treatment allocation policies that target distributional welfare. Most literature on treatment choice has considered utilitarian welfare based on the conditional average treatment effect (ATE). While average welfare is intuitive, it may yield undesirable allocations especially when individuals are heterogeneous (e.g., with outliers) - the very reason individualized treatments were introduced in the first place. This observation motivates us to propose an optimal policy that allocates the treatment based on the conditional \emph{quantile of individual treatment effects} (QoTE). Depending on the choice of the quantile probability, this criterion can accommodate a policymaker who is either prudent or negligent. The challenge of identifying the QoTE lies in its requirement for knowledge of the joint distribution of the counterfactual outcomes, which is generally hard to recover even with experimental data. Therefore, we introduce minimax optimal policies that are robust to model uncertainty. We then propose a range of identifying assumptions under which we can point or partially identify the QoTE. We establish the asymptotic bound on the regret of implementing the proposed policies. We consider both stochastic and deterministic rules. In simulations and two empirical applications, we compare optimal decisions based on the QoTE with decisions based on other criteria.
A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization -- generalizing to target data whose distribution differs from that of source data. Despite its significant importance, the fundamental question of ``what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift. This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. That is, no algorithm performs better than MLE in this setting (up to a constant factor), justifying MLE is all you need. Our result holds for a very rich class of parametric models, and does not require any boundedness condition on the density ratio. We illustrate the wide applicability of our framework by instantiating it to three concrete examples -- linear regression, logistic regression, and phase retrieval. This paper further complement the study by proving that, under the misspecified setting, MLE is no longer the optimal choice, whereas Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in certain scenarios.
In this paper we consider pricing of insurance contracts for breast cancer risk based on three multiple state models. Using population data in England and data from the medical literature, we calibrate a collection of semi-Markov and Markov models. Considering an industry-based Markov model as a baseline model, we demonstrate the strengths of a more detailed model while showing the importance of accounting for duration dependence in transition rates. We quantify age-specific cancer incidence and cancer survival by stage along with type-specific mortality rates based on the semi-Markov model which accounts for unobserved breast cancer cases and progression through breast cancer stages. Using the developed models, we obtain actuarial net premiums for a specialised critical illness and life insurance product. Our analysis shows that the semi-Markov model leads to results aligned with empirical evidence. Our findings point out the importance of accounting for the time spent with diagnosed or undiagnosed pre-metastatic breast cancer in actuarial applications.
The recently proposed fixed-X knockoff is a powerful variable selection procedure that controls the false discovery rate (FDR) in any finite-sample setting, yet its theoretical insights are difficult to show beyond Gaussian linear models. In this paper, we make the first attempt to extend the fixed-X knockoff to partially linear models by using generalized knockoff features, and propose a new stability generalized knockoff (Stab-GKnock) procedure by incorporating selection probability as feature importance score. We provide FDR control and power guarantee under some regularity conditions. In addition, we propose a two-stage method under high dimensionality by introducing a new joint feature screening procedure, with guaranteed sure screening property. Extensive simulation studies are conducted to evaluate the finite-sample performance of the proposed method. A real data example is also provided for illustration.
Aberrant respondents are common but yet extremely detrimental to the quality of social surveys or questionnaires. Recently, factor mixture models have been employed to identify individuals providing deceptive or careless responses. We propose a comprehensive factor mixture model that combines confirmatory and exploratory factor models to represent both the non-aberrant and aberrant components of the responses. The flexibility of the proposed solution allows for the identification of two of the most common aberant response styles, namely faking and careless responding. We validated our approach by means of two simulations and two case studies. The results indicate the effectiveness of the proposed model in handling with aberrant responses in social and behavioral surveys.
We introduce a new powerful scan statistic and an associated test for detecting the presence and pinpointing the location of a change point within the distribution of a data sequence where the data elements take values in a general separable metric space $(\Omega, d)$. These change points mark abrupt shifts in the distribution of the data sequence. Our method hinges on distance profiles, where the distance profile of an element $\omega \in \Omega$ is the distribution of distances from $\omega$ as dictated by the data. Our approach is fully non-parametric and universally applicable to diverse data types, including distributional and network data, as long as distances between the data objects are available. From a practicable point of view, it is nearly tuning parameter-free, except for the specification of cut-off intervals near the endpoints where change points are assumed not to occur. Our theoretical results include a precise characterization of the asymptotic distribution of the test statistic under the null hypothesis of no change points and rigorous guarantees on the consistency of the test in the presence of change points under contiguous alternatives, as well as for the consistency of the estimated change point location. Through comprehensive simulation studies encompassing multivariate data, bivariate distributional data and sequences of graph Laplacians, we demonstrate the effectiveness of our approach in both change point detection power and estimating the location of the change point. We apply our method to real datasets, including U.S. electricity generation compositions and Bluetooth proximity networks, underscoring its practical relevance.
We provide a new algorithmic framework for differentially private estimation of general functions that adapts to the hardness of the underlying dataset. We build upon previous work that gives a paradigm for selecting an output through the exponential mechanism based upon closeness of the inverse to the underlying dataset, termed the inverse sensitivity mechanism. Our framework will slightly modify the closeness metric and instead give a simple and efficient application of the sparse vector technique. While the inverse sensitivity mechanism was shown to be instance optimal, it was only with respect to a class of unbiased mechanisms such that the most likely outcome matches the underlying data. We break this assumption in order to more naturally navigate the bias-variance tradeoff, which will also critically allow for extending our method to unbounded data. In consideration of this tradeoff, we provide strong intuition and empirical validation that our technique will be particularly effective when the distances to the underlying dataset are asymmetric. This asymmetry is inherent to a range of important problems including fundamental statistics such as variance, as well as commonly used machine learning performance metrics for both classification and regression tasks. We efficiently instantiate our method in $O(n)$ time for these problems and empirically show that our techniques will give substantially improved differentially private estimations.
Our work presents two fundamental contributions. On the application side, we tackle the challenging problem of predicting day-ahead crypto-currency prices. On the methodological side, a new dynamical modeling approach is proposed. Our approach keeps the probabilistic formulation of the state-space model, which provides uncertainty quantification on the estimates, and the function approximation ability of deep neural networks. We call the proposed approach the deep state-space model. The experiments are carried out on established cryptocurrencies (obtained from Yahoo Finance). The goal of the work has been to predict the price for the next day. Benchmarking has been done with both state-of-the-art and classical dynamical modeling techniques. Results show that the proposed approach yields the best overall results in terms of accuracy.
This paper explores the application of Machine Learning (ML) and Natural Language Processing (NLP) techniques in cryptocurrency price forecasting, specifically Bitcoin (BTC) and Ethereum (ETH). Focusing on news and social media data, primarily from Twitter and Reddit, we analyse the influence of public sentiment on cryptocurrency valuations using advanced deep learning NLP methods. Alongside conventional price regression, we treat cryptocurrency price forecasting as a classification problem. This includes both the prediction of price movements (up or down) and the identification of local extrema. We compare the performance of various ML models, both with and without NLP data integration. Our findings reveal that incorporating NLP data significantly enhances the forecasting performance of our models. We discover that pre-trained models, such as Twitter-RoBERTa and BART MNLI, are highly effective in capturing market sentiment, and that fine-tuning Large Language Models (LLMs) also yields substantial forecasting improvements. Notably, the BART MNLI zero-shot classification model shows considerable proficiency in extracting bullish and bearish signals from textual data. All of our models consistently generate profit across different validation scenarios, with no observed decline in profits or reduction in the impact of NLP data over time. The study highlights the potential of text analysis in improving financial forecasts and demonstrates the effectiveness of various NLP techniques in capturing nuanced market sentiment.
Reinforcement Learning from Human Feedback (RLHF) has played a crucial role in the success of large models such as ChatGPT. RLHF is a reinforcement learning framework which combines human feedback to improve learning effectiveness and performance. However, obtaining preferences feedback manually is quite expensive in commercial applications. Some statistical commercial indicators are usually more valuable and always ignored in RLHF. There exists a gap between commercial target and model training. In our research, we will attempt to fill this gap with statistical business feedback instead of human feedback, using AB testing which is a well-established statistical method. Reinforcement Learning from Statistical Feedback (RLSF) based on AB testing is proposed. Statistical inference methods are used to obtain preferences for training the reward network, which fine-tunes the pre-trained model in reinforcement learning framework, achieving greater business value. Furthermore, we extend AB testing with double selections at a single time-point to ANT testing with multiple selections at different feedback time points. Moreover, we design numerical experiences to validate the effectiveness of our algorithm framework.
Structural and Positional Encodings can significantly improve the performance of Graph Neural Networks in downstream tasks. Recent literature has begun to systematically investigate differences in the structural properties that these approaches encode, as well as performance trade-offs between them. However, the question of which structural properties yield the most effective encoding remains open. In this paper, we investigate this question from a geometric perspective. We propose a novel structural encoding based on discrete Ricci curvature (Local Curvature Profiles, short LCP) and show that it significantly outperforms existing encoding approaches. We further show that combining local structural encodings, such as LCP, with global positional encodings improves downstream performance, suggesting that they capture complementary geometric information. Finally, we compare different encoding types with (curvature-based) rewiring techniques. Rewiring has recently received a surge of interest due to its ability to improve the performance of Graph Neural Networks by mitigating over-smoothing and over-squashing effects. Our results suggest that utilizing curvature information for structural encodings delivers significantly larger performance increases than rewiring.
I propose a new identification-robust test for the structural parameter in a heteroskedastic linear instrumental variables model. The proposed test statistic is similar in spirit to a jackknife version of the K-statistic and the resulting test has exact asymptotic size so long as an auxiliary parameter can be consistently estimated. This is possible under approximate sparsity even when the number of instruments is much larger than the sample size. As the number of instruments is allowed, but not required, to be large, the limiting behavior of the test statistic is difficult to examine via existing central limit theorems. Instead, I derive the asymptotic chi-squared distribution of the test statistic using a direct Gaussian approximation technique. To improve power against certain alternatives, I propose a simple combination with the sup-score statistic of Belloni et al. (2012) based on a thresholding rule. I demonstrate favorable size control and power properties in a simulation study and apply the new methods to revisit the effect of social spillovers in movie consumption.
A system of coupled oscillators on an arbitrary graph is locally driven by the tendency to mutual synchronization between nearby oscillators, but can and often exhibit nonlinear behavior on the whole graph. Understanding such nonlinear behavior has been a key challenge in predicting whether all oscillators in such a system will eventually synchronize. In this paper, we demonstrate that, surprisingly, such nonlinear behavior of coupled oscillators can be effectively linearized in certain latent dynamic spaces. The key insight is that there is a small number of `latent dynamics filters', each with a specific association with synchronizing and non-synchronizing dynamics on subgraphs so that any observed dynamics on subgraphs can be approximated by a suitable linear combination of such elementary dynamic patterns. Taking an ensemble of subgraph-level predictions provides an interpretable predictor for whether the system on the whole graph reaches global synchronization. We propose algorithms based on supervised matrix factorization to learn such latent dynamics filters. We demonstrate that our method performs competitively in synchronization prediction tasks against baselines and black-box classification algorithms, despite its simple and interpretable architecture.
When only few data samples are accessible, utilizing structural prior knowledge is essential for estimating covariance matrices and their inverses. One prominent example is knowing the covariance matrix to be Toeplitz structured, which occurs when dealing with wide sense stationary (WSS) processes. This work introduces a novel class of positive definiteness ensuring likelihood-based estimators for Toeplitz structured covariance matrices (CMs) and their inverses. In order to accomplish this, we derive positive definiteness enforcing constraint sets for the Gohberg-Semencul (GS) parameterization of inverse symmetric Toeplitz matrices. Motivated by the relationship between the GS parameterization and autoregressive (AR) processes, we propose hyperparameter tuning techniques, which enable our estimators to combine advantages from state-of-the-art likelihood and non-parametric estimators. Moreover, we present a computationally cheap closed-form estimator, which is derived by maximizing an approximate likelihood. Due to the ensured positive definiteness, our estimators perform well for both the estimation of the CM and the inverse covariance matrix (ICM). Extensive simulation results validate the proposed estimators' efficacy for several standard Toeplitz structured CMs commonly employed in a wide range of applications.
Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive. In this work, we empirically show that momentum gradient descent with a large learning rate and learning rate warmup displays large catapults, driving the iterates towards flatter minima than those found by gradient descent. We then provide empirical evidence and theoretical intuition that the large catapult is caused by momentum "amplifying" the self-stabilization effect (Damian et al., 2023).
Constrained optimization of the parameters of a simulator plays a crucial role in a design process. These problems become challenging when the simulator is stochastic, computationally expensive, and the parameter space is high-dimensional. One can efficiently perform optimization only by utilizing the gradient with respect to the parameters, but these gradients are unavailable in many legacy, black-box codes. We introduce the algorithm Scout-Nd (Stochastic Constrained Optimization for N dimensions) to tackle the issues mentioned earlier by efficiently estimating the gradient, reducing the noise of the gradient estimator, and applying multi-fidelity schemes to further reduce computational effort. We validate our approach on standard benchmarks, demonstrating its effectiveness in optimizing parameters highlighting better performance compared to existing methods.
In artificial-intelligence-aided signal processing, existing deep learning models often exhibit a black-box structure, and their validity and comprehensibility remain elusive. The integration of topological methods, despite its relatively nascent application, serves a dual purpose of making models more interpretable as well as extracting structural information from time-dependent data for smarter learning. Here, we provide a transparent and broadly applicable methodology, TopCap, to capture the most salient topological features inherent in time series for machine learning. Rooted in high-dimensional ambient spaces, TopCap is capable of capturing features rarely detected in datasets with low intrinsic dimensionality. Applying time-delay embedding and persistent homology, we obtain descriptors which encapsulate information such as the vibration of a time series, in terms of its variability of frequency, amplitude, and average line, demonstrated with simulated data. This information is then vectorised and fed into multiple machine learning algorithms such as k-nearest neighbours and support vector machine. Notably, in classifying voiced and voiceless consonants, TopCap achieves an accuracy exceeding 96% and is geared towards designing topological convolutional layers for deep learning of speech and audio signals.
In this paper, we provide a fine-grained analysis of the local landscape of phase retrieval under the regime with limited samples. Our aim is to ascertain the minimal sample size necessary to guarantee a benign local landscape surrounding global minima in high dimensions. Let $n$ and $d$ denote the sample size and input dimension, respectively. We first explore the local convexity and establish that when $n=o(d\log d)$, for almost every fixed point in the local ball, the Hessian matrix must have negative eigenvalues as long as $d$ is sufficiently large. Consequently, the local landscape is highly non-convex. We next consider the one-point strong convexity and show that as long as $n=\omega(d)$, with high probability, the landscape is one-point strongly convex in the local annulus: $\{w\in\mathbb{R}^d: o_d(1)\leqslant \|w-w^*\|\leqslant c\}$, where $w^*$ is the ground truth and $c$ is an absolute constant. This implies that gradient descent initialized from any point in this domain can converge to an $o_d(1)$-loss solution exponentially fast. Furthermore, we show that when $n=o(d\log d)$, there is a radius of $\widetilde\Theta\left(\sqrt{1/d}\right)$ such that one-point convexity breaks in the corresponding smaller local ball. This indicates an impossibility to establish a convergence to exact $w^*$ for gradient descent under limited samples by relying solely on one-point convexity.
The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
While physics-informed neural networks (PINNs) have been proven effective for low-dimensional partial differential equations (PDEs), the computational cost remains a hurdle in high-dimensional scenarios. This is particularly pronounced when computing high-order and high-dimensional derivatives in the physics-informed loss. Randomized Smoothing PINN (RS-PINN) introduces Gaussian noise for stochastic smoothing of the original neural net model, enabling Monte Carlo methods for derivative approximation, eliminating the need for costly auto-differentiation. Despite its computational efficiency in high dimensions, RS-PINN introduces biases in both loss and gradients, negatively impacting convergence, especially when coupled with stochastic gradient descent (SGD). We present a comprehensive analysis of biases in RS-PINN, attributing them to the nonlinearity of the Mean Squared Error (MSE) loss and the PDE nonlinearity. We propose tailored bias correction techniques based on the order of PDE nonlinearity. The unbiased RS-PINN allows for a detailed examination of its pros and cons compared to the biased version. Specifically, the biased version has a lower variance and runs faster than the unbiased version, but it is less accurate due to the bias. To optimize the bias-variance trade-off, we combine the two approaches in a hybrid method that balances the rapid convergence of the biased version with the high accuracy of the unbiased version. In addition, we present an enhanced implementation of RS-PINN. Extensive experiments on diverse high-dimensional PDEs, including Fokker-Planck, HJB, viscous Burgers', Allen-Cahn, and Sine-Gordon equations, illustrate the bias-variance trade-off and highlight the effectiveness of the hybrid RS-PINN. Empirical guidelines are provided for selecting biased, unbiased, or hybrid versions, depending on the dimensionality and nonlinearity of the specific PDE problem.
Constrained clustering allows the training of classification models using pairwise constraints only, which are weak and relatively easy to mine, while still yielding full-supervision-level model performance. While they perform well even in the absence of the true underlying class labels, constrained clustering models still require large amounts of binary constraint annotations for training. In this paper, we propose a semi-supervised context whereby a large amount of \textit{unconstrained} data is available alongside a smaller set of constraints, and propose \textit{ConstraintMatch} to leverage such unconstrained data. While a great deal of progress has been made in semi-supervised learning using full labels, there are a number of challenges that prevent a naive application of the resulting methods in the constraint-based label setting. Therefore, we reason about and analyze these challenges, specifically 1) proposing a \textit{pseudo-constraining} mechanism to overcome the confirmation bias, a major weakness of pseudo-labeling, 2) developing new methods for pseudo-labeling towards the selection of \textit{informative} unconstrained samples, 3) showing that this also allows the use of pairwise loss functions for the initial and auxiliary losses which facilitates semi-constrained model training. In extensive experiments, we demonstrate the effectiveness of ConstraintMatch over relevant baselines in both the regular clustering and overclustering scenarios on five challenging benchmarks and provide analyses of its several components.
Although statistical learning theory provides a robust framework to understand supervised learning, many theoretical aspects of deep learning remain unclear, in particular how different architectures may lead to inductive bias when trained using gradient based methods. The goal of these lectures is to provide an overview of some of the main questions that arise when attempting to understand deep learning from a learning theory perspective. After a brief reminder on statistical learning theory and stochastic optimization, we discuss implicit bias in the context of benign overfitting. We then move to a general description of the mirror descent algorithm, showing how we may go back and forth between a parameter space and the corresponding function space for a given learning problem, as well as how the geometry of the learning problem may be represented by a metric tensor. Building on this framework, we provide a detailed study of the implicit bias of gradient descent on linear diagonal networks for various regression tasks, showing how the loss function, scale of parameters at initialization and depth of the network may lead to various forms of implicit bias, in particular transitioning between kernel or feature learning.
Information-theoretic image quality assessment (IQA) models such as Visual Information Fidelity (VIF) and Spatio-temporal Reduced Reference Entropic Differences (ST-RRED) have enjoyed great success by seamlessly integrating natural scene statistics (NSS) with information theory. The Gaussian Scale Mixture (GSM) model that governs the wavelet subband coefficients of natural images forms the foundation for these algorithms. However, the explosion of user-generated content on social media, which is typically distorted by one or more of many possible unknown impairments, has revealed the limitations of NSS-based IQA models that rely on the simple GSM model. Here, we seek to elaborate the VIF index by deriving useful properties of the Multivariate Generalized Gaussian Distribution (MGGD), and using them to study the behavior of VIF under a Generalized GSM (GGSM) model.
We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate. We point out relations of the latter to sub-Riemannian geometry.
As black-box machine learning models grow in complexity and find applications in high-stakes scenarios, it is imperative to provide explanations for their predictions. Although Local Interpretable Model-agnostic Explanations (LIME) [22] is a widely adpoted method for understanding model behaviors, it is unstable with respect to random seeds [35,24,3] and exhibits low local fidelity (i.e., how well the explanation approximates the model's local behaviors) [21,16]. Our study shows that this instability problem stems from small sample weights, leading to the dominance of regularization and slow convergence. Additionally, LIME's sampling neighborhood is non-local and biased towards the reference, resulting in poor local fidelity and sensitivity to reference choice. To tackle these challenges, we introduce GLIME, an enhanced framework extending LIME and unifying several prior methods. Within the GLIME framework, we derive an equivalent formulation of LIME that achieves significantly faster convergence and improved stability. By employing a local and unbiased sampling distribution, GLIME generates explanations with higher local fidelity compared to LIME. GLIME explanations are independent of reference choice. Moreover, GLIME offers users the flexibility to choose a sampling distribution based on their specific scenarios.
Despite the large research effort devoted to learning dependencies between time series, the state of the art still faces a major limitation: existing methods learn partial correlations but fail to discriminate across distinct frequency bands. Motivated by many applications in which this differentiation is pivotal, we overcome this limitation by learning a block-sparse, frequency-dependent, partial correlation graph, in which layers correspond to different frequency bands, and partial correlations can occur over just a few layers. To this aim, we formulate and solve two nonconvex learning problems: the first has a closed-form solution and is suitable when there is prior knowledge about the number of partial correlations; the second hinges on an iterative solution based on successive convex approximation, and is effective for the general case where no prior knowledge is available. Numerical results on synthetic data show that the proposed methods outperform the current state of the art. Finally, the analysis of financial time series confirms that partial correlations exist only within a few frequency bands, underscoring how our methods enable the gaining of valuable insights that would be undetected without discriminating along the frequency domain.
Given two Erd\H{o}s-R\'enyi graphs with $n$ vertices whose edges are correlated through a latent vertex correspondence, we study complexity lower bounds for the associated correlation detection problem for the class of low-degree polynomial algorithms. We provide evidence that any degree-$O(\rho^{-1})$ polynomial algorithm fails for detection, where $\rho$ is the edge correlation. Furthermore, in the sparse regime where the edge density $q=n^{-1+o(1)}$, we provide evidence that any degree-$d$ polynomial algorithm fails for detection, as long as $\log d=o\big( \frac{\log n}{\log nq} \wedge \sqrt{\log n} \big)$ and the correlation $\rho<\sqrt{\alpha}$ where $\alpha\approx 0.338$ is the Otter's constant. Our result suggests that several state-of-the-art algorithms on correlation detection and exact matching recovery may be essentially the best possible.
Standard regularized training procedures correspond to maximizing a posterior distribution over parameters, known as maximum a posteriori (MAP) estimation. However, model parameters are of interest only insomuch as they combine with the functional form of a model to provide a function that can make good predictions. Moreover, the most likely parameters under the parameter posterior do not generally correspond to the most likely function induced by the parameter posterior. In fact, we can re-parametrize a model such that any setting of parameters can maximize the parameter posterior. As an alternative, we investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data. We show that this procedure leads to pathological solutions when using neural networks and prove conditions under which the procedure is well-behaved, as well as a scalable approximation. Under these conditions, we find that function-space MAP estimation can lead to flatter minima, better generalization, and improved robustness to overfitting.
Score-based diffusion models have emerged as one of the most promising frameworks for deep generative modelling, due to their state-of-the art performance in many generation tasks while relying on mathematical foundations such as stochastic differential equations (SDEs) and ordinary differential equations (ODEs). Empirically, it has been reported that ODE based samples are inferior to SDE based samples. In this paper we rigorously describe the range of dynamics and approximations that arise when training score-based diffusion models, including the true SDE dynamics, the neural approximations, the various approximate particle dynamics that result, as well as their associated Fokker--Planck equations and the neural network approximations of these Fokker--Planck equations. We systematically analyse the difference between the ODE and SDE dynamics of score-based diffusion models, and link it to an associated Fokker--Planck equation. We derive a theoretical upper bound on the Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms of a Fokker--Planck residual. We also show numerically that conventional score-based diffusion models can exhibit significant differences between ODE- and SDE-induced distributions which we demonstrate using explicit comparisons. Moreover, we show numerically that reducing the Fokker--Planck residual by adding it as an additional regularisation term leads to closing the gap between ODE- and SDE-induced distributions. Our experiments suggest that this regularisation can improve the distribution generated by the ODE, however that this can come at the cost of degraded SDE sample quality.
Unobserved confounding is common in many applications, making causal inference from observational data challenging. As a remedy, causal sensitivity analysis is an important tool to draw causal conclusions under unobserved confounding with mathematical guarantees. In this paper, we propose NeuralCSA, a neural framework for generalized causal sensitivity analysis. Unlike previous work, our framework is compatible with (i) a large class of sensitivity models, including the marginal sensitivity model, f-sensitivity models, and Rosenbaum's sensitivity model; (ii) different treatment types (i.e., binary and continuous); and (iii) different causal queries, including (conditional) average treatment effects and simultaneous effects on multiple outcomes. The generality of \frameworkname is achieved by learning a latent distribution shift that corresponds to a treatment intervention using two conditional normalizing flows. We provide theoretical guarantees that NeuralCSA is able to infer valid bounds on the causal query of interest and also demonstrate this empirically using both simulated and real-world data.
The magnitude of a metric space was recently established as a novel invariant, providing a measure of the `effective size' of a space across multiple scales. By capturing both geometrical and topological properties of data, magnitude is poised to address challenges in unsupervised representation learning tasks. We formalise a novel notion of dissimilarity between magnitude functions of finite metric spaces and use them to derive a quality measure for dimensionality reduction tasks. Our measure is provably stable under perturbations of the data, can be efficiently calculated, and enables a rigorous multi-scale comparison of embeddings. We show the utility of our measure in an experimental suite that comprises different domains and tasks, including the comparison of data visualisations.