SyNPar: Synthetic Null Parallelism for High-Power FDR Control in Feature Selection

Changhu Wang
Department of Statistics and Data Science
University of California, Los Angeles
Ziheng Zhang
Department of Biostatistics
University of California, Los Angeles
Jingyi Jessica Li
1
Department of Statistics and Data Science
University of California, Los Angeles


Abstract

Balancing false discovery rate (FDR) and statistical power to ensure reliable discoveries is a key challenge in high-dimensional feature selection. Although several FDR control methods have been proposed, most involve perturbing the original data, either by concatenating knockoff variables or splitting the data into two halves, both of which can lead to a loss of power. In this paper, we introduce a novel approach called Synthetic Null Parallelism (SyNPar), which controls the FDR in high-dimensional feature selection while preserving the original data. SyNPar generates synthetic null data from a model fitted to the original data and modified to reflect the null hypothesis. It then applies the same estimation procedure in parallel to both the original and synthetic null data to estimate coefficients that indicate feature importance. By comparing the coefficients estimated from the null data with those from the original data, SyNPar effectively identifies false positives, functioning as a numerical analog of a likelihood ratio test. We provide theoretical guarantees for FDR control at any desired level while ensuring that the power approaches one with high probability asymptotically. SyNPar is straightforward to implement and can be applied to a wide range of statistical models, including high-dimensional linear regression, generalized linear models, Cox models, and Gaussian graphical models. Through extensive simulations and real data applications, we demonstrate that SyNPar outperforms state-of-the-art methods, including knockoffs and data-splitting methods, in terms of FDR control, power, and computational efficiency.

1

1

Title

1 Introduction↩︎

Feature selection is a fundamental challenge in high-dimensional data analysis, aiming to identify a subset of relevant features from a large pool of candidates. This task is crucial in various fields, such as bioinformatics, genetics, and neuroscience, where the number of features often far exceeds the number of observations. For example, genome-wide association studies seek to identify genetic markers linked to specific diseases or traits, with the goal of selecting a few features (i.e., genetic markers) associated with the response variable (i.e., disease or trait) while disregarding irrelevant features. The feature selection problem is most rigorously defined under high-dimensional linear models, and numerous methods have been proposed to address it, including LASSO [1], elastic net [2], SCAD [3], and stability selection [4]. However, most of these methods concentrate on selecting relevant features without explicitly considering the false discovery rate (FDR)—the expected proportion of false discoveries among the selected features. To address this, researchers often first select features using these methods, followed by controlling the FDR through multiple testing correction procedures.

The Benjamini-Hochberg (BH) procedure [5] was the first and remains the most widely used method for controlling the FDR in multiple testing, assuming valid and independent \(p\)-values. To address the common dependencies among \(p\)-values (e.g., in high-dimensional feature selection where features are often correlated), the BHq [6] and adaptive BH [7] procedures were developed. Both methods are designed to control the FDR under the assumption of positive dependence among features while still requiring valid \(p\)-values. In high-dimensional feature selection, however, obtaining valid \(p\)-values is challenging. When feature selection depends on the data, applying classical inference methods to the selected features can introduce double-dipping bias, often resulting in invalid \(p\)-values. To tackle this challenge, various strategies have been proposed. For example, [8] and [9] utilized the debiased LASSO to compute asymptotically valid \(p\)-values for features in high-dimensional linear and logistic regression models, followed by the application of the BHq procedure for FDR control. [10] demonstrated that in high-dimensional logistic models, the likelihood-ratio test statistic deviates from the asymptotic chi-square distribution and proposed a framework to derive an accurate asymptotic distribution for valid \(p\)-value computation. Nonetheless, while these methods yield \(p\)-values that are asymptotically valid, they often exhibit significant non-uniformity under the null hypothesis in finite samples. In parallel with the methods relying on asymptotic distributions, \(p\)-values can be computed through conditional randomization testing when the feature distribution is assumed known [11]. However, this approach is computationally intensive and may become impractical in high-dimensional settings.

To address the challenges associated with \(p\)-values, [12] introduced the knockoff filter, a method that controls the FDR without relying on valid \(p\)-values in linear models under the fixed-X design, where the design matrix \(\mathbf{X}\) is treated as fixed. The knockoff filter (Fixed-X knockoff) constructs a set of “knockoff” features that mimic the correlation structure of the original features. By comparing the original features to their knockoff counterparts, it identifies relevant features with FDR control. However, a limitation of the knockoff filter is that it requires the number of observations to be greater than the number of features, limiting its applicability in high-dimensional settings. To overcome this limitation, [11] proposed the Model-X knockoff filter, which extends the knockoff approach to high-dimensional settings by assuming knowledge of the joint distribution of the features, \(\mathbf{X}\). However, if this distribution is unknown, studies in [13] and [14] demonstrate that Model-X knockoffs can lead to inflated FDR and reduced power. Even when the joint distribution of \(\mathbf{X}\) is known, constructing Model-X knockoffs remains challenging and computationally intensive due to the stringent exchangeability condition, which requires that swapping any subset of features with their knockoffs preserves the joint distribution of all features and their knockoffs. Recent advancements in generating high-quality knockoff features include methods using deep generative models [15], [16], sequential MCMC algorithms [17], robust knockoff generation [18], minimizing reconstructability [19], and derandomizing knockoffs [20], [21]. Additionally, the knockoff filter has been adapted for various models, including Gaussian graphical models [22] and Cox regression [23]. In addition to the challenges of knowing the joint distribution of \(\mathbf{X}\) and satisfying the exchangeability condition, a significant issue with both Fixed-X and Model-X knockoff procedures is that they double the size of the design matrix by concatenating the original features with their knockoffs. This effectively alters the original data and creates a linear model that differs from the one based solely on the original features, potentially reducing statistical power [24].

In parallel with the knockoff filter, the Gaussian Mirror (GM) approach [24] represents another line of research aimed at FDR control without relying on \(p\)-values. It computes per-feature mirror statistics by perturbing one feature at a time—adding and subtracting Gaussian noise to create a pair of “mirror variables"—while keeping other features unchanged, resulting in smaller modifications to the original data compared to knockoff methods. Since the GM method perturbs one feature at a time and requires \(2p\) separate linear model fittings, the computational cost can become substantial as the number of features \(p\) increases. To address this computational issue, a subsequent data splitting (DS) method [14] perturbs all features simultaneously by randomly splitting the data into two halves, which reduces computational demand to only two linear model fittings. However, the DS method inflates the variances of estimated regression coefficients, potentially leading to power loss. To mitigate this issue, the multiple data splitting (MDS) method [14] aggregates feature selection results from independent replications of DS. Nonetheless, the computational cost of MDS can become substantial due to the need for multiple replications. Similar to the knockoff filter, the DS approach has been extended beyond linear models to logistic regression [25] and Cox regression [26].

Motivated by the limitations of existing methods for FDR control without \(p\)-values—including power reduction caused by perturbation of the original data (through concatenation of knockoff features or data splitting) and high computational cost—we propose a novel framework called SyNPar (Synthetic Null Parallelism). This framework offers two key advantages over existing methods: (1) it achieves FDR control with high power by preserving the integrity of the original data, and (2) it is computationally efficient. We focus on high-dimensional feature selection under the following statistical model: \[\label{eq:model} \mathbf{y}\sim F( \cdot \mid \mathbf{X}; \,\boldsymbol{\beta}^*, \boldsymbol{\nu}^*), \tag{1}\] where \(\mathbf{y}\in \mathbb{R}^n\) represents the observable response, and \(\mathbf{X}\in \mathbb{R}^{n \times p}\) is the design matrix, with each row corresponding to an observation and each column representing a feature. Model 1 represents a fixed design, as it is about the randomness of \(\mathbf{y}\) conditional on \(\mathbf{X}\). The coefficient vector \(\boldsymbol{\beta}^* \in \mathbb{R}^p\) represents the parameters of interest and captures the effects of the features on the response. The term \(\boldsymbol{\nu}^* \in \mathbb{R}^d\) contains the nuisance parameters, which account for additional model structure or potential sources of variability that are not the main focus of inference. Here, \(d\) can either be finite, \(d < \infty\), indicating a parametric model, or infinite, \(d = \infty\), representing the inclusion of a non-parametric component. In Section 2, we demonstrate that the linear model, generalized linear model, Cox regression model, and Gaussian graphical model can all be expressed in the form of 1 .

In the context of the feature selection problem, we define the set of indices corresponding to the non-zero elements of \(\boldsymbol{\beta}^*\) as the signal set, denoted by \(\mathcal{S}\). Conversely, the set of indices where the coefficients of \(\boldsymbol{\beta}^*\) are zero is denoted by \(\mathcal{S}_0\). Our goal is to provide an estimate \(\widehat{\mathcal{S}}\) of the signal set \(\mathcal{S}\), while controlling the FDR, defined as \(\text{FDR} = \mathbb{E}\left[{V}/{\max\{R,1\}} \right],\) where \(V = \#\{\widehat{\mathcal{S}} \cap \mathcal{S}_0\}\) represents the number of false positives among the selected features, and \(R = \#\hat{\mathcal{S}}\) is the total number of selected features, and \({V}/{\max\{R,1\}}\) is referred to as the false discovery proportion (FDP).

The SyNPar method generates synthetic null data from a “null model" \(F(\cdot \mid \mathbf{X}; \, \boldsymbol{\beta}_0, \boldsymbol{\nu}^*)\), where \(\boldsymbol{\beta}_0\) represents the parameters of interest under the null hypothesis. It then estimates the parameters of interest on both the synthetic and original data in parallel using the same estimation procedure. By comparing the parameters estimated from the null data to those obtained from the original data, SyNPar effectively detects false positives, serving as a numerical analog of a likelihood ratio test. A detailed algorithm for SyNPar is provided in Section 2. The SyNPar method is computationally efficient, making it particularly suitable for high-dimensional data analysis. Our contributions are as follows: (1) We introduce SyNPar, a conceptually novel and computationally efficient method for FDR control in high-dimensional feature selection, that achieves high power by preserving the integrity of the original data. (2) We evaluate SyNPar through extensive simulations and real data applications, comparing it with existing methods, including the Fixed-X knockoff, Model-X knockoff, GM, DS, and MDS, demonstrating its superior performance in terms of FDR control and statistical power. (3) We provide theoretical guarantees, showing that SyNPar controls FDR at any desired level and achieves asymptotically optimal power under mild conditions about the tail behavior and weak dependence of the estimated coefficients.

We compare SyNPar with the knockoff filter, GM, and DS methods conceptually from two perspectives: their approach to creating contrasts from the original data and their strategy for fitting the model to the data. Table 1 summarizes the comparison. Both SyNPar and the knockoff filter generate synthetic data containing “null features" that have no effect on the response. However, they differ in how the model is fitted: SyNPar fits the model to the original and synthetic data in parallel, while the knockoff filter concatenates the original and knockoff features and fits the model to the concatenated data. In contrast, GM and DS do not generate synthetic data. Instead, they perturb the original data or split it into two datasets, fitting the model to these datasets in parallel.

Table 1: Comparison of SyNPar with the knockoff filter, GM, and DS methods.
Modeling Fitting to Parallel Data Modeling Fitting to Concatenated Data
Data Synthesis SyNPar Knockoffs
Data Purturbation GM
Data Splitting DS

2 Method: Synthetic Null Parallelism (SyNPar)↩︎

The core idea of SyNPar involves generating synthetic null data and applying the same model fitting approach to both the original and synthetic null data in parallel to estimate the parameters of interest about feature importance. The parameter estimates from the synthetic null data serve as the negative control to those from the original data to identify important features with FDR control.

Definition 1 (Synthetic null data). The synthetic null data used in SyNPar retains the same design matrix \(\mathbf{X}\) and samples a synthetic null response \(\tilde{\mathbf{y}}\) from the null model: \[\label{eq:SyNPar} \tilde{\mathbf{y}} \sim F(\cdot \mid \mathbf{X}; \, \boldsymbol{\beta}_0, \hat{\boldsymbol{\nu}})\,, \qquad{(1)}\] where \(\hat{\boldsymbol{\nu}}\) is the nuisance parameter estimated jointly with the coefficient vector \(\hat{\boldsymbol{\beta}}\) from the original data \(\{\mathbf{y}, \mathbf{X}\}\). The vector \(\boldsymbol{\beta}_0\) represents the coefficient vector specified under the null hypothesis. For instance, \(\boldsymbol{\beta}_0 = (0, \dots, 0)^\top\) corresponds to the complete null hypothesis, where no features have an effect. Alternatively, \(\boldsymbol{\beta}_0\) with its \(j\)-th element specified as zero represents the individual null hypothesis that the \(j\)-th feature has no effect2.

Let \(\mathcal{E}(\cdot, \cdot)\) denote an estimation procedure for \(\boldsymbol{\beta}\) such that \(\hat{\boldsymbol{\beta}} = \mathcal{E}(\mathbf{y}, {\mathbf{X}})\) estimates \(\boldsymbol{\beta}^*\), and \(\tilde{\boldsymbol{\beta}} = \mathcal{E}(\tilde{\mathbf{y}}, {{\mathbf{X}}})\) estimates \(\boldsymbol{\beta}_0\). Our goal is to use \(\tilde{\boldsymbol{\beta}}\) as a negative control for \(\hat{\boldsymbol{\beta}}\) to facilitate feature selection with FDR control. Selected features are those with large absolute coefficients in \(\hat{\boldsymbol{\beta}}\). Ideally, for any null feature \(j\) with \(\beta_j^* = 0\), we expect \(|\hat{\beta}_j|\) to be of similar or smaller magnitude than \(|\tilde{\beta}_j|\). This allows us to conclude that a non-zero \(|\hat{\beta}_j|\) is not significant enough to reject the null hypothesis \(\beta_j^* = 0\). Formally, we require \({\rm I}\kern-0.18em{\rm P}(|\hat{\beta}_j| \geq t) \leq {\rm I}\kern-0.18em{\rm P}(|\tilde{\beta}_j| \geq t)\), which implies \[\label{eq:gamma95intu} {\rm I}\kern-0.18em{\rm E}\left[\#\{j: j \in \mathcal{S}_0, |\hat{\beta}_j| \geq t\}\right] \leq {\rm I}\kern-0.18em{\rm E}\left[\#\{j: |\tilde{\beta}_j| \geq t\}\right]. \tag{2}\] With this inequality, the FDP can be approximated as \(\frac{\#\{ j : |\tilde{\beta}_j| \geq t \}}{\max\left( \#\{ j : |\hat{\beta}_j| \geq t \}, 1 \right)}\). However, achieving this inequality is not straightforward due to the differences between \(\tilde{\mathbf{y}}\) and \(\mathbf{y}\), which result in the differences between \(\hat{\boldsymbol{\beta}}\) and \(\tilde{\boldsymbol{\beta}}\). Following Definition 1, we propose two approaches for generating \(\tilde{\mathbf{y}}\): one based on the complete null hypothesis and the other based on individual null hypotheses, each corresponding to a single feature. We refer to these approaches as “SyNPar (complete)" and”SyNPar (individual)," respectively.

2.1 Comparison of SyNPar (complete) and SyNPar (individual)↩︎

SyNPar (complete) is computationally efficient, requiring only a single synthetic null dataset generated under the complete null hypothesis \(H_0: \boldsymbol{\beta}^* = 0\) and a single model fitting for that dataset. In contrast, SyNPar (individual) is computationally intensive because it generates \(p\) synthetic null datasets, each corresponding to one of the \(p\) individual null hypotheses \(H_{0j}: \beta_{0j}^* = 0\) for \(j = 1, \ldots, p\), and performs \(p\) separate model fittings on these datasets. While SyNPar (individual) is conceptually ideal, as it aligns with the individual null hypotheses that define the feature selection problem, its computational demands make it impractical. This parallels the distinction between GM and DS—GM perturbs one feature at a time, requiring \(2p\) separate model fittings, whereas DS splits the data into two halves once, requiring just two model fittings.

For SyNPar (complete), we set \(\boldsymbol{\beta}_0 = (0, \dots, 0)^{\mkern-1.5mu\mathsf{T}}\), with its detailed procedure described in Algorithm 1 and Section 3. On the other hand, SyNPar (individual) generates synthetic null data for the \(j\)-th feature by setting \(\boldsymbol{\beta}_0 = \boldsymbol{\beta}_0^j := \left(\hat{\boldsymbol{\beta}}_{1:(j-1)}^{-j}, \, 0, \, \hat{\boldsymbol{\beta}}_{j:(p-1)}^{-j}\right)^{\mkern-1.5mu\mathsf{T}}\), where \(\hat{\boldsymbol{\beta}}^{-j} = \left(\hat{\boldsymbol{\beta}}_{1:(j-1)}^{-j}, \, \hat{\boldsymbol{\beta}}_{j:(p-1)}^{-j}\right)^{\mkern-1.5mu\mathsf{T}}\) is the estimated coefficient vector based on \(\mathbf{y}\) and the design matrix \(\mathbf{X}^{-j}\), which excludes the \(j\)-th feature. Synthetic null data \(\tilde{\mathbf{y}}^j\) is then generated based on \(\boldsymbol{\beta}_0^j\), and the \(j\)-th negative control coefficient \(\tilde{\beta}_j\), corresponding to \(\hat{\beta}_j\), is extracted as the \(j\)-th element of \(\mathcal{E}(\tilde{\mathbf{y}}^j, \mathbf{X})\). Repeating this procedure for \(j = 1, \ldots, p\), the data-driven threshold for SyNPar (individual) is determined as the smallest \(t > 0\) satisfying the inequality \(\frac{\#\{ j : |\tilde{\beta}_j| \geq t \}}{\max\left( \#\{ j : |\hat{\beta}_j| \geq t \}, 1 \right)} \leq q\), where \(q\) is the target FDR level. The detailed procedure for SyNPar (individual) is provided in Algorithm 8 (Appendix B).

Here, we numerically compare SyNPar (complete) and SyNPar (individual) to assess whether SyNPar (complete) achieves satisfactory performance in FDR control and power under the linear model \(\mathbf{y}= \mathbf{X}\boldsymbol{\beta}^* + \boldsymbol{\varepsilon}\), where \(\boldsymbol{\varepsilon}\sim \mathcal{N}(0, \mathbf{I})\), using the following simulation.

Simulation Setting 1. We set \(n = 300\) and \(p = 200\). The design matrix \(\mathbf{X}\) consists of i.i.d. rows and AR(1) columns, generated from \(\mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma})\), where \(\boldsymbol{\Sigma}\) is a Toeplitz matrix with an autocorrelation parameter \(\rho \in (0,1)\), representing the correlation between two adjacent features in \(\mathbf{X}\). The first \(30\) elements of the coefficient vector \(\boldsymbol{\beta}^*\) are assigned values with amplitude \(A = 0.3\) and random signs, while the remaining \(170\) elements are set to zero.

The estimation procedure \(\mathcal{E}(\cdot, \cdot)\) is the LASSO with the regularization parameter selected by 10-fold cross-validation. Prior to applying the LASSO, we center and scale the columns of \(\mathbf{X}\) and center the response \(\mathbf{y}\). We compare the power and FDR of SyNPar (complete) and SyNPar (individual) at various \(\rho\) values under a target FDR of \(q = 0.1\). Each setting is evaluated using 100 replications. The results, summarized in Table 2, show that both methods perform similarly in terms of power and FDR, but SyNPar (complete) is computationally more efficient. Excluding the cross-validation time for determining the regularization parameter, SyNPar (complete) requires 0.03 seconds, compared to 1.96 seconds for SyNPar (individual). This computational advantage becomes more significant as \(p\) increases. Interestingly, as \(\rho\) increases, SyNPar (individual) initially outperforms SyNPar (complete) in power but later underperforms. Identifying the crossover point between the two approaches with respect to \(\rho\) could be a valuable theoretical topic for future research. Given its computational efficiency and strong performance, SyNPar (complete) is used as the default version of SyNPar in the following sections.

Table 2: Comparison of power and FDR at different \(\rho\) values, with a target FDR of \(q=0.1\).
\(\rho\) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Power
Indiv 0.961 0.950 0.941 0.908 0.865 0.769 0.643 0.506 0.330 0.186
Compl 0.952 0.939 0.932 0.900 0.859 0.792 0.680 0.558 0.419 0.278
Compl \(-\) Indiv -0.009 -0.011 -0.009 -0.008 -0.006 0.023 0.037 0.052 0.089 0.092
FDR
Indiv 0.085 0.088 0.088 0.080 0.076 0.061 0.064 0.057 0.046 0.037
Compl 0.068 0.062 0.071 0.065 0.077 0.077 0.079 0.083 0.093 0.075

2.2 SyNPar assumptions and algorithm↩︎

To ensure the false discovery rate (FDR) control of SyNPar using \(\hat{\boldsymbol{\beta}} = \mathcal{E}(\mathbf{y}, \mathbf{X})\) and \(\tilde{\boldsymbol{\beta}} = \mathcal{E}(\tilde{\mathbf{y}}, \mathbf{X})\), where \(\tilde{\mathbf{y}}\) is generated under the complete null hypothesis, we introduce an estimation error correction factor term \(\gamma_{n,p}\) to model the underlying relationship between \(\hat{\boldsymbol{\beta}}\) and \(\tilde{\boldsymbol{\beta}}\). This term should be incorporated to ensure the validity of the inequality in \((\ref{eq:gamma95intu})\), which is critical for achieving FDR control. A straightforward approach is to define the term based on \(\|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}^* \|_{\infty}\) and \(\|\tilde{\boldsymbol{\beta}} - \boldsymbol{\beta}_0 \|_{\infty}\), quantifying the maximum deviations of the two estimators from the respective true parameters. This motivates the following assumption.

Assumption 1 (High-probability upper bound on estimation error). If the nuisance parameter estimator \(\hat{\boldsymbol{\nu}}\) lies within a compact set with probability approaching one as \(n\) and \(p\) increase, assume that \[{\rm I}\kern-0.18em{\rm P}\left( \left\|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}^* \right\|_{\infty} \geq \gamma_{n,p} \right) = o(p^{-1})and\;{\rm I}\kern-0.18em{\rm P}\left( \left\|\tilde{\boldsymbol{\beta}} - \boldsymbol{\beta}_0 \right\|_{\infty} \geq \gamma_{n,p} \right) = o(p^{-1}), \] where \(\gamma_{n,p}\) is the estimation error correction factor term.

Ensuring that the nuisance parameter estimator \(\hat{\boldsymbol{\nu}}\) lies within a compact set is straightforward by projecting \(\hat{\boldsymbol{\nu}}\) onto a pre-specified compact set. This condition ensures that no singularities arise when generating synthetic null data \(\tilde{\mathbf{y}}\). Essentially, Assumption 1 requires that \(\left\|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}^* \right\|_{\infty}\) and \(\left\|\tilde{\boldsymbol{\beta}} - \boldsymbol{\beta}_0 \right\|_{\infty}\) satisfy a high-probability upper bound of \(\gamma_{n,p}\). In many statistical models, such as those in Examples 1–4, this bound can be established using high-dimensional statistical tools, such as concentration inequalities and empirical process theory. With \(\gamma_{n,p}\), to achieve the inequality in 2 , we modify \(\tilde{\boldsymbol{\beta}}\) as follows: \[\label{eq:perturb} |\tilde{\beta}_j^\prime| = |\tilde{\beta}_j| + \gamma_{n,p}, \; j = 1, \ldots, p\,. \tag{3}\] This modification is sufficient to ensure that 2 holds. Specifically, for \(j \in \mathcal{S}_0\), we have \(\beta^*_j = \beta_{0j} = 0\). Consequently, \(|\hat{\beta}_j|\) and \(|\tilde{\beta}_j|\) are both bounded by \(\gamma_{n,p}\) with high probability, as stated in Assumption 1. Therefore, it follows that \(|\hat{\beta}_j| \leq |\tilde{\beta}^\prime_j|\) with high probability. Then, each \(|\hat{\beta}_j|\) would be compared against \(|\tilde{\beta}_j^\prime|\) to decide if the \(j\)-th feature is a discovery. While this modification may make SyNPar more conservative, our simulation results across all four examples (Examples 1–4) demonstrate that SyNPar achieves the highest power compared to state-of-the-art approaches, including the knockoff and DS methods. Nevertheless, further research could focus on optimizing the bias term to enhance power.

Consider selecting features where \(|\hat{\beta}_j| \ge t\), the FDP is defined as: \[\text{FDP}(t) = \frac{\#\{j: j \in \mathcal{S}_0, |\hat{\beta}_j| \geq t\}}{\max\left( \#\{ j : |\hat{\beta}_j| \geq t \}, 1 \right)}\,, \] and is expected to be bounded from above with high probability by: \[\label{eq:FDP} \widehat{\text{FDP}}(t) = \frac{\#\{ j : |\tilde{\beta}_j^\prime| \geq t \}}{\max\left( \#\{ j : |\hat{\beta}_j| \geq t \}, 1 \right)}\,, \tag{4}\] since the numerator of \(\widehat{\text{FDP}}(t)\) overestimates the unobservable numerator of \(\text{FDP}(t)\). Based on this rationale, SyNPar determines the threshold for \(|\hat{\beta}_j|\) as \(\tau_q = \min \{ t > 0 : \widehat{\text{FDP}}(t) \leq q \}\), where \(q\) represents the target FDR level, and selects the features in \(\widehat{\mathcal{S}} = \{ j : |\hat{\beta}_j| \geq \tau_q \}.\) The SyNPar procedure is summarized in Algorithm 1.

Figure 1: Feature Selection via SyNPar

An alternative approach to estimate the FDP is based on \(W_j = |\hat{\beta}_j| - |\tilde{\beta}_j^\prime|\), defined as: \[\label{eq:FDP2} \widehat{\text{FDP}}(t) = \frac{\#\{ j : W_j \leq -t \}}{\max\left( \#\{ j : W_j \geq t \}, 1 \right)}\,, \tag{5}\] which is widely used in the literature [11], [14], [27] and is applicable to SyNPar. However, it is important to note that, compared to \(|\hat{\beta}_j|\), \(W_j\) incorporates additional randomness from \(|\tilde{\beta}_j^\prime|\). By replacing \(\widehat{\text{FDP}}(t)\) in [eq:tauq] of Algorithm 1 with 5 and selecting features in \(\widehat{\mathcal{S}}(\tau_q) = \{ j : W_j > \tau_q \}\), we define this SyNPar variant as SyNPar-Diff, where "Diff" refers to the difference \(W_j\). In our simulation studies (Section 3.1), we compare the performance of SyNPar with that of SyNPar-Diff. The results show that the FDR control and power achieved by SyNPar-Diff are slightly inferior to those achieved by SyNPar, supporting the choice of using the FDP estimate in 4 for SyNPar.

Next, we provide theoretical guarantees for the FDR control and power of SyNPar.

Assumption 2 (Weak dependence among features). Given a target FDR level \(q \in (0,1)\) and the threshold \(\tau_q\) in [eq:tauq], assume that \[\mathrm{Var}\left(\sum_{j=1}^{p} \mathbb{I}(|\hat{\beta}_j| \geq \tau_q)\right) \big/ p^2 \rightarrow 0and\mathrm{Var}\left(\sum_{j=1}^{p} \mathbb{I}( |\tilde{\beta}_j^\prime| \geq \tau_q )\right) \big/ p^2 \rightarrow 0 \quad \text{as } n, p \to \infty.\]

Theorem 1. Under Assumptions 1 and 2, given a target FDR level \(q \in (0,1)\), the threshold \(\tau_q\) in [eq:tauq], and the selected feature set \(\widehat{\mathcal{S}}(\tau_q)\) in [eq:hat95cS], as \(n, p \to \infty\), SyNPar satisfies: \[\mathrm{FDR}(\tau_q) = \mathbb{E} \left[ \frac{\#\left\{\widehat{\mathcal{S}}(\tau_q) \cap \mathcal{S}_0 \right\}}{\max\{\#\widehat{\mathcal{S}}(\tau_q), 1\}} \right] \leq q + o(1)\,. \] Furthermore, if \(\min_{j \in \mathcal{S}} |{\beta}^*_{j}| > 3\gamma_{n,p}\), then \[\mathrm{Power}(\tau_q) = \mathbb{E} \left[ \frac{\#\left\{ \widehat{\mathcal{S}}(\tau_q) \cap \mathcal{S}\right\}}{\# \mathcal{S}(\tau_q)} \right] = 1 - o(1)\,. \]

Theorem 1 provides a theoretical guarantee for controlling the FDR in SyNPar. Furthermore, it establishes that when the minimum signal strength satisfies \(\min_{j \in \mathcal{S}} |{\beta}^*_{j}| > 3\gamma_{n,p}\), the power of SyNPar approaches 1 as \(n\) and \(p\) tend to infinity. In other words, under Assumption 1, which ensures that the estimation procedure for \(\boldsymbol{\beta}\) is reliable, and Assumption 2, which guarantees that the \(p\) features exhibit weak dependence, SyNPar effectively controls the FDR in feature selection and achieves an asymptotic power of 1 when the minimum signal strength is sufficiently large.

3 SyNPar for High-Dimensional Linear Models↩︎

In this section, we outline the specific steps for applying SyNPar to perform feature selection in a high-dimensional linear model, \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta}^* + \boldsymbol{\varepsilon}\). A crucial step in this process is estimating the distribution of \(\boldsymbol{\varepsilon}\) from the original data \(\{\mathbf{y}, \mathbf{X}\}\) to enable synthetic null data generation. For instance, the distribution of \(\boldsymbol{\varepsilon}\) can be assumed to follow \(\mathcal{N}(0, \boldsymbol{\Phi})\), where \(\boldsymbol{\Phi}\) may be a diagonal matrix, a sparse matrix, or another form of distribution.

Definition 2 (Synthetic null data for a linear model). For a linear model \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta}^* + \boldsymbol{\varepsilon}\), where \(\boldsymbol{\varepsilon}\sim \mathcal{N}(0, \sigma^2\mathbf{I})\), SyNPar defines \(\tilde{\mathbf{y}} \in \mathbb{R}^n\) as \(\tilde{\mathbf{y}} = {\mathbf{X}} \boldsymbol{\beta}_0 + \tilde{\boldsymbol{\varepsilon}} = \tilde{\boldsymbol{\varepsilon}},\) where \(\boldsymbol{\beta}_0 = (0, \dots, 0)^{\mkern-1.5mu\mathsf{T}}\in \mathbb{R}^p\) is the coefficient vector under the complete null hypothesis, and \(\tilde{\boldsymbol{\varepsilon}} \sim \mathcal{N}(0, \hat{\sigma}^2\mathbf{I})\), where \((\hat{\boldsymbol{\beta}}, \hat{\sigma}^2)\) are estimates of \((\boldsymbol{\beta}^*, \sigma^2)\) from the original data \(\{\mathbf{y}, \mathbf{X}\}\).

We consider using the LASSO as the estimation procedure for \(\boldsymbol{\beta}\): \[\hat{\boldsymbol{\beta}}= \operatorname*{argmin}_{\boldsymbol{\beta}\in \mathbb{R}^{p}} \frac{1}{2n} \| \mathbf{y}- \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda_n \|\boldsymbol{\beta}\|_1and\;\tilde{\boldsymbol{\beta}} = \operatorname*{argmin}_{\boldsymbol{\beta}\in \mathbb{R}^{p}} \frac{1}{2n} \| \tilde{\mathbf{y}} - {\mathbf{X}} \boldsymbol{\beta}\|_2^2 + \lambda_n \|\boldsymbol{\beta}\|_1, \] where \(\lambda_n\) is the same regularization parameter applied to both the original data and the synthetic null data. Other estimation procedures, such as the elastic net and SCAD, can also be used. Here, we focus on the LASSO for simplicity.

Lemma 1. Under the conditions specified in Theorem 1 of [28], Assumption 1 holds for the LASSO estimator with \(\gamma_{n,p} = \kappa \left( \lambda_n + \sqrt{\frac{\log p}{n}} \right)\), where \(\kappa\) is a constant.

In our implementation of SyNPar, the hyperparameter \(\kappa\) in Lemma 1 is set to \(0.2 \hat{\sigma}\), where \(\hat{\sigma}\) is the estimate of \(\sigma\) derived from the residuals of the LASSO. In Table 7, we inverstiagte the influence of \(\kappa\) on the FDR and power of SyNPar. The results show that the performance of SyNPar is robust to the choice of \(\kappa\).

While SyNPar assumes a simple distribution \(\mathcal{N}(0, \sigma^2\mathbf{I})\) for the noise term \(\boldsymbol{\varepsilon}\) in synthetic null data generation, incorporating the estimation error correction factor term \(\gamma_{n,p}\) enhances SyNPar’s robustness to misspecification of the distribution of \(\boldsymbol{\varepsilon}\). We verify that when \(\hat{\sigma}\) deviates from the true value \(\sigma = 1\) to varying degrees, SyNPar consistently maintains FDR control when \(\hat{\sigma} > \sigma\), meaning the estimated error distribution is more heavy-tailed (and thus conservative) (see Table 8).

3.1 Simulation results on a toy dataset↩︎

In this subsection, we evaluate the performance of SyNPar in controlling the FDR and power on a toy dataset. We compare SyNPar with the \(p\)-value-free methods Fixed-X knockoff (Fixed-X), Model-X knockoff (Model-X), GM, DS, and MDS, as well as the \(p\)-value-based methods BH and BHq. The toy dataset is generated according to Simulation Setting 1, where \(n > p\); thus, the \(p\)-values for BH and BHq are computed using \(t\)-tests based on OLS.

SyNPar is compared with other methods in terms of FDR and power across varying feature correlation (\(\rho\)) values under a target FDR of \(q = 0.1\). The results, summarized in Figure 2, indicate that SyNPar achieves the highest power while effectively controlling the FDR, particularly in high-correlation scenarios. The knockoff filters (Fixed-X and Model-X) demonstrate conservative behavior, maintaining FDR control at the expense of reduced power. The DS, MDS, and \(p\)-value-based BH and BHq methods show slightly higher power than the knockoff filters but remain less powerful than SyNPar. The GM method shows a minor violation of FDR control and achieves power levels still lower than SyNPar. As reported in Section 2, SyNPar-Diff performs slightly worse than SyNPar, especially in high-correlation scenarios.

Figure 2: Empirical FDRs and power under Simulation Setting 1.

Table 3: Comparison of running times (s) under Simulation Setting 1.
SyNPar SyNPar-Diff Model-X Fixed-X GM DS MDS BH BHq
2.28 2.26 15.82 10.85 42.10 0.87 25.01 0.42 0.39

Table 3 compares the running times of SyNPar (including the cross-validation time for selecting the regularization parameter) with those of other methods. Among them, BH and BHq are the fastest, while GM is the slowest. SyNPar, SyNPar-Diff, and DS exhibit comparable computational efficiency. Due to its inefficiency, we exclude the GM method from further comparisons in the following sections.

3.2 Simulation results on medium-scale datasets↩︎

In this subsection, we examine a medium-scale simulation scenario.

Simulation Setting 2. We set \(n = 2000\). The design matrix \(\mathbf{X}\) is generated as in Simulation Setting 1. We consider four simulation parameters for adjustment: (a) the autocorrelation parameter \(\rho \in [0, 0.9]\), (b) the signal amplitude \(A \in [0.15, 0.35]\), (c) the target FDR level \(q \in [0.05, 0.4]\), and (d) the number of features \(p \in \{500, 1000, \cdots, 3500\}\). For each scenario where one parameter varies, the remaining parameters are held constant as: \[\label{eq:para} \rho = 0.8, \, A = 0.25, \, q = 0.1, \, \text{and } \, p = 1000. \qquad{(2)}\] The first \(30\) elements of the coefficient vector \(\boldsymbol{\beta}^*\) are randomly assigned values with amplitude \(A\) and random signs, while the remaining \(p - 30\) elements are set to zero.

For each scenario under Simulation Setting 2, we compare the FDR and power of the different methods, using 100 replications. In Appendix C, we also evaluate the Area Under the Precision-Recall Curve (AUPR) for each method. For certain scenarios where \(p\) is large, we exclude Fixed-X from the comparison because it requires \(n \geq 2p\), and BH and BHq due to the high computational cost of using the debiased LASSO for \(p\)-value calculation.

The empirical FDR and power of the different methods are presented in Figures 36. The AUPR results are provided in Appendix C. Overall, the FDRs of most methods remain controlled across all scenarios, except for DS and BH, which sometimes slightly lose control. In all scenarios, SyNPar consistently demonstrates reliable FDR control and, more importantly, achieves higher powers and AUPRs than other methods. Compared to SyNPar, SyNPar-Diff exhibits slightly lower power, particularly in more challenging scenarios such as those with high autocorrelations and low signal amplitudes.

Figure 3: Empirical FDRs and power vs. autocorrelation (\(\rho\)) under Simulation Setting 2.

Figure 4: Empirical FDRs and power vs. signal amplitude (\(A\)) under Simulation Setting 2.

Figure 5: Empirical FDRs and power vs. target FDR level (\(q\)) under Simulation Setting 2.

Figure 6: Empirical FDRs and power for the linear regression model (Number of feature).

In Figure 3, where we increase the autocorrelation \(\rho\) between features, SyNPar’s power decreases much more slowly than that of the other methods. This demonstrates that SyNPar is more robust to high correlations among features. In Figure 4, where we vary the amplitude \(A\), we observe that once \(A\) increases to \(0.3\), the power and AUPR of SyNPar reach \(1\) and remain constant. In Figure 5, when varying the target FDR level \(q\), SyNPar consistently achieves higher power across all FDR levels compared to the other methods, and SyNPar-Diff also maintains relatively high power except at \(q = 0.05\). Furthermore, we observe that as \(q\) increases, DS tends to lose control over its actual FDR. In Figure 6, where we vary the number of features \(p\), all methods exhibit a slight improvement in performance at \(p = 2000\).

Table 4: Comparison of running times (s) under ?? in Simulation Setting 2.
SyNPar Model-X Fixed-X DS MDS
5.72 40.92 62.92 2.53 86.56

Our proposed method, SyNPar, demonstrates significantly faster running times compared to the competing methods. Table 4 summarizes the running times under the specific setting described in ?? .

4 Real Data Analysis↩︎

4.1 Data Description↩︎

In this section, we apply SyNPar to the time-to-Labor dataset, a longitudinal dataset collected from pregnant women receiving antepartum and postpartum care at Stanford’s Lucile Packard Children’s Hospital [29], which was also analyzed by [30]. The dataset includes 63 participants in their second or third trimester of a singleton, uncomplicated pregnancy, with 1–3 samples per participant (median of 3), spanning up to spontaneous labor. Each sample comprises 6348 features, including 3529 metabolites, 1317 plasma proteins, and 1502 single-cell immune features derived from blood mass cytometry. The dataset is divided into training and validation sets. The training set contains data from 53 women, totaling 150 samples, while the validation set includes 10 participants with 27 samples. Due to incomplete data in the validation set, our analysis focuses exclusively on the training data. For preprocessing, we combined the three omics datasets (single-cell proteomic, plasma proteomic, and metabolomic data) and removed features that were zero across all observations. After preprocessing, the final dataset contained \(n=150\) observations and \(p=6331\) features.

The performance of SyNPar, Model-X, and MDS was evaluated using three metrics: sparsity, goodness-of-fit, and efficiency. Sparsity reflects the number of selected features, with simpler models (fewer features) preferred when accuracy is maintained. Goodness-of-fit is measured by the adjusted R-squared value, indicating how well the model explains variability in the response variable while accounting for feature usage. Efficiency is assessed by the running time. Each metric was averaged over 70 replications to account for randomization.

4.2 Analysis Results↩︎

The LASSO regularization parameter \(\lambda_n\) is selected using cross-validation. The estimation error correction factor is set to \(\gamma_{n,p} = 0.5(\lambda_n + \frac{\log p}{\sqrt{n}})\) for SyNPar, justified by simulation results. The FDR level is fixed at \(q = 0.1\) for all methods. Figure 7 compares the performance of the three methods. MDS selects no features in all 70 replications, while Model-X selects features in only 17 out of 70 replications. In contrast, SyNPar selects features in every replication, consistent with its high power in feature selection in simulation studies.

a

b

Figure 7: Performance metrics of feature selection methods. (a) Number of selected features for each method, indicating sparsity. (b) Adjusted R-squared value for model fit quality..

Table 5: Comparison of Running Times (s) for the Onset of Labor Example
SyNPar Model-X MDS
17.81 11432.09 421.47

Specifically, although Model-X achieves better sparsity, its adjusted R-squared value is inferior to that of SyNPar, indicating a loss of statistical power. Additionally, Table 5 highlights a significant limitation of Model-X knockoff: its running time. SyNPar demonstrates clear computational efficiency with a significantly shorter runtime, while Model-X requires approximately 650 times longer. Overall, our proposed method, SyNPar, consistently outperforms other methods.

Next, we extract the features selected by our proposed method with an occurrence frequency exceeding 50% across 70 replications. From a biological perspective, the approaching labor is associated with an decreased responsiveness to inflammatory stimuli, such as the pSTAT1 signaling response to IFN\(\alpha\) and the STAT3 response to IL-2, IL-4, and IL-6 in natural killer (NK) cells [31][33]. In addition, several proteins consistently appeared in all 70 replications, which indicates their significance in relation to labor. As labor approaches, placental-derived proteins (e.g. Siglec-6 [34]), immune regulatory plasma proteins (e.g., IL-1R4 [35] and SLPI [36]). These findings are all consistent with those reported in [30]. In addition, we identified that growth and differentiation proteins (e.g., Activin A [37]) experience an increase. This shift is accompanied by a synchronized decrease in hCG [38] (placental-derived proteins). Our proposed method, SyNPar, enables the identification of crucial biological features associated with the onset of labor as shown in Table 6, which could be used to predict labor timing. This provides a scientific foundation for developing a blood-based diagnostic tool to anticipate the timing of labor.

Table 6: Coefficients of key features identified by SyNPar across omic categories.
Feature NK (STAT1, IFN-\(\alpha\)) NK (STAT3, IL-2, IL-4, IL-6) Siglec-6 IL-1R4
Coefficient 3.025 2.095 3.326 3.132
SLPI Activin A hCG
3.119 1.194 -4.454

5 Discussion↩︎

In this paper, we propose a novel method, SyNPar, to address the FDR control in high-dimensional feature selection. Compared to the knockoff filter and data splitting, SyNPar preserves the original data, leading to higher statistical power and improved computational efficiency. SyNPar requires two key components: the ability to generate synthetic null data and an estimation procedure for the parameter of interest. Regarding data generation, SyNPar shares a close relationship with parametric bootstrap. However, the key distinction lies in the data generation mechanism: SyNPar employs a synthetic null data approach to construct the negative control, whereas parametric bootstrap uses the estimated model to generate synthetic data and construct confidence intervals. After generating null data, SyNPar identifies false positives by comparing the parameter estimates from the original data and synthetic null data, functioning as a numerical analog to a likelihood ratio test. We refer to this combination of parametric bootstrap and numerical likelihood ratio test as a simulation-based inference framework, with SyNPar being a special case of this framework. SyNPar demonstrates the potential of simulation-based inference to serve as a novel and general framework for statistical analysis in future research.

Future work will focus on several promising directions. First, SyNPar is a versatile framework that can be adapted to various statistical models, such as quantile regression, mixed-effects models, generalized linear mixed-effects models, and generalized additive models. Future research will focus on extending SyNPar to these models and exploring its application to emerging topics in statistics, including post-selection inference and conformal prediction. Second, for negative control \(\tilde{\boldsymbol{\beta}}\) construction, this paper primarily relies on synthetic null data. However, null data and true data are inherently different, and the deeper theoretical relationship between the zero-coefficient estimates derived from them warrants further investigation. To address this discrepancy, we introduce the estimation error correction factor as a potential solution. Alternatively, synthetic data can be generated under the alternative hypothesis, i.e., the true model \(\mathbf{y}\sim F(\cdot \mid \mathbf{X}; \boldsymbol{\beta}^*, \boldsymbol{\nu}^*)\). This approach could leverage a sparse estimator of \(\boldsymbol{\beta}^*\) to generate synthetic alternative data, offering a complementary perspective on bias estimation. Lastly, this paper estimates the estimation error correction factor using \(\left\|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}^* \right\|_{\infty}\), which requires theoretical derivation and a hyperparameter \(\kappa\). Different model structures, however, may yield varying estimates of \(\left\|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}^* \right\|_{\infty}\). A promising future direction would be the development of data-driven, simulation-based methods for estimation error correction factor estimation. Such methods could eliminate the need for theoretical derivation and manual hyperparameter selection, enhancing the flexibility and robustness of SyNPar.

6 Proofs of Theorem 1↩︎

Lemma 2. Under Assumption 1, for any null \(j\), there exists a set \(\mathcal{G}\) such that within \(\mathcal{G}\), \[|\hat{\beta}_j| \leq |\tilde{\beta}_j^\prime|,\] with \({\rm I}\kern-0.18em{\rm P}\left(\mathcal{G}^{c}\right) = o(p^{-1})\).

Proof of Lemma 2. By Assumption 1, it follows that \[\label{eq:tail1} \mathbb{P} \left( \left\|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}^* \right\|_{\infty} \geq \gamma_{n,p} \right) = o(p^{-1}).\tag{6}\] Define the event \(\mathcal{G} = \left\{\left\|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}^* \right\|_{\infty} < \gamma_{n,p} \right\}\). By the definition of \(\tilde{\beta}_j^\prime\), we have \[|\tilde{\beta}_j^\prime| \geq \gamma_{n,p}.\] Consequently, on the event \(\mathcal{G}\), it holds that \[|\hat{\beta}_j| \leq |\tilde{\beta}_j^\prime|.\] Thus, the proof is complete. ◻

Let \(a \vee b\) denote the maximum of \(a\) and \(b\).

Lemma 3. Under Assumption 1, the marginal false discovery rate (mFDR) is given by \[\text{mFDR}(\tau_q) = \frac{\mathbb{E} \left[\#\left\{\widehat{\mathcal{S}}(\tau_q) \cap \mathcal{S}_0 \right\}\right]}{\mathbb{E} \left[\max\{\#\widehat{\mathcal{S}}(\tau_q), 1\} \right]} \leq q + o(1).\]

Proof of Lemma 3. By definition, the mFDR is given by: \[\begin{align} \mathrm{mFDR}(\tau_q) & = \frac{\mathbb{E} \left(\#\left\{j : j \in \mathcal{S}_0 \text{ and } |\hat{\beta}_j| \geq \tau_q \right\}\right)}{\mathbb{E} \left(\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 \right)} \\ & \leq \frac{\mathbb{E} \left(\#\left\{j : j \in \mathcal{S}_0 \text{ and } |\tilde{\beta}_j^\prime| \geq \tau_q \right\}\right)}{\mathbb{E} \left(\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 \right)} + o(1) \\ & \leq \frac{\mathbb{E} \left(\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}\right)}{\mathbb{E} \left(\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 \right)} + o(1), \end{align}\] where the first inequality follows from Lemma 2, and the second inequality comes from the fact that \[\mathbb{E} \left(\#\left\{j : j \in \mathcal{S}_0 \text{ and } |\tilde{\beta}_j^\prime| \geq \tau_q\right\}\right) \leq \mathbb{E} \left(\#\left\{j : j \in [p] \text{ and } |\tilde{\beta}_j^\prime| \geq \tau_q\right\}\right).\] By the definition of \(\tau_q\), we have \[\frac{\#\{ j : |\tilde{\beta}_j^\prime| \geq \tau_q \}}{\max\left( \#\{ j : |\hat{\beta}_j| \geq \tau_q \}, 1 \right)} \leq q.\] It follows that \[\mathbb{E}\left[ \#\{ j : |\tilde{\beta}_j^\prime| \geq \tau_q \} \right] \leq q \mathbb{E}\left[ \max\left( \#\{ j : |\hat{\beta}_j| \geq \tau_q \}, 1 \right)\right].\] Thus, we conclude that \[\mathrm{mFDR}(\tau_q) \leq q + o(1),\] completing the proof. ◻

Lemma 4. Under Assumption 1–2, the usual false discovery rate (FDR) is also controlled: \[\mathbb{E} \left[ \frac{\#\left\{\hat{S}(\tau_q) \cap S_0 \right\}}{\max\{\#\hat{S}(\tau_q), 1\}} \right] \leq q + o(1).\]

Proof of Lemma 4. It is cleat that \[\begin{align} \mathrm{FDR}(\tau_q) & =\mathbb{E} \left( \frac{\#\left\{j : j \in \mathcal{S}_0 \text{ and } |\hat{\beta}_j| \geq \tau_q \right\}}{\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 } \right)\\ & \leq\mathbb{E} \left( \frac{\#\left\{j : j \in \mathcal{S}_0 \text{ and } |\tilde{\beta}_j^\prime| \geq \tau_q \right\}}{\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 }\right) + o(1) \\ & \leq \mathbb{E} \left(\frac{\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}}{\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 }\right) + o(1), \end{align}\] where the first inequality follows from Lemma 2, and the second inequality comes from the fact that \[\#\left\{j : j \in \mathcal{S}_0 \text{ and } |\tilde{\beta}_j^\prime| \geq \tau_q\right\} \leq \#\left\{j : j \in [p] \text{ and } |\tilde{\beta}_j^\prime| \geq \tau_q\right\}.\] By Lemma 3, we know that \[\frac{\mathbb{E} \left(\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}\right)}{\mathbb{E} \left(\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 \right)} \leq q + o(1).\] By Assumption 2 and Chebshev’s inequality, we have \[\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}\big/p \rightarrow \mathbb{E} \left(\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}\right)\big/p \text{ in probability,}\] and \[\left(\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1\right) \big/p \rightarrow \mathbb{E} \left(\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 \right)\big/p \text{ in probability.}\] By slutsky’s theorem, we have \[\frac{\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}}{\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 } \rightarrow \frac{\mathbb{E} \left(\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}\right)}{\mathbb{E} \left(\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 \right)} \text{ in probability.}\] Since \[\frac{\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}}{\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 } \leq q,\] it follows from the Dominated Convergence Theorem that \[{\rm I}\kern-0.18em{\rm E}\left( \frac{\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}}{\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 } \right) \rightarrow \frac{\mathbb{E} \left(\#\left\{j : |\tilde{\beta}_j^\prime| \geq \tau_q \right\}\right)}{\mathbb{E} \left(\#\left\{j : |\hat{\beta}_j| \geq \tau_q \right\} \vee 1 \right)} \leq q + o(1),\] which completes the proof. ◻

Lemma 5. Under the same assumptions as Theorem 1, we have \[\text{Power}(\tau_q) = \mathbb{E} \left[ \frac{\#\left\{ \widehat{\mathcal{S}}(\tau_q) \cap \mathcal{S}\right\}}{\#\mathcal{S}} \right] = 1 + o(1).\]

Proof of Lemma 5. Since \(\min_{j \in \mathcal{S}} |{\beta}^*_{j}| > 3\gamma_{n,p}\), by Assumption 1, it follows that \[\mathbb{P} \left( \min_{j \in \mathcal{S}} |\hat{\beta}_{j}| \leq 2\gamma_{n,p} \right) = o(1).\] Using Assumption 1 again, we have \[\mathbb{P} \left( \|\tilde{\beta}\|_{\infty} \geq \gamma_{n,p} \right) = o(1),\] which further implies that \[\mathbb{P} \left( \|\tilde{\beta}^\prime\|_{\infty} \geq 2\gamma_{n,p} \right) = o(1).\]

Now, define the event \[\mathcal{G}_2 = \left\{ \|\tilde{\beta}^\prime\|_{\infty} < 2\gamma_{n,p} \right\} \cap \left\{ \min_{j \in \mathcal{S}} |\hat{\beta}_{j}| > 2\gamma_{n,p} \right\}.\] It follows that \(\mathbb{P}(\mathcal{G}_2^c) = o(1)\). On the event \(\mathcal{G}_2\), with \(t^* = 2\gamma_{n,p}\), we have \(\widehat{\text{FDP}}(t^*) = 0\), which implies that \(\tau \leq t^*\). Consequently, we have \(\mathcal{S}\subset \widehat{\mathcal{S}}(\tau_q)\), meaning \[\frac{\#\left\{ \widehat{\mathcal{S}}(\tau_q) \cap \mathcal{S}\right\}}{\#\mathcal{S}} = 1.\] Thus, we obtain \[\mathbb{E} \left[ \frac{\#\left\{ \widehat{\mathcal{S}}(\tau_q) \cap \mathcal{S}\right\}}{\#\mathcal{S}} \right] \leq \mathbb{E} \left[ \frac{\#\left\{ \widehat{\mathcal{S}}(\tau_q) \cap \mathcal{S}\right\}}{\#\mathcal{S}}\mathbb{I}(\mathcal{G}_2) \right] = \mathbb{E} \left[ \mathbb{I}(\mathcal{G}_2) \right] = 1 + o(1),\] which completes the proof. ◻

Proof of Theorem 1. Combining Lemma 35 yields Theorem 1. ◻

6.1 Proof of Lemma 1↩︎

Lemma 1 is from the paper [28].

7 Algorithm of SyNPar-Indiv↩︎

Figure 8: Feature Selection with SyNPar-Indiv

8 Supplemental Figures↩︎

Figure 9: Empirical AUPRs for the linear regression model (Correlation).

Figure 10: Empirical AUPRs for the linear regression model (Amplitude).

Figure 11: Empirical AUPRs for the linear regression model (Number of feature).

9 Supplemental Tables↩︎

9.1 Robustness of SyNPar with respect to the \(\kappa\) parameter↩︎

We validate the robustness of SyNPar by presenting empirical results for the FDR, statistical power, and AUPR under the specified setting \[\rho = 0.8, A = 0.25, q=0.1, \text{ and } p = 1000,\] considering varying values of the parameter \(\kappa\). The parameter \(\kappa\) controls the FDR, and while previous simulations were conducted with \(\kappa = 0.2 \Hat{\sigma}\), we extend our analysis by systematically varying \(\kappa\) from \(0.0\) to \(0.4 \Hat{\sigma}\). A larger \(\kappa\) value makes the method more conservative. As shown in Table 7, when \(\kappa = 0.0\), the FDR slightly exceeds the target level, highlighting the importance of the estimation error correction factor term \(\gamma_{n,p}\) in controlling the FDR. Increasing \(\kappa\) improves FDR control, while the power of the method remains largely unaffected. Importantly, SyNPar consistently achieves high power levels across a wide range of \(\kappa\) values. These results demonstrate the robustness and adaptability of SyNPar under diverse parameter settings.

Table 7: Performance results under varying \(\kappa\) parameters. The target FDR is 0.1. The simulation setting is \(\rho = 0.8, A = 0.25, q=0.1, \text{ and } p = 1000\).
\(\kappa\) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
FDR 0.179 0.135 0.100 0.071 0.054 0.042 0.034 0.026 0.017
Power 0.969 0.964 0.960 0.956 0.951 0.946 0.940 0.937 0.931

9.2 Robustness of SyNPar with nuisance parameter↩︎

We examine the effect of varying the parameter \(\hat{\sigma}\), which is used for generating null data. In the previous simulations, we estiamte \(\sigma\) from the residuals of the LASSO estimator. Here, we consider fixing and varying \(\hat{\sigma}\) from \(0.8\) to \(2.0\) to evaluate the performance of SyNPar. Table 8, under the setting \(\rho = 0.8, A = 0.35, p = 1000,\text{ and } n =2000\), demonstrates that FDR is controlled in most settings. When \(\hat{\sigma} = 0.8\), compared to the true setting (\(\sigma = 1\)), the signal-to-noise ratio of the null data increases, making it easier to estimate \(\boldsymbol{\beta}_0\) from null data. This leads to an underestimation of the negative control, causing a slight increase in FDR. However, the power remains consistently high. As \(\hat{\sigma}\) increases further, the FDR decreases below the target level even 0, and the power remains stable. Larger \(\hat{\sigma}\) values make SyNPar more conservative because overestimating \(\hat{\sigma}\) leads to an overestimation of the negative control.

Table 8: Performance results under varying \(\Hat{\sigma}\) estimates for synthetic null data generation, with the true \(\sigma =1\) and the target FDR level of \(0.1\). The simulation setting is \(\rho = 0.8, A = 0.35, p = 1000,\text{ and }n =4000\).
\(\hat{\sigma}\) 0.8 1.0 1.2 1.4 1.6 1.8 2.0
FDR 0.260 0.043 0.004 0.001 0.000 0.000 0.000
Power 1.000 1.000 1.000 1.000 1.000 1.000 1.000

References↩︎

[1]
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288.
[2]
Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology 67(2), 301–320.
[3]
Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association  96(456), 1348–1360.
[4]
Meinshausen, N. and P. Bühlmann (2010). Stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology 72(4), 417–473.
[5]
Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Statistical Methodology 57(1), 289–300.
[6]
Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 1165–1188.
[7]
Benjamini, Y., A. M. Krieger, and D. Yekutieli (2006). Adaptive linear step-up procedures that control the false discovery rate. Biometrika 93(3), 491–507.
[8]
Javanmard, A. and H. Javadi (2019). False discovery rate control via debiased lasso. Electronic Journal of Statistics 13, 1212–1253.
[9]
Ma, R., T. Tony Cai, and H. Li (2021). Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association  116(534), 984–998.
[10]
Sur, P. and E. J. Candès (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences  116(29), 14516–14525.
[11]
Candes, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold:‘model-X’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology 80(3), 551–577.
[12]
Barber, R. F. and E. J. Candès (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 2055–2085.
[13]
Barber, R. F., E. J. Candès, and R. J. Samworth (2020). Robust inference with knockoffs. The Annals of Statistics 48(3), 1409–1431.
[14]
Dai, C., B. Lin, X. Xing, and J. S. Liu (2023a). False discovery rate control via data splitting. Journal of the American Statistical Association  118(544), 2503–2520.
[15]
Romano, Y., M. Sesia, and E. Candès (2020). Deep knockoffs. Journal of the American Statistical Association  115(532), 1861–1872.
[16]
Jordon, J., J. Yoon, and M. van der Schaar (2018). : Generating knockoffs for feature selection using generative adversarial networks. In International Conference on Learning Representations.
[17]
Bates, S., E. Candès, L. Janson, and W. Wang (2021). Metropolized knockoff sampling. Journal of the American Statistical Association  116(535), 1413–1427.
[18]
Fan, Y., L. Gao, and J. Lv (2023). Ark: Robust knockoffs inference with coupling. arXiv preprint arXiv:2307.04400.
[19]
Spector, A. and L. Janson (2022). Powerful knockoffs via minimizing reconstructability. The Annals of Statistics 50(1), 252–276.
[20]
Ren, Z., Y. Wei, and E. Candès (2023). Derandomizing knockoffs. Journal of the American Statistical Association  118(542), 948–958.
[21]
Ren, Z. and R. F. Barber (2024). Derandomised knockoffs: leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology 86(1), 122–154.
[22]
Li, J. and M. H. Maathuis (2021). . Journal of the Royal Statistical Society Series B: Statistical Methodology 83(3), 534–558.
[23]
Li, D., J. Yu, and H. Zhao (2023). Coxknockoff: Controlled feature selection for the Cox model using knockoffs. Stat 12(1), e607.
[24]
Xing, X., Z. Zhao, and J. S. Liu (2023). Controlling false discovery rate using Gaussian mirrors. Journal of the American Statistical Association  118(541), 222–241.
[25]
Dai, C., B. Lin, X. Xing, and J. S. Liu (2023b). A scale-free approach for false discovery rate control in generalized linear models. Journal of the American Statistical Association  118(543), 1551–1565.
[26]
Ge, Y., S. Zhang, and X. Zhang (2024). False discovery rate control for high-dimensional cox model with uneven data splitting. Journal of Statistical Computation and Simulation  94(7), 1462–1493.
[27]
Ge, X., Y. E. Chen, D. Song, M. McDermott, K. Woyshner, A. Manousopoulou, N. Wang, W. Li, L. D. Wang, and J. J. Li (2021). Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biology 22, 1–29.
[28]
Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of lasso and dantzig estimators. Electronic Journal of Statistics 2, 90–102.
[29]
Stelzer, I. A., M. S. Ghaemi, X. Han, K. Ando, J. J. Hédou, D. Feyaerts, L. S. Peterson, K. K. Rumer, E. S. Tsai, E. A. Ganio, et al. (2021). Integrated trajectories of the maternal metabolome, proteome, and immunome predict labor onset. Science Translational Medicine 13(592), eabd9898.
[30]
Hédou, J., I. Marić, G. Bellan, J. Einhaus, D. K. Gaudillière, F.-X. Ladant, F. Verdonk, I. A. Stelzer, D. Feyaerts, A. S. Tsai, et al. (2024). Discovery of sparse, reliable omic biomarkers with stabl. Nature Biotechnology, 1–13.
[31]
Shah, N. M., A. A. Herasimtschuk, A. Boasso, A. Benlahrech, D. Fuchs, N. Imami, and M. R. Johnson (2017). Changes in T cell and dendritic cell phenotype from mid to late pregnancy are indicative of a shift from immune tolerance to immune activation. Frontiers in Immunology 8, 1138.
[32]
Kraus, T. A., S. M. Engel, R. S. Sperling, L. Kellerman, Y. Lo, S. Wallenstein, M. M. Escribese, J. L. Garrido, T. Singh, M. Loubeau, et al. (2012). . Journal of Clinical Immunology 32, 300–311.
[33]
Ono, C. T., Z. Yu, T. Obara, M. Ishikuro, K. Murakami, M. Kikuya, S. Kikuchi, N. Kobayashi, H. Kudo, S. Ogishima, et al. (2023). Association between low levels of anti-inflammatory cytokines during pregnancy and postpartum depression. Psychiatry and Clinical Neurosciences 77(8), 434–441.
[34]
Brinkman-Van der Linden, E. C., N. Hurtado-Ziola, T. Hayakawa, L. Wiggleton, K. Benirschke, A. Varki, and N. Varki (2007). Human-specific expression of Siglec-6 in the placenta. Glycobiology 17(9), 922–931.
[35]
Huang, B., A. N. Faucette, M. D. Pawlitz, B. Pei, J. W. Goyert, J. Z. Zhou, N. G. El-Hage, J. Deng, J. Lin, F. Yao, et al. (2017). . Nature Medicine 23(1), 128–135.
[36]
Li, A., R. H. Lee, J. C. Felix, P. Minoo, and T. M. Goodwin (2009). Alteration of secretory leukocyte protease inhibitor in human myometrium during labor. American Journal of Obstetrics and Gynecology 200(3), 311–e1.
[37]
Petraglia, F., D. De Vita, A. Gallinelli, L. Aguzzoli, A. R. Genazzani, R. Romero, and T. K. Woodruff (1995). . The Journal of Clinical Endocrinology & Metabolism  80(2), 558–561.
[38]
Edelstam, G., C. Karlsson, M. Westgren, C. Löwbeer, and M.-L. Swahn (2007). Human chorionic gonadatropin (hCG) during third trimester pregnancy. Scandinavian Journal of Clinical and Laboratory Investigation 67(5), 519–525.

  1. Correspondence should be addressed to Jingyi Jessica Li (jli@stat.ucla.edu)↩︎

  2. The other elements of \(\boldsymbol{\beta}_0\) need to be estimated from the original data \(\{\mathbf{y}, \mathbf{X}\}\).↩︎