Neural Networks for Censored Expectile Regression Based on Data Augmentation


Abstract

Expectile regression neural networks (ERNNs) are powerful tools for capturing heterogeneity and complex nonlinear structures in data. However, most existing research has primarily focused on fully observed data, with limited attention paid to scenarios involving censored observations. In this paper, we propose a data augmentation–based ERNNs algorithm, termed DAERNN, for modeling heterogeneous censored data. The proposed DAERNN is fully data-driven, requires minimal assumptions, and offers substantial flexibility. Simulation studies and real-data applications demonstrate that DAERNN outperforms existing censored ERNNs methods and achieves predictive performance comparable to models trained on fully observed data. Moreover, the algorithm provides a unified framework for handling various censoring mechanisms without requiring explicit parametric model specification, thereby enhancing its applicability to practical censored data analysis.

Censor data ,Expectile Regression ,Neural Network ,Data augmentation

1 Introduction↩︎

Expectile regression (ER), first proposed by [1], has been extensively studied as a flexible alternative to quantile regression (QR) for modeling heterogeneous distributions. By estimating expectiles across different levels, one can obtain an alternative characterization of the conditional distribution—similar to QR, but with smoother behavior. Specifically, for \(\tau\in (0,1)\), the \(\tau\)th regression expectile of \(Y\) given \(X\) is obtained by: \[\begin{align} \label{eq:er} e_{\tau}(Y|\boldsymbol{x})&=\mathop{\arg\min}_\theta \mathbb{E}\Big([\rho_{\tau}(Y-\theta)-\rho_{\tau}(Y)]|X=\boldsymbol{x}\Big), \end{align}\tag{1}\] where \(\rho_{\tau}(u)=\frac{1}{2}\cdot|\tau-\boldsymbol{1}(u<0)|\cdot u^2\) is the expectile check function, where \(\boldsymbol{1}(\cdot)\) is the indicator function. Unlike QR, ER uses a differentiable squared loss function rather than the non-differentiable absolute loss, resulting in greater computational efficiency. ER also simplifies inference because its asymptotic distribution can be derived without estimating the error density function [1], [2]. Moreover, the asymmetric squared loss enables ER to capture both the probability and magnitude of tail behaviors, providing valuable insights into the distribution’s shape. This makes ER particularly effective for quantifying tail-related risks and estimating conditional expectiles at specified probability levels, supporting high-impact risk assessment and expectation-based evaluation. For example, [3] and [4] use expectiles to estimate Value-at-Risk (VaR) and to construct conditional autoregressive models for assessing extreme losses. Recent work has extended expectile-based methods to a range of applications, including risk measures [5][9], exchange rate volatility [10], stock index and portfolio modeling [11], [12], and high-frequency financial data analysis [13]. [14], [15], and [16] also apply ER to survival analysis, particularly for estimating mean residual life expectancy. Clearly, there is rising interest in ER.

With the rapid advancement of information technology, data acquisition has become increasingly efficient, but the relationships among variables have grown more complex. In many cases, linear models fail to capture these nonlinear structures. Machine learning methods have therefore gained significant attention for their flexibility and ability to model complex variable interactions without strong parametric assumptions. Building on this, numerous studies have extended ER with machine learning techniques to better capture nonlinear patterns. For example, [17] proposed a tree-based gradient boosting method for nonparametric multiple ER, while [18] introduced the Expectile Regression Neural Network (ERNNs). Similarly, [19] and [20] developed support vector machine–based ER models (ER-SVM), and [21] integrated random forests with ER to create the ERF model. More recently, [22] extended ERNNs to high-dimensional settings, enabling nonlinear variable selection. These approaches demonstrate strong performance in modeling complex, nonlinear data structures.

However, the above studies primarily focus on fully observed datasets. In real-world applications, data incompleteness is common, with censoring being one of the most frequent cases. Censoring occurs when the event of interest is not observed within the study period. Such data are common in fields like biomedicine [23][26], engineering management [27][29], and economics [30][32]. Censoring introduces discrepancies between observed and true population values, complicating the modeling of conditional distributions. To address this challenge, various statistical methods have been developed, including the Kaplan–Meier (KM) estimator [33], the Cox proportional hazards model [34], and the accelerated failure time (AFT) model [35], along with numerous extensions. More recently, machine learning techniques have been incorporated into survival analysis to capture complex nonlinear relationships, such as neural networks–based Cox models [36], hazard models [37], AFT models [38], and censored quantile regression models [39], [40].

Despite growing interest in survival analysis with deep learning, expectile regression neural networks (ERNNs) for censored data remain largely unexplored. [16] proposed a neural networks–based approach for right-censored data using inverse probability weighting (IPW), termed WERNN (Weighted Expectile Regression Neural Networks). However, WERNN is limited to right-censoring, whereas many real-world applications involve other censoring types, such as left [41] or interval censoring [42]. Moreover, the IPW approach requires estimating the censoring distribution, which is computationally intensive in high dimensions and can be unreliable when the survival function is miss-specified.

To our knowledge, apart from [16], no other studies have addressed censored ERNNs. Furthermore, the existing method can only accommodate a single censoring type. This highlights the need for a unified estimation framework for ERNNs that can flexibly handle multiple censoring mechanisms. Inspired by the data augmentation strategy of [43], we propose a novel algorithm, the Data Augmentation Expectile Regression Neural Networks (DAERNN). Starting from an initial model, DAERNN iteratively updates through three key steps—data augmentation, model updating, and prediction—and typically converges within a few iterations. Notably, the method is highly flexible and can be applied to various censoring mechanisms, including singly left-, right-, and interval-censored data.

The remainder of this paper is structured as follows. In Section 2 we introduce the proposed method DAERNN. Section 3 conducts Monte Carlo simulations to investigate the finite sample performance of the proposed DAERNN method. Real case applications are shown in Section 4. Section 5 gives a brief conclusion of this article and also provides some potential research directions about ERNNs for censored data.

2 Methodology↩︎

In this section, we introduce the proposed DAERNN method, which is designed to effectively handle censored datasets exhibiting complex nonlinear relationships and heteroscedastic structures.

2.1 Model setup↩︎

Assume that \(\{y_i,\boldsymbol{x}_i\}_{i=1}^n\) are samples draw from \((Y,X)\), where \(\boldsymbol{x}_i=(1,x_{i1},\cdots,x_{ip})^T\) includes the \(p\)-dimensional covariates. To better capture the intricate relationship of the conditional expectile between independent and dependent variables, we assume the \(\tau\)th conditional expectile of \(Y\) given \(X\) defined in Eq.@eq:eq:er takes the form: \[\begin{align} \label{eq:nlner} e_{\tau}(y_i|\boldsymbol{x}_i)&= m_\tau(\boldsymbol{x}_i), \;i=1,\cdots,n, \end{align}\tag{2}\] where \(m_\tau:\mathbb{R}^p \rightarrow \mathbb{R}\) is an unknown smoothing function. Compare with traditional linear model, model 2 imposes no parametric assumptions on the functional form, thereby offering greater flexibility in capturing complex and nonlinear structures inherent in the data. However, in practical applications involving censored data, the estimation of model 2 faces two major challenges. Fisrt, although \(m_\tau(\cdot)\) enhances flexibility for modeling complex data, the nonparametric nature simultaneously increases the difficulty of estimation. In addition, in the presence of censoring, the response variable \(y_i\) is only partially observed, resulting in biased samples that cannot be directly used in standard expectile regression.

Due to censoring, the observed response \(t_i\) may not coincide with the true latent response \(y_i\). Let \(\delta_i\) denote the censoring type for the \(i\)th response, which is defined as 1) \(\delta_i=0\): no censoring; 2) \(\delta_i=1\): right censoring at \(R_i\); 3) \(\delta_i=2\): left censoring at \(L_i\); and 4) \(\delta_i=3\): interval censoring between \((L_i,R_i)\). Here \(L_i\) and \(R_i\) denote the left and right censoring points, respectively. Thus, the observed response \(t_i\) can be expressed as \[\begin{align} t_i&=\left\{\begin{array}{ll} y_i, & \text{no censoring}\\ y_i \wedge R_i, & \text{right censoring at}\;R_i\\ y_i \vee L_i, & \text{left censoring at}\;L_i\\ L_i \vee (y_i\wedge R_i), & \text{interval censoring between}\;(L_i,R_i) \end{array}\right., \end{align}\] where \(\wedge\) and \(\vee\) denote the minimum and maximum operations. Consequently, model estimation can only be based on the available observations \(\{t_i, \boldsymbol{x}_i,\delta_i\}\) for \(i = 1, 2, \ldots, n\). To overcome the challenges discussed above, we develop a novel algorithm that integrates a neural networks framework with data augmentation techniques and more details are discussed in Section 2.2.

2.2 Data-Augmented Expectile Regression Neural Networks for Censored Data: DAERNN↩︎

In this section, we first introduce the DAERNN procedure for estimating \(m_\tau(\cdot)\) in (2 ) under censoring. The procedure consists of three main steps: data augmentation, model updating, and prediction. We then describe the estimation process for the model updating step, specifically focusing on the ERNNs model.

2.2.1 General framework of DAERNN↩︎

Estimating \(m_\tau(\cdot)\) under censoring is challenging because it requires recovering an unknown nonlinear function while correcting for bias introduced by censoring. Traditional nonparametric methods, such as kernel regression, local polynomials, and smoothing splines, can model nonlinear structures but struggle with highly complex relationships and suffer from the well-known “curse of dimensionality” [44]. Neural networks offer a flexible alternative, capable of capturing complex nonlinear patterns in high-dimensional settings. As composite functions of affine transformations and nonlinear activation maps, organized into multiple layers, neural networks can approximate highly complex functions [45], [46]. Deeper architectures further enhance their ability to represent intricate or high-dimensional structures [47], [48]. Building on this, we adopt the ERNNs model to estimate \(m_\tau(\cdot)\), which will be described in Section 2.2.2.

To address censoring, we integrate a data augmentation strategy that imputes censored observations, avoiding the need to explicitly estimate the survival function. This approach is naturally adaptable to various censoring mechanisms and offers both flexibility and computational efficiency. Prior studies have demonstrated that imputing event times can effectively reduce bias, particularly under complex censoring scenarios [43], [49][52].

Algorithm 1 summarizes the procedure of DAERNN. To implement the algorithm, two hyper-parameters must be specified: the imputation expectile levels \(\{\tau_k = \frac{k}{m+1}: k =1,\cdots,m\}\) with \(m\) being a prespecified integer, i.e, \(m=\max\{\lfloor \sqrt{n}\rfloor,99\}\) in our proposal, and the number of iterations \(H\). In addition, we define \(S(i)\) as the potential set of real response \(y_i\): \[S(i)=\left\{ \begin{array}{ll} {y_i}, & \text{no censoring}\\ (R_i,\infty), & \text{right censoring at}\;R_i\\ (-\infty,L_i) , & \text{left censoring at}\;L_i\\ (L_i,R_i) , & \text{interval censoring between}\;(L_i,R_i) \end{array} \right.,\] which is essential for the imputation step. As shown in Algorithm 1, DAERNN consists of two main stages: Initialization and Iterative Estimation.

Initialization. The goal of this step is to train an initial model that imputes censored outcomes and generates preliminary estimates for subsequent iterations. To improve computational efficiency, hyperparameter tuning is performed only on uncensored observations, and the selected parameters are then applied to train the ERNNs model for the remainder of the procedure. In this initialization step, we obtain \(m\) ERNNs models, denoted as \(M^{(0)}(\tau_k)\) for \(\tau_k, \, k = 1, \ldots, m\).

Iterative estimation at the \(h\)th Step. Each iteration of DAERNN involves three sub-steps: data augmentation, model updating, and prediction.

  1. Data augmentation. For each censored sample, we impute a value from the predicted set \(\{\hat{t}_i^{(h)} : i = 1, \ldots, n,\, \delta_i \neq 0\}\) obtained in the \((h-1)\)-th iteration. The imputed \(\tilde{t}_i^{(h)}\) value is accepted if it satisfies the censoring constraint \(S(i)\); otherwise, the sampling is repeated until a feasible value is found. If no feasible value can be drawn, the censored sample is imputed using its censoring boundary, that is, \(L_i, R_i, or (L_i+R_i)/2\) for left, right, and interval censoring, respectively.

  2. Model Updating. After imputation, the ERNNs models are retrained using the augmented dataset \(\{\tilde{t}_i^{(h)},\boldsymbol{x}_i\}\), yielding updated models \(M^{(h)}(\tau_k)\) for \(k = 1, \ldots, m\). These models will be used for prediction in the next iteration.

  3. Prediction. The updated models \(M^{(h)}(\tau_k)\) generate predictions for all expectile levels \(\tau_k\) on the test data.

Figure 1: DAERNN: Data-Augmented Expectile Regression Neural Networks for Censored Data

The above process is repeated until the predefined number of iterations \(H\) is reached, and the final prediction is obtained by averaging the results across all iterations.

2.2.2 Estimation of ERNNs model↩︎

Following the general framework above, we now present the estimation of \(m_\tau(\cdot)\) under the proposed DAERNN structure. We implement a multilayer perceptron (MLP) structure, where the hidden-layer activations \(g_{i,l}\) are computed as \[\label{eq:ERNN} \begin{align} g_{i,1}&=f_1\left(\boldsymbol{x}_i^\top w_{i}^{(j,1)}+b^{(1)}\right),\\ g_{i,l}&=f_l\left(\boldsymbol{x}_i^\top w_{i}^{(j,l)}+b^{(l)}\right), l=2,\cdots,L, \end{align}\tag{3}\] where \(w^{(l)}\) and \(b^{(l)}\) are the weight and bias of layer \(l\), respectively. The activation function \(f_l\) is typically chosen as the Sigmoid or Rectified Linear Unit (ReLU). With data augmentation, the censored dataset is reconstructed as \(\{\tilde{t}_i^{(h)}, \boldsymbol{x}_i\}\) which is then treated as pseudo fully observed data. Based on the neural network structure defined in Equation (3 ), the optimal model parameters are obtained by solving the following optimization problem: \[\label{eq:ERNN-sol} \begin{align} &\widehat{\boldsymbol{\theta}}=\mathop{\arg\min}\limits_{\theta}\frac{1}{n}\sum_{i=1}^n\rho_{\tau}\big(\tilde{t}_i^{(h)}-\hat{t}_i^{(h)}),\\ &\hat{t}_i^{(h)}=f_{o}\left(g_{i,L}^\top w_{i}^{\left(o\right)}+b^{\left(o\right)}\right), \end{align}\tag{4}\] where \(\hat{t}_i^{(h)}\) is the prediction value at \(h\) step, \(n\) is the total number of samples, \(L\) is the number of hide layer, \(w_{i}^{(o)}\) and \(b^{(o)}\) represents the weights of and bias term of output layer. The model parameter vector is given by \(\widehat{\boldsymbol{\theta}}=\big[w_{i}^{(o)},b^{(o)},w_{i}^{(j,l)},b^{(l)}\big]^\top\).

Unlike the QR, the expectile loss \(\rho_{\tau}(\cdot)\) is differentiable at all expectile levels, which facilitates efficient optimization using gradient-based methods. In this work, we employ mini-batch gradient descent (MBGD), which balances the efficiency of stochastic gradient descent with the stability of full-batch methods [53]. MBGD computes gradients on small random batches, improving computational efficiency, reducing gradient variance, and enabling parallel or distributed training for large-scale problems [53], [54].

In practice, hyperparameter tuning is crucial for effective neural network modeling. While increasing the number of hidden layers and the number of nodes per layer can enhance the capacity of model to capture complex nonlinear relationships, it also raises the risk of overfitting. In this study, we consider the following hyperparameters: number of hidden layers, nodes per layer, dropout rate and training epochs. Additionally, we include the mini-batch size as a hyperparameter arising from the use of MBGD. We adopt a grid search with cross-validation to select the optimal combination, following prior studies [16], [18].

In summary, we also illustrate the flowchart of the proposed algorithm in Figure 2.

Figure 2: Flowchart of DAERNN Algorithm

3 Simulation Studies↩︎

In this section, we conduct simulation studies to evaluate the predictive performance of our proposed method. For comparison, we focus primarily on the DAERNN method and a neural networks model that does not account for censoring, i.e., it ignores the influence of censoring, referred to as the FULL. In addition, we compare against two existing methods for censored expectile regression: the DALinear method [52] and the Weighted Expectile Regression Neural Networks (WERNN) method [16].

3.1 Data generation↩︎

We consider the following homoscedastic and heteroscedastic cases, respectively:

  • (1) Model 1 (Homoscedastic Case): \[\label{m1} y_i=\sin(2x_{i1})+2\exp(-16x_{i2}^2)+0.5e_i,\tag{5}\] where \(x_{i1}\) and \(x_{i2}\) are independently generated from \(N(0,0.5^2)\).

  • (2) Model 2 (Heteroscedastic Case): \[\label{m2} y_i=1+\sin(x_{i1})+\exp\big({0.5x_{i1}^2-x_{i1}x_{i2}+0.2x_{i2}^2}\big)+|\frac{1+0.2(x_{i1}+x_{i2})}{5}|e_i.\tag{6}\] For other censoring, the model are denoted as follows: where \(x_{i1}\sim Unif(-1,1)\) and \(x_{i2} \sim N(0,1)\), respectively.

To illustrate the robustness of our proposed method, we consider two error distributions for each scenario: a normal and a heavy-tailed distribution. Specifically, for the normal case, the errors are generated as \(e_i \sim N(0, 1)\); for the heavy-tailed case, they follow a Student’s \(t\)-distribution with 3 degrees of freedom, \(t(3)\). We also consider two censoring rates, 25% and 50%. Detailed parameter settings are summarized in Table 1.

Table 1: The distributions of censoring rates for different censoring types of the two simulation scenarios.
Model Error term Censoring rate Distribution of censoring sample
4-6 Right Left Interval
Model 1 \(N(0,1)\) 25% \(N(1.4,2^2)\) \(N(0,2^2)\) \(L\sim N(-0.5,2^2), R\sim N(0,2^2)\)
50% \(N(0.6,2^2)\) \(N(0.6,2^2)\) \(L\sim N(0,2^2), R\sim N(1.5,2^2)\)
\(t(3)\) 25% \(N(1.5,2^2)\) \(N(0,2^2\)) \(L\sim N(-0.5,2^2), R\sim N(0,2^2)\)
50% \(N(0.65,2^2)\) \(N(0.6,2^2)\) \(L\sim N(0,2^2), R\sim N(1.5,2^2)\)
Model 2 \(N(0,1)\) 25% \(\exp(4)\) \(\exp(2)\) \(L\sim \exp(0.85), R \sim \exp(1.35)\)
50% \(\exp(3)\) \(\exp(3)\) \(L\sim \exp(0.55), R\sim \exp(1.45)\)
\(t(3)\) 25% \(\exp(4)\) \(\exp(2)\) \(L\sim \exp(0.85), R \sim \exp(1.35)\)
50% \(\exp(3)\) \(\exp(3)\) \(L\sim \exp(0.55), R\sim \exp(1.45)\)

In each simulation, we generate \(n = 1000\) samples and randomly split the data into 80% for training and 20% for evaluating predictive performance. The primary evaluation metrics are the Expectile Loss (EL) and the Expectile Loss Ratio (EL\(_{\text{ratio}}\)), defined as: \[\begin{align} \text{EL} = \frac{1}{n_\text{test}}\sum_{i=1}^{n_\text{test}}\rho_{\tau}\big(y_i-\hat{y}_i\big), \;\;\; &\text{EL}_\text{ratio} = \frac{\text{EL}_\text{DAERNN}}{\text{EL}_\text{Compete}}, \end{align}\] where \(n_\text{test}\) is the sample size of test set, \(\text{EL}_\text{DAERNN}\) and \(\text{EL}_\text{Compete}\) represent \(\text{EL}\) values obtained through DAERNN and other three competitive methods (FULL, DALinear and WERNN). We report the average values of these evaluation metrics under each scenario, based on \(200\) replications and across \(9\) expectile levels, i.e., \(\tau = \{0.1, 0.2, \ldots, 0.9\}\).

Additionally, our proposed method DAERNN, along with two other neural networks-based methods, requires hyperparameter tuning for model construction. The hyperparameters considered include the number of layers \(L \in \{2, 3, 4\}\), number of nodes per layer \(J \in \{16, 32, 64\}\), learning rate \(\in \{0.01, 0.1\}\), dropout rate \(\in \{0.1, 0.2, 0.3\}\), number of training epochs \(\in \{50, 100\}\), and batch size \(\in \{64, 128, 256\}\). The activation function is ReLU. Hyperparameters are tuned using \(5\)-fold cross-validation on the training set. To reduce computational cost, tuning for all three neural network methods is performed only on the uncensored samples.

3.2 Performance results↩︎

3.2.1 Results for right-censoring case↩︎

Table 2 reports the EL\(_{\text{ratio}}\) values under both homoscedastic and heteroscedastic settings, across different error distributions and censoring rates of 25% and 50%. The EL\(_{\text{ratio}}\) is defined as the ratio of the expectile loss obtained from DAERNN to that from the FULL, DALinear, and WERNN methods. Hence, a ratio below one indicates that DAERNN performs better than the corresponding competitive method, while a ratio above one suggests inferior performance. Figure 3 illustrates the expectile loss (\(\text{EL}\)) across different model settings and methods, under varying censoring rates.

Our main findings are summarized as follows:

  • Overall, DAERNN consistently outperforms all competitive methods in terms of predictive accuracy. Compared with FULL and WERNN, its predictive accuracy improves by approximately 50% when the censoring rate is 25%. This finding highlights the importance of accounting for censoring effects and demonstrates the robustness of the data augmentation approach, which does not rely on the assumption of a correctly specified survival function.

  • The predictive accuracy also shows substantial improvement compared with DAlinear, with gains of at least 20%. This demonstrates that replacing the linear model structure with a neural network can significantly enhance predictive performance, particularly for data with complex underlying relationships.

  • Notably, the performance remains comparable even under heavy-tailed error distributions and heterogeneous settings, which demonstrating the robustness of the proposed method.

  • The DAERNN method demonstrates stable performance in terms of EL values across different censoring rates, expect for Model 2 under a \(t(3)\) error distribution (Figure 3). When compared against competing methods, this observed stability suggests that the censoring rate does not significantly influence the performance of DAERNN, especially within homogeneous settings.

Figure 3: Expectile Losses of predicted responses for different methods and settings at \tau = 0.1, 0.3, 0.5, 0.7, and 0.9, under right censoring with different censoring rate.

In addition, we also consider the results of the ERNNs model fitted without censoring in the dataset (referred to as Oracle) and compare its performance with our proposed method. Figure 4 and Figure 5 illustrate the distribution of the predicted values \(\hat{y}(\tau)\) from the Oracle and DAERNN models under Model 1 and Model 2, respectively. The distributions of the predicted responses are quite similar, indicating that the performance of our proposed method can achieve comparable accuracy to that of the complete-case model.

Figure 4: The boxplots of the predicted responses at \tau = 0.1, 0.3, 0.5, 0.7, 0.9 from the Oracle models and the \textsf{DAERNN} under Model 1, across different error distributions and censoring rates.
Figure 5: The boxplots of the predicted responses at \tau = 0.1, 0.3, 0.5, 0.7, 0.9 from the Oracle models and the \textsf{DAERNN} under Model 2, across different error distributions and censoring rates.
Table 2: The \(\text{EL}_{\text{ratio}}\) values for errors following \(N(0,1)\) and \(t(3)\) distribution, comparing the proposed DAERNN with the FULL, WERNN, and DALinear methods at \(\tau=0.1,0.3,0.5,0.7,0.9\). The results are based on right-censoring with censoring rates of 25% and 50%.
Model 1 Model 2
3-6 (lr)7-10 \(\boldsymbol{\tau}\) \(N(0,1)\) \(t(3)\) \(N(0,1)\) \(t(3)\)
3-4 (lr)5-6 (lr)7-8 (lr)9-10 25% 50% 25% 50% 25% 50% 25% 50%
FULL/DAERNN 0.1 0.223 0.145 0.544 0.378 0.667 0.617 0.686 0.640
0.3 0.292 0.182 0.555 0.377 0.690 0.641 0.699 0.659
0.5 0.360 0.218 0.589 0.410 0.712 0.665 0.719 0.686
0.7 0.430 0.272 0.642 0.468 0.746 0.698 0.741 0.711
0.9 0.532 0.368 0.742 0.569 0.795 0.758 0.792 0.778
WERNN/DAERNN 0.1 0.223 0.145 0.543 0.377 0.668 0.619 0.687 0.640
0.3 0.286 0.181 0.551 0.376 0.692 0.639 0.700 0.658
0.5 0.359 0.220 0.581 0.410 0.711 0.666 0.720 0.684
0.7 0.415 0.274 0.654 0.460 0.742 0.696 0.747 0.713
0.9 0.540 0.371 0.738 0.571 0.792 0.761 0.790 0.773
DALinear/DAERNN 0.1 0.400 0.424 0.729 0.738 0.776 0.687 0.793 0.711
0.3 0.367 0.391 0.628 0.651 0.759 0.661 0.765 0.681
0.5 0.346 0.378 0.595 0.619 0.744 0.649 0.755 0.672
0.7 0.334 0.362 0.585 0.606 0.733 0.633 0.738 0.651
0.9 0.317 0.336 0.615 0.611 0.714 0.615 0.715 0.630

3.2.2 Results for other censoring types↩︎

Table 3 reports the EL\(_\text{ratio}\) values comparing DAERNN with both FULL and DALinear for left censoring. The EL\(_\text{ratio}\) remains lower than 1 in most caseswhen compared to both FULL and DALinear, confirming the advantage of DAERNN. Moreover, the results show only minor variations across different error distributions at the same expectile level, indicating the robustness of the proposed method. Results for the interval censoring case are shown in Table 4, which are consistent with those from the left censoring case; thus, we omit redundant discussion.

Table 3: The \(\text{EL}_{\text{ratio}}\) values for errors following \(N(0,1)\) and \(t(3)\) distribution, comparing the proposed DAERNN with the FULL, DALinear methods at \(\tau=0.1,0.3,0.5,0.7,0.9\). The results are based on left-censoring with censoring rates of 25% and 50%.
Model 1 Model 2
3-6 (lr)7-10 \(\boldsymbol{\tau}\) \(N(0,1)\) \(t(3)\) \(N(0,1)\) \(t(3)\)
3-4 (lr)5-6 (lr)7-8 (lr)9-10 25% 50% 25% 50% 25% 50% 25% 50%
FULL/DAERNN 0.1 0.580 0.387 0.740 0.610 0.988 0.781 0.943 1.050
0.3 0.463 0.286 0.669 0.494 0.943 0.944 0.875 0.917
0.5 0.376 0.238 0.618 0.435 0.803 0.613 0.970 0.722
0.7 0.309 0.191 0.573 0.401 0.740 0.607 0.885 0.583
0.9 0.229 0.148 0.548 0.398 0.895 0.740 0.995 0.624
DALinear/DAERNN 0.1 0.387 0.402 0.668 0.687 0.846 0.311 0.485 0.324
0.3 0.356 0.371 0.595 0.627 0.547 0.203 0.425 0.259
0.5 0.337 0.358 0.581 0.614 0.291 0.171 0.254 0.163
0.7 0.326 0.345 0.581 0.616 0.181 0.127 0.174 0.128
0.9 0.320 0.338 0.661 0.680 0.099 0.100 0.119 0.091
Table 4: The \(\text{EL}_{\text{ratio}}\) values for errors following \(N(0,1)\) and \(t(3)\) distribution, comparing the proposed DAERNN with the FULL, DALinear methods at \(\tau=0.1,0.3,0.5,0.7,0.9\). The results are based on interval-censoring with censoring rates of 25% and 50%.
Model 1 Model 2
3-6 (lr)7-10 \(\boldsymbol{\tau}\) \(N(0,1)\) \(t(3)\) \(N(0,1)\) \(t(3)\)
3-4 (lr)5-6 (lr)7-8 (lr)9-10 25% 50% 25% 50% 25% 50% 25% 50%
FULL/DAERNN 0.1 0.638 0.499 0.890 0.810 0.974 1.122 0.943 1.194
0.3 0.776 0.625 0.903 0.845 0.973 0.641 0.875 0.927
0.5 0.821 0.711 0.921 0.878 0.996 0.877 0.970 0.924
0.7 0.858 0.747 0.933 0.898 0.954 0.985 0.885 0.939
0.9 0.851 0.717 0.941 0.898 1.008 1.159 0.995 0.932
DALinear/DAERNN 0.1 0.391 0.392 0.708 0.709 0.722 1.243 0.639 0.918
0.3 0.350 0.354 0.614 0.624 0.668 0.750 0.594 0.684
0.5 0.333 0.344 0.579 0.593 0.636 0.690 0.515 0.612
0.7 0.320 0.334 0.579 0.595 0.620 0.678 0.520 0.597
0.9 0.315 0.328 0.643 0.657 0.600 0.585 0.453 0.551

3.3 Computational cost↩︎

Here we compare the computation time of different method methods under different scenarios.

Table 5: Average computational time (secs)
Error term Censoring rate Model 1 Model 2
3-5(lr)6-8 FULL DAERNN WERNN FULL DAERNN WERNN
\(N(0,1)\) 25% 7.240 23.426 132.676 5.698 20.471 147.545
50% 6.641 21.368 131.773 5.208 17.687 150.272
\(t(3)\) 25% 6.341 18.164 124.364 6.198 19.295 129.840
50% 6.198 19.295 129.840 5.522 18.779 156.787

Table 5 reports the average computational time of DAERNN and other competing methods. The results for DALinear are omitted, as it operates under a linear framework and is not directly comparable with the neural networks-based approaches. As observed, the proposed DAERNN requires slightly more time than the FULL model but is considerably more efficient than WERNN. This is reasonable since DAERNN involves iterative imputation of censored observations, which naturally increases computational cost. Nevertheless, the results indicate that DAERNN achieves higher accuracy than FULL with only a modest increase in computation time, and it significantly outperforms WERNN in both accuracy and efficiency, highlighting the advantages of data-augmentation-based strategies.

The optimal hyperparameters number of layers (\(L\)) and number of nodes per layer (\(J\)) for expectile levels \(\tau \in \{0.3, 0.5, 0.7\}\) selected via 5-fold cross-validation, are reported in Table 6. The results suggest that the proposed method performs well with a relatively simple neural network structure, indicating that DAERNN is practical and efficient for real applications.

Table 6: The selected hyperparameters: number of layers (\(L\)) and number of nodes per layer (\(J\)) for DAERNN, obtained via 5-fold cross validation on the simulation data.
Error term Censoring rate Model 1 Model 2
4-9 (lr)10-15 right left interval right left interval
4-5 (lr)6-7 (lr)8-9 (lr)10-11 (lr)12-13 (lr)14-15 \(L\) \(J\) \(L\) \(J\) \(L\) \(J\) \(L\) \(J\) \(L\) \(J\) \(L\) \(J\)
\(N(0,1)\) 25% 0.3 3 32 3 16 2 32 4 16 2 16 3 32
0.5 4 16 4 16 3 32 3 32 3 32 3 32
0.7 3 32 2 16 4 32 4 64 2 32 4 32
50% 0.3 3 16 3 32 2 32 4 64 2 32 2 32
0.5 3 16 3 32 4 16 3 32 4 32 2 32
0.7 3 32 3 32 4 16 4 32 4 64 4 16
\(t(3)\) 25% 0.3 3 32 2 16 2 32 4 32 3 32 2 32
0.5 4 16 4 32 3 16 3 32 4 32 3 16
0.7 2 16 2 16 2 32 3 16 4 32 4 16
50% 0.3 4 32 3 16 3 32 4 16 2 32 3 32
0.5 3 32 2 64 4 16 3 64 4 32 4 32
0.7 2 16 4 16 3 16 4 32 4 64 4 32

4 Empirical study↩︎

For illustration purpose, we apply the proposed method to analyze data sets from two practical examples.

4.1 WHAS dataset↩︎

The WHAS dataset was collected from the Worcester Heart Attack Study, conducted between 1975 and 2001. This study investigates the effects of various factors on the survival time following hospital admission for acute myocardial infarction. The dataset contains 500 observations and 14 covariates, including gender, age, and congestive heart complications, among others. The censoring rate is approximately 57.0%. The data are available in the R package smoothHR and also used the in outer study [16]. Description of variables are shown in Table 7.

Table 7: Description of variables for WHAS dataset.
Variable Description Variable declaration
Survival time Length of follow-up days Depend variable
Status Status as of last follow-up 1 = Dead, 0 = Alive
Gender Gender 0 = Male, 1 = Female
cvd History of cardiovascular disease 0 = No, 1 = Yes
afb Atrial fibrillation 0 = No, 1 = Yes
sho Cardiogenic shock 0 = No, 1 = Yes
chf Congestive heart complications 0 = No, 1 = Yes
av3 Complete heart block 0 = No, 1 = Yes
miord MI Order 0 = First, 1 = Recurrent
mitype MI Type 0 = non Q-wave, 1 = Q-wave
Age Age at hospital admission (years)
hr Initial heart rate. Beats per minute
los Length of hospital stay
sysbp Initial systolic blood pressure (mmHg)
diasbp Initial diastolic blood pressure (mmHg)
bmi Body mass index

In this section, we compare the predictive performance of the proposed method with WERNN and DALinear using 10-fold cross-validation, computing the EL\(_\text{ratio}\) for each method across \(\tau\in\{0.1,0.2,\cdots,0.9\}\). The empirical results are summarized in Table 8. As shown, DAERNN generally achieves lower forecasting errors, further confirming the superior predictive capability of the proposed approach.

Table 8: Expectile Loss Ratio for WHAS data by 10-fold cross-validation.
\(\boldsymbol{\tau}\) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
DALinear 0.717 0.760 0.828 0.856 0.911 0.895 0.961 0.981 1.035
WERNN 1.024 0.875 0.848 0.878 0.865 0.875 0.849 0.840 0.874

4.2 YVR dataset↩︎

To evaluate the performance of the proposed method under complex censoring mechanisms, we apply the DAERNN approach to a real-world dataset and artificially generate censored samples at different censoring rates. The data, available in the R package qrnn as the “YVRprecip” object, contain daily precipitation totals (in mm) recorded at Vancouver International Airport (YVR) from 1971 to 2000. The dataset consists of \(n = 10{,}958\) observations and includes three covariates: daily time series of sea-level pressure, 700-hPa specific humidity, and 500-hPa geopotential height. To account for seasonal effects relevant to precipitation downscaling, we also include sine and cosine transformations of the day of the year as additional predictors. Similarly, here we make the same comparisons as those in the simulation studies.

The “YVRprecip’’ dataset is originally uncensored, we follow the instruction of [55] artificially introduce right, left, and interval censoring to assess model performance under different censoring scenarios. The specific settings for each censoring type and rate are summarized in Table [tab:YVR].

4.5pt

c|c|ccccccccc [3]*Method & &
& & 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9
[0]*Oracle/DAERNN & 25% & 1.050 & 1.106 & 1.177 & 1.212 & 1.278 & 1.342 & 1.455 & 1.566 & 1.995
& 50% & 1.088 & 1.132 & 1.176 & 1.276 & 1.379 & 1.611 & 1.821 & 2.064 & 2.732
[1]*OWERNN/DAERNN & 25% & 0.985 & 1.001 & 1.007 & 0.975 & 0.967 & 0.984 & 1.001 & 1.003 & 0.968
& 50% & 0.722 & 0.761 & 0.788 & 0.802 & 0.820 & 0.839 & 0.853 & 0.858 & 0.897
[1]*ODALinear/DAERNN & 25% & 0.927 & 0.897 & 0.873 & 0.850 & 0.833 & 0.818 & 0.809 & 0.793 & 0.775
& 50% & 0.943 & 0.923 & 0.911 & 0.902 & 0.895 & 0.889 & 0.883 & 0.870 & 0.859

We evaluate and compare the values EL\(_\text{ratio}\) using a 10-fold cross-validation to assess the relative performance of different methods and the results are shown in Table [tab:YVR]. The results show that the proposed method performs comparably to the Oracle, with \(\text{EL}_{\text{ratio}}\) values close to 1 in most cases. Compared to the other two censored expectile methods, DAERNN demonstrates better performance, especially when the censoring rate is high.

Figure 6 demonstrates the expectile loss for other censoring scenarios. The result show that DAERNN is superior than DALinear. All in all, the precipitation dataset demonstrates the high prediction efficiency of our proposed DAERNN method.

Figure 6: The expectile loss (\text{EL}) of Oracle,DALinear and DAERNN method for left-censoring case top panel and interval- censoring scenario bottom panle, with 25% and 50% censor rates.

5 Conclusion and discussions↩︎

In this paper, we propose a unified estimation framework, DAERNN, for censored expectile regression neural network models under various censoring types. By integrating data augmentation techniques, DAERNN imputes censored outcomes while effectively capturing complex nonlinear and heterogeneous structures through expectile regression with neural networks. Both simulation and empirical studies demonstrate that DAERNN achieves performance comparable to the oracle model and significantly outperforms existing methods, including IPW-based ERNNs and linear data augmentation approaches.

Despite its advantages, several limitations merit future attention. First, like many neural network models, DAERNN lacks interpretability. Incorporating semiparametric structures, such as single-index models, partial linear model may enhance model transparency while preserving flexibility [37]. Second, although neural networks can accommodate high-dimensional data through feature extraction in hidden layers, their performance is constrained by limited sample sizes. Future work may integrate variable screening or marginal feature selection to improve model efficiency in high-dimensional settings.

Acknowledgments↩︎

This research was financially supported by the State Key Program of National Natural Science Foundation of China [Nos. 72531002].

References↩︎

[1]
W. K. Newey and J. L. Powell, “Asymmetric least squares estimation and testing,” Econometrica, vol. 55, no. 4, pp. 819–847, 1987, Accessed: Sep. 22, 2023. [Online].
[2]
B. Abdous and B. Remillard, “Relating quantiles and expectiles under weighted-symmetry,” Annals of the Institute of Statistical Mathematics, vol. 47, no. 2, pp. 371–384, 1995.
[3]
J. W. Taylor, “Estimating value at risk and expected shortfall using expectiles,” Journal of Financial Econometrics, vol. 6, no. 2, pp. 231–252, 2008.
[4]
C.-M. Kuan, J.-H. Yeh, and Y.-C. Hsu, “Assessing value at risk with CARE, the conditional autoregressive expectile models,” Journal of Econometrics, vol. 150, no. 2, pp. 261–270, Jun. 2009, Accessed: Oct. 11, 2023. [Online].
[5]
V. Chavez-Demoulin, P. Embrechts, and M. Hofert, “An extreme value approach for modeling operational risk losses depending on covariates,” Journal of Risk and Insurance, vol. 83, no. 3, pp. 735–776, 2016.
[6]
M. Kim and S. Lee, “Nonlinear expectile regression with application to value-at-risk and expected shortfall estimation,” Computational Statistics & Data Analysis, vol. 94, pp. 1–19, 2016.
[7]
A. Daouia, S. Girard, and G. Stupfler, “Estimation of tail risk based on extreme expectiles,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 80, no. 2, pp. 263–292, Mar. 2018, Accessed: Sep. 22, 2023. [Online].
[8]
M. Mohammedi, S. Bouzebda, and A. Laksaci, “The consistency and asymptotic normality of the kernel type expectile regression estimator for functional data,” Journal of Multivariate Analysis, vol. 181, p. 104673, 2021.
[9]
W. Xu, Y. Hou, and D. Li, “Prediction of Extremal Expectile Based on Regression Models With Heteroscedastic Extremes,” Journal of Business & Economic Statistics, vol. 40, no. 2, pp. 522–536, 2022, Accessed: Mar. 05, 2024. [Online].
[10]
S. Xie, Y. Zhou, and A. T. Wan, “A varying-coefficient expectile model for estimating value at risk,” Journal of Business & Economic Statistics, vol. 32, no. 4, pp. 576–592, 2014.
[11]
M. Sahamkhadam, “Dynamic copula-based expectile portfolios,” Journal of Asset Management, vol. 22, no. 3, pp. 209–223, 2021.
[12]
R. Jiang, X. Hu, and K. Yu, “Single-index expectile models for estimating conditional value at risk and expected shortfall,” Journal of Financial Econometrics, vol. 20, no. 2, pp. 345–366, 2022.
[13]
R. Gerlach and C. Wang, “Bayesian semi-parametric realized conditional autoregressive expectile models for tail risk forecasting,” Journal of Financial Econometrics, vol. 20, no. 1, pp. 105–138, 2022.
[14]
A. Seipp, V. Uslar, D. Weyhe, A. Timmer, and F. Otto‐Sobotka, “Weighted expectile regression for right‐censored data,” Statistics in Medicine, vol. 40, no. 25, pp. 5501–5520, Nov. 2021, Accessed: Oct. 11, 2023. [Online].
[15]
G. Ciuperca, “Right-censored models by the expectile method,” Lifetime Data Analysis, vol. 31, no. 1, pp. 149–186, 2025.
[16]
F. Zhang, X. Chen, P. Liu, and C. Fan, “Weighted expectile regression neural networks for right censored data,” Statistics in Medicine, vol. 0, pp. 1–15, 2024.
[17]
Y. Yang and H. Zou, “Nonparametric multiple expectile regression via ER-boost,” Journal of Statistical Computation and Simulation, vol. 85, no. 7, pp. 1442–1458, 2015.
[18]
C. Jiang, M. Jiang, Q. Xu, and X. Huang, “Expectile regression neural network model with applications,” Neurocomputing, vol. 247, pp. 73–86, 2017, Accessed: Jul. 12, 2024. [Online].
[19]
M. Farooq and I. Steinwart, “An SVM-like approach for expectile regression,” Computational Statistics & Data Analysis, vol. 109, pp. 159–181, 2017.
[20]
H. Pei, Q. Lin, L. Yang, and P. Zhong, “A novel semi-supervised support vector machine with asymmetric squared loss,” Advances in Data Analysis and Classification, vol. 15, no. 1, pp. 159–191, 2021.
[21]
C. Cai, H. Dong, and X. Wang, “Expectile regression forest: A new nonparametric expectile regression model,” Expert Systems, vol. 40, no. 1, p. e13087, 2023.
[22]
R. Yang and Y. Song, “Nonparametric expectile regression meets deep neural networks: A robust nonlinear variable selection method,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 17, no. 6, p. e70002, 2024.
[23]
Y. Wu and G. Yin, “Cure rate quantile regression for censored data with a survival fraction,” Journal of the American Statistical Association, vol. 108, no. 504, pp. 1517–1531, 2013, Accessed: Sep. 06, 2023. [Online].
[24]
Y. Wu and G. Yin, “Multiple imputation for cure rate quantile regression with censored data: Multiple imputation for cure rate quantile regression,” Biometrics, vol. 73, no. 1, pp. 94–103, 2017, Accessed: Sep. 06, 2023. [Online].
[25]
T. Yu, L. Xiang, and H. J. Wang, “Quantile regression for survival data with covariates subject to detection limits,” Biometrics, vol. 77, no. 2, pp. 610–621, 2021, Accessed: Sep. 04, 2023. [Online].
[26]
N. Narisetty and R. Koenker, “Censored quantile regression survival models with a cure proportion,” Journal of Econometrics, vol. 226, no. 1, pp. 192–203, 2022, Accessed: Aug. 04, 2023. [Online].
[27]
C. Chen et al., “Predictive maintenance using cox proportional hazard deep learning,” Advanced Engineering Informatics, vol. 44, p. 101054, 2020.
[28]
Y. Xu, Y. Cai, and L. Song, “Lifespan prediction of electronic card in nuclear power plant based on few samples,” Journal of Shanghai Jiaotong University (Science), pp. 1–7, 2023.
[29]
R. Xiao, T. Zayed, M. A. Meguid, and L. Sushama, “Improving failure modeling for gas transmission pipelines: A survival analysis and machine learning integrated approach,” Reliability Engineering & System Safety, vol. 241, p. 109672, 2024.
[30]
S. J. L. Nicholas M. Kiefer and G. R. Neumann, “How long is a spell of unemployment? Illusions and biases in the use of CPS data,” Journal of Business & Economic Statistics, vol. 3, no. 2, pp. 118–128, 1985.
[31]
N. M. Kiefer, “Economic duration data and hazard functions,” Journal of economic literature, vol. 26, no. 2, pp. 646–679, 1988.
[32]
E. Lüdemann, R. A. Wilke, and X. Zhang, “Censored quantile regressions and the length of unemployment periods in west germany,” Empirical Economics, vol. 31, no. 4, pp. 1003–1024, 2006, Accessed: Sep. 19, 2024. [Online].
[33]
E. L. Kaplan and P. Meier, “Nonparametric estimation from incomplete observations,” Journal of the American statistical association, vol. 53, no. 282, pp. 457–481, 1958.
[34]
D. R. Cox, “Regression Models and Life-Tables,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 34, no. 2, pp. 187–220, 1972, Accessed: Mar. 13, 2024. [Online].
[35]
L. J. Wei, The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis,” Statistics in Medicine, vol. 11, no. 14–15, pp. 1871–1879, 1992.
[36]
D. Faraggi and R. Simon, “A neural network model for survival data,” Statistics in medicine, vol. 14, no. 1, pp. 73–82, 1995.
[37]
Q. Zhong, J. W. Mueller, and J.-L. Wang, “Deep extended hazard models for survival analysis,” Advances in Neural Information Processing Systems, vol. 34, pp. 15111–15124, 2021.
[38]
G. Kim, J. Park, and S. Kang, “Deep neural network-based accelerated failure time models using rank loss,” Statistics in Medicine, vol. 43, no. 28, pp. 5331–5343, 2024.
[39]
Y. Jia and J.-H. Jeong, “Deep learning for quantile regression under right censoring: DeepQuantreg,” Computational Statistics & Data Analysis, vol. 165, p. 107323, 2022.
[40]
R. Hao, C. Weng, X. Liu, and X. Yang, “Data augmentation based estimation for the censored quantile regression neural network model,” Expert Systems with Applications, vol. 214, p. 119097, 2023.
[41]
Z. Wang, J. Ding, L. Sun, and Y. Wu, “Tobit quantile regression of left-censored longitudinal data with informative observation times,” Statistica Sinica, vol. 28, no. 1, pp. 527–548, 2018.
[42]
T. Choi, S. Park, H. Cho, and S. Choi, “Interval-censored linear quantile regression,” Journal of Computational and Graphical Statistics, vol. 34, no. 1, pp. 187–198, 2024.
[43]
X. Yang, N. N. Narisetty, and X. He, “A new approach to censored quantile regression estimation,” Journal of Computational and Graphical Statistics, vol. 27, no. 2, pp. 417–425, 2018, Accessed: Aug. 03, 2023. [Online].
[44]
W. Härdle and O. Linton, “Applied nonparametric methods,” Handbook of econometrics, vol. 4, pp. 2295–2339, 1994.
[45]
K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
[46]
K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural networks, vol. 4, no. 2, pp. 251–257, 1991.
[47]
M. Telgarsky, “Benefits of depth in neural networks,” in Conference on learning theory, 2016, pp. 1517–1539.
[48]
D. Yarotsky, “Error bounds for approximations with deep ReLU networks,” Neural networks, vol. 94, pp. 103–114, 2017.
[49]
C.-H. Hsu, J. M. Taylor, S. Murray, and D. Commenges, “Multiple imputation for interval censored data with auxiliary variables,” Statistics in Medicine, vol. 26, no. 4, pp. 769–781, 2007.
[50]
M. Lee, L. Kong, and L. Weissfeld, “Multiple imputation for left-censored biomarker data based on gibbs sampling method,” Statistics in medicine, vol. 31, no. 17, pp. 1838–1848, 2012.
[51]
R. Hao, Q. Han, L. Li, and X. Yang, “DAmcqrnn: Anbapproach to censored monotone composite quantile regression neural network estimation,” Information Sciences, vol. 638, p. 118986, 2023, Accessed: Aug. 04, 2023. [Online].
[52]
W. Cao and S. Wang, “Expectile regression for censored data based on data augmentation,” Biometrical Journal (Under the 2nd round of review), 2024.
[53]
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction using mini-batches,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 165–202, 2012.
[54]
S. Khirirat, H. R. Feyzmahdavian, and M. Johansson, “Mini-batch gradient descent: Faster convergence under data sparsity,” in 2017 IEEE 56th annual conference on decision and control (CDC), 2017, pp. 2880–2887.
[55]
R. Hao, H. Zheng, and X. Yang, “Data augmentation based estimation for the censored composite quantile regression neural network model,” Applied Soft Computing, vol. 127, p. 109381, 2022.