November 17, 2023
Rolling bearings, often referred to as "the joints of the industry," are crucial components of rotating machinery and find extensive applications across various industrial domains[1]–[3]. The least severe outcome of a bearing failure is the malfunction of equipment and subsequent economic losses; the most severe is the initiation of safety incidents and infliction of personal injuries. Consequently, the ability to predict a bearing’s remaining useful life (RUL) with precision and promptness is of considerable research interest in the domain of industrial safety. Generally speaking, RUL predictions can be categorized into model-based, data-driven, and hybrid methods[4]. Approaches based on models employ physical representations of the degradation trajectory, integrating these with empirical data to project the forthcoming decline. For example, Singleton et al.[5] introduces a method utilizing an extended Kalman filter predicated on continuous-time state equations to estimate RUL. Qian et al.[6] presents an enhanced particle filter technique that uses a variable importance density function and a backpropagation neural network to increase particle diversity for accurate prediction of RUL.
The acquisition of physical models can become a formidable challenge in the face of complex systems or elaborate operating conditions. In such contexts, the abundance of monitoring data paves the way for the application of data-driven methods[7]–[14] that harness artificial intelligence algorithms for the prediction of RUL. These methods are characterized by their lack of reliance on prior knowledge, enabling their flexible deployment across a spectrum of problems and situational contexts. As a result, when a substantial dataset related to operational failures is at hand, predictive efforts tend to pivot primarily towards data-driven methodologies.
Traditional machine learning methods include neural networks, support vector machines, relevance vector machines, bayesian inference, and fuzzy systems[15]–[19]. These types of mapping models are usually achieved by conducting regression analysis to model the relationship between RUL and monitored data. Besides the aforementioned methods, hybrid approaches aim to integrate physical models with data-driven methods, amalgamating the advantages of both while mitigating their limitations concerning RUL prediction[20]. For instance, Cheng et al.[21] combines the adaptive neuro-fuzzy inference system with particle filters, utilized respectively for learning the state transition functions and predicting RUL based on the learned state transition functions.
Over several years of development, deep learning methods have achieved breakthroughs in many fields including image classification, speech recognition, and natural language understanding. Even in the realm of fault diagnosis and prediction, deep learning-based methods for bearing RUL prediction have gradually become mainstream. Through the exploitation of large datasets and advanced computational capabilities, deep learning facilitates the extraction of complex patterns and features from data, thereby significantly enhancing the accuracy and reliability of RUL predictions in comparison to traditional machine learning approaches. Ren et al.[9] proposes a new method for the prediction of bearing RUL based on deep convolution neural network (CNN). Guo et al.[22] study a recurrent neural network (RNN) based health indicator for RUL prediction of bearings. As an improvement of the RNN, long short-term memory (LSTM) based neural network was introduced by Yuan et al.[10] for effective fault diagnosis and RUL estimation of aero engines. Que et al.[11] study remaining useful life prediction for bearings based on a gated recurrent unit (GRU), and an ensemble data-driven approach is proposed to predict the RUL of bearings.
The temporal convolutional network (TCN) is a recent improvement on the CNN architecture, utilizing dilated causal convolution (DCC) to extract information from historical data. Dilated causal convolution typically comprises fewer layers compared to classical CNNs, yet it is capable of capturing the same receptive field. Through the dilation operation, TCN expands the kernel’s reach without requiring additional parameters, hence maintaining computational efficiency while still being able to discern long-term dependencies in the data, making it a suitable choice for time-sensitive applications such as RUL prediction and time series forecasting[12]–[14]. Therefore, this paper adopts TCN as the backbone network for the experiments.
Additionally, it is common practice now to first convert vibration signals into health indicators (HI), which can guide the network in various ways. Kong et al.[23] uses the mean absolute value of signal extremes (MAVE) to characterize the signal energy to solve the problem that the HI selected by the existing model at the first prediction time (FPT) is insensitive to the initial failure. Nowadays, the method of utilizing latent values through the hidden layer of an autoencoder (AE) to create HI is becoming increasingly popular. Chen et al.[24] introduces a novel deep convolutional autoencoder based on a quadratic function for constructing health indicators from raw vibration signals to predict the RUL of bearings. Liu et al.[25] study densely connected fully convolutional auto-encoder based slewing bearing degradation trend prediction method. The proposed method is verified by large-scale slewing bearing data from the highly accelerated life test. Despite the increasingly mature techniques in creating HI, there are two shortcomings: (1) some methods perform well on specific datasets but lack generalization capabilities, leading to decreased performance when applied to different types of bearings or under varying operating conditions; (2) extracting useful features from raw vibration signals is challenging, and existing methods may not adequately capture all the information related to bearing health contained within the signals.
In order to solve the issues in health indicator generation, this paper establishes a new RUL prediction framework, with the specifics of this method illustrated in Figure 1. Upon normalization of all data, it is fed into VQ-VAE for label generation. As an improvement over the AE, the VQ-VAE[26] was initially developed for image generation tasks, providing a means to learn discrete latent representations, which are useful for generating high-quality images. The introduction of a discrete codebook in VQ-VAE enables the model to handle more complex data distributions than standard AEs, making it particularly powerful for tasks that benefit from compressed, yet expressive, representations of data. The created labels along with the normalized data are then sent to the predictive neural network TCN for training. The loss of the model is computed as the mean squared error (MSE) between the predicted results and the labels obtained through VQ-VAE. The contributions of this paper can be summarized in three main points:
An end-to-end HI construction method, VQ-VAE, is introduced for the prediction of bearing RUL. This method optimizes the shortcomings of HI metrics generated by traditional methods and resolves the need for dimensionality reduction of latent variables in conventional unsupervised learning methods such as Autoencoder.
In light of the inadequacy of traditional statistical metrics like MSE, MAE, Score, etc., in reflecting the fluctuation in curves accurately, two novel statistical metrics, mean absolute distance (MAD) and mean variance (MV), are proposed. These metrics can accurately depict the fluctuation patterns in curves, thereby reflecting the model’s accuracy in identifying similar features.
A novel RUL prediction framework is established, constituting an end-to-end prediction process that directly employs vibration signal inputs, eliminating the need for a priori knowledge for feature selection and dimensionality reduction.
This paper is structured as follows: Section 2 briefly describes the basic principles of temporal convolutional networks (TCN), autoencoder (AE), and vector quantised variational autoencoder (VQ-VAE). Section 3 introduces the basic network structures of ASTCN, AE, and VQ-VAE, as well as the associated label generation methods. Section 4 evaluates the practical performance of the RUL prediction framework through experiments. Section 5 concludes the paper and proposes some directions for future work.
Causal convolution is a type of convolutional operation used primarily in the context of time series data and sequence modeling, where the output at a given time is determined only by the current and past inputs, not by any future inputs. To achieve this, causal convolutions often use padding on the left side (past) of the input sequence but not on the right side (future), ensuring that the convolutional filter does not see future data[27]. Dilated causal convolution is an extension of causal convolution. The dilation aspect introduces a skip factor in the convolution operation, allowing the network to encompass a broader scope of input data without the necessity for increasing the number of parameters or layers, thus maintaining computational efficacy. Formally, the dilated convolution operation can be defined as follows:
\[y(t)=\sum_{i=0}^{k-1}f(i)\cdot x(t-d\cdot i)\] where \(y(t)\) is the output at time t, \(x(t)\) is the input at time t, \(f(i)\) is the filter value at position i, \(k\) is the size of the filter, and \(d\) is the dilation factor.
In traditional convolution, the dilation factor d is 1, meaning that the filter operates on consecutive input values. However, with a dilation factor greater than 1, the operation skips \(d-1\) positions between inputs, thereby extending the receptive field of the convolution without enlarging the filter size, as the Figure 2 shows. The causal aspect ensures that the output at any time \(t\) is only dependent on the current and past input values, adhering to the temporal ordering and guaranteeing that future data does not influence the present output. This is crucial for tasks such as time series forecasting or RUL prediction[12]–[14], where the model’s predictions should only be influenced by past and present data.
In TCNs, the depth of the network directly affects the size of its receptive field, thereby influencing the model’s ability to learn and understand long-term dependencies in time series data. The utilization of residual connections[28] allows for the training of significantly deeper networks without encountering the challenges of increased training difficulty or the vanishing gradient problem. Within the TCN architecture, the incorporation of residual connections fosters a robust and efficient learning environment, especially over extended temporal sequences, ensuring that crucial information from earlier time steps is carried forward and not lost, thus substantively aiding in capturing long-term dependencies inherent in sequential data.
These connections, also known as skip connections, form shortcuts that bypass one or more layers in the network, thus facilitating an easier learning process and mitigating the common vanishing gradient problem encountered in deep networks. The residual connection is represented as \(Y=F(X)+X\), where \(Y\) is the output, \(X\) is the input, and \(F(X)\) denotes the transformation applied by the layer, including operations such as convolution and activation functions. The improved residual structure is presented in Figure 3 (a). Each TCN block consisting of two dilated causal convolution blocks, preceded by batch normalization, LeakyReLU and dropout and the final output is the sum of the dilated causal convolution and the 1x1 convolution.
Figure 3: No caption. a — The block diagram of the fully residual module.
An autoencoder (AE) is a variant of artificial neural networks designed for the unsupervised learning of efficient encodings.The primary goal of an AE is to learn a representation (encoding) for a set of data, typically for dimensionality reduction[29]. An autoencoder is structured into two key segments: the encoder, which condenses the input into a condensed latent representation, and the decoder, which strives to recreate the initial input from this compressed form. The fundamental principle of an AE’s operation is to diminish the variance between the original data and its reconstructed counterpart. This reduction in the reconstruction discrepancy is facilitated by employing backpropagation and gradient descent techniques to refine the network’s parameters and minimize the error in the output. Mathematically, this can be represented as minimizing the loss function \(L(x,\hat{x})\), where \(x\) is the input data and \(\hat{x}\) is the reconstructed output. Formally, the encoding and decoding processes in an autoencoder are expressed as:
\[\begin{align} h&=f(W\cdot x+b)\\ \hat{x}&=g(W'\cdot h+b') \end{align}\] where \(h\) is the encoded representation, \(f\) and \(g\) are the activation functions for the encoding and decoding processes respectively, \(W\) and \(W'\) are the weight matrices, \(b\) and \(b'\) are the bias vectors.
Building upon the fundamental principles of AE, VQ-VAE[26] introduces a discrete latent representation, aiming to address the limitations of continuous latent variables often encountered in variational autoencoders (VAEs) [30]. The core idea behind VQ-VAE is to employ vector quantization in the latent space, wherein each point in the latent space is mapped to the nearest point in a predefined grid of embeddings, as Figure 4 shows.
The quantization process in VQ-VAE is achieved by employing a codebook of vectors, and each element in the latent space is replaced by the nearest code in the codebook. The objective function of VQ-VAE comprises three terms: a reconstruction loss term ensuring the quality of reconstruction, a codebook loss ensuring the effective utilization of the codebook, and a commitment loss term encouraging the encoder to commit to a particular code, as the Equation (3) shows:
\[L=\log p(x|z_q(x))+\|sg[z_e(x)]-e\|_2^2+\beta \|z_e(x)-sg[e]\|_2^2\] where \(x\) is the input value, and \(z_q(x)\) is its reconstructed counterpart via the decoder. The term \(e\) represents the quantized output from the codebook, while \(z_e(x)\) is the encoder’s output. The ‘sg’ operation denotes the ‘stop gradient’ process. Lastly, \(\beta\) serves as a coefficient that weights the commitment loss in the model.
\(logp(x\mid z_q(x))\) represents the reconstruction loss. It quantifies the divergence between the actual input \(x\) and its reconstructed version based on the quantized representation \(z_q(x)\). The purpose of this term is to ensure a faithful reproduction of the input post-encoding and decoding.
\(\left\lVert sg[z_e(x)] - e \right\rVert_2^2\) corresponds to the codebook loss. The ‘sg’ denotes the ‘stop gradient’ operation, which prevents certain gradients from updating during the backpropagation process. This loss component ensures that the encoder’s continuous output closely matches the discrete codebook entries.
\(\beta\left\lVert z_e(x) - sg[e]\right\rVert_2^2\) is the commitment loss. The coefficient \(\beta\) denotes the weight of this term, emphasizing the importance of the encoder’s commitment to specific codebook vectors. The ‘sg’ applied to \(e\) ensures that during optimization, the encoder’s weights are updated based on this loss, while the codebook remains unchanged.
Compared to some feature extraction methods, using raw data directly can better reflect the network’s feature extraction capability[31]. Therefore, we chose ASTCN [32] as our backbone network.
As the Figure 5 shows, the ASTCN’s block, apart from the TCN block, also includes the AS module and the soft thresholding function. The AS module takes in a feature map and calculates the threshold for each channel, which is then inputted into the soft thresholding function. Its working mechanism is as follows: Suppose that the input tensor has \(N\) rows and \(M\) columns; the tensor undergoes an absolute operation, followed by global average pooling. Subsequently, we obtain a tensor \(h\) with dimensions \(N \times 1\). This tensor is then successively passed through a fully connected layer, a BN layer, and a sigmoid activation function to constrain its values between 0 and 1. This result is named as \(h'\). By performing element-wise multiplication between \(h\) and \(h'\), we obtain the final result with \(N\) rows and \(1\) column, meaning that each channel gets a corresponding threshold value. The boundary in the soft thresholding function is controlled by a threshold \(\tau\) which is actually computed by the AS Model. Then, by adding the value initially inputted into the block, we obtain the block’s output.
The soft thresholding function, given by Equation (4), is a fundamental tool utilized predominantly in the realm of sparse signal processing and compressive sensing. Given a threshold parameter \(\tau\), this function serves to shrink the magnitude of its input \(x\) towards zero. Specifically, for magnitudes less than or equal to \(\tau\), the function maps the input directly to zero, effectively introducing sparsity. On the other hand, for values exceeding \(\tau\) in either positive or negative direction, the function decreases the absolute magnitude by \(\tau\), ensuring that only significant coefficients persist while lesser important ones are suppressed. By using the soft thresholding function as an activation function, we can encourage the network to produce sparse activations. This means that in certain scenarios, the output of some neurons will be set to zero, which can help reduce the complexity of the network and enhance interpretability.
\[soft(x,\tau)=y=\begin{cases}x+\tau,x<-\tau\\0,|x|\leq\tau\\x-\tau,x>\tau\end{cases}\] where \(y\) and \(x\) are the output and input feature, respectively.
The ASTCN[32] structure is shown in Figure 6, and structural parameters is shown in Table 1. The network starts with an input layer for time series data, followed by a convolutional layer and a max pooling layer. After the initial processing, the data passes through a dropout layer for regularization and then sequentially through three ASTCN blocks. Post-processing includes a global average pooling layer, a flatten layer, a full connection layer, a ReLU activation function, and a layer that outputs the predicted RUL. The network calculates the loss by comparing this prediction to the labeled RUL and uses backpropagation for optimization during training.
Layer | Parameter |
---|---|
Conv 1D | Kernel length = 12, channel = 16, stride = 4 |
Maxpooling 1D | Kernel length = 4, channel = 16, pooling size = 4 |
Dropout | Dropout rate = 0.3 |
AS-TCN Block 1 | Kernel length = 3, channel = 12, dilation rate = 1, leaky rate = 0.2, dropout =0.3 |
AS-TCN Block 2 | Kernel length = 3, channel = 6, dilation rate = 2, leaky rate = 0.2, dropout = 0.3 |
AS-TCN Block 3 | Kernel length = 3, channel = 4, dilation rate = 4, leaky rate = 0.2, dropout = 0.3 |
Dense | channel = 1 |
We use AE [33] to produce health indicators. Its structure is quite simple, as shown in Table 7, The encoder consists of four convolutional blocks and another two 1x1 convolutions. Each block is composed of a convolution layer, BN layer, ReLU layer, and max pooling layer. The structure of the decoder is symmetrical to the encoder, placing the latter convolution at the forefront and replacing the max pooling layer in each block with an upsampling layer. The size of all convolution kernels in the autoencoder is 3x1.
Layer name | Kernels size | Stride | Kernels number | Output size |
---|---|---|---|---|
Input | – | – | – | 1024 x 1 |
Convolution1 | 1 x 3 | 1 x 1 | 4 | 1024 x 4 |
Pooling1 | 1 x 2 | 1 x 2 | – | 512 x 4 |
Convolution2 | 1 x 3 | 1 x 1 | 4 | 512 x 4 |
Pooling2 | 1 x 2 | 1 x 2 | – | 256 x 4 |
Convolution3 | 1 x 3 | 1 x 1 | 8 | 256 x 8 |
Pooling3 | 1 x 2 | 1 x 2 | – | 128 x 8 |
Convolution4 | 1 x 3 | 1 x 1 | 8 | 128 x 8 |
Pooling4 | 1 x 2 | 1 x 2 | – | 64 x 8 |
Convolution5 | 1 x 3 | 1 x 1 | 16 | 64 x 16 |
Pooling6 | 1 x 2 | 1 x 2 | – | 32 x 16 |
Convolution6 | 1 x 3 | 1 x 1 | 16 | 32 x 16 |
VQ-VAE[26] and AE have fundamentally similar structures, but there is a change in the number of channels in the convolutional kernels, as shown in Table 2. Moreover, VQ-VAE[26] introduces an additional component called the codebook. The dimensions of the codebook are 32x16, meaning there are 32 embeddings, and each embedding is 16-dimensional.
In addition to directly receiving vibration data, the VQ-VAE and AE in this paper also accept 38-dimensional data that has undergone feature selection. For handling 38-dimensional data, modifications in the network architecture have been made by reducing two convolutional layers along with their corresponding max-pooling layers, as shown in Table 8 and Table 9. This reduction is aimed at maintaining a balance between the model’s complexity and its capability to capture essential information from the data.
Currently, most remaining life prediction problems still adopt supervised learning training methods. This means that the ability to accurately label data directly affects the final prediction accuracy. There are two primary methods to add labels [34]–[38]. One method is to use a linear function, illustrated in Figure 7a , defined in Equation (5). The value of RUL decreases linearly, with the degradation rate remaining constant. However, during the actual operation of the bearing, the degradation rate of the bearing does not remain constant. Therefore, the experimental results of this labeling method might not be optimal.
\[y(t_i)=-\left(\frac{1}{t_n}*t_{i}\right)+1\]
where \(t_n\) is the total number of samples, and \(t_i\) denotes the sequence number \(i\).
And the other employs a piecewise function to predict the degradation trend of bearings. As depicted in Figure 7b, the RUL stays consistent up until time \(t_j\)=0.4 and subsequently diminishes in a linear manner until the bearing reaches its complete failure at time \(t_n\). The primary hurdle of this technique is pinpointing the precise onset of the FPT. Additionally, this strategy tends to neglect faint patterns present in initial data, and the rate of degradation isn’t linear during the rapid deterioration phase.
\[y(t_i)=\begin{cases}1&t_i\leq t_j\\\left(\frac{1}{t_j-t_n}\right)*t_i+\left(\frac{t_n}{t_n-t_j}\right)&t_i>t_j\end{cases}\] where \(t_n\) is the total number of samples, \(t_i\) denotes the sequence number \(i\) and \(t_j\) is the FPT.
Given the myriad issues associated with the preceding two methods, it is a common practice to construct HI curves as labels. An optimal HI exhibits non-linear behavior, maintains a certain degree of monotonicity, and experiences minimal fluctuations within a confined range, with a value domain spanning from 0 to 1. The idea for HI usually involves first transforming the waveform into corresponding features, followed by feature selection to eliminate features with poor monotonicity and trendency. Subsequently, feature dimensionality reduction is carried out, where methods like PCA (principal component analysis), autoencoder, and others can be utilized. In this study, a suite of traditional feature extraction methodologies in the time-frequency domain has been utilized. The selected time domain features encompass (root mean square, variance, peak value, peak-to-peak value, skewness, kurtosis, peak index, margin index and waveform index), alongside frequency domain attributes such as root mean square frequency and frequency center of gravity. Subsequent to the discrete wavelet packet decomposition, the original signal undergoes a three-tiered wavelet packet decomposition, yielding eight distinct sub-bands. The energy content of each sub-band is then quantified and harnessed as a characteristic feature. It is observed that vibration data typically manifest in two principal orientations: lateral and longitudinal. From these, a set of 38 distinct features is derived, which are subsequently condensed via PCA to delineate the bearing’s state degradation trajectory. This method is referred to as the "feature+PCA" method in this paper, which is also abbreviated as "F+PCA".
In a similar vein, features can be extracted using an autoencoder. This method is referred to as "feature+AE", which is also abbreviated as "F+AE". The process results in a 1x9 vector, the structure of which is depicted in Table 8. In the case of directly employing an Autoencoder (AE), the encoder generates a 1x32 vector. Subsequently, a SOM(self-organizing map) network[33] is trained using the vector obtained from the autoencoder. This network, trained through unsupervised learning, facilitates a low-dimensional (typically two-dimensional) discretized representation of the input space of the training samples. Each node on the SOM grid competes to be the closest to the input vector, and over time, the nodes self-organize to form a map where spatial proximity corresponds to feature similarity. Following the training, the best matching node corresponding to each sample is identified. The HI value for each sample can be derived by computing the distance between the sample and its corresponding node. The formula for this can be expressed as:
\[HI =||x_i-w_i||\] where \(x_i\) represents the vector representation of the \(i^{th}\) sample after being processed through the encoder of AE, and \(w_i\) denotes the vector representation of the best matching node corresponding to \(x_i\) in the SOM network.
For the VQ-VAE network, the codebook and SOM serve the same clustering function; therefore, we also have an approach called "feature+VQVAE", which is abbreviated as "F+VQVAE". The dimensions of the latent variables produced by the direct use of VQ-VAE and "F+VQVAE" are 32x16 and 9x4, respectively. The structure of these methods are presented in Tables 2 and Tables 9, respectively. Absolutely, the integration of the clustering process within the neural network framework in VQ-VAE, unlike in SOM, facilitates end-to-end training. This along with its principled regularization through variational inference not only eases its integration into larger neural network architectures but also potentially enhances generalization across varying data scenarios. We use the same formula to represent HI. The formula is as follows: \[HI =||x_i-w_i||\] where \(x_i\) represents the vector representation of the \(i^{th}\) sample after being processed through the encoder of VQ-VAE, and \(w_i\) denotes the vector representation of the nearest code corresponding to \(x_i\) in the codebook.
Our method was validated using a dataset for bearing performance run-to-failure [39]. This dataset, presented at the IEEE PHM2012 Data Challenge, was captured with the PRO-NOSTIA testing setup, as depicted in Figure 8. Data gathering involved two accelerometers, oriented horizontally and vertically. With a sampling rate set at 25.6 kHz, readings were taken every 10 seconds for a duration of 0.1 seconds, resulting in a total of 2560 data points for each instance. Due to safety precautions, the testing was halted if the vibration data amplitude surpassed 20 g. The time at which this amplitude threshold was crossed was considered as the bearing’s failure time. Our predictive framework was assessed utilizing the run-to-failure data under two conditions, illustrated in Table 3.
During the prediction model training, the AdamW optimizer is utilized to minimize the value of the loss function, with a learning rate set at 0.001 and a batch size of 128. The network is trained over 100 epochs, and an early stopping mechanism is employed to prevent overfitting, with the early stopping parameter set at 15. In the training process of the label model, the AdamW optimizer with a learning rate of 0.001 is utilized, and a batch size of 256 is set. The neural network is trained for 150 epochs with an early stopping parameter set to 20 to prevent overfitting.Results presented in this paper, which encompass both the label and prediction models, are averaged over three independent runs using random seeds 15, 16, and 25 to ensure consistency and reproducibility. All experiments are conducted on an NVIDIA GeForce 3090 (24GB) GPU, and the models are implemented using the PyTorch framework.
The alternate method is employed to divide the dataset into training and testing sets; that is, selecting one bearing from the seven available, by its identifier, as the test set while using the remaining six bearings for the training set. The label model is trained using the data from all seven bearings under operating condition 1. The results and comparisons presented in this paper primarily focus on bearings 1-1 and 1-4 to demonstrate the performance of various label generation methods. The life cycle data is shown in Figure 9.
Operating Conditions | Working Load | Rotating Speed (r/min) | Bearing Number | Actual Life (s) |
---|---|---|---|---|
condition one | 4000 N | 1800 | Bearing1-1 | 28,030 |
Bearing1-2 | 8,710 | |||
Bearing1-3 | 23,750 | |||
Bearing1-4 | 14,280 | |||
Bearing1-5 | 24,630 | |||
Bearing1-6 | 24,480 | |||
Bearing1-7 | 22,590 | |||
condition two | 4200 N | 1650 | Bearing2-1 | 9,110 |
Bearing2-2 | 7,970 | |||
Bearing2-3 | 19,550 | |||
Bearing2-4 | 7,510 | |||
Bearing2-5 | 23,110 |
We have chosen the most commonly used RMSE (root mean square error) metric to evaluate the difference between the predicted values and the labels. It represents the square root of the quadratic mean of the differences between predicted values and observed values, as the Equation (9) shows. \[RMSE=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(\hat{x}_i-x_{i}\right)^{2}}\] where \(\hat{x}_i\) denotes actual RUL of the \(i^{th}\) predictive object, \(x_i\) signifies the predicted RUL of the \(i^{th}\) predictive object and \(N\) denotes the number of the samples.
To ensure the validity of the prediction results, we select the score metric proposed by the PHM2012 official challenge[39], and its formula are as follows:
\[\begin{align} E_{i}&=\hat{x}_{i}-x_{i}\\ A_i&=\begin{cases} \exp^{-(\frac{E_{i}}{13})} &\text{ if }E_{i}\leq0\\ \exp^{(\frac{E_{i}}{10})} &\text{ if }E_{i}>0 \end{cases} \\ Score&=\frac{1}{N}\sum_{i=1}^{N}(A_{i}) \end{align}\] where \(\hat{x}_i\) denotes actual RUL of the \(i^{th}\) predictive object, \(x_i\) signifies the predicted RUL of the \(i^{th}\) predictive object, and \(N\) denotes the total number of samples.
In order to conduct a quantitative analysis of the labels, this paper selects monotonicity and trendency indicators from commonly used health indicator evaluation metrics. Monotonicity measures the monotonous variation trend of the HI curve, while trendiness reflects the correlation between degradation stages.
\[\begin{align} mon&=\left|\frac{Num~of~\delta x>0}{N-1}-\frac{Num~of~\delta x<0}{N-1}\right|\\\\ corr&=\frac{\left|\sum_{i=1}^T\left(x_i-\bar{x}\right)(l_i-\bar{l})\right|}{\sqrt{\sum_{i=1}^T\left(x_i-\bar{x}\right)^2\sum_{i=1}^T\left(l_i-\bar{l}\right)^2}} \end{align}\] where \(\delta x\) denotes the difference between two adjacent points on the HI curve, \(x_i\) represents the predicted RUL for the \(i_{th}\) sample, and \(l_i\) is the index number corresponding to that sample. Furthermore, \(\bar{x}\) is the mean of the predicted RUL values, \(\bar{l}\) is the mean of the sample indices, and \(N\) represents the total number of samples.
The aforementioned indicators, monotonicity and trendiness, may not adequately reflect the fluctuation characteristics of the data. Therefore, we propose two metrics: mean absolute distance between two adjacent points and mean variance between windows, to provide a more nuanced analysis of the data.
The mean absolute distance (MAD) method is straightforward, involving the computation of the absolute distance between every pair of adjacent points, followed by summing these distances and taking the average. The core idea behind this method hinges on the notion that the HI represents a linear progression from 1 to 0. The cumulative movement distance of all points remains constant; Conversely, if the points exhibit more significant fluctuations, implying a more pronounced up and down movement, the movement distance increases, suggesting a less accurate HI. Its equations are shown as follows:
\[mad=\frac{1}{n-1}\sum_{i=2}^n|x_i-x_{i-1}|\] where \(x_i\) denotes the predicted RUL of the \(i^{th}\) predictive object and \(n\) represents the total number of samples.
The primary concept of the variance method revolves around segregating continuous data points into designated windows, followed by the computation of the variance of data within each window, and culminating in the averaging of the variances across all windows. The windows are designed to overlap, and after each computation, the window is shifted forward by one data point. This overlapping and incremental shifting approach facilitates a continuous and smooth analysis across the data stream, ensuring that the evaluation captures the data’s behavior in a comprehensive and nuanced manner. Its equations are shown as follows: \[\begin{align} \mu_i&=\frac{1}{n}\sum_{k=1}^nx_{ik} \\ \sigma_i&=\frac{1}{k}\sum_{j=0}^{k-1}(x_{i+j}-\mu_i)^2 \\ mv&=\frac{1}{N-k+1}\sum_{i=1}^{N-k+1}\sigma_i \end{align}\] where \(k\) denote the length of each window, \(x_{i+j}\) represent the predicted value of the \(j^{th}\) predictive object in the \(i^{th}\) window, \(N\) signify the total number of data points in the sample, \(\sigma_i\) denote the variance of the window starting from the \(i^{th}\) data point, and \(\mu_i\) represent the mean of the window starting from the \(i^{th}\) data point.
In this section, we will conduct a comprehensive comparison of different label creation methods. For the selection of FPT points in segmented labels, we have employed method as proposed by Zhang et al.[40]. The method is based on the \(3\sigma\) technique, as illustrated in Table 4.
For HI, apart from the previously mentioned "feature+PCA" , "feature+AE" and "feature+VQVAE" methods, we have also introduced three additional methods, namely RMS[23], direct AE and direct VQ-VAE, for comparison. All these HI curves are smoothed using the Savitzky-Golay filter with a window length of 21.
Bearings | FPT[40] | Actual Life(s) |
---|---|---|
B1 | 11420 | 28030 |
B2 | 8220 | 8710 |
B3 | 9600 | 23750 |
B4 | 10180 | 14280 |
B5 | 24070 | 24630 |
B6 | 16270 | 24480 |
B7 | 22040 | 22590 |
Figure 10 shows displays six distinct label creation methods of bearing1-1. In sub-figure (a), it is observed that the prediction performance of these three methods remains relatively stable over most of the time intervals. However, post t=2200, a certain level of fluctuation is evident in the two methods other than VQ-VAE. In sub-figure (b), the "feature+ae" method exhibits a trough at t=1250, while the "feature+VQVAE" method maintains smoothness throughout the entire process.
In Figure 11, the decline for bearing1-4 is markedly more severe compared to Figure 10.The two non-neural network methods demonstrate fairly good performance. Observed in Figure 11 (a) is a heavy trough at t=1250 for the AE method, while the other two methods remain relatively smooth. In Figure 11 (b), the "feature+PCA" method maintains smoothness throughout. The initial value prediction of "feature+VQVAE" exhibits a certain degree of offset.
The test results of these label creation methods on bearings 1 to 7 are shown in Table 5. The "feature+VQVAE" method achieves the lowest MAD, while the AE method achieves the lowest MV value, with the MV value of VQ-VAE not differing significantly from it. The RMS method, which has the highest monotonicity and tendency, performs poorly on both MAD and MV, indicating that these two traditional metrics cannot accurately measure the fluctuation of the curves.
Metrics | AE | F+VQVAE | VQVAE(Our) | F+AE | RMS | PCA |
---|---|---|---|---|---|---|
Mon | 0.0310 | 0.0203 | 0.0277 | 0.0176 | 0.0438 | 0.0379 |
Tend | -0.309 | -0.292 | -0.318 | -0.330 | -0.408 | -0.247 |
MAD | 0.0293 | 0.0215 | 0.0310 | 0.0373 | 0.0336 | 0.0479 |
MV | 0.00103 | 0.00135 | 0.00108 | 0.00205 | 0.00388 | 0.00616 |
In this section, the ASTCN model is employed as the prediction model, and the alternating training method is utilized to test different label methods on bearings 1 to 7. In addition to the labels mentioned in the previous subsection, two commonly used labels, linear and segmented, are also tested. The test results are illustrated in Figures 12 and 13, and Table 6.
Figure 12a displays the prediction results for bearing 1-1, where VQ-VAE method lags in reflecting the degradation trend in the latter part, failing to accurately capture the deteriorating trajectory. In Figure 12b, "F+VQVAE" and VQ-VAE methods exhibit very similar patterns, with the only relatively accurate depiction coming from the "F+AE" method, albeit with a degradation trend differing from that of the labels.
Figure 13, showcasing the predictions for bearing 1-4, presents a larger discrepancy in the outcomes of different labeling methods, thus better reflecting the variances among different labels. Mirroring the findings in Figure 12, VQ-VAE method again exhibits aberrant performance in the latter half, and the result of "F+VQVAE" method closely resembles VQ-VAE method. The prediction of AE method demonstrates an initial value deviation in both figures; however, upon inspecting its prediction results for other bearings in condition 1, no such initial misjudgment as seen in bearing1-1 and bearing1-4 was found. As for the linear and piecewise methods depicted in Figures 12c and 13c, their performances are comparatively dismal.
Two intriguing conclusions emerge from these figures: "F+AE" method performs well in both, and its predictions even more accurately reflect the degradation trend of the bearings than the labels do, indicating that to some extent, the prediction model has transcended the constraints posed by the labels. The FPT points of bearing1-4 in the labels are all situated earlier than those predicted by the ASTCN model. In conjunction with the RMS and "feature + PCA" non-neural network labels from Figure 11, it can be posited that these models have overfit to a certain degree.
As shown in Table 6, the ASTCN model trained with VQ-VAE labels achieved the optimal performance on MAD and MV metrics, and the RMSE, MAE, and Score values only slightly differed from the optimal "feature+VQVAE" method. This indicates that the models trained with VQ-VAE are able to make more accurate predictions regarding the remaining useful life of bearings. The models trained with piecewise labels also performed well on MAD and MV, but had very high RMSE and MAE values, as shown in Table 11, being significantly affected by the outliers in bearing 1-3. This suggests that the piecewise labels can be too rigid at times, deviating significantly from the actual predicted values. The ASTCN model trained with "Feature+AE" methods, which performed well in Figures 12 and 13, exhibited very average performance in metrics, likely due to the inaccuracies in the labels.
Metrics | RMSE | MAE | MAD | MV | Score |
---|---|---|---|---|---|
Linear | 0.219 | 0.172 | 0.1002 | 0.00884 | 32.0 |
Piecewise | 0.210 | 0.144 | 0.0287 | 0.00236 | 27.2 |
PCA | 0.214 | 0.172 | 0.0566 | 0.00643 | 30.3 |
AE | 0.156 | 0.142 | 0.0575 | 0.00437 | 26.8 |
F+AE | 0.128 | 0.097 | 0.0468 | 0.00484 | 17.9 |
F+VQVAE | 0.077 | 0.060 | 0.0377 | 0.00348 | 11.3 |
VQVAE(Our) | 0.086 | 0.063 | 0.0275 | 0.00122 | 11.4 |
A proficient model often exhibits commendable performance in cross-domain scenarios. Therefore, under condition 2, bearings 2-1 to 2-4 are selected as the test set. The AE, VQ-VAE, and ASTCN models, having been trained on bearings 1-1 to 1-6, are directly tested without fine-tuning. Both prediction models and label models show good generalization ability, with the results illustrated in Figures 14-17, which displays the performance metrics across different test bearings.".
In Figure 14, a trough is observed in the prediction results of ASTCN within the interval of t=250, while the prediction values of VQ-VAE exhibits a more stable performance. Similar phenomena are encountered in Figures 15 and 16, where both bearing 2-2 and 2-3 display a concurrent trough in labels and prediction results. This arises as both the label and prediction models lack the capability for temporal modeling. In Figure 17, a certain deviation in initial values between the label model and prediction model is noted, along with issues concerning trend direction. This problem could be ameliorated through cross-domain methodologies.
This paper proposes an end-to-end health indicator (HI) construction method using vector quantised variational autoencoder (VQ-VAE) for predicting the remaining useful life (RUL) of bearings. This method addresses the issue of suboptimal HI metrics generated by traditional statistical methods such as RMS and kurtosis and resolves the need for dimensionality reduction of latent variables in traditional unsupervised learning methods such as Autoencoder. Through empirical testing, it is demonstrated that traditional statistical metrics do not adequately reflect the fluctuation in the curves. Consequently, two novel statistical metrics, mean absolute distance (MAD) and mean variance (MV), are introduced. These metrics accurately depict the fluctuation patterns in the curves, thereby reflecting the model’s accuracy in judging similar features. Both label and prediction models were tested using these metrics. On the PMH2012 dataset, both the direct application of VQ-VAE and the combined method of feature selection followed by VQ-VAE for label construction achieved lower MAD and MV values. Furthermore, the ASTCN prediction model trained with VQ-VAE labels also exhibited promising results, attaining the lowest values for MAD and MV.
In subsequent work, the network architecture of VQ-VAE will continue to be optimized to reduce the up-and-down fluctuations in the curves, with the aim of constructing a more accurate label for training. Consideration will be given to employing some cross-domain methods to enhance its generalization ability under different operating conditions or even across different datasets.
Layer name | Kernels size | Stride | Kernels number | Output size |
---|---|---|---|---|
Input | – | – | – | 1024 x 1 |
Convolution1 | 1 x 3 | 1 x 1 | 16 | 1024 x 16 |
Pooling1 | 1 x 2 | 1 x 2 | – | 512 x 16 |
Convolution2 | 1 x 3 | 1 x 1 | 32 | 512 x 32 |
Pooling2 | 1 x 2 | 1 x 2 | – | 256 x 32 |
Convolution3 | 1 x 3 | 1 x 1 | 64 | 256 x 64 |
Pooling3 | 1 x 4 | 1 x 4 | – | 64 x 64 |
Convolution4 | 1 x 3 | 1 x 1 | 128 | 64 x 128 |
Pooling4 | 1 x 2 | 1 x 2 | – | 32 x 128 |
Convolution5 | 1 x 3 | 1 x 1 | 32 | 32 x 32 |
Convolution6 | 1 x 3 | 1 x 1 | 1 | 32 x 1 |
Layer name | Kernels size | Stride | Kernels number | Output size |
---|---|---|---|---|
Input | – | – | – | 38 x 1 |
Convolution1 | 1 x 3 | 1 x 1 | 16 | 36 x 16 |
Pooling1 | 1 x 2 | 1 x 2 | – | 18 x 16 |
Convolution2 | 1 x 3 | 1 x 1 | 32 | 18 x 32 |
Pooling2 | 1 x 2 | 1 x 2 | – | 9 x 32 |
Convolution3 | 1 x 3 | 1 x 1 | 64 | 9 x 64 |
Convolution4 | 1 x 3 | 1 x 1 | 128 | 9 x 128 |
Convolution5 | 1 x 3 | 1 x 1 | 32 | 9 x 32 |
Convolution6 | 1 x 3 | 1 x 1 | 1 | 9 x 1 |
Layer name | Kernels size | Stride | Kernels number | Output size |
---|---|---|---|---|
Input | – | – | – | 38 x 1 |
Convolution1 | 1 x 3 | 1 x 1 | 4 | 36 x 4 |
Pooling1 | 1 x 2 | 1 x 2 | – | 18 x 4 |
Convolution2 | 1 x 3 | 1 x 1 | 4 | 18 x 4 |
Pooling2 | 1 x 2 | 1 x 2 | – | 9 x 4 |
Convolution3 | 1 x 3 | 1 x 1 | 8 | 9 x 8 |
Convolution4 | 1 x 3 | 1 x 1 | 16 | 9 x 16 |
Convolution5 | 1 x 3 | 1 x 1 | 4 | 9 x 4 |
-0cm
Method | Bearing1-1 | Bearing1-2 | Bearing1-3 | Bearing1-4 | Bearing1-5 | Bearing1-6 | Bearing1-7 | ||
---|---|---|---|---|---|---|---|---|---|
AE | Mon | 0.0143 | 0.0437 | 0.0194 | 0.0708 | 0.0219 | 0.0176 | 0.0292 | |
Tend | -0.376 | -0.245 | -0.552 | -0.668 | -0.009 | -0.049 | -0.268 | ||
MAD | 0.0174 | 0.0429 | 0.0070 | 0.0353 | 0.0314 | 0.0387 | 0.0322 | ||
MV | 0.00034 | 0.00190 | 0.00041 | 0.00147 | 0.00080 | 0.00136 | 0.00096 | ||
Feature+VQVAE | Mon | 0.0128 | 0.0000 | 0.0093 | 0.0329 | 0.0268 | 0.0298 | 0.0301 | |
Tend | -0.268 | -0.231 | -0.463 | -0.523 | -0.339 | -0.123 | -0.095 | ||
MAD | 0.0051 | 0.0326 | 0.0190 | 0.0196 | 0.0199 | 0.0256 | 0.0290 | ||
MV | 0.00022 | 0.00282 | 0.00068 | 0.00065 | 0.00058 | 0.00356 | 0.00095 | ||
VQVAE(Our) | Mon | 0.0371 | 0.0621 | 0.0312 | 0.0399 | 0.0041 | 0.0086 | 0.0106 | |
Tend | -0.525 | -0.393 | -0.558 | -0.729 | 0.212 | 0.002 | -0.231 | ||
MAD | 0.0202 | 0.0454 | 0.0222 | 0.0160 | 0.0406 | 0.0474 | 0.0253 | ||
MV | 0.00045 | 0.00204 | 0.00085 | 0.00050 | 0.00129 | 0.00183 | 0.00060 | ||
Feature+AE | Mon | 0.0121 | 0.0276 | 0.0447 | 0.0007 | 0.0292 | 0.0061 | 0.0027 | |
Tend | -0.596 | -0.033 | -0.691 | -0.617 | -0.188 | -0.031 | -0.156 | ||
MAD | 0.0159 | 0.0730 | 0.0402 | 0.0352 | 0.0257 | 0.0418 | 0.0289 | ||
MV | 0.00054 | 0.00612 | 0.00163 | 0.00093 | 0.00079 | 0.00354 | 0.00081 | ||
RMS | Mon | 0.0264 | 0.0713 | 0.0455 | 0.0834 | 0.0471 | 0.0037 | 0.0292 | |
Tend | -0.383 | -0.145 | -0.421 | -0.714 | -0.185 | -0.270 | -0.741 | ||
MAD | 0.0084 | 0.1423 | 0.0160 | 0.0100 | 0.0143 | 0.0294 | 0.0146 | ||
MV | 0.00020 | 0.02114 | 0.00068 | 0.00023 | 0.00056 | 0.00408 | 0.00025 | ||
PCA | Mon | 0.0264 | 0.0575 | 0.0101 | 0.0638 | 0.0203 | 0.0135 | 0.0735 | |
Tend | -0.549 | -0.318 | -0.849 | -0.693 | 0.077 | -0.247 | 0.848 | ||
MAD | 0.0132 | 0.1869 | 0.0665 | 0.0068 | 0.0103 | 0.0220 | 0.0295 | ||
MV | 0.00026 | 0.03346 | 0.00677 | 0.00009 | 0.00028 | 0.00166 | 0.00061 |
-0cm
Method | Bearing1-1 | Bearing1-2 | Bearing1-3 | Bearing1-4 | Bearing1-5 | Bearing1-6 | Bearing1-7 | ||
---|---|---|---|---|---|---|---|---|---|
Featuer+AE | RMSE | 0.060 | 0.144 | 0.174 | 0.077 | 0.178 | 0.074 | 0.191 | |
MAE | 0.040 | 0.117 | 0.084 | 0.061 | 0.144 | 0.061 | 0.168 | ||
MAD | 0.0023 | 0.0525 | 0.0040 | 0.0234 | 0.1567 | 0.0358 | 0.0530 | ||
MV | 0.00043 | 0.00199 | 0.00008 | 0.00327 | 0.02137 | 0.00307 | 0.00368 | ||
Score | 9.1 | 22.5 | 9.5 | 5.3 | 27.3 | 16.6 | 35.2 | ||
AE | RMSE | 0.069 | 0.127 | 0.230 | 0.064 | 0.111 | 0.262 | 0.229 | |
MAE | 0.042 | 0.118 | 0.220 | 0.038 | 0.097 | 0.256 | 0.221 | ||
MAD | 0.0029 | 0.0570 | 0.0495 | 0.1004 | 0.1003 | 0.0647 | 0.0278 | ||
MV | 0.00032 | 0.00323 | 0.00192 | 0.01059 | 0.00876 | 0.00441 | 0.00134 | ||
Score | 9.5 | 22.7 | 24.5 | 3.0 | 18.7 | 56.0 | 53.0 | ||
VQVAE(Our) | RMSE | 0.084 | 0.094 | 0.166 | 0.067 | 0.094 | 0.055 | 0.046 | |
MAE | 0.077 | 0.085 | 0.086 | 0.053 | 0.082 | 0.037 | 0.020 | ||
MAD | 0.0442 | 0.0443 | 0.0044 | 0.0294 | 0.0423 | 0.0135 | 0.0145 | ||
MV | 0.00141 | 0.00174 | 0.00014 | 0.00185 | 0.00185 | 0.00075 | 0.00082 | ||
Score | 13.6 | 16.3 | 10.6 | 4.4 | 20.3 | 10.5 | 4.3 | ||
Feature+VQVAE | RMSE | 0.055 | 0.066 | 0.124 | 0.079 | 0.106 | 0.037 | 0.071 | |
MAE | 0.039 | 0.052 | 0.100 | 0.064 | 0.093 | 0.023 | 0.051 | ||
MAD | 0.0525 | 0.0537 | 0.0145 | 0.0770 | 0.0137 | 0.0045 | 0.0481 | ||
MV | 0.00187 | 0.00188 | 0.00066 | 0.01020 | 0.00190 | 0.00029 | 0.00757 | ||
Score | 8.4 | 10.0 | 13.9 | 5.5 | 22.9 | 6.3 | 12.1 | ||
Linear | RMSE | 0.280 | 0.269 | 0.140 | 0.283 | 0.254 | 0.130 | 0.178 | |
MAE | 0.225 | 0.231 | 0.106 | 0.214 | 0.188 | 0.097 | 0.143 | ||
MAD | 0.0715 | 0.0649 | 0.0759 | 0.1331 | 0.1855 | 0.0780 | 0.0926 | ||
MV | 0.00385 | 0.00288 | 0.00468 | 0.01679 | 0.02267 | 0.00514 | 0.00586 | ||
Score | 48.4 | 47.0 | 14.1 | 18.0 | 43.9 | 23.0 | 29.5 | ||
Piecewise | RMSE | 0.070 | 0.060 | 0.226 | 0.129 | 0.765 | 0.128 | 0.095 | |
MAE | 0.010 | 0.035 | 0.100 | 0.027 | 0.713 | 0.086 | 0.041 | ||
MAD | 0.0022 | 0.0267 | 0.0015 | 0.0021 | 0.0485 | 0.0782 | 0.0420 | ||
MV | 0.00045 | 0.00127 | 0.00031 | 0.00076 | 0.00173 | 0.00800 | 0.00400 | ||
Score | 2.2 | 6.8 | 11.3 | 2.4 | 138.9 | 19.6 | 9.3 | ||
PCA | RMSE | 0.162 | 0.420 | 0.110 | 0.327 | 0.100 | 0.088 | 0.288 | |
MAE | 0.134 | 0.417 | 0.052 | 0.292 | 0.058 | 0.056 | 0.194 | ||
MAD | 0.0864 | 0.1140 | 0.0111 | 0.0735 | 0.0344 | 0.0280 | 0.0487 | ||
MV | 0.00512 | 0.00934 | 0.00065 | 0.00635 | 0.01234 | 0.00551 | 0.00570 | ||
Score | 23.7 | 80.4 | 6.6 | 25.9 | 13.4 | 15.0 | 47.1 |
-0cm