A Review on Self-Supervised Learning for Time Series Anomaly Detection: Recent Advances and Open Challenges

Aitor Sánchez-Ferrera
University of the Basque Country UPV/EHU
Manuel Lardizabal Ibilbidea 1
Donostia/San-Sebastián, Spain
aitor.sanchezf@ehu.eus
Borja Calvo
University of the Basque Country UPV/EHU
Manuel Lardizabal Ibilbidea 1
Donostia/San-Sebastián, Spain
borja.calvo@ehu.eus
Jose A. Lozano
University of the Basque Country UPV/EHU
Manuel Lardizabal Ibilbidea 1
Donostia/San-Sebastián, Spain
Basque Center for Applied Mathematics (BCAM)
Mazarredo Zumarkalea 14
Bilbao, Spain
ja.lozano@ehu.eus


Abstract

Time series anomaly detection presents various challenges due to the sequential and dynamic nature of time-dependent data. Traditional unsupervised methods frequently encounter difficulties in generalization, often overfitting to known normal patterns observed during training and struggling to adapt to unseen normality. In response to this limitation, self-supervised techniques for time series have garnered attention as a potential solution to undertake this obstacle and enhance the performance of anomaly detectors. This paper presents a comprehensive review of the recent methods that make use of self-supervised learning for time series anomaly detection. A taxonomy is proposed to categorize these methods based on their primary characteristics, facilitating a clear understanding of their diversity within this field. The information contained in this survey, along with additional details that will be periodically updated, is available on the following GitHub repository: https://github.com/Aitorzan3/Awesome-Self-Supervised-Time-Series-Anomaly-Detection.

1 Introduction↩︎

Time series refer to collections of measurements or recordings arranged in chronological order [1]. The categorization of time series depends on the number of dimensions involved. A time series is classified as univariate if it is based on the evolution of a single variable over time. Conversely, a time series is referred to as multivariate if it is composed of multiple univariate time series that collectively describe a single system or entity. Mining time series data involves various tasks such as classification, regression, clustering, and anomaly detection, which yield valuable insights in domains like economy, health, and industry [2][4]. In recent years, the advent of powerful machine learning techniques has sparked significant interest in time series anomaly detection, enabling effective solutions for tasks like financial fraud detection [5], [6], network traffic monitoring [7], [8], disease detection [9], [10], and fault detection in Internet of Things (IoT) devices [11], [12], among others.

Time series anomalies, also known as novelties or outliers [13], are defined as abnormal events that deviate from the expected behavior in the system under analysis [14]. As mentioned in [15], these phenomena can be interpreted in two distinct ways. They can either signify abnormal events that we aim to identify, or they can be inaccurate and noisy measurements that we seek to eliminate or rectify in order to enhance the quality of our dataset. For the purpose of this study, we will refer to anomalies as the former. The literature identifies various types of anomalies based on their context [16]:

  • Local Context Anomalies: these deviations manifest at the level of observations within the context of an individual time series. This type of anomaly detection zooms in localized patterns of abnormality, allowing for a more detailed examination of deviations at the granular level. We identify two primary types of local anomalies:

    • Point Outliers: these anomalies pertain to specific timestamps where the values within a time series exhibit a substantial deviation from other values or neighboring points. Essentially, point outliers stand out as isolated data points that significantly differ from their surrounding data, contributing to irregularities in the overall temporal pattern.

    • Subsequence Outliers: this category of anomalies comprises consecutive points in time that collectively display atypical behavior, forming a subsequence of abnormality within the time series. In other words, subsequence outliers manifest as clusters of consecutive data points exhibiting unusual patterns or trends, contributing to an identifiable deviation from the expected temporal behavior.

  • Global Context Anomalies: this anomaly category involves entire time series that function as outliers. Specifically, tasks related to anomaly detection in this context revolve around identifying complete time series describing anomalous events within a database that comprises numerous time series. In essence, the focus is on detecting overarching patterns or sequences of abnormal behavior across the entirety of a time series dataset, emphasizing the comprehensive nature of the anomalies being addressed.

Figure 1 depicts real examples of the different types of time series anomalies explained before.

Figure 1: Examples of time series anomalies. Top left: a point anomaly belonging to the Yahoo-TSA dataset. Bottom left: a subsequence anomaly belonging to the UCR_BIDMC1_2500_5400_5600 dataset. Right: a sequence anomaly in the global context of the ECG5000 dataset.

In the field of machine learning for anomaly detection in time series data, two main perspectives are distinguished: supervised learning-based approaches and unsupervised learning-based approaches [17]. In the supervised setting, models are trained using datasets where normal and abnormal samples are labeled. Typically, a supervised predictor in an anomaly detection scenario functions as a binary classifier, learning to predict whether a new sample is an anomaly based on labeled samples. While these models can achieve high accuracy in anomaly detection, training a supervised classifier for anomalies can be problematic. Anomalies, by their nature, are infrequent occurrences, resulting in an extremely imbalanced classification problem that is challenging to solve. Additionally, creating labeled datasets for training requires significant time and specialized knowledge related to the specific problem, making supervised training of machine learning models often unfeasible in real-world contexts [18].

Unsupervised approaches, unlike supervised methods, do not rely on labeled datasets. Instead, they are trained using datasets composed solely of normal samples, aiming to capture the underlying patterns and characteristics of the data distribution in the problem at hand. Once trained, these approaches typically compare the properties of new samples with the learned notion of normality from the training phase to assess their abnormality degree or anomaly score [19]. If the anomaly score of a new sample exceeds a predetermined threshold, it is classified as an outlier. One popular example of an unsupervised method for time series anomaly detection is the use of autoencoders [20]. By training an autoencoder exclusively on normal samples, we assume that it learns the latent subspace representing the normal patterns and characteristics of the data. During the inference phase, the reconstruction error of the trained autoencoder serves as the anomaly score for detecting anomalies within new samples.

Unsupervised methods, despite their flexibility and independence from labeled datasets, can indeed suffer from overfitting to the observed normality during the training phase. To mitigate this issue, it is crucial for unsupervised approaches to be trained on datasets that encompass a wide range of diverse normal samples. This allows them to capture the complete spectrum of normal characteristics present in the data distribution. Otherwise, during the inference phase, if unsupervised methods encounter normal samples that exhibit slight deviations from the ones in the training set, they may struggle to generalize and mistakenly classify them as anomalies. This can lead to an increased false-positive rate in anomaly detection. Therefore, the effectiveness of unsupervised methods relies heavily on the quality and representativeness of the normal training data they are exposed to [21].

New perspectives are emerging as a solution to the limitations of unsupervised methods, with self-supervised learning gaining significant attention. Self-supervised learning is an unsupervised learning paradigm that focuses on representation learning, that is, learning representations of the data that make it easier to extract useful information for building classifiers and predictors [22]. Thus, self-supervised learning aims at enhancing the ability of machine learning models for capturing the underlying distribution of data and improve their performance across various tasks. Note that the representations learned by means of this methodology are general and to some extent "agnostic" to the final downstream task(s) that we want our models to accomplish (anomaly detection in our case) [23].

Recently, there has been a substantial increase in the number of contributions utilizing self-supervised learning for time series anomaly detection. While existing literature includes reviews on machine learning for anomaly detection [19], [24], [25], as well as some surveys focusing on anomaly detection in temporal data [26] or time series specifically [16], [27], [28], in addition to a review on self-supervised learning for time series analysis [29], to the best of our knowledge, no survey has specifically addressed the growing use of self-supervised learning for time series anomaly detection. Furthermore, none of the previous surveys cover the extensive range of recent contributions in this area. Therefore, there is a need for a comprehensive review that categorizes these contributions and aids researchers in identifying future research directions in self-supervised time series anomaly detection. This paper aims to address this gap and presents three key research contributions:

  • A review of self-supervised learning-based time series anomaly detection approaches: the paper presents a comprehensive review of the contributions that utilize self-supervised learning methods for time series anomaly detection. To the best of our knowledge, this is the first survey that focuses exclusively on the use of self-supervised learning approaches for anomaly detection in time series data, filling a gap in the existing literature.

  • A taxonomy for self-supervised time series anomaly detection approaches: we propose a taxonomy for self-supervised approaches used in anomaly detection for time series data. This taxonomy is developed based on the key characteristics of self-supervised methods found in the literature, and it enhances the overall understanding of the existing contributions in this field by capturing the properties of each method.

  • An analysis of research challenges and future directions: the paper examines the current research challenges in the field of self-supervised learning for time series anomaly detection. It also identifies and discusses future research directions to improve the performance of anomaly detection using self-supervised learning in time series data.

  • A compilation of available software and datasets: we collect the source code of methods associated with the contributions examined in this study, along with the time series datasets utilized in their experiments, and provide information on where they can be accessed.

It is worth noting that self-supervised learning is often employed for pre-training models and afterwards they are fine-tuned using labeled data or in conjunction with semi-supervised learning [30], [31]. However, as annotated data about anomalies is not commonly available, unsupervised approaches for anomaly detection are the most practical option [32]. Due to this, in this survey, we focus solely on analyzing unsupervised methods that do not rely on the use of labeled anomalies.

The rest of the article is organized as follows. Section 2 provides an overview of self-supervised learning. In Section 3, we present a taxonomy for self-supervised time series anomaly detection approaches. Sections 4 and 5 explore the works that build upon self-supervised time series anomaly detection and analyze their properties. Section 6 compiles available software relevant to self-supervised learning in this context, along with the most popular datasets used for evaluating these methods. Finally, Section 7 presents a discussion of findings and conclusions, summarizing the contributions of the research and suggesting future avenues of exploration. In the interest of ensuring reproducibility, the methodology employed to conduct this survey is presented in Appendix 8.

2 An overview of self-supervised learning↩︎

In Yann LeCun’s statement at AAAI 2020, self-supervised learning was originally described as a process where ‘a machine predicts parts of its input based on observed parts’ by making predictions about information within the input that is assumed to be unknown. However, self-supervised learning has evolved to encompass a broader set of techniques and objectives, including predicting certain data attributes, relationships, or transformations within the data without relying on human annotations [23]. This approach falls within the realm of unsupervised learning and leverages the abundance of unlabeled data to capture the underlying structure and patterns of the data distribution, thereby enhancing the learned latent representations of machine learning models [33]. Traditional self-supervised pipelines typically consist of two main components: the self-supervised pretext task and the downstream task [34].

The self-supervised pretext task, also known as proxy task, involves predicting specific attributes or characteristics of data that are already known. Essentially, the pretext task is automatically created based on certain data attributes, relationships, or transformations within the data without relying on human annotations, leveraging large-scale unlabeled data. In self-supervised learning, a model learns to solve the pretext task by assuming that it will gain a deep understanding of valuable data information and the data’s inherent structure, which will result in the model capturing useful representations of data [31]. These learned representations can enhance the model’s performance when applied to the final downstream task, such as classification, regression, or anomaly detection, which is the ultimate goal of the learning process. The model can learn the pretext task together with the downstream task in a multi-task way, or by performing a pre-training in the proxy task and then fine-tuning the model for the downstream task. Note that, while the pretext task may not directly align with the ultimate downstream task, a well-designed pretext task is expected to effectively guide in learning valuable and versatile data representations that benefit the performance in the final downstream task [35].

As mentioned before, the self-supervised proxy task can be introduced in the learning process of the model in two primary manners: i) through pre-training a model on the proxy task or ii) by simultaneous learning it together with the downstream task. In both scenarios, the goal is to leverage the knowledge obtained from the self-supervised pretext task for one (or more) downstream task(s), but the execution of the proxy task and the transfer of this knowledge are achieved in different ways 1.

The approaches considering self-supervised pre-training involve two primary steps. First, a module \(f_{p}\), often referred to as the ‘proxy module’ (typically a neural network with multiple hidden layers), is trained to solve the proxy task. This module learns to map the input space of unlabeled data to the output space of the proxy task, \(f_{p} : X \rightarrow Y_{p}\). This initial phase is known as self-supervised pre-training. Once the self-supervised pre-training is complete, we proceed to train the ‘downstream module’ \(f_{d}\). This module is trained by leveraging the representations learned by the proxy module to solve the downstream task, a process referred to as fine-tuning. Specifically, in this phase, we use the input data \(X\) of the downstream task as input for the proxy module \(f_{p}\), extracting its hidden representations from one or more hidden layers and discarding the outputs related to the proxy task. This generates a new representation of the input data \(\phi(X)\) that is employed as input for the downstream module, which is trained for the specific downstream task at hand, resulting in \(f_{d}: \phi(X) \rightarrow Y_{d}\). It is worth noting that the earlier layers of the proxy module capture more general data patterns, while the later layers tend to capture task-specific attributes [31]. The fundamental assumption behind this process is that representations learned by effectively solving a well-designed proxy task hold value and can be subsequently applied to various downstream tasks related to the input data within the same dataset [30].

In the literature, the hidden layers of the proxy module used to extract useful representations for the downstream module are commonly referred to as the ‘feature extractor’ \(\phi(\cdot )\), while the remaining layers of the proxy module and the downstream module are termed the ‘task-specific heads’. These task-specific heads are responsible for mapping the hidden representations of the input data extracted from the pre-trained feature extractor to the output space corresponding to their respective tasks. It is important to note that in the second step, the parameters of the feature extractor are typically frozen to ensure that the learning of the downstream task does not affect the representations learned during the pre-training phase. However, with the emergence of new techniques in the field of transfer learning, there are contributions that also allow for adjustments to the feature extractor’s parameters during the fine-tuning step. This enables the updating of the hidden representations learned during the initial step to make them more specific to the chosen downstream task.

As an alternative to pre-training, there are other approaches where the pretext task and the downstream task are jointly learned in a multi-task learning manner [36]. In this setup, the downstream task takes precedence as the primary objective, while the proxy task serves as an auxiliary task aimed at enhancing data representations for facilitating the learning of the downstream task. The central assumption here is that well-designed proxy tasks share structural similarities with the downstream task, facilitating the transfer of knowledge from the learned representations of the auxiliary task to enhance the performance of models on the downstream task [37]. In broad terms, both the pretext task and the downstream task are trained using the same foundational or base model, typically a neural network with multiple hidden layers referred to as the feature extractor. Additionally, each task is equipped with its respective task-specific head module (\(f_{p}\) and \(f_{d}\)) and the overall loss of the model is computed as a linear combination of the losses associated with the downstream task and the pretext task. Note that in this scenario, unlike in the second step of the pre-training approach where the parameters of the feature extractor are typically frozen, the learning of the downstream task influences the representations captured by the feature extractor when learning the proxy task. Thus, in this case, the representations learned by the feature extractor will always be tailored to the downstream task at hand, thus not being necessarily good for other downstream tasks.

Overall, in the pre-training scenario the objective is to devise a precise self-supervised proxy task capable of capturing the most valuable representations that enhance the performance of the posterior downstream module(s) (optimally general enough representations that are useful for a wide range of downstream tasks). Conversely, in the second context, the aim is to select auxiliary pretext tasks that enhance the performance of a specific downstream task that we want our model to accomplish. It is worth noting that self-supervised approaches allow for the exploration of multiple proxy tasks to enhance the quality and suitability of acquired representations. Consequently, the incorporation of multiple proxy tasks can significantly improve the overall performance of the model in the respective downstream task(s) [38].

Self-supervised learning was initially introduced in the field of artificial vision [37], [39]. Following its success in learning effective visual representations, the utilization of self-supervised learning was afterwards extended to diverse domains such as medicine and healthcare [40], graph learning [41], audiovisual data [42], [43], recommender systems [44], and even remote sensing [45]. Be aware that the specific design of the proxy task influences the model’s ability to capture local patterns, as well as significant properties and dependencies within the data. In addition, it is important to note that proxy tasks in self-supervised learning are often tailored to specific domains. Therefore, one of the major challenges researchers have encountered is devising new pretext tasks that can be applied to other types of data, such as time series data [32].

2.1 Types of self-supervised pretext tasks↩︎

Existing literature on self-supervised learning identifies two main types of self-supervised pretext tasks: self-predictive tasks and contrastive learning-based tasks [23]. Both types of proxy tasks can be considered when applying the self-supervised learning strategies as pre-training and multi-task learning. Recall that the concept of self-supervised learning originated in the artificial vision domain, consequently, the explanations and examples provided in this section will predominantly focus on its application to images. Later, in the literature review, we will focus on time series data.

2.1.1 Self-predictive pretext tasks↩︎

The pretext tasks in the field of self-prediction involve constructing self-supervised prediction tasks at the individual data sample level. Typically, a portion of the data is used to predict another part that is assumed to be unknown. This is often achieved by applying one or more transformations to each sample in the training set for generating a set of augmented views of the samples. The tasks associated with self-predictive self-supervised learning typically revolve around either predicting the specific transformation applied to the original sample to generate each augmented view or reconstructing the original samples based on their augmented views. Broadly speaking, self-predictive tasks can be divided into three primary classes: self-supervised classification, self-supervised reconstruction and self-supervised forecasting [23].

2.1.1.1 Self-supervised classification

The pretext tasks related to this category involve applying a family of transformations \(\mathcal{T} = \lbrace T_1, ..., T_K \rbrace\) to each input sample \(x_i\) in the training set. This results in a series of augmented views specific to that sample \(\mathcal{T}(x_i) = \lbrace T_1(x_i), ..., T_K(x_i) \rbrace\), where \(T_k(x_i)\) denotes the augmented view obtained by applying the \(k\)-th transformation from \(\mathcal{T}\) to \(x_i\), and \(k\) serves as the pseudo-label associated to \(T_k(x_i)\). Then, the self-supervised classification-based pretext task is based on training a classifier \(f(\cdot)\) to discriminate among the \(K\) transformations applied to each sample for generating the corresponding augmented views, such that ideally \(f(T_k(x_i)) = k\) for \(1 \leq k \leq K\). This is achieved by minimizing a classification loss, such as cross-entropy. These proxy tasks heavily depend on the adeccuacy of the transformations considered for generating the augmented views of samples. The most popular choice in this field is the use of geometric transformations [46], [47].

2.1.1.2 Self-supervised reconstruction

The proxy tasks within this category involve applying a transformation \(T\) to each input sample \(x_i\) in the training set, resulting in an augmented view of that particular sample \(T ( x_i )\). The pretext tasks within the realm of self-supervised reconstruction are based on training a model \(f(\cdot)\) to reconstruct, from the augmented view of each sample in the training set, its corresponding original sample \(f(T(x_i)) = x_i\). Thus, the model is trained by minimizing the reconstruction error between the model’s output and the original sample. The methods proposed within this field differ primarily on the transformation considered for generating the augmented views and performing the self-supervised reconstruction. Among the prominent choices, we find masking a portion of the data to restore the original sample [48], generating the grayscale version of an image to predict the corresponding color channels of the original image [49], or adding noise to images and reconstruct clean data from noisy inputs as it is done with denoising autoencoders [50]. These tasks, among others, are all oriented to enabling models to grasp the complete content of an image, generating plausible hypotheses for missing information, and understanding the relationship between image semantics and textures.

2.1.1.3 Self-supervised forecasting

In this category, we find pretext tasks that leverage the temporal dependencies that are inherent in some types of data. Let us consider a time-evolving sample \(x \in \mathbb{R} ^{d \times L}\), where \(d\) is the number of dimensions, \(L\) is the total number of timestamps, and \(x_t \in \mathbb{R}^d\) is the value of the sample at timestamp \(t\). The pretext tasks belonging to this category involve training models to, on the basis of a context window \(x_{t-w, t} = (x_{t-w}, x_{t-w+1}, ..., x_t)\), predict the future \(p\) values \(x_{t+1, t+p} = (x_t+1, ..., x_t+p)\) of that sample. The model is trained by minimizing the prediction error between the model’s outputs and the ground-truth values of the window to be predicted. As it is intuitive, this type of pretext tasks are only applicable to time-evolving data. In the case of videos, this is applied by predicting future frames on the basis of a series of context frames [43], and for natural language processing, this is achieved by predicting future sentences [51].

2.1.2 Contrastive pretext tasks↩︎

Contrastive learning is a self-supervised learning paradigm that aims at training models by emphasizing the similarities and differences (or contrasts) between pairs of data samples or parts of them. The models trained by means of these pretext tasks map the input data into a more compact and abstract latent feature space by means of a network that acts as an encoder. The primary objective is to learn representations in which similar samples are brought closer together in the latent space, while dissimilar samples are pushed apart. The core idea is to encourage the model to learn meaningful representations by maximizing the similarity between positive pairs (similar instances) and minimizing the similarity between negative pairs (dissimilar instances) [52]. In this context, the term "similarity" includes metrics that evaluate the resemblance between latent representations of data samples or components, such as cosine similarity, as well as distance metrics like Euclidean or Mahalanobis distances.

Contrastive learning employs an iterative procedure, through which the following steps are repeated for each sample in a batch.

  1. Pair Selection. For the current sample (the anchor), positive and negative samples are chosen for creating the positive and the negative pairs. This selection can be done according to different criteria, which will be explained later on.

  2. Latent Representations. The latent representations of the anchor, the positive samples and the negative samples are extracted by means of a neural network that acts as an encoder. In the realm of contrastive frameworks, the utilization of ‘siamese network’ architectures is prevalent for this purpose [53].

  3. Contrastive Loss. The contrastive loss is computed and minimized to facilitate learning the proxy task. Popular contrastive losses include InfoNCE (Noise Contrastive Estimation) [54] and NT-Xent (normalized temperature-scaled cross-entropy) [55] losses. Notably, these losses involve processing more than one negative sample per iteration of the contrastive task. Another noteworthy loss function in this context is the triplet loss [56], which, unlike InfoNCE and NT-Xent, operates with distance measures instead of similarity measures.

By training models using contrastive pretext tasks, the model acquires a deep understanding of the meaningful and informative similarity patterns within the data. This newfound knowledge shows to be highly valuable, as it can enhance the model’s performance in a wide range of downstream tasks [52]. It is crucial to emphasize that certain methods exclude the incorporation of negative samples, focusing solely on positive samples for the contrastive proxy task.

The main difference among the existing contrastive tasks in the literature are based on the way they select the positive and negative pairs for each sample. Even if there exist many, there are two main strategies for doing so and, thus, we distinguish two principal methodologies for contrastive tasks: augmentation contrast and sampling contrast.

2.1.2.1 Augmentation contrast

This is the most popular strategy for generating pairs in contrastive learning. Augmentation transformations are used at some stage of the process to either generate positive samples, negative samples or both. Commonly, an augmentation transformation \(T_{p}\) is applied to the anchor \(x_i\), generating an augmented view \(x_p\) that serves as its positive sample. The pretext task aims to maximize the similarity between the latent representations of the anchor \(x_i\) and the positive sample \(x_p\), while minimizing the similarity between the anchor and negative samples \(x_n\). These negative samples are obtained either by augmenting the anchor with a transformation \(T_{n}\) or by selecting other samples from the batch. Typically, authors choose positive transformations with the goal of enforcing model invariance to perturbations associated with the specific augmentation. For example, a common positive transformation involves introducing Gaussian noise to samples, fostering invariance in the model to noisy samples, as noise often does not alter the class or nature of the samples in many contexts (depending on the problem). Conversely, negative transformations, when employed, aim to disrupt the inherent nature of the data, creating "corrupted" views of the anchor that are presumed to exhibit distinct properties and characteristics compared to the original sample [52]. Common negative transformations in contrastive learning include masking, resizing, color distortion, rotation, and the application of filters [55]. However, the selection of these transformations for generating positive and negative samples depends on the context of the problem and the assumptions made by the authors about the data [57]. It is noteworthy that some studies combine multiple transformations to augment samples and generate pairs in contrastive tasks [55].

2.1.2.2 Sampling contrast

This approach avoids dependence on transformations for the selection of pairs in contrastive proxy tasks. Instead, it relies on leveraging specialized knowledge about the data to make assumptions regarding the similarity of various instances or components of the data to select positive and negative samples. For instance, in enhancing image classification, positive samples for the anchor can be chosen based on whether they belong to the same class in the classification problem [58]. In the context of video representation learning, an assumption can be made that consecutive frames of a video should exhibit a similar representation in the latent space. Consequently, contrastive pairs for the anchor can be generated by considering distances between frames in a video [59].

The techniques outlined in this section revolve around different types of methods within self-supervised learning. It is evident that the choice of pretext tasks in self-supervised learning must align with the specific data type, rendering it unfeasible to employ all proxy tasks universally for all data types. In the following sections, we focus on the established methods that employ self-supervised proxy tasks for handling time series data and do not rely on the use of labeled data, considering anomaly detection as the downstream task at hand.

3 A taxonomy for Self-supervised learning-based anomaly detection in time series↩︎

Self-supervised techniques for identifying anomalies in time series data can be categorized based on two primary axes derived from the content of the previous sections. First, the focus of the time series anomaly detector concerning the context in which anomalies are detected. Second, the type of pretext tasks considered during model training.

Figure 2: Taxonomy for self-supervised time series anomaly detection approaches.

3.1 Context of the anomaly detection task↩︎

As noted in the introduction, time series anomaly detection tasks are commonly categorized into two main scenarios. This axis distinguishes approaches that address the time series anomaly detection task in the following two contexts:

  • Anomaly detection in the local context. Anomaly detection in the local context refers to tasks focused on identifying unusual patterns that differ from the typical patterns and characteristics within an individual time series. Anomalies within this context may consist of individual points that are abnormal or subsequences of the time series that exhibit abnormal behavior.

  • Anomaly detection in the global context. This type of time series anomaly detection involves identifying abnormalities within a dataset containing multiple complete time series. Here, the anomalies being detected are instances of entire time series that display anomalous behavior compared to the normal properties exhibited by the rest of the samples in the dataset.

It is worth noting that certain studies shift the anomaly detection task from the global context to the local context. One such approach involves dividing an individual time series (local context) into various subsequences and employing a method that treats each resulting subsequence as a sample within a dataset containing all the subsequences (global context). In this survey, we will classify each method based on this axis according to the problem they aim to address. For instance, techniques following the aforementioned methodology will be classified as anomaly detectors in the local context, as they aim to identify abnormal subsequences.

3.2 Type of pretext tasks↩︎

As mentioned before, there exist two types of pretext tasks, namely self-predictive and contrastive. Furthermore, multiple self-supervised pretext tasks can be combined during the training of the models for better capturing the normality of data. There are some approaches that consider only a single type of proxy task, while others combine both self-predictive and contrastive pretext tasks in the training of the models. Based on this, we can distinguish between two primary categories of approaches:

  • Single-type approaches. These approaches exclusively rely on a single type of pretext task. Specifically, any approach that exclusively employs a single pretext task falls within this category. Moreover, within the set of approaches that incorporate multiple pretext tasks, those that utilize only one type of pretext task also fall into this same group. In this category, we discern methods that rely exclusively on self-predictive pretext tasks and those that exclusively focus on contrastive pretext tasks.

  • Multi-type approaches. In this scenario, both self-predictive and contrastive learning objectives are integrated in the training of anomaly detection models. It is important to note that, unlike the single-type approaches, where models can be trained using one or more pretext tasks, in this case, models always incorporate a minimum of two proxy tasks in their training.

Considering these two fundamental dimensions and the various values they encompass, we can delineate six distinct categories of contributions. The taxonomy established for self-supervised time series anomaly detection methods is visually depicted in Figure 2, together with the number of analyzed contributions belonging to each category. In the subsequent sections, we will expound upon each of these potential categories. Specifically, the following two sections are structured based on the first axis of the taxonomy, addressing the context of the anomaly detection task. Within these sections, we further partition the subsections according to the remaining axis of the taxonomy, which relates to the types of pretext tasks considered in the approach.

4 Anomaly Detection Methods in the Local Context↩︎

Most of the self-supervised methods for time series anomaly detection focus on identifying anomalous patterns within the local context of individual time series \(X = \lbrace x_1, x_2, ..., x_L \rbrace \in \mathbb{R} ^ {d \times L}\), where \(d\) denotes the number of variables considered (\(d=1\) for univariate time series) and \(L\) represents the sequence length. The primary objective of these methods is to detect anomalous points \(x_t \in \mathbb{R}^d\), anomalous subsequences \(X_{t,w} = \lbrace x_{t-w+1}, ..., x_t \rbrace \in \mathbb{R}^{d \times w}\), or both, within the target time series \(X\).

To achieve this goal, these models typically learn one or more proxy tasks tailored to capture the normality of data based on a designated normal region of \(X\), which serves as the training set. Typically, time series are segmented by using sliding windows and the proxy tasks are computed based on the resulting subsequences. Subsequently, the knowledge acquired through these proxy tasks is utilized to determine the anomaly score associated with new points and/or subsequences within \(X\), constituting the evaluation set.

The studies falling into this category, along with their specific properties, are illustrated in Table [tab:local]. It specifically outlines, the method employed in the self-supervised task(s) considered, the type of outliers that the method identifies, the measure considered for computing anomaly scores, and whether the approach addresses univariate or multivariate time series. In relation to the anomaly scores column, it delineates various approaches for its computation: evaluating the model’s classification performance (encompassing classification losses and misclassifications), assessing residual errors (covering reconstruction and prediction errors), and utilizing metrics associated with the latent representations produced by the model (involving representation similarities and distances). We include the use of contrastive losses within the last subgroup, as these compute similarity and distance measures between the latent representations of data parts or samples.

Table 1: Summary of self-supervised methods for time series anomaly detection in the local context
Paper Type of PT Anomaly Type Anomaly Score Dim
U M
Self-predictive
[60] Rec Point RE
[61] Rec Point RE
[62] Rec Point RE
[63] Rec Point RE
[64] Rec Subseq RE
[65] Rec Point, Subseq RE
[66] Rec Point, Subseq RE
[67] Rec Point, Subseq RE
[68] Rec Point, Subseq RE
[69] For Point RE
[70] For Point RE
[71] For Point RE
[72] For Point, Subseq RE
[73] Class Point CP
[74] Class Subseq CP
[75] Class Subseq CP
[76] Rec + For Point RE
[77] Rec + Class Point RE
Contrastive
[78] SC Point RM
[79] SC Point RM
[80] SC Point RM
[81] AC Point RM
[82] AC Point RM
[83] AC Subseq RM
[84] AC Point RM
[85] AC Point RM
[86] AC Point RM
[87] SC + AC Point RM
[88] SC + AC Point RM
Multi-type
[89] Rec + SC Point RE + RM
[90] Rec + AC Point RE
[91] Rec + AC Point, Subseq RM
[92] Rec + SC + AC Point RE
[93] For + SC Point RE
[94] Class + SC Subseq RM
[95] Class + SC + AC Point RM
image

4.1 Single-type self-supervised approaches↩︎

Single-type self-supervised learning approaches make use of one or more pretext task, but always of the same type, either self-predictive or contrastive, to detect local anomalies within a target time series.

4.1.1 Self-predictive approaches↩︎

Methods in this category propose self-predictive proxy tasks to capture the normality of individual time series. Most methods in this group utilize self-supervised reconstruction as a proxy task. This involves applying a transformation to the samples in the training set to generate augmented views for each sample. Subsequently, the proxy tasks revolve around attempting to reconstruct the original samples from their corresponding augmented views. The augmented views of the training samples can be viewed as ‘corrupted’ versions of the original input data. By learning to reconstruct the original samples from their augmentations, the model captures the underlying patterns and structures of normal data during the training phase, identifying the expected features and representations of the ‘non-corrupted’ data. The primary difference among these approaches lies in the type of transformation considered for the self-supervised reconstruction-based proxy task.

Drawing from the concept of ‘masked language models’ as introduced in the works of [51], [96], one of the prevailing strategies involves employing masking as a transformation for self-supervised reconstruction [60], [61], [66][68]. These studies propose a pretext task centered around ‘masked time series modeling’ to capture meaningful data representations. The authors utilize a random masking technique to replace a portion of the normal input subsequences with either random values or predefined constants. Subsequently, the models are trained to reconstruct the original non-masked subsequences using their masked versions as inputs, thereby minimizing reconstruction errors. Once a self-supervised reconstruction-based model is trained, it is assumed that if the model encounters difficulty in faithfully reconstructing a portion of the data, it suggests the presence of data that diverges from the established patterns in the learned data. This divergence can strongly indicate anomalies or irregularities in the input. Consequently, during inference, the reconstruction errors of points and subsequences are utilized as anomaly scores to categorize them as normals or abnormals.

The works by [60], [61], [66] conduct inference by inputting non-masked subsequences into their trained models. They then utilize the reconstruction errors associated with specific timestamps to assess the anomaly score of points in the time series. It is worth noting that in the case of [61], [66], their methods are adapted to perform point outlier detection in time series streaming data, which is updated in real time in an online fashion. Other works, such as [67], [68], reproduce the proxy task during inference by masking parts of the subsequences to be evaluated and reconstructing them using their models. Subsequently, they compute the anomaly scores of both points and subsequences based on their reconstruction errors.

In addition to masking, some studies have explored the use of noise injection as a transformation for self-supervised reconstruction. For instance, the works of [62], [65] introduce denoising autoencoders for time series anomaly detection. These models inject Gaussian noise to corrupt input samples and aim to reconstruct their original, non-corrupted versions. During the inference phase, the anomaly score for new timestamps and subsequences is computed based on the model’s reconstruction errors.

To end with self-supervised reconstruction, the use of regular autoencoders has been widely extended to the identification of point anomalies [63] and subsequence anomalies in time series [64]. Autoencoders encode time series data into a more compact latent space and then reconstruct the original data on the basis of its latent representations. By considering that the encoding process removes a part of the information of the original data by using a neural network-based transformations (the encoder), these methods are included in the category of self-supervised reconstruction.

Within this category, we also encompass the techniques associated with self-supervised forecasting. In this case, the temporal patterns inherent in time series data are leveraged to construct the self-supervised pretext task. In these works, models are trained to predict future timestamps (next-step prediction) and future subsequences (multi-step ahead prediction) with respect to the current window [70][72]. Then, during the inference, the anomaly score of new points and subsequences are computed by assessing the prediction errors of the models. As a novelty, [69] introduces a pretext task centered around next-step prediction using partially masked windows as inputs. The authors argue that, by masking a part of the context window, the representations learned by the models are more robust to noise and capture better the normality of the data.

Alternative approaches in anomaly detection focus on employing self-supervised classification to identify anomalous points and subsequences within individual time series. The approaches within this category primarily differ in the choice of transformations considered for the self-supervised classification.

The works by [74], [75] employ self-supervised classification for identifying anomalous subsequences. In [74], downsampling is utilized at various rates, with each downsampling rate representing a distinct class for the self-supervised classification task. Conversely, [75] proposes employing jittering (adding Gaussian noise) and vertically flipping the subsequences to construct a binary self-supervised classification task. This binary classification aims to distinguish between the two resulting augmented views of each original sample. During training, the model is trained to differentiate between these augmented views by minimizing the mean cross-entropy value across the augmented views for each sample. For the detection stage, both approaches replicate the self-supervised classification task and utilize the classification loss as the anomaly score for new subsequences.

The approach presented in [73] adapts self-supervised classification to identify abnormal timestamps in the local context of a time series. To construct the proxy task, the authors propose four different transformations, known as degradations, which involve introducing noise and randomly substituting some values within a time series in different ways. During training, they degrade some segments within the time series to simulate abnormalities. The model is then trained to distinguish between the normal and the degraded timestamps in the time series, that is, performing a binary classification. In the inference phase, no degradation is applied to the new samples, and the model is utilized to classify if each timestamp of the input is anomalous or not.

So far, we have explored the three types of self-predictive pretext tasks for detecting anomalies in local time series data. However, some approaches combine different types of tasks to train their models. After model training, one or more of these proxy tasks are replicated during inference, and the anomaly scores for new segments of the time series data are derived from the model’s performance in reproducing these tasks.

For example, [76] combines masked reconstruction and masked forecasting to extract more generalized features for pinpointing anomalies at specific timestamps within time series data. They simultaneously train both pretext tasks in a multi-task manner, calculating the overall loss as the sum of the reconstruction errors and the prediction errors of the model. During inference, they repeat this process, determining the anomaly score as a weighted sum of these errors, with the weights adjusted by a fixed hyperparameter.

Finally, in the study conducted by [77], the proposed method involves learning a regular reconstruction alongside a self-supervised classification task. To better capture spatial correlations across various dimensions in multivariate time series data, the authors advocate for the utilization of graph convolutional networks. The model is designed to encode and reconstruct the original subsequences while simultaneously learning the following self-supervised classification task. Each subsequence is augmented once through random permutations, and the model is trained to distinguish between the original subsequence and its corrupted augmented view. During the inference phase, the anomaly score for new timestamps is determined solely by evaluating the reconstruction errors generated by the model. It is important to note that during this phase, the classification task is disregarded and not reproduced for computing the anomaly score of new timestamps.

This last method differs from the approach presented by [76] where all proxy tasks are reproduced during inference. Here, the self-supervised classification is introduced during the training phase of the models to enhance the learning of better representations of data in addition to the reconstruction task. Nevertheless, it is the reconstruction task that plays an active role in inferring the anomaly score for new timestamps. This strategy, which involves integrating auxiliary proxy tasks that facilitate the learning of enhanced representations, is employed by several works discussed in the following sections of this paper. These auxiliary tasks are not considered during the inference phase of the models but contribute to the overall improvement of the learned representations.

4.1.2 Contrastive approaches↩︎

The works in this subsection utilize contrastive proxy tasks to capture the local dynamics of individual time series and conduct anomaly detection to identify unusual points and subsequences. Commonly, time series are preprocessed by splitting them into time windows, and the pretext tasks are computed based on the resulting subsequences. We primarily distinguish between approaches employing sampling contrast and those considering augmentation contrast instead.

We find a few works that build upon sampling contrast, where no transformations are considered for the contrastive proxy task. Instead, authors make various assumptions about data similarities and disimilarities to generate the contrastive pairs. The first proposal we find in this field is [80]. Two distinct representations are derived from the input subsequences through self-attention mechanisms: the patch-wise representation, capturing the relationships among points at the same position in each segment, and the in-patch representations, depicting the relations between sample points within the same segment. The contrastive objective aims to bring these two representations together in a shared latent space. Finally, the anomaly score for new timestamps is calculated as the discrepancy between the in-patch and patch-wise representations of the testing data.

The remaining studies employing sampling contrast adopt a methodology known as Contrastive Predictive Coding (CPC), which falls under a subcategory termed ‘prediction contrast’ in the literature [29]. CPC focuses on training an encoder to produce representations of subsequences useful for predicting future values. In the works by [78], [79], an encoder, an autoregressive model, and a non-linear projection head are utilized. Initially, the encoder generates the latent representation of each input subsequence. Subsequently, the autoregressive model predicts a context vector for a prediction window extending one or more steps into the future from the current window. Finally, the non-linear projection head estimates the latent representations of the prediction window, and the infoNCE contrastive loss is employed for model training. Essentially, this approach resembles self-supervised forecasting, but with a focus on estimating the latent representation of the future window using a contrastive learning objective, rather than explicitly predicting its content. To simplify, the prediction window is treated as a positive sample to the anchor, while other windows serve as negative samples.

After model training, [78] represents normal sample latent representations in the latent space using a Gaussian distribution. For each new subsequence, they calculate the probability density of its representation with respect to the Gaussian distribution to identify abnormal points. On the other hand, [79] determines abnormal points in time series by modeling the anomaly score through a variant of the CPC loss.

Apart from the previous approaches, there are various studies that build upon using augmentation contrast. In augmentation contrast, positive transformations are typically employed to encourage models to capture invariance with respect to those transformations, assuming they do not alter the normal nature of the data. Conversely, negative transformations, when employed, aim to enable models to detect patterns associated with those transformations, assumed to be indicative of unusual behavior. The primary distinction among works in this field lies in the selection of transformations considered to generate the contrastive pairs.

The work by [83] employs Gaussian noise injection into time windows as a positive transformation, while generating negatives through shuffling, scaling, and changing trends. The proposal of [81], combines various transformations like noise addition, signal magnitude adjustment, and time warping to create positive samples. However, in this case other windows in the batch are used as negatives. In another scope, [82] generates negative samples by introducing point anomalies in time windows. Unlike the previous approaches, they do not employ any transformation to generate positive samples but assume that neighboring windows should share similar representations in the latent space, designating them as positive pairs. They all utilize similarity measures, such as the contrastive loss or distances between latent representations to compute anomaly scores for new points [81], [82] and subsequences [83].

Another popular method in this category is TS2Vec [84]. TS2Vec is a contrastive framework for time series representation learning based on the following. Each window is randomly masked and cropped twice to generate two augmented views from it. Then, two proxy tasks are learned on the basis of the overlapped regions of the augmented views:

  • Temporal Contrasting: it brings together the representations of the same timestamp from the two augmented views. At the same time, it pushes away the representations of different timestamps within the same augmented view.

  • Instance-wise contrasting: it maximizes the similarity between representations of different timestamps within the same augmented view. Simultaneously, it minimizes the similarity with respect to the representation of the same timestamp in other time series in the batch.

Both proxy tasks are concurrently learned to enable the model to effectively capture both low-level and high-level features. In the inference phase, TS2Vec targets the identification of anomalous timestamps in time series. To achieve this, the new time series is split into segments, and the latent representation of each segment is computed twice: initially with the last observation (the timestamp under evaluation) masked, and then without any mask applied. The anomaly score for that point is determined by the distance between these representations. It is important to highlight that the anomaly score of new timestamps is computed in such a way that TS2Vec can be employed for anomaly detection in streaming scenarios. In the work presented by [85], a closely aligned contrastive approach is put forth. What sets it apart from the prior contribution is the introduction of noise into overlapping windows instead of masking specific portions. As the earlier method, they train their model employing both the temporal contrastive and the instance-wise contrastive losses. The anomaly score computation remains the same as in TS2Vec, serving as the basis for detecting abnormal timestamps. Notably, this method for calculating anomaly scores for timestamps in time series is adopted by numerous works, which will be further reviewed in this survey.

The previous works have typically relied on manually defined transformations for constructing contrastive pairs to facilitate learning proxy tasks. However, alternative approaches have emerged, advocating for automatic neural transformations that do not rely on assumptions about the most suitable transformations for the data’s nature. Studies as [86][88] adopt this strategy, which originated from the proposal of NeuTraL AD [97] (presented in the subsequent section as a method for global time series anomaly detection).

A notable aspect of this line of research is its independence from manually defined augmentations for time series. Instead, these studies leverage neural networks to generate augmented views, which are trained concurrently with other modules in the model. For instance, in [86], each subsequence is divided into two overlapping windows: the context window and the suspect window. The suspect window undergoes augmentation through trainable neural transformations. The authors then employ the ‘Deterministic Contrastive Loss,’ as introduced in NeuTraL AD, to converge the representations of the augmented views and the original input while simultaneously pushing away the embeddings of the augmented views among them. Additionally, they bring together the representations of the context window and the augmentations of the suspect window. During inference, anomaly scores for new timestamps are inferred from the training loss.

The two remaining contributions encompass a fusion of sampling and augmentation contrast methodologies in their proposals [87], [88]. In the study by [87], they amalgamate the Deterministic Contrastive Loss of NeuTraL AD with Contrastive Predictive Coding during model training. Post-training, the model evaluates the anomaly score of new timestamps utilizing the Dynamic Deterministic Contrastive Loss (DDCL) inherited from NeuTraL AD. In a similar vein, [88] employ neural transformations to enhance time series data for contrastive tasks. Their approach involves generating augmented views for each subsequence using neural networks as the positive sample for the augmentation contrast-based task. They ensure that representations of other windows in the batch are pushed apart. Additionally, they concurrently learn a task based on sampling contrast, which entails subdividing the augmented view of the anchor into various subsequences to promote similarity between closely located subsequences and reduce it for those that are more distant. Post-training, they employ the same strategy as TS2Vec for computing the anomaly score of new data points.

4.2 Multi-type approaches↩︎

This category encompasses techniques employing self-predictive and contrastive proxy tasks for detecting anomalies within a local context. These approaches leverage pretext tasks akin to those discussed earlier. Their primary distinction lies in the choice and combination of pretext tasks used to extract pertinent patterns and features from individual time series.

Most of the studies in this category integrate various proxy tasks with traditional reconstruction using autoencoders. For instance, [89] incorporates Contrastive Predictive Coding alongside autoencoder-based reconstruction. During training, both tasks are simultaneously learned, and during inference, the same process is reiterated. The anomaly score for new timestamps is computed by combining the reconstruction error with a modified version of the Contrastive Predictive Coding loss function.

In the study conducted by [93], focusing on point anomaly detection within sensor networks, models are simultaneously trained on a regular reconstruction-based task, a self-supervised forecasting task, and a sampling contrast-based task. The self-predictive task focuses on next-step prediction relying on the current sliding window. Given the nature of the problem, the contrastive task aims at bringing together the representations of signals from the same sensor while pushing away the representations of signals from other sensors. During training, the model combines the losses from the three tasks to facilitate simultaneous learning. In the inference phase, reconstruction errors are used as the anomaly score for new timestamps.

The proposal by [91] presents an approach for detecting abnormal points and subsequences. Their method operates under the assumption that data representations should remain invariant to warping-based perturbations for effective anomaly detection. To achieve this, they employ two twin networks (similar to siamese networks but with potentially different weights to learn two self-supervised reconstruction tasks, and an augmentation contrast-based proxy task. Each input subsequence undergoes augmentation through transformations related to time series warping. One of the reconstruction tasks is based on a regular autoencoder, while the other reconstructs the original input using its augmented view. The augmentation-based contrastive task forms contrastive pairs using the anchor and its augmented view as the positive pair. These tasks are simultaneously learned in a multi-task manner during network training. In the inference phase, the anomaly score for new subsequences and points is calculated as the distance between their embeddings and their K-Nearest Neighbors.

In the case of [92], they combine regular reconstruction with a sampling contrast and an augmentation contrast task. The first contrastive task employs contextual contrasting, bringing together the latent representations of adjacent timestamps. In the second task, Gaussian noise is added to the original windows and their frequency-based spectrums to create two augmented views from each subsequence that constitute the positive pair. The three tasks are learned concurrently. All three tasks are trained simultaneously. During inference, point abnormalities are determined primarily by the model’s reconstruction errors, with the contrastive tasks excluded from the anomaly detection stage.

The last contribution that combines reconstruction with other pretext tasks is an augmentation contrast-based pre-training framework [90]. The method operates on multivariate time series data, where a variable within the series is randomly selected (the anchor) and augmented through random masking to create a positive sample. Another variable is then chosen as the negative sample. By computing a latent representation that captures both spectral and temporal information, the framework aims to align the positive sample’s representation with the anchor while distancing the negative sample’s representation. Following pre-training, a decoder module is added to the model, and fine-tuning is conducted based on regular reconstruction. During inference, anomalous timestamps are detected by evaluating the reconstruction errors of new data points.

Other works in this category do not consider the use of regular reconstruction for multi-type self-supervised learning [94], [95]. [95] builds upon the TS2Vec approach, incorporating pretext tasks from that work along with two additional tasks: temporal consistency and transformation consistency. Temporal consistency is a self-supervised classification task. Given an anchor window, a neighboring window, and a non-neighboring window, the task aims to differentiate between the neighboring and non-neighboring windows. Transformation consistency augments anchor windows using two types of augmentations: weak and strong. Weak augmentation introduces slight variations, while strong augmentation produces substantially different views. This task considers the augmentations of the same anchor as positive samples to the anchor, while treating the weak augmentations of other samples as negatives. The model concurrently learns these tasks, and anomaly scores for new timestamps are computed following TS2Vec’s methodology.

Finally, another method belonging to this category is COUTA [94]. This approach combines self-supervised classification with Deep Support Vector Data (SVDD), a widely recognized sampling contrast method for anomaly detection [98]. SVDD is a deep one-class classification technique that maps normal training data into a higher-dimensional latent space to effectively delineate a hyperplane separating normal instances from potential anomalies. The model is trained to position normal instances close to the center of the hyperspherical latent space, where the center is presumed to embody the normal behavior of the data. Subsequently, outlier detection relies on measuring the distance of a data sample from the hypersphere’s center to compute its anomaly score. In COUTA [94], an encoder is employed to map time series subsequences to a hyperspherical latent space, aiming to minimize their distance from the center. Unlike in SVDD, the distances of normal representations from the center are modeled as a Gaussian distribution. This addition imposes a penalty on predictions associated with high model uncertainty, effectively addressing issues related to anomaly contamination in the training set. Simultaneously, the model learns the self-supervised classification task, where the transformations considered aim to simulate point and subsequence anomalies (e.g., replacing timestamps with extreme values or substituting segments by random subsequences). Hence, a binary classifier learns to distinguish between normal inputs and their abnormal augmented views. In the inference phase, the model encodes new time series subsequences, and the distance between their latent representations and the center of the hyperspherical latent space is used as their anomaly score.

5 Anomaly Detection Methods in the Global Context↩︎

Self-supervised anomaly detection approaches in the global context aim at identifying abnormal complete time series that deviate from the expected normal patterns and characteristics of a collection of time series samples. Models are trained on a dataset consisting solely of normal time series \(\mathbf{X} = \lbrace X_i \rbrace ^N_{i=1}\), where \(N\) is the total number of sequences in the dataset, and \(X_i \in \mathbb{R}^{d \times L}\) with \(d\) variables or dimensions (\(d=1\) for univariate time series) and length \(L\). The objective in this scenario is to train a model on the dataset \(\mathbf{X}\) by learning one or more self-supervised proxy tasks designed to capture the normality of the data samples. Once the model is trained, the acquired knowledge is utilized to assess the degree of abnormality in new samples, which are then categorized as either normal or anomalous. Specifically, for each new sample \(X_{new}\), the trained model is used to calculate its associated anomaly score \(\textrm{AS}(X_{new})\). By this, the higher \(\textrm{AS}(X_{new})\) is, the more likely \(X_{new}\) is an anomalous time series.

The studies categorized here, along with their specific characteristics, are presented in Table [tab:global]. This table provides details on the self-supervised task(s) utilized, the metric used for calculating anomaly scores, and whether the method is applied to univariate or multivariate time series data. As in the previous section, the column referring to the anomaly scores outlines different methods for their computation, including evaluating the model’s classification performance (including classification losses and misclassifications), assessing residual errors (such as reconstruction and prediction errors), and utilizing metrics associated with the model’s latent representations (involving representation similarities and distances). Contrastive losses are also included in this category, as they calculate similarity and distance measures between the latent representations of data segments or samples.

Table 2: Summary of self-supervised methods for time series anomaly detection in the global context
Paper Type of PT Anomaly Score Dim
U M
Self-predictive
[99] Class CP
[100] Class CP
[101] Class CP
[102] Class CP
[103] Class RM
[104] Rec RE
[105] Rec + Class RE
[106] Rec + Class RE
[107] Rec + Class RM
[108] Rec + Class CP + RE
Contrastive
[97] AC RM
[109] AC RM
[110] SC + AC RM
Multi-type
[111] Class + AC RM
[112] Class + AC RM
[113] Class + Rec + SC RE + RM
image

5.1 Single-type self-supervised learning↩︎

Single-type self-supervised learning methods in this section employ a single type of task, such as self-predictive or contrastive, to identify outlier time series.

5.1.1 Self-predictive approaches↩︎

Recall from previous sections that self-predictive proxy tasks operate at the data-sample level. In the context of global time series anomaly detection, the models are fed with samples representing complete time series to capture the normality by means of one or more proxy tasks. Once the pretext tasks are learned on the basis of normal samples, the models leverage the acquired knowledge to compute the anomaly score of new samples to be classified as normals or abnormals.

Most methods in this category utilize self-supervised classification-based proxy tasks. In this approach, a set of \(K\) transformations is applied to each input sample, resulting in \(K\) augmented views per sample in the training set. The model is trained to predict the transformation applied to each input sample for generating its augmented views, thus creating a self-supervised classification with \(K\) possible classes (one for each transformation). The augmented views of training samples serve as inputs to the model, while their associated transformations serve as pseudo-labels for the self-supervised classification.

The main difference among the approaches that employ self-supervised classification for time series anomaly detection relies on the choice of the transformations considered to augment the input samples. In this context, the criteria that must fulfill the transformations are twofold:

  • Break the normality. The transformations considered need to generate augmented views that disrupt the normal nature of the data, generating ‘corrupted’ versions of the samples that do not follow the normal patterns and characteristics of the data distribution. Note that the selection of a transformation that breaks the normality of the data depends on the context of the problem at hand. Therefore, the augmentations that fulfill this requirement in some problems might not be suitable for performing anomaly detection in other problems.

  • Diversity. The transformations considered should generate augmented views that share relevant semantic information with the original data. Optimally, the resulting augmented views should produce diverse views of each sample without redundancies on the information they keep.

The literature presents various approaches employing manually predefined transformations for self-supervised classification in global time series anomaly detection. For example, [101] proposes resizing normal training time series at different scales, with each scale representing a distinct class. Other works suggest applying \(K\) different affine transformations to input samples to generate augmented views for each considered transformation [99], [102]. In another approach by [103], they propose to augment normal samples with two transformations: one altering the amplitude and the other modifying the frequency of the time series. The proxy task involves distinguishing between the original samples, the amplitude-changed augmentations, and the frequency-modified views.

Following model training, new samples undergo augmentation with the same transformations used during training, followed by the reapplication of the self-supervised classification task. Subsequently, the model’s classification performance concerning the augmented views of the new sample is evaluated to compute its anomaly score. In the approaches of [101], [102], the classification loss utilized during training is employed in the inference phase to measure the model’s classification performance and compute the anomaly score of new samples. Conversely, [99] quantifies the number of misclassifications of augmented views of new samples, generating a discrete anomaly score for time series anomaly detection. In contrast, [103] utilizes the latent representations of new samples to compute the anomaly score. During inference, latent representations of normal time series in the training set are derived from the model’s last hidden layer and represented by a multivariate Gaussian distribution. Subsequently, the anomaly score of new sequences is calculated as the Mahalanobis distance between their latent representations and the Gaussian distribution of normal representations in the model’s latent space.

While less prevalent, some studies diverge from utilizing transformations to formulate self-supervised classification tasks. A notable example is the research conducted by [100], which opts for leveraging domain-specific knowledge pertaining to the targeted problem: the DCASE2020 challenge [114]. Prior to presenting the contribution, we will provide a concise overview of the particulartities of this problem and its configuration. The DCASE2020 challenge revolves around acoustic anomaly detection, specifically identifying irregular sounds in various types of operating machines. Each machine, from which the sounds are recorded, is assigned a unique ID. Thus, each time series in the training and test datasets has an associated ID representing the specific machine from which it was generated [114]. Due to this, there are multiple contributions that propose self-supervised approaches for addressing this task. These approaches leverage the "privileged information" inherent in machine IDs to construct self-supervised pretext tasks for addressing the anomaly detection task effectively. In this scenario, the authors of [100] utilize the machine IDs to form a self-supervised classification task centered on predicting the machine ID associated with each time series in the training set. In the inference phase, as in the previous approaches, anomalous time series are detected by assessing the model’s performance in predicting the ID of new time series.

In addition to self-supervised classification, we also introduce the method proposed by [104] within the realm of self-predictive approaches for detecting anomalous sequences. In this study, the authors aim to transform each time series in the training set into a predetermined target signal, which remains fixed as a hyperparameter. To achieve this, they employ a neural network to model the black box function that links the normal time series in the training data with the target signal, thereby capturing the normal distribution of the data. Subsequently, during the inference phase, they repeat the same procedure for new time series, using the prediction error as the anomaly score. We emphasize the resemblance this method shares with self-supervised reconstruction. However, in this instance, the task’s objective is not to reconstruct the original time series but to reconstruct a predefined target signal.

As we have seen, there are multiple works that consider only using self-supervised classification for time series anomaly detection. There is only one method that makes use only of self-supervised reconstruction for this aim. However, we find various studies in which self-supervised reconstruction is employed alongside self-supervised classification to enhance the capabillities of the models to detect abnormal time series.

The works of [105], [106] advocate for a fusion of autoencoder-based unsupervised reconstruction with an auxiliary self-supervised classification. They utilize an autoencoder to reconstruct input time series while simultaneously learning a self-supervised classification task. The transformations considered to perform the self-supervised classification include jittering, reversing and scaling, each resulting in a different class to be distinguished. Additionally, representational memory modules are introduced into the models to mitigate issues stemming from noisy information in the inputs and improving their ability to reconstruct normal samples. Once the tasks are learned concurrently, the anomaly score for new time series is based on the reconstruction errors of the autoencoders in both approaches.

Similarly, the authors in [107] also utilize both self-supervised reconstruction and self-supervised classification. For self-supervised classification, they employ transformations such as shifting and up-scaling segment values, introducing noise, and swapping two random segments. Concurrently, the model adopts an encoder-decoder structure to learn reconstructing the original time series. Additionally, they concatenate the reconstruction error and the latent representation of the original time series and fit them to a Gaussian Mixture Model. These tasks are learned simultaneously, and during inference, the anomaly score of new time series depends on the density of the concatenation of the reconstruction error and the latent representation with respect to the Gaussian Mixture Model. Notably, this study incorporates automated machine learning, employing techniques to automatically select optimal hyperparameters for the machine learning model. It is important to highlight that in this method the anomaly scores are not derived from the proxy tasks during inference but from the Gaussian Mixture Model, which is introduced during model training to better capture the normal characteristics of the data.

Finally, another work proposes a combined approach using a self-supervised reconstruction and a self-supervised classification to solve the DCASE2020 challenge and find abnormal sequences in the global context of time series anomaly detection [108]. Given the nature of the problem, they extract the spectrograms from the normal time series sequences. To augment the training data, they mix spectrograms from different machine IDs by creating randomized linear combinations. The self-supervised reconstruction task involves masking parts of these augmented views and then reconstructing them. In addition, the self-supervised classification task aims to predict the weights of the machine IDs associated to the spectrograms used in generating the augmented views. These two proxy tasks are simultaneously learned using a multi-task learning approach. For inference, each new time series serves as input for the model, and the anomaly score is computed by a weighted sum of the reconstruction error of that input and the prediction error for the associated machine ID.

5.1.2 Contrastive approaches↩︎

These methods utilize one or more contrastive pretext tasks to train the models. The representations learned from these tasks are then utilized to determine the anomaly score of new time series, distinguishing them as normal or abnormal. Specifically, all methods in this category employ augmentation contrast, where transformations are applied to input time series samples to generate pairs for the contrastive task. Positive transformations aim to impart invariance to the model regarding normality-preserving changes, while negative samples are created by applying transformations that disrupt normality or selecting other samples from the dataset. The primary distinction among these approaches lies in the choice of transformations used to augment samples, which depends on the assumptions regarding the normal patterns and characteristics of the data made by the authors for each specific problem.

The proposal of [109] introduces a contrastive pre-training framework based on three augmentation-based contrastive tasks for time series representation learning, followed by time series anomaly detection. For each time series, this method extracts a time-based representation and a frequency-based representation by means of two encoders. The first two contrastive tasks are based on time- and frequency-consistency. Specifically, the time-based and the frequency-based representations are augmented by means of transformations applied in the temporal domain (jittering, scaling and time shifts) and in the frequency domain (removing and adding frequency components), respectively. Then, these two tasks take each of the two representations with their respective augmented views as the positive pairs, while considering the augmented views of other sequences in the batch as the negatives to the anchor. The third task focuses on ensuring that the distance between the frequency-based and temporal-based representations of the anchor is greater than the distance between the frequency-based and temporal-based representations and their respective augmented views. These tasks are concurrently learned in a multi-task manner during pre-training. For fine-tuning, a one-class SVM is incorporated at the top of the model to handle the anomaly detection task. The one-class SVM establishes a boundary encapsulating the majority of normal instances in the feature space, classifying sequences falling outside that boundary as anomalies during inference.

The previous approach utilizes augmented views of different samples as negative samples for the contrastive task. However, some approaches, such as COCA [110], do not incorporate negative samples in their contrastive tasks. COCA is a deep contrastive approach specifically designed for time series anomaly detection. This method integrates a contrastive pretext task with SVDD, which has been explained in the previous section. SVDD serves as a popular sampling contrast-based method that does not rely on negative pairs. Nevertheless, like other contrastive methods, it may suffer from a drawback known as ‘hypersphere collapse’, where representations of all normal instances converge to a constant. COCA proposes to combine the SVDD task with an augmentation contrast-based contrastive task. To construct the positive pairs of the anchor sequence, jittering and scaling transformations are used. Concurrently, COCA employs the SVDD loss by additionally incorporating a variance term to prevent hypersphere collapse. During inference, the anomaly score for a new time series is computed based on the distance between the representations of the original input and its augmented view relative to the center of the hypersphere.

The methods we have seen in this section make use of manually predefined transformations to generate the pairs to learn the contrastive tasks. However, there is a line of research on developing automatic augmentations that are not manually defined. In this line, we find NeuTraL AD [97]. In this work, the authors introduce a contrastive framework that augment neural network-based transformations, which can be represented by any parameterized function with gradient-based optimization accessibility. For each time series in the training dataset, they apply a set of learnable transformations to create multiple augmented views for each input. In this case, they implement the transformations by means of feed forward neural networks. Then, they employ what they term the ‘Deterministic Contrastive Loss’ to bring together the representations of the augmented views and the original input while simultaneously pushing away the embeddings of the augmented views away from each other. Note that the augmentation neural network is learned during the training phase as well. During the inference phase, the training loss serves as the anomaly score for effectively identifying anomalous time series.

5.2 Multi-type approaches↩︎

This section reviews methods that utilize both self-predictive and contrastive proxy tasks for global time series anomaly detection. Concretely, they all aim to tackle the previously outlined DCASE2020 challenge. Within this category, we identify three works that integrate various approaches discussed earlier to more effectively capture the normal patterns within collections of time series.

The works by [111], [112] combine self-supervised classification with an augmentation contrast-based proxy task. Considering transformations like pitch shift, time stretch, and time shifting, two random transformations are selected and applied to input sequences to create two augmented views for each sequence. Subsequently, mel-spectrograms are extracted from these augmented views, forming the basis for jointly learning the tasks. The self-supervised classification task involves predicting the transformation applied to the input for generating each of its two augmented views. The contrastive task pairs the augmented views of the same input as positive, while the rest of the samples in the batch serve as negative samples. After model training, the anomaly score for a new time series is determined by computing the Mahalanobis distance between its representation and the representations of normal training instances.

Lastly, [113] presents another method to tackle the DCASE2020 challenge. This work builds upon previous contributions that leverage machine IDs to extract information from acoustic time series for self-supervised representation learning. In this study, the authors perform self-supervised reconstruction by masking original sequences and reconstructing them to pre-train the model. Following this, they concurrently learn a self-supervised classification and a contrastive task. For the self-supervised classification, each input in the training set is encoded and reconstructed, and a binary classifier is employed to distinguish between the original time series and their corresponding reconstructions. To enhance the model’s discriminative ability with respect to original and reconstructed samples, a contrastive task is introduced, considering samples with the same ID as positive pairs and their reconstructed counterparts as negative samples. During inference, the model’s reconstruction error and the similarity of samples with the same ID are utilized to compute the anomaly score of new sequences.

6 Available Software↩︎

Within this section, we compile the openly accessible software related to the self-supervised time series anomaly detection approaches delineated in the preceding sections. An overview of this software is presented in Table 3. The organization of the table is structured in accordance with the context in which anomaly detection is performed and the type of anomalies they aim to identify. The method’s name contains a hyperlink leading to the URL for access. The summary of the most used datasets for evaluating the performance of self-supervised time series anomaly detectors is presented in Table [tab:datasets] (see Appendix 9).

Table 3: Summary of the publicly available software associated to self-supervised time series anomaly detection methods.
Name Related Research Anomaly Type
Local anomaly detection
AnomalyBERT [73] Point
SMLR [76] Point
TS2Vec [84] Point
LNT [87] Point
DCdetector [80] Point
CoInception [85] Point
USAD [63] Point
MSCRED [64] Subsequence
COUTA [94] Subsequence
ContrastAD [83] Subsequence
DeepAnt [72] Point, Subsequence
WaRTEm-AD [91] Point, Subsequence
Global anomaly detection
OCSTN [104] Sequence
SSDPT [108] Sequence
NeuTraL AD [97] Sequence
AADCL [112] Sequence
- [103] Sequence
COCA [110] Sequence
AMSL [106] Sequence
TF-C [109] Sequence

7 General Discussion and Conclusions↩︎

Time series anomaly detection is an expanding area of research due to its relevance in several application domains. With the emergence of novel machine learning techniques, numerous researchers have incorporated self-supervised learning into their methodologies to overcome the limitations of traditional unsupervised methods and improve the efficacy of anomaly detection algorithms. In this paper, we have explored the current literature of self-supervised methods designed for time series anomaly detection. In addition, we have introduced a taxonomy that categorizes these approaches based on the context in which they solve the anomaly detection task, and the type of pretext tasks considered. This final section offers general remarks on the examined works and presents specific conclusions organized according to the primary axes of the proposed taxonomy. Together with the conclusiones, we provide some insights about possible future research directions in the field.

To begin, all the works we have explored are rooted on a one-class classification perspective, in which the training set is assumed to comprise only normal samples. The main advantage of these methods is that they do not rely on the use of human annotations about normal and abnormal samples for model training. However, this is a heavy assumption specially in real problems, as there might exist samples or data parts that are anomalous in our training set. Thus, as unsupervised models tend to be sensitive to training set contamination due to the presence of unrecognized anomalous patterns, there is room for the design of methods that are robust to the existence of abnormalities in the data used for model training. Consequently, as a potential direction for future research, exploring whether self-supervised approaches demonstrate reduced sensitivity to the contamination of the normal training set would be interesting.

Concerning the inference phase of anomaly detection models, one of the key contributions of the articles discussed in this paper lies in introducing novel methods for computing anomaly scores for new data leveraging self-supervised learning. These scores measure the degree of abnormality in new samples, aligning with the ultimate goal in anomaly detection. Nevertheless, numerous works refrain from explicitly defining a threshold value for identifying anomalies due to the absence of a precise and clear methodology. Hence, there is potential value in exploring new research endeavors in this aspect.

As a final general conclusion, we observe that the proposed methods exhibit diverse and intriguing properties, showcasing a variety of approaches to time series anomaly detection. Nevertheless, as outlined in Section 6, only a limited number of works have shared the source code of their proposed methods for public access. To encourage ongoing progress in this field, it is important to emphasize the significance of releasing the source code of new methods. This practice facilitates other researchers in comprehending these approaches thoroughly and refining their own proposals. Moreover, it would be valuable to create new datasets for evaluating newly proposed approaches. Since many contributions undergo testing on the same datasets, there is a risk of creating an inaccurate perception of progress in the advancement of techniques for time series anomaly detection [115].

We shift our focus fto each axis individually, beginning with the first one. The approaches we have analyzed address the challenge of anomaly detection in two contexts: (i) identifying abnormal points and subsequences at the local level of individual time series, and ii) detecting anomalous sequences within the global context of a dataset composed of a number of time series. Most of the analyzed methods fall into the former category, where long time series are typically segmented into subsequences to analyze local data dynamics. While numerous works aim to identify abnormal points in the local context, there are a few approaches focused on detecting anomalous subsequences. Both types of approaches aim to capture the temporal dynamics and characteristics of individual time series. Thus, by adjusting the computation of anomaly scores from points to subsequences, many of the methods within this framework that have been examined may effectively identify both abnormal points and subsequences in the local context of time series. It would be interesting to test if the approaches falling under this category are capable of accurately detecting both point and subsequence outliers by means of the previously mentioned adjustment.

In the realm of local time series anomaly detection, there exists a particular task based on the identification of point and subsequence anomalies in streaming scenarios. Detecting anomalies in real-time or streaming settings involves recognizing abnormal patterns or events in an ongoing data flow. In this context, data points arrive sequentially, emphasizing the need to promptly detect anomalies as they unfold. This is particularly critical in applications where timely identification of unusual behavior is crucial. In the literature, only a limited number of works address the local anomaly detection task in streaming scenarios [61], [66], [84]. In this sense, as a future research direction, the proposal of self-supervised methods that address time series anomaly detection in streaming contexts is crucial for improving the robustness and reliability of time series anomaly detection in real systems systems.

In the context of global time series anomaly detection, these approaches strive to identify overarching patterns and characteristics representing the normality of data across datasets comprising diverse time series. Even if there are not many, there are global anomaly detection-based approaches that divide the original sequence into subsequences and treat them as complete sequences within the dataset. This strategy, presented by works such as [60], [74], [75], suggests that many global time series anomaly detection methods can be adapted for local anomaly detection by segmenting the original time series into subsequences and treating each subsequence as a distinct time series.

Concerning the second axis of the taxonomy, which pertains to the types of pretext tasks in the self-supervised setting, a notable prevalence of single-type approaches is observed in comparison to multi-type approaches, particularly methods based on self-predictive tasks rather than on contrastive tasks. Let us analyze the fundamental attributes of the proxy tasks inherent in these two categories of self-supervised learning.

In reference to self-predictive learning, the explored approaches usually rely on the use of transformations for constructing proxy tasks. Within this field, we specifically remark the utilization of self-supervised classification and self-supervised reconstruction tasks within this domain. Based on the analyzed works, it is notable that self-supervised classification tasks (alone or in combination with self-supervised reconstruction), are predominantly utilized in global anomaly detection to capture general patterns that describe the normality of collections of samples. In contrast, self-supervised reconstruction and forecasting are more commonly employed to capture local patterns and characteristics for identifying abnormal points and subsequences.

In the context of self-supervised classification, we see that the proposed transformations are typically chosen to satisfy the following criteria: i) disrupting the normality of the data to capture anomalous behavior when subjected to transformations and performing the self-supervised classification task, and ii) ensuring that the resulting augmented views are as diverse as possible [97]. Many methods in this field advocate for manually designed transformations to fulfill these two conditions depending on the problems they aim to tackle [73][75], [99], [101][103]. The most popular choices include resizing and scaling, applying affine transformations or altering the frequency of the time series. The performance of these methods heavily relies on their chosen transformation accurately fitting the data, so a transformation effective for one problem may not be suitable for another. Motivated by the proposal of NeUtraL AD [97], there is a line of research based on generating augmented views of samples for contrastive tasks on the basis of neural network-based transformations, which are jointly learned with the remaining modules of the model proposed [86][88]. Inspired by these works, a possible approach could be to employ neural networks to learn automatic transformations that meet the previous two conditions and employ them in self-supervised classification-based time series anomaly detection. This could lead to self-supervised classification-based methods applicable to many different problems without depending on the manual design of effective transformations applicable to all of them.

Switching to the domain of self-supervised reconstruction and forecasting, the most widely utilized transformations are masking [60], [62], [66][69] and jittering [62], [65]. Even if these two methods for generating augmented views show good results in terms of anomaly detection, it would be interesting to propose more transformations for self-supervised reconstruction. Moreover, it would also be possible to combine different transformations to generate augmented views as this could lead to methods that are more robust to perturbations on the data and achieve a better performance.

Concerning contrastive tasks, methods favoring augmentation contrast are more prevalent than those advocating for sampling contrast. Augmentation contrast typically involves manually predefined transformations to create invariance to positive transformations while differing from negative ones [81][83]. In addition, as mentioned before, recent advancements in deep learning have led to the adoption of neural transformations in the contrastive setting [86][88]. Sampling contrast is primarily adopted in methods focused on local anomaly detection, with the most common approach being CPC [78], [79], which shares some similarities with self-supervised forecasting but belongs to the contrastive setting. Additionally, certain works leverage "privileged information" specific to the problem, such as machine IDs in the DCASE2020 problem [113], to construct contrastive pairs for learning these tasks. For computing anomaly scores for new samples, most approaches utilize measures associated with the latent representations extracted from the model.

Finally, only a few works opt for multi-type self-supervised learning in time series anomaly detection. These approaches employ multiple proxy tasks, including both self-predictive and contrastive tasks, during the training of the models. After training, some tasks may be used, while others may be discarded as they are meant to be auxiliary for learning better representations during the training of the models. In the literature, the most common combination involves using autoencoders with other pretext tasks [90][93].

Following the above, multi-type approaches have the potential to harness the benefits of both self-predictive learning, which focuses on acquiring robust representations at the data sample level, and contrastive learning, which excels at capturing patterns related to relationships between different data parts and samples. Therefore, further exploration of methods combining self-predictive and contrastive tasks is merited, as they can extract both high and low-level features crucial for capturing data normality in anomaly detection tasks.

Acknowledgements↩︎

Authors thank financial support of Ministerio de Economía, Industria y Competitividad (MINECO) of the Spanish Central Government [PID2019-104933GB-10/AEI/10.13039/501100011033 and PID2022-137442NB-I00], and Departamento de Industria of the Basque Government [IT1504-22 and ELKARTEK Programme]. A. S. F. thanks financial support of Departamento de Educación of the Basque Government under the grant PRE_2022_1_0103. JA. L. thanks financial support of the Basque Government through the BERC 2022-2025 program and by the Ministry of Science and Innovation: BCAM Severo Ochoa accreditation CEX2021-001142-S / MICIN / AEI / 10.13039/501100011033.

8 Methodology↩︎

To conduct this survey and ensure its reproducibility, we have implemented a systematic methodology consisting of four key processes: Database Selection, Literature Search, Selection of Relevant Studies, and Analysis of the Studies. Each of these stages contributes to the comprehensive investigation of the research topic under consideration.

Database Selection. For this survey, we have chosen several scientific research databases to gather relevant publications. The selected databases are: ACM Digital, IEEE Explore Digital Library, ScienceDirect, SpringerLink, DBLP, Google Scholar, and Semantic Scholar.

Literature Search. To find articles that are relevant to the subject of this survey, we employ a literature search strategy using different queries. These queries are created by combining various strings that represent the methodology, task, and type of data considered. Specifically, we generate all possible queries resulting from the concatenation of the following strings:

(Self-Supervised Learning OR Contrastive Learning)
AND
(Anomaly Detection OR Outlier Detection)
AND
"Time Series Data"

Additionally, we also examine the papers referenced by the selected studies and those that cite them, as long as they are relevant to the subject of our survey.

Selection of Relevant Studies. In our analysis, we consider contributions that explicitly state the use of self-supervised learning for detecting anomalies in time series data in their titles or abstracts. We include these studies in our survey to examine their findings and approaches.

Analysis of the Studies. Upon identifying and selecting the relevant contributions for analysis, we proceed to extract their primary characteristics in order to construct a comprehensive taxonomy for classifying these studies. In addition to this, we conduct a detailed examination of other properties that characterize each of the studies.

9 Dataset list↩︎

Many methods use well-known time series datasets to evaluate models in anomaly detection tasks. Table [tab:datasets] presents the main characteristics of the most commonly used datasets for evaluating self-supervised time series anomaly detection approaches. These properties include the dataset’s name, associated scientific research, types of outliers it contains, the number of dimensions considered in the time series. The dataset’s name contains a hyperlink leading to the URL for access.

lccc Name&Related research&Anomaly Type& Dim
Yahoo-TSA & [116] & Point & \(1\)
Tennessee Eastman Process& [117] & Point & \(52\)
SMAP & [118] & Point, Subsequence & \(25\)
SMD & [119] & Point, Subsequence & \(38\)
SWaT & [120] & Point, Subsequence & \(51\)
MSL & [118] & Point, Subsequence & \(55\)
WaDi & [121] & Point, Subsequence & \(123\)
MIMII& [122]&Sequence & \(1\)
ToyADMOS & [123] & Sequence & \(1\)
UPenn and Mayo Clinic’s Seizure & [124] & Sequence & \(16\)
UCR Time Series Classification Archive\(^*\) & [125] & Sequence & \(-\)
UCR Anomaly Archive& [115] & Point, Subsequence, Sequence & \(-\)

image

References↩︎

[1]
James Douglas Hamilton.1994. Time series analysis. Princeton university press.
[2]
Jonathan D Cryer.1986. Time series analysis. Vol. 286. Duxbury Press Boston.
[3]
Tak-chung Fu.2011. . Engineering Applications of Artificial Intelligence24, 1(2011), 164–181.
[4]
Philippe Esling Carlos Agon.2012. . ACM Computing Surveys (CSUR)45, 1(2012), 1–34.
[5]
John O Awoyemi, Adebayo O Adetunmbi, and Samuel A Oluwadare.2017. . In 2017 international conference on computing networking and informatics (ICCNI). IEEE, 1–9.
[6]
Khyati Chaudhary, Jyoti Yadav, and Bhawna Mallick.2012. . International Journal of Computer Applications45, 1(2012), 39–44.
[7]
Mohiuddin Ahmed, Abdun Naser Mahmood, and Jiankun Hu.2016. . Journal of Network and Computer Applications60(2016), 19–31.
[8]
Donghwoon Kwon, Hyunjoo Kim, Jinoh Kim, Sang C Suh, Ikkyun Kim, and Kuinam J Kim.2019. . Cluster Computing22(2019), 949–961.
[9]
Arijit Ukil, Soma Bandyoapdhyay, Chetanya Puri, and Arpan Pal.2016. . In 2016 IEEE 30th international conference on advanced information networking and applications (AINA). IEEE, 994–997.
[10]
Yuequan Bao, Zhiyi Tang, Hui Li, and Yufeng Zhang.2019. . Structural Health Monitoring18, 2(2019), 401–421.
[11]
Nanda Kumar Thanigaivelan, Ethiopia Nigussie, Rajeev Kumar Kanth, Seppo Virtanen, and Jouni Isoaho.2016. . In 2016 13th IEEE annual consumer communications & networking conference (CCNC). IEEE, 319–320.
[12]
Xin-Xue Lin, Phone Lin, and En-Hau Yeh.2020. . IEEE Network35, 1(2020), 212–218.
[13]
Ander Carreño, Iñaki Inza, and Jose A Lozano.2020. . Artificial Intelligence Review53(2020), 3575–3594.
[14]
Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock.2022. . Proceedings of the VLDB Endowment15, 9(2022), 1779–1797.
[15]
Charu C Aggarwal Charu C Aggarwal.2017. An introduction to outlier analysis. Springer.
[16]
Ane Blázquez-Garcı́a, Angel Conde, Usue Mori, and Jose A Lozano.2021. . ACM Computing Surveys (CSUR)54, 3(2021), 1–33.
[17]
Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu C Aggarwal, and Mahsa Salehi.2022. . arXiv preprint arXiv:2211.05244(2022).
[18]
Kamran Shaukat, Talha Mahboob Alam, Suhuai Luo, Shakir Shabbir, Ibrahim A Hameed, Jiaming Li, Syed Konain Abbas, and Umair Javed.2021. . In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 1. Springer, 865–877.
[19]
Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel.2021. . ACM computing surveys (CSUR)54, 2(2021), 1–38.
[20]
Dor Bank, Noam Koenigstein, and Raja Giryes.2020. . arXiv preprint arXiv:2003.05991(2020).
[21]
Gopinath Muruti, Fiza Abdul Rahim, and Zul-Azri bin Ibrahim.2018. . In 2018 IEEE conference on application, information and network security (AINS). IEEE, 81–86.
[22]
Yoshua Bengio, Aaron Courville, and Pascal Vincent.2013. . IEEE Transactions on Pattern Analysis and Machine Intelligence35, 8(2013), 1798–1828.
[23]
Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang.2021. . IEEE Transactions on Knowledge and Data Engineering35, 1(2021), 857–876.
[24]
Varun Chandola, Arindam Banerjee, and Vipin Kumar.2009. . ACM computing surveys (CSUR)41, 3(2009), 1–58.
[25]
Raghavendra Chalapathy Sanjay Chawla.2019. . arXiv preprint arXiv:1901.03407(2019).
[26]
Manish Gupta, Jing Gao, Charu C Aggarwal, and Jiawei Han.2013. . IEEE Transactions on Knowledge and data Engineering26, 9(2013), 2250–2267.
[27]
Mohammad Braei Sebastian Wagner.2020. . arXiv preprint arXiv:2004.00433(2020).
[28]
Kukjin Choi, Jihun Yi, Changhwa Park, and Sungroh Yoon.2021. . IEEE Access9(2021), 120043–120065.
[29]
Kexin Zhang, Qingsong Wen, Chaoli Zhang, Rongyao Cai, Ming Jin, Yong Liu, James Zhang, Yuxuan Liang, Guansong Pang, Dongjin Song, et al2023. . arXiv preprint arXiv:2306.10125(2023).
[30]
Jie Zhu, Jiyang Qi, Mingyu Ding, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, and Jingdong Wang.2023. . arXiv preprint arXiv:2301.11915(2023).
[31]
Jie Gui, Tuo Chen, Jing Zhang, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao.2023. . arXiv preprint arXiv:2301.05712(2023).
[32]
Hadi Hojjati, Thi Kieu Khanh Ho, and Narges Armanfard.2022. . arXiv preprint arXiv:2205.05173(2022).
[33]
Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer.2019. . In Proceedings of the IEEE/CVF international conference on computer vision. 1476–1485.
[34]
Longlong Jing Yingli Tian.2020. . IEEE transactions on pattern analysis and machine intelligence43, 11(2020), 4037–4058.
[35]
Ishan Misra Laurens van der Maaten.2020. . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6707–6717.
[36]
Yu Zhang Qiang Yang.2018. . National Science Review5, 1(2018), 30–43.
[37]
Amr Ahmed, Kai Yu, Wei Xu, Yihong Gong, and Eric Xing.2008. . In Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part III 10. Springer, 69–82.
[38]
Carl Doersch Andrew Zisserman.2017. . In Proceedings of the IEEE international conference on computer vision. 2051–2060.
[39]
Carl Doersch, Abhinav Gupta, and Alexei A Efros.2015. . In Proceedings of the IEEE international conference on computer vision. 1422–1430.
[40]
Rayan Krishnan, Pranav Rajpurkar, and Eric J Topol.2022. . Nature Biomedical Engineering(2022), 1–7.
[41]
Yixin Liu, Ming Jin, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and S Yu Philip.2022. . IEEE Transactions on Knowledge and Data Engineering35, 6(2022), 5879–5900.
[42]
Shuo Liu, Adria Mallol-Ragolta, Emilia Parada-Cabaleiro, Kun Qian, Xin Jing, Alexander Kathan, Bin Hu, and Bjoern W Schuller.2022. . Patterns3, 12(2022), 100616.
[43]
Madeline C Schiappa, Yogesh S Rawat, and Mubarak Shah.2022. . Comput. Surveys(2022).
[44]
Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Jundong Li, and Zi Huang.2023. . IEEE Transactions on Knowledge and Data Engineering(2023).
[45]
Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham, Lichao Mou, and Xiao Xiang Zhu.2022. . arXiv preprint arXiv:2206.13188(2022).
[46]
Izhak Golan Ran El-Yaniv.2018. . Advances in neural information processing systems31(2018).
[47]
Spyros Gidaris, Praveer Singh, and Nikos Komodakis.2018. . arXiv preprint arXiv:1803.07728(2018).
[48]
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros.2016. . In Proceedings of the IEEE conference on computer vision and pattern recognition. 2536–2544.
[49]
Richard Zhang, Phillip Isola, and Alexei A Efros.2016. . In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 649–666.
[50]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.2008. . In Proceedings of the 25th international conference on Machine learning. 1096–1103.
[51]
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova.2019. . In Proceedings of naacL-HLT, Vol. 1. 2.
[52]
Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon.2020. . Technologies9, 1(2020), 2.
[53]
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah.1993. . Advances in neural information processing systems6(1993).
[54]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals.2018. . arXiv preprint arXiv:1807.03748(2018).
[55]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.2020. . In International conference on machine learning. PMLR, 1597–1607.
[56]
Florian Schroff, Dmitry Kalenichenko, and James Philbin.2015. . In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
[57]
Tiffany Tianhui Cai, Jonathan Frankle, David J Schwab, and Ari S Morcos.2020. arXiv preprint arXiv:2010.06682(2020).
[58]
Han Yang Jun Li.2023. Soft Computing(2023), 1–10.
[59]
Hossein Mobahi, Ronan Collobert, and Jason Weston.2009. . In Proceedings of the 26th Annual International Conference on Machine Learning. 737–744.
[60]
Hengyu Meng, Yuanxiang Li, Yuxuan Zhang, and Honghua Zhao.2019. . In 2019 Prognostics and System Health Management Conference (PHM-Qingdao). IEEE, 1–7.
[61]
Junsheng Chen, Jian Li, Weigen Chen, Youyuan Wang, and Tianyan Jiang.2020. . Renewable Energy147(2020), 1469–1480.
[62]
Mayu Sakurada Takehisa Yairi.2014. . In Proceedings of the MLSDA 2014 2nd workshop on machine learning for sensory data analysis. 4–11.
[63]
Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A Zuluaga.2020. . In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3395–3404.
[64]
Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen, and Nitesh V Chawla.2019. . In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 1409–1416.
[65]
Erik Marchi, Fabio Vesperini, Florian Eyben, Stefano Squartini, and Björn Schuller.2015. . In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 1996–2000.
[66]
Guoqian Jiang, Ping Xie, Haibo He, and Jun Yan.2017. . IEEE/Asme transactions on mechatronics23, 1(2017), 89–100.
[67]
Yiwei Fu Feng Xue.2022. . In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
[68]
Minhao Liu, Zhijian Xu, and Qiang Xu.2021. . arXiv preprint arXiv:2112.06247(2021).
[69]
Jinwei Pan, Wendi Ji, Bo Zhong, Pengfei Wang, Xiaoling Wang, and Jin Chen.2022. . IEEE Sensors Journal(2022).
[70]
Gang Li, Zeyu Yang, Honglin Wan, and Min Li.2022. . Electronics11, 23(2022), 3955.
[71]
Lifeng Shen, Zhuocong Li, and James Kwok.2020. . Advances in Neural Information Processing Systems33(2020), 13016–13026.
[72]
Mohsin Munir, Shoaib Ahmed Siddiqui, Andreas Dengel, and Sheraz Ahmed.2018. . Ieee Access7(2018), 1991–2005.
[73]
Yungi Jeong, Eunseok Yang, Jung Hyun Ryu, Imseong Park, and Myungjoo Kang.2023. . arXiv preprint arXiv:2305.04468(2023).
[74]
Desen Huang, Lifeng Shen, Zhongzhong Yu, Zhenjing Zheng, Min Huang, and Qianli Ma.2022. . Neurocomputing491(2022), 261–272.
[75]
Duc Hoang Tran, Van Linh Nguyen, Huy Nguyen, and Yeong Min Jang.2022. . Electronics11, 14(2022), 2146.
[76]
Qiucheng Miao, Chuanfu Xu, Jun Zhan, Dong Zhu, and Chengkun Wu.2022. . arXiv preprint arXiv:2208.09240(2022).
[77]
Panpan Qi, Dan Li, and See-Kiong Ng.2022. . In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1232–1244.
[78]
Theivendiram Pranavan, Terence Sim, Arulmurugan Ambikapathi, and Savitha Ramasamy.2022. . arXiv preprint arXiv:2202.03639(2022).
[79]
Minseo Kang Byunghan Lee.2023. . IEEE Access(2023).
[80]
Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, and Liang Sun.2023. . arXiv preprint arXiv:2306.10347(2023).
[81]
Guillaume Chambaret, Laure Berti-Equille, Frédéric Bouchara, Emmanuel Bruno, Vincent Martin, and Fabien Chaillan.2022. . In International Conference on Pattern Recognition and Artificial Intelligence. Springer, 306–317.
[82]
Chris U Carmona, François-Xavier Aubet, Valentin Flunkert, and Jan Gasthaus.2021. . arXiv preprint arXiv:2107.07702(2021).
[83]
Bin Li Emmanuel Müller.2023. . In 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
[84]
Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu.2022. . In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 8980–8987.
[85]
Anh Duy Nguyen, Trang H Tran, Hieu H Pham, Phi Le Nguyen, and Lam M Nguyen.2023. . arXiv preprint arXiv:2306.06579(2023).
[86]
Katrina Chen, Mingbin Feng, and Tony S Wirjanto.2023. . arXiv preprint arXiv:2304.07898(2023).
[87]
Tim Schneider, Chen Qiu, Marius Kloft, Decky Aspandi Latif, Steffen Staab, Stephan Mandt, and Maja Rudolph.2022. . arXiv preprint arXiv:2202.03944(2022).
[88]
Xu Zheng, Tianchun Wang, Samin Yasar Chowdhury, Ruimin Sun, and Dongsheng Luo.2023. . In 2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). IEEE, 363–369.
[89]
Kexin Zhang, Qingsong Wen, Chaoli Zhang, Liang Sun, and Yong Liu.2022. . In NeurIPS 2022 Workshop: Self-Supervised Learning-Theory and Practice.
[90]
Ling Yang Shenda Hong.2022. . In International Conference on Machine Learning. PMLR, 25038–25054.
[91]
S Abilasha, Sahely Bhadra, P Deepak, and Anish Mathew.2022. . Neurocomputing511(2022), 22–33.
[92]
Hao Zhou, Ke Yu, Xuan Zhang, Guanlin Wu, and Anis Yazidi.2022. . Information Sciences610(2022), 266–280.
[93]
Zhenyu Zhang, Lin Zhao, Dongyang Cai, Shuming Feng, Jiawei Miao, Yirun Guan, Haicheng Tao, and Jie Cao.2022. . In 2022 IEEE International Conference on Knowledge Graph (ICKG). IEEE, 392–397.
[94]
Hongzuo Xu, Yijie Wang, Songlei Jian, Qing Liao, Yongjun Wang, and Guansong Pang.2022. . arXiv preprint arXiv:2207.12201(2022).
[95]
Heejeong Choi Pilsung Kang.2023. . arXiv preprint arXiv:2303.01034(2023).
[96]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut.2019. . arXiv preprint arXiv:1909.11942(2019).
[97]
Chen Qiu, Timo Pfrommer, Marius Kloft, Stephan Mandt, and Maja Rudolph.2021. . In International Conference on Machine Learning. PMLR, 8703–8714.
[98]
Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft.2018. . In International conference on machine learning. PMLR, 4393–4402.
[99]
Ane Blázquez-Garcı́a, Angel Conde, Usue Mori, and Jose A Lozano.2021. . Information Sciences574(2021), 528–541.
[100]
Miseul Kim, Minh Tri Ho, and Hong-Goo Kang.2021. . In 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 586–590.
[101]
Junjie Xu, Yaojia Zheng, Yifan Mao, Ruixuan Wang, and Wei-Shi Zheng.2020. . In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 363–368.
[102]
Jiaxin Zhang, Kyle Saleeby, Thomas Feldhausen, Sirui Bi, Alex Plotkowski, and David Womble.2021. . In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
[103]
Yaojia Zheng, Zhouwu Liu, Rong Mo, Ziyi Chen, Wei-shi Zheng, and Ruixuan Wang.2022. . In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 193–203.
[104]
Toshitaka Hayashi, Dalibor Cimr, Filip Studnička, Hamido Fujita, Damián Bušovsk, and Richard Cimler.2022. . Information Sciences614(2022), 71–86.
[105]
Renjun Wang, Xiushan Xia, and Zhenyi Xu.2022. . In Asian Simulation Conference. Springer, 419–430.
[106]
Yuxin Zhang, Jindong Wang, Yiqiang Chen, Han Yu, and Tao Qin.2022. . IEEE Transactions on Knowledge and Data Engineering(2022).
[107]
Yang Jiao, Kai Yang, Dongjing Song, and Dacheng Tao.2022. . IEEE Transactions on Network Science and Engineering9, 3(2022), 1604–1619.
[108]
Jisheng Bai, Jianfeng Chen, Mou Wang, Muhammad Saad Ayub, and Qingli Yan.2023. . Digital Signal Processing135(2023), 103939.
[109]
Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik.2022. . Advances in Neural Information Processing Systems35(2022), 3988–4003.
[110]
Rui Wang, Chongwei Liu, Xudong Mou, Kai Gao, Xiaohui Guo, Pin Liu, Tianyu Wo, and Xudong Liu.2023. . In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). SIAM, 694–702.
[111]
Jiajia Li.2022. . In Journal of Physics: Conference Series, Vol. 2414. IOP Publishing, 012011.
[112]
Hadi Hojjati Narges Armanfard.2022. . In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3253–3257.
[113]
Xiao-Min Zeng, Yan Song, Zhu Zhuo, Yu Zhou, Yu-Hong Li, Hui Xue, Li-Rong Dai, and Ian McLoughlin.2023. . In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
[114]
Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Masaaki Yamamoto, and Yohei Kawaguchi.2022. . arXiv preprint arXiv:2206.05876(2022).
[115]
Renjie Wu Eamonn Keogh.2021. . IEEE Transactions on Knowledge and Data Engineering(2021).
[116]
Nikolay Laptev, Saeed Amizadeh, and Youssef Billawala.2015. A Benchmark Dataset for Time Series Anomaly Detection.
[117]
Cory A. Rieth, Ben D. Amsel, Randy Tran, and Maia B. Cook.2017. Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation. ://doi.org/10.7910/DVN/6C3JR1.
[118]
Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom.2018. . In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 387–395.
[119]
Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei.2019. . In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2828–2837.
[120]
Aditya P Mathur Nils Ole Tippenhauer.2016. . In 2016 international workshop on cyber-physical systems for smart water networks (CySWater). IEEE, 31–36.
[121]
Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur.2017. . In Proceedings of the 3rd international workshop on cyber-physical systems for smart water networks. 25–28.
[122]
Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi.2019. . arXiv preprint arXiv:1909.09347(2019).
[123]
Yuma Koizumi, Shoichiro Saito, Hisashi Uematsu, Noboru Harada, and Keisuke Imoto.2019. . In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 313–317.
[124]
Andriy Temko, Achintya Sarkar, and Gordon Lightbody.2015. . In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 6582–6585.
[125]
Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh.2019. . IEEE/CAA Journal of Automatica Sinica6, 6(2019), 1293–1305.

  1. The methods outlined below are presented within the perspective of deep learning, as the majority of self-supervised learning techniques are rooted in this field. Nevertheless, note that self-supervised learning can find application in broader contexts that do not exclusively depend on deep learning.↩︎