Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy
Abstract
Unsupervisedly detecting anomaly points in time series is challenging, which requires the model to learn informative representations and derive a distinguishable criterion. Prior methods mainly detect anomalies based on the recurrent network representation of each time point. However, the pointwise representation is less informative for complex temporal patterns and can be dominated by normal patterns, making rare anomalies less distinguishable. We find that in each time series, each time point can also be described by its associations with all time points, presenting as a pointwise distribution that is more expressive for temporal modeling. We further observe that due to the rarity of anomalies, it is harder for anomalies to build strong associations with the whole series and their associations shall mainly concentrate on the adjacent time points. This observation implies an inherently distinguishable criterion between normal and abnormal points, which we highlight as the Association Discrepancy. Technically we propose the Anomaly Transformer with an AnomalyAttention mechanism to compute the association discrepancy. A minimax strategy is devised to amplify the normalabnormal distinguishability of the association discrepancy. Anomaly Transformer achieves stateoftheart performance on six unsupervised time series anomaly detection benchmarks for three applications: service monitoring, space & earth exploration, and water treatment.
1 Introduction
Realworld systems always work in a continuous way, which can generate several successive measurements monitored by multisensors, such as industrial equipment, space probe, etc. Discovering the malfunctions from largescale system monitoring can be reduced to detecting the abnormal time points from time series, which is quite meaningful for ensuring security and avoiding financial loss. But anomalies are usually rare and can be hidden by vast normal points, making the data labeling hard and expensive. Thus, we focus on time series anomaly detection under the unsupervised setting.
Unsupervised time series anomaly detection is extremely challenging in practice. The model not only should learn informative representations from complex temporal dynamics through the unsupervised tasks, but it also should derive a criterion that is distinguishable to detect the rare anomalies from plenty of normal time points. Various classic anomaly detection methods have provided many unsupervised paradigms, such as the densityestimation methods proposed in local outlier factor (LOF, Breunig et al. (2000)), clusteringbased methods presented in oneclass SVM (OCSVM, Schölkopf et al. (2001)) and SVDD (Tax and Duin, 2004). But these classic methods do not consider the temporal information and are difficult to generalize to unseen realworld scenarios. Benefiting from the great representation learning capability of neural networks, recent deep models (Su et al., 2019b; Shen et al., 2020; Li et al., 2021) have made remarkable advances. They mainly focus on learning temporal representations through welldesigned recurrent networks and selfsupervised by the reconstruction task, in which the most practical anomaly criterion is reconstruction error per time point based on the learned representations. However, due to the rarity of anomalies, the pointwise representation is less informative for complex temporal patterns and can be dominated by normal time points, making anomalies less distinguishable. Also, the reconstruction error is calculated pointwisely, which cannot provide a comprehensive description of the temporal context.
From a new perspective, we find that in each time series, each time point can also be represented by its associations with all the time points, presenting as a distribution of association weights along the horizon. The association distribution of each time point can provide a more informative description for the temporal context, indicating dynamic patterns, such as the period or trend. This association distribution is referred to as the seriesassociation, which can be discovered from the raw series.
Further, we observe that due to the rarity of anomalies and the dominance of normal patterns, it is harder for anomalies to build strong associations with the whole series. The associations of anomalies shall concentrate on the adjacent time points that are more likely to contain similar abnormal patterns due to the continuity. Such an adjacentconcentration inductive bias is referred to as the priorassociation. In contrast, the dominating normal time points can discover informative associations with the whole series, not limiting to the adjacent area. Based on this observation, we try to utilize the inherent normalabnormal distinguishability of the association distribution. This leads to a new anomaly criterion for each time point, quantified by the distance between each time point’s priorassociation and its seriesassociation, named as Association Discrepancy. As aforementioned, because the associations of anomalies are more likely to be adjacentconcentrating, anomalies will present a smaller association discrepancy than normal time points.
Taking the advantage of the great model capability of Transformers (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020), we introduce them to the unsupervised time series anomaly detection and propose the Anomaly Transformer for association learning. To compute the Association Discrepancy, we renovate the selfattention mechanism to the AnomalyAttention, which contains a twobranch structure to model the priorassociation and seriesassociation of each time point respectively. The priorassociation employs a learnable Gaussian distribution to present the adjacentconcentration inductive bias of each time point, while the seriesassociation corresponds to the selfattention weights learned from raw series. Besides, a minimax strategy is applied between the two branches, which can amplify the normalabnormal distinguishability of the Association Discrepancy and further derive a new associationbased criterion. Anomaly Transformer achieves strong results on six benchmarks, covering three real applications. The contributions are summarized as follows:

Based on the key observation of Association Discrepancy, we propose the Anomaly Transformer with an AnomalyAttention mechanism, which can model the priorassociation and seriesassociation simultaneously to embody the Association Discrepancy.

We propose a minimax strategy to amplify the normalabnormal distinguishability of the Association Discrepancy and further derive a new associationbased detection criterion.

Anomaly Transformer achieves the stateoftheart anomaly detection results on six benchmarks for three real applications. Extensive ablations and insightful case studies are given.
2 Related Work
2.1 Unsupervised Time Series Anomaly Detection
As a vital realworld problem, unsupervised time series anomaly detection has been widely explored. Categorizing by anomaly determination criterion, the paradigms roughly include the densityestimation, clusteringbased and reconstructionbased methods.
As for the densityestimation methods, the classic methods local outlier factor (LOF, Breunig et al. (2000)) and connectivity outlier factor (COF, Tang et al. (2002)) calculates the local density and local connectivity as the metrics for outlier determination respectively. DAGMM from Zong et al. (2018) integrates the deep Autoencoder (AE) with a Gaussian Mixture Model (GMM), which can get latent representations from AE and estimate the density of the representations using GMM.
In clusteringbased methods, SVDD (Tax and Duin, 2004) and Deep SVDD (Ruff et al., 2018) try to gather the representations from normal data to a compact cluster, in which the anomaly score of an instance is formalized as the distance to cluster center. THOC (Shen et al., 2020) fuses the multiscale temporal features from intermediate layers together by a hierarchical clustering mechanism and determines the anomalies by the weighted sum of distances to the cluster centers of each layer.
The reconstructionbased models attempt to detect the anomalies by the reconstruction error. Park et al. (2018) presented the LSTMVAE model that employed the LSTM backbone for temporal modeling and the Variational AutoEncoder (VAE) for reconstruction. OmniAnomaly proposed by Su et al. (2019b) further extends the LSTMVAE model with a normalizing flow and uses the reconstruction probabilities for detection. InterFusion from Li et al. (2021) renovates the backbone to a hierarchical VAE to model the inter and intra dependency among multiple series simultaneously. GANs (Goodfellow et al., 2014) are also used for reconstructionbased anomaly detection (Schlegl et al., 2019; Li et al., 2019a; Zhou et al., 2019) and perform as an adversarial regularization.
This paper is characterized by a new associationbased anomaly detection criterion, which is embodied by a codesign of the temporal models for learning more informative timepoint associations.
2.2 Transformers for Time Series Analysis
Recently, Transformers (Vaswani et al., 2017) have shown great power in sequential data processing, such as natural language processing (Devlin et al., 2019; Brown et al., 2020), audio processing (Huang et al., 2019) and computer vision (Dosovitskiy et al., 2021; Liu et al., 2021). For time series analysis, benefiting from the advantage of the selfattention mechanism, Transformers are used to discover the reliable longrange temporal dependencies (Kitaev et al., 2020; Li et al., 2019b; Zhou et al., 2021; Wu et al., 2021). Especially for time series anomaly detection, GTA proposed by Chen et al. (2021) employs the graph structure to learn the relationship among multiple IoT sensors, as well as the Transformer for temporal modeling and the reconstruction criterion for anomaly detection. Unlike the previous usage of Transformers, Anomaly Transformer renovates the selfattention mechanism to the AnomalyAttention based on the key observation of association discrepancy and detects the anomalies based on our proposed associationbased criterion.
3 Method
Suppose monitoring a successive system of measurements and recording the equally spaced observations over time. The observed time series is denoted by a set of time points , where represents the observation of time . The unsupervised time series anomaly detection problem is to determine whether is anomalous or not without labels.
As aforementioned, we highlight the key to unsupervised time series anomaly detection as learning informative representations and finding distinguishable criterion. We propose the Anomaly Transformer to discover more informative associations and tackle this problem by learning the Association Discrepancy, which is inherently normalabnormal distinguishable. Technically, we propose the AnomalyAttention to embody the priorassociation and seriesassociations, along with a minimax optimization strategy to obtain a more distinguishable association discrepancy. Codesigned with the architecture, we derive an associationbased criterion based on the learned association discrepancy.
3.1 Anomaly Transformer
Given the limitation of Transformers (Vaswani et al., 2017) for anomaly detection, we renovate the vanilla architecture to the Anomaly Transformer (Figure 1) with an AnomalyAttention mechanism.
Overall Architecture
Anomaly Transformer is characterized by stacking the AnomalyAttention blocks and feedforward layers alternately. This stacking structure is conducive to learning underlying associations from deep multilevel features. Suppose the model contains layers with length input time series . The overall equations of the th layer are formalized as:
(1) 
where denotes the output of the th layer with channels. The initial input represents the embedded raw series. is the th layer’s hidden representation. is to compute the association discrepancy.
AnomalyAttention
Note that the singlebranch selfattention mechanism (Vaswani et al., 2017) cannot model the priorassociation and seriesassociation simultaneously. We propose the AnomalyAttention with a twobranch structure (Figure 1). For the priorassociation, we adopt a learnable Gaussian distribution, centered at the corresponding position index. Benefiting from the unimodal property of the Gaussian family, this design can pay more attention to the adjacent horizon constitutionally. We also use a learnable variance parameter for the Gaussian prior, making the priorassociations adapt to the various time series patterns, such as different lengths of anomaly segments. The seriesassociation branch is to learn the associations from raw series, which can find the most effective associations adaptively. Note that these two forms maintain the temporal dependencies of each time point, which are more informative than pointwise representation. They also reflect the adjacentconcentration prior and the learned real associations respectively, whose discrepancy shall be normalabnormal distinguishable. The AnomalyAttention in the th layer is formalized as:
(2) 
where represent the query, key and value respectively in the selfattention, and denotes the seriesassociation. Priorassociation is generated based on the learned variance parameter and corresponds to the th time point. Concretely, for the th time point, its association weight for the th point is calculated from Gaussian distribution with respect to the relative distance . is to transform the association weights to the discrete distributions by dividing the row sum. represents the linear projector of the th layer. is the hidden representation after the AnomalyAttention in the th layer. We use to summarize Equation 2. See Appendix B for pseudo code.
In the multihead version that we use, the learned variance is for heads. denote the query, key and value of the th head respectively. The block concatenates the outputs from multiple heads and gets the final result .
Association Discrepancy
We formalize the Association Discrepancy as the symmetrized KL divergence between prior and series associations, which represents the information gain between these two distributions (Neal, 2007). We average the association discrepancy from multiple layers to combine the associations from multilevel features into a more informative measure as:
(3) 
where means the pointwise association discrepancy of with respect to priorassociation and seriesassociation from multiple layers. The th element of results corresponds to the th time point of . From previous observation, abnormal time points will present smaller than normal time points, which makes inherently distinguishable.
3.2 Minimax Association Learning
As an unsupervised task, we employ the reconstruction loss for optimizing our model. The reconstruction loss will guide the seriesassociation to find the most informative associations, such as the adjacent time points of anomalies. To further amplify the difference between normal and abnormal time points, we also use an additional loss to enlarge the association discrepancy. Due to the unimodal property of the priorassociation, the discrepancy loss will guide the seriesassociation to pay more attention to the nonadjacent area, which makes the reconstruction of anomalies harder and makes anomalies more identifiable. The loss function for input series is formalized as:
(4) 
where denotes the reconstruction of and means the L2norm. is to tradeoff these two loss terms. When , the optimization target is to enlarge the association discrepancy. We propose a new minimax strategy to make the association discrepancy more distinguishable.
Minimax Strategy
Note that directly maximizing the association discrepancy will extremely reduce the variance of the Gaussian prior (Neal, 2007), making the priorassociation meaningless. Towards a better control of association learning, we propose a minimax strategy (Figure 2). Concretely, for the minimize phase, we drive the priorassociation to approximate the seriesassociation that is learned from raw series. This process will make the priorassociation adapt to various temporal patterns. For the maximize phase, we optimize the seriesassociation to enlarge the association discrepancy. This process forces the seriesassociation to pay more attention to the nonadjacent horizon. Thus, integrating the reconstruction loss, the total loss functions of these two phases are:
(5) 
where and means to stop the gradient backpropagation of the association (Figure 1). As approximates in the minimize phase, the maximize phase will conduct a stronger constraint to the seriesassociation, forcing the time points to pay more attention to the nonadjacent area. Under the reconstruction loss, this is much harder for anomalies to achieve than normal time points, thereby amplifying the normalabnormal distinguishability of the association discrepancy.
Associationbased Anomaly Criterion
We incorporate the normalized association discrepancy to the reconstruction criterion, which will take the benefits of both temporal representation and the distinguishable association discrepancy. The final anomaly score of is shown as follows:
(6) 
where denotes the pointwise anomaly criterion of . The anomalies need to pay more attention to adjacent time points for a better reconstruction, which will make the association discrepancy decrease and derive a higher anomaly score. Thus, this design can make the reconstruction error and the association discrepancy collaborate to improve detection performance.
4 Experiments
We extensively evaluate Anomaly Transformer on six benchmarks for three practical applications.
Datasets
Here is a description of the six experiment datasets: (1) SMD (Server Machine Dataset, Su et al. (2019b)) is a 5weeklong dataset that is collected from a large Internet company with 38 dimensions. (2) PSM (Pooled Server Metrics, Abdulaal et al. (2021)) is collected internally from multiple application server nodes at eBay with 26 dimensions. (3) Both MSL (Mars Science Laboratory rover) and SMAP (Soil Moisture Active Passive satellite) are public datasets from NASA (Su et al., 2019b) with 55 and 25 dimensions respectively, which contain the telemetry anomaly data derived from the Incident Surprise Anomaly (ISA) reports of spacecraft monitoring systems. (4) SWaT (Secure Water Treatment, Mathur and Tippenhauer (2016)) is obtained from 51 sensors of the critical infrastructure system under continuous operations. (5) NeurIPSTS (NeurIPS 2021 Time Series Benchmark) is a dataset proposed by Lai et al. (2021) and includes five time series anomaly scenarios categorized by behaviordriven taxonomy as pointglobal, patterncontextual, patternshapelet, patternseasonal and patterntrend. Each dataset includes training, validation and testing subsets. Anomalies are only labeled in the testing subset. The statistical details are summarized in Table 1.
Benchmarks  Applications  Dimension  Window  #Training  #Validation  #Test  AR (Truth) 

SMD  Server  38  100  566,724  141,681  708,420  0.042 
PSM  Server  25  100  105,984  26,497  87,841  0.278 
MSL  Space  55  100  46,653  11,664  73,729  0.105 
SMAP  Space  25  100  108,146  27,037  427,617  0.128 
SWaT  Water  51  100  396,000  99,000  449,919  0.121 
NeurIPSTS  Various Anomalies  1  100  20,000  10,000  20,000  0.018 
Implementation details
Following the wellestablished protocol in Shen et al. (2020), we adopt a nonoverlapped sliding window to obtain a set of subseries. The sliding window is with a fixed size of 100 for all datasets as shown in Table 1. We label the time points as anomalies if their anomaly scores (Equation 6) are larger than a certain threshold . The threshold is determined to make proportion data of the validation dataset labeled as anomalies. For the main results, we set for SMD, 0.1% for SWaT and 1% for other datasets. We adopt the widelyused adjustment strategy (Xu et al., 2018; Su et al., 2019a; Shen et al., 2020): if a time point in a certain successive abnormal segment is detected, all anomalies in this abnormal segment are viewed to be correctly detected. This strategy is justified from the observation that an abnormal time point will cause an alert and further make the whole segment noticed in realworld applications. Experimentally, Anomaly Transformer contains 3 layers. We set the channel number of hidden states as 512 and the number of heads as 8. The hyperparameter (Equation 4) is set as 3 for all datasets to tradeoff two parts of the loss function. We use the ADAM (Kingma and Ba, 2015) optimizer with an initial learning rate of . The training process is early stopped within 10 epochs with the batch size of 32. All the experiments are implemented in Pytorch (Paszke et al., 2019) and conducted on a single NVIDIA TITAN RTX 24GB GPUs. We provide the analysis of hyperparameter sensitivity in Appendix A.
Baselines
We extensively compare our model with 10 baselines, including the reconstructionbased models: InterFusion (2021), BeatGAN (2019), OmniAnomaly (2019b), LSTMVAE (2018); the densityestimation models: LOF (2000), DAGMM (2018); the clusteringbased methods: DeepSVDD (2018), THOC (2020), classic methods: OCSVM (2004) and IsolationForest (2008). InterFusion (2021) and THOC (2020) are the stateoftheart deep models.
4.1 Main Results
Dataset  SMD  MSL  SMAP  SWaT  PSM  

Metric  P  R  F1  P  R  F1  P  R  F1  P  R  F1  P  R  F1 
OCSVM 
44.34 
76.72 
56.19 
59.78 
86.87 
70.82 
53.85 
59.07 
56.34 
45.39 
49.22 
47.23 
62.75 
80.89 
70.67 
IsolationForest 
42.31 
73.29 
53.64 
53.94 
86.54 
66.45 
52.39 
59.07 
55.53 
49.29 
44.95 
47.02 
76.09 
92.45 
83.48 
LOF 
56.34 
39.86 
46.68 
47.72 
85.25 
61.18 
58.93 
56.33 
57.60 
72.15 
65.43 
68.62 
57.89 
90.49 
70.61 
DeepSVDD 
78.54 
79.67 
79.10 
91.92 
76.63 
83.58 
89.93 
56.02 
69.04 
80.42 
84.45 
82.39 
95.41 
86.49 
90.73 
DAGMM 
67.30 
49.89 
57.30 
89.60 
63.93 
74.62 
86.45 
56.73 
68.51 
89.92 
57.84 
70.40 
93.49 
70.03 
80.08 
LSTMVAE 
75.76 
90.08 
82.30 
85.49 
79.94 
82.62 
92.20 
67.75 
78.10 
76.00 
89.50 
82.20 
73.62 
89.92 
80.96 
BeatGAN 
72.90 
84.09 
78.10 
89.75 
85.42 
87.53 
92.38 
55.85 
69.61 
64.01 
87.46 
73.92 
90.30 
93.84 
92.04 
OmniAnomaly 
83.68 
86.82 
85.22 
89.02 
86.37 
87.67 
92.49 
81.99 
86.92 
81.42 
84.30 
82.83 
88.39 
74.46 
80.83 
InterFusion 
87.02 
85.43 
86.22 
81.28 
92.70 
86.62 
89.77 
88.52 
89.14 
80.59 
85.58 
83.01 
83.61 
83.45 
83.52 
THOC 
79.76 
90.95 
84.99 
88.45 
90.97 
89.69 
92.06 
89.34 
90.68 
83.94 
86.36 
85.13 
88.14 
90.99 
89.54 
Ours 
89.40 
95.45 
92.33 
92.09 
95.15 
93.59 
94.13 
99.40 
96.69 
91.55 
96.73 
94.07 
96.91 
98.90 
97.89 
Realworld datasets
We extensively evaluate our model on five realworld datasets with ten competitive baselines. As shown in Table 2, Anomaly Transformer achieves the consistent stateoftheart on all benchmarks. We observe that deep models generally beat the classic statistic models, benefiting from the powerful nonlinear modeling capability of the neural network. Also, deep models that consider the temporal information outperform the general anomaly detection model, such as DeepSVDD (Ruff et al., 2018) and DAGMM (Zong et al., 2018), which verifies the effectiveness of temporal modeling. Our proposed Anomaly Transformer goes beyond the pointwise representation learned by RNNs and models the more informative associations. The results in Table 2 are persuasive for the advantage of association learning in time series anomaly detection. In addition, we plot the ROC curve in Figure 3 for a complete comparison. Anomaly Transformer has the highest AUC values for all five datasets, which means that our model is more distinguishable and robust under various preselected thresholds. See Appendix C for showcases.
NeurIPSTS benchmark
This benchmark is generated from welldesigned rules proposed by Lai et al. (2021), which completely includes all types of anomalies, covering both the pointwise and patternwise anomalies. As shown in Figure 4, Anomaly Transformer can still achieve stateoftheart performance on various anomalies, which means that our model is robust to all types of anomalies. We provide some showcases in Appendix C.
Ablation study
As shown in Table 3, we further investigate the effect of each part in our model. Our associationbased criterion outperforms the widelyused reconstruction criterion consistently. Specifically, the associationbased criterion brings a remarkable 18.76% (76.2094.96) averaged absolute F1score promotion. Also, directly taking the association discrepancy as the criterion still achieves a good performance (F1score: 91.55%) and surpasses the previous stateoftheart model THOC (F1score: 88.01% calculated from Table 2). Besides, the learnable priorassociation (corresponding to in Equation 2) and the minimax strategy can further improve our model and get 8.43% (79.0587.48) and 7.48% (87.4894.96) averaged absolute promotions respectively. Finally, our proposed Anomaly Transformer surpasses the pure Transformer by 18.34% (76.6294.96) absolute improvement. These verify that each module of our design is effective and necessary. More ablations of association discrepancy can be found in Appendix D.
Architecture  Anomaly  Prior  Optimization  SMD  MSL  SMAP  SWaT  PSM  Avg F1 

Criterion  Association  Strategy  (as %)  
Transformer  Recon  79.72  76.64  73.74  74.56  78.43  76.62  
Recon  Learnable  Minmax  71.35  78.61  69.12  81.53  80.40  76.20  
Anomaly  AssDis  Learnable  Minmax  87.57  90.50  90.98  93.21  95.47  91.55 
Transformer  Assoc  Fix  Max  83.95  82.17  70.65  79.46  79.04  79.05 
Assoc  Learnable  Max  88.88  85.20  87.84  81.65  93.83  87.48  
*final  Assoc  Learnable  Minmax  92.33  93.59  96.90  94.07  97.89  94.96 
4.2 Model Analysis
To explain how our model works intuitively, we provide the visualization and statistical results for our three key designs: anomaly criterion, learnable priorassociation and optimization strategy.
Anomaly criterion visualization
To get more intuitive cases about how associationbased criterion works, we provide some visualization in Figure 5 and further explore the criterion performance under different types of anomalies, where the taxonomy is from Lai et al. (2021). We can find that our proposed associationbased criterion is more distinguishable in general. Concretely, the associationbased criterion can obtain the consistent smaller values for the normal part, which is quite contrasting in pointcontextual and patternseasonal cases (Figure 5). In contrast, the jitter curves of the reconstruction criterion make the detection process confused and fail in the aforementioned two cases. This visualization verifies that our proposed criterion can highlight the anomalies and provide distinct values for normal and abnormal points, making the detection precise and robust.
Priorassociation visualization
We find that the learned changes to adapt to various data patterns of time series (Figure 6). Especially, the priorassociation of anomalies generally has a smaller than normal time points, which matches our adjacentconcentration inductive bias of anomalies.
Optimization strategy analysis
Only with the reconstruction loss, the abnormal and normal time points present similar performance in the association weights to adjacent time points, corresponding to a contrast value closed to 1 (Table 4). Maximizing the association discrepancy will force the seriesassociations to pay more attention to the nonadjacent area. However, to obtain a better reconstruction, the anomalies have to maintain much larger adjacent association weights than normal time points, corresponding to a larger contrast value. But direct maximization will cause the optimization problem of Gaussian prior and cannot strongly amplify the difference between normal and abnormal time points as expected (SMD:1.151.27). The minimax strategy optimizes the priorassociation to provide a stronger constraint to seriesassociation. Thus, the minimax strategy obtains more distinguishable contrast values than direct maximization (SMD:1.272.39) and thereby performs better.
Dataset  SMD  MSL  SMAP  SWaT  PSM  

Optimization  Recon  Max  Ours  Recon  Max  Ours  Recon  Max  Ours  Recon  Max  Ours  Recon  Max  Ours 
Abnormal (%)  1.08  0.95  0.86  1.01  0.65  0.35  1.29  1.18  0.70  1.27  0.89  0.37  1.02  0.56  0.29 
Normal (%)  0.94  0.75  0.36  1.00  0.59  0.22  1.23  1.09  0.49  1.18  0.78  0.21  0.99  0.54  0.11 
Contrast ()  1.15  1.27  2.39  1.01  1.10  1.59  1.05  1.08  1.43  1.08  1.14  1.76  1.03  1.04  2.64 
5 Conclusion
This paper studies the unsupervised time series anomaly detection problem. Unlike previous methods, we try to tackle this problem with more informative association learning. Based on the key observation of association discrepancy, we propose the Anomaly Transformer, including an AnomalyAttention with the twobranch structure to embody the association discrepancy. A minimax strategy is adopted to further amplify the difference between normal and abnormal time points. By introducing the association discrepancy, we propose the associationbased criterion, which makes the reconstruction performance and association discrepancy collaborate. Anomaly Transformer achieves the stateoftheart on extensive benchmarks. Comprehensive ablations and insightful analyses are included to verify the effectiveness of our design and elaborate on how the model works.
References
 Practical approach to asynchronous multivariate time series anomaly detection and localization. International Conference on Knowledge Discovery & Data Mining. Cited by: §4.
 LOF: identifying densitybased local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Cited by: §1, §2.1, §4.
 Language models are fewshot learners. In Neural Information Processing Systems, Cited by: §1, §2.2.
 Learning graph structures with transformer for multivariate time series anomaly detection in iot. ArXiv abs/2104.03466. Cited by: §2.2.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, Cited by: §1, §2.2.
 An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §2.2.
 Generative adversarial nets. In Neural Information Processing Systems, Cited by: §2.1.
 Music transformer. In International Conference on Learning Representations, Cited by: §2.2.
 Adam: A method for stochastic optimization. In International Conference on Learning Representations, Cited by: §4.
 Reformer: the efficient transformer. In International Conference on Learning Representations, Cited by: §2.2.
 Revisiting time series outlier detection: definitions and benchmarks. In NeurIPS Dataset and Benchmark Track, Cited by: Figure 5, §4, §4.1, §4.2.
 MADgan: multivariate anomaly detection for time series data with generative adversarial networks. In ICANN, Cited by: §2.1.
 Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Neural Information Processing Systems, Cited by: §2.2.
 Multivariate time series anomaly detection and interpretation using hierarchical intermetric and temporal embedding. International Conference on Knowledge Discovery & Data Mining. Cited by: §1, §2.1, §4.
 Isolation forest. International Conference on Data Mining. Cited by: §4.
 Swin transformer: hierarchical vision transformer using shifted windows. ArXiv abs/2103.14030. Cited by: §2.2.
 SWaT: a water treatment testbed for research and training on ICS security. In International Workshop on Cyberphysical Systems for Smart Water Networks, Cited by: §4.
 Pattern recognition and machine learning. Technometrics. Cited by: §3.1, §3.2.
 A multimodal anomaly detector for robotassisted feeding using an lstmbased variational autoencoder. IEEE Robotics and Automation Letters. Cited by: §2.1, §4.
 PyTorch: an imperative style, highperformance deep learning library. In Neural Information Processing Systems, Cited by: §4.
 Deep oneclass classification. In International Conference on Machine Learning, Cited by: §2.1, §4, §4.1.
 F‐anogan: fast unsupervised anomaly detection with generative adversarial networks. Medical Image Analysis. Cited by: §2.1.
 Estimating the support of a highdimensional distribution. Neural Computation. Cited by: §1.
 Timeseries anomaly detection using temporal hierarchical oneclass network. In Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §E.2, Table 7, §1, §2.1, §4, §4.
 Robust anomaly detection for multivariate time series through stochastic recurrent neural network. International Conference on Knowledge Discovery & Data Mining. Cited by: §4.
 Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In International Conference on Knowledge Discovery & Data Mining, A. Teredesai, V. Kumar, Y. Li, R. Rosales, E. Terzi, and G. Karypis (Eds.), Cited by: §1, §2.1, §4, §4.
 Enhancing effectiveness of outlier detections for low density patterns. In PacificAsia Conference on Knowledge Discovery & Data Mining, Cited by: §2.1.
 Support vector data description. Machine Learning. Cited by: §1, §2.1, §4.
 Attention is all you need. In Neural Information Processing Systems, pp. . Cited by: §1, §2.2, §3.1, §3.1.
 Autoformer: decomposition transformers with autocorrelation for longterm series forecasting. ArXiv abs/2106.13008. Cited by: §2.2.
 Unsupervised anomaly detection via variational autoencoder for seasonal kpis in web applications. Proceedings of the World Wide Web Conference. Cited by: §4.
 BeatGAN: anomalous rhythm detection using adversarially generated time series. In International Joint Conference on Artificial Intelligence, Cited by: §2.1, §4.
 Informer: beyond efficient transformer for long sequence timeseries forecasting. In AAAI Conference on Artificial Intelligence, Cited by: §2.2.
 Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International Conference on Learning Representations, Cited by: §2.1, §4, §4.1.
Appendix A Parameter Sensitivity
Figure 7 provides the model performance under different choices of hyperparameters: the window size and the loss weight. We present that our model is robust to the window size over extensive datasets (Figure 7 left). Note that a larger window size indicates a larger memory cost and a smaller sliding number. Thus, we set the window size as 100 throughout the main text, which gives consideration to the performance, memory and computation efficiency. Especially, only considering the performance, its relationship to the window size can be determined by the data pattern. For example, our model performs better when the window size is 50 for the SMD dataset. Besides, we adopt the loss weight in Equation 5 to tradeoff the reconstruction loss and the association part. We find that is robust and easy to tune in the range of 2 to 4. Thus, we set as 3 for all experiments. The above results verify the robustness of our model, which is essential for realworld applications.
Appendix B Implementation Details
We present the pseudocode of AnomalyAttention in Algorithm 1.
Appendix C More Showcases
To obtain an intuitive comparison of main results (Table 2), we visualize the criterion of various baselines. Anomaly Transformer can present the most distinguishable criterion (Figure 8). Besides, for the realworld dataset, Anomaly Transformer can also detect the anomalies correctly. Especially for the SWaT dataset (Figure 9(d)), our model can detect the anomalies in the early stage, which is meaningful for realworld applications, such as the early warning of malfunctions.
Appendix D Ablation of Association Discrepancy
d.1 Ablation of MultiLevel Quantification
We average the association discrepancy from multiple layers for the final results (Equation 6). We further investigate the model performance under the singlelayer usage. As shown in Table 5, the multiplelayer design achieves the best, which verifies the effectiveness of multilevel quantification.
Dataset  SMD  MSL  SMAP  SWaT  PSM  

Metric  P  R  F1  P  R  F1  P  R  F1  P  R  F1  P  R  F1 
layer 1 
87.15 
92.87 
89.92 
90.36 
94.11 
92.19 
93.65 
99.03 
96.26 
92.61 
91.92 
92.27 
97.20 
97.50 
97.35 
layer 2 
87.22 
95.17 
91.02 
90.82 
92.41 
91.60 
93.69 
98.75 
96.15 
92.48 
92.50 
92.49 
96.12 
98.62 
97.35 
layer 3 
87.27 
93.89 
90.46 
91.61 
88.81 
90.19 
93.40 
98.83 
96.04 
88.75 
91.22 
89.96 
77.25 
94.53 
85.02 
Multiplelayer 
89.40 
95.45 
92.33 
92.09 
95.15 
93.59 
94.13 
99.40 
96.69 
91.55 
96.73 
94.07 
96.91 
98.90 
97.89 
d.2 Ablation of Statistical Distance
We select the following widelyused statistical distances to calculate the association discrepancy:

Symmetrized Kullback–Leibler Divergence (Ours).

Jensen–Shannon Divergence (JSD).

Wasserstein Distance (Wasserstein).

CrossEntropy (CE).

L2 Distance (L2).
Dataset  SMD  MSL  SMAP  SWaT  PSM  

Metric  P  R  F1  P  R  F1  P  R  F1  P  R  F1  P  R  F1 
L2 
85.26 
74.80 
79.69 
85.58 
81.30 
83.39 
91.25 
56.77 
70.00 
79.90 
87.45 
83.51 
70.24 
96.34 
81.24 
CE 
88.23 
81.85 
84.92 
90.07 
86.44 
88.22 
92.37 
64.08 
75.67 
62.78 
81.50 
70.93 
70.71 
94.68 
80.96 
Wasserstein 
78.80 
71.86 
75.17 
60.77 
36.47 
45.58 
90.46 
57.62 
70.40 
92.00 
71.63 
80.55 
68.25 
92.18 
78.43 
JSD 
85.33 
90.09 
87.64 
91.19 
92.42 
91.80 
94.83 
95.14 
94.98 
83.75 
96.75 
89.78 
95.33 
98.58 
96.93 
Ours 
89.40 
95.45 
92.33 
92.09 
95.15 
93.59 
94.13 
99.40 
96.69 
91.55 
96.73 
94.07 
96.91 
98.90 
97.89 
As shown in Table 6, our proposed definition of association discrepancy still achieves the best performance. We find that both the CE and JSD can provide fairly good results, which are close to our definition in principle and can be used to represent the information gain. The L2 distance is not suitable for the discrepancy, which overlooks the property of discrete distribution. The Wasserstein distance also fails in some datasets. The reason is that the priorassociation and seriesassociation are exactly matched in the position indexes. Still, the Wasserstein distance is not calculated point by point and considers the distribution offset, which may bring noises to the optimization and detection.
Appendix E Ablation of Associationbased Criterion
e.1 Calculation
We present the pseudocode of associationbased criterion in Algorithm 3.
e.2 Ablation of Criterion Definition
We explore the model performance under different definitions of anomaly criterion, including the pure association discrepancy, pure reconstruction performance and different combination methods for association discrepancy and reconstruction performance: addition and multiplication.
(7) 
Dataset  SMD  MSL  SMAP  SWaT  PSM  Avg  
Metric  P  R  F1  P  R  F1  P  R  F1  P  R  F1  P  R  F1  F1(%) 
THOC 
79.76 
90.95 
84.99 
88.45 
90.97 
89.69 
92.06 
89.34 
90.68 
83.94 
86.36 
85.13 
88.14 
90.99 
89.54 
88.01 
Recon 
78.63 
65.29 
71.35 
79.15 
78.07 
78.61 
89.38 
56.35 
69.12 
76.81 
86.89 
81.53 
69.84 
94.73 
80.40 
76.20 
AssDis 
86.74 
88.42 
87.57 
91.20 
89.81 
90.50 
91.56 
90.41 
90.98 
97.27 
89.48 
93.21 
97.80 
93.25 
95.47 
91.55 
Addition 
77.16 
70.58 
73.73 
88.08 
87.37 
87.72 
91.28 
55.97 
69.39 
84.34 
81.98 
83.14 
97.60 
97.61 
97.61 
82.32 
Ours 
89.40 
95.45 
92.33 
92.09 
95.15 
93.59 
94.13 
99.40 
96.69 
91.55 
96.73 
94.07 
96.91 
98.90 
97.89 
94.96 
From Table 7, we find that directly using our proposed association discrepancy can also achieve a good performance, which surpasses the competitive baseline THOC (Shen et al., 2020) consistently. Besides, the multiplication combination that we used in Equation 6 performs the best, which can bring a better collaboration to the reconstruction performance and association discrepancy.