这是indexloc提供的服务,不要输入任何密码

STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting

Hongjun Wang, Jiyuan Chen, Tong Pan, Zheng Dong,
Lingyu Zhang, Renhe Jiang, and Xuan Song
Hongjun Wang, Jiyuan Chen, Tong Pan, Zheng Dong and Lingyu Zhang are with (1) SUSTech-UTokyo Joint Research Center on Super Smart City, Department of Computer Science and Engineering (2) Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology (SUSTech), Shenzhen, China. E-mail: wanghj2020,11811810@mail.sustech.edu.cn, pant@sustech.edu.cn, zhengdong00@outlook.com, and zhanglingyu@didiglobal.com. Xuan Song is with (1) School of Artificial Intelligence, Jilin University (2) Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology (SUSTech), Shenzhen, China. Email: songxuan@jlu.edu.cn. R. Jiang is with Center for Spatial Information Science, University of Tokyo, Tokyo, Japan. Email: jiangrh@csis.u-tokyo.ac.jp. Corresponding to Xuan Song;
Abstract

Traffic forecasting is a cornerstone of smart city management, enabling efficient resource allocation and transportation planning. Deep learning, with its ability to capture complex nonlinear patterns in spatiotemporal (ST) data, has emerged as a powerful tool for traffic forecasting. While graph neural networks (GCNs) and transformer-based models have shown promise, their computational demands often hinder their application to real-world road networks, particularly those with large-scale spatiotemporal interactions. To address these challenges, we propose a novel spatiotemporal graph transformer (STGformer) architecture. STGformer effectively balances the strengths of GCNs and Transformers, enabling efficient modeling of both global and local traffic patterns while maintaining a manageable computational footprint. Unlike traditional approaches that require multiple attention layers, STG attention block captures high-order spatiotemporal interactions in a single layer, significantly reducing computational cost. In particular, STGformer achieves a 100x speedup and a 99.8% reduction in GPU memory usage compared to STAEformer during batch inference on a California road graph with 8,600 sensors. We evaluate STGformer on the LargeST benchmark and demonstrate its superiority over state-of-the-art Transformer-based methods such as PDFormer and STAEformer, which underline STGformer’s potential to revolutionize traffic forecasting by overcoming the computational and memory limitations of existing approaches, making it a promising foundation for future spatiotemporal modeling tasks. Codes are available at GitHub

Index Terms:
Traffic Forecasting, Urban Computing, Long-tailed Distribution

I Introduction

Spatiotemporal graph neural networks [1, 2, 3, 4] have shown exceptional potential and have become a preferred method for making precise traffic predictions by leveraging graph neural networks to capture spatial relationships between sensors and utilize sequential models to learn temporal patterns. Recently, the emergence of Transformer-based architectures [5, 6, 7] greatly challenges the dominance of GCNs and become state of the arts architecture in traffic forecasting. However, recent work [8] have prove that most of ST-model, including Transformer, have failed when meeting large scale graph because current traffic datasets used for benchmarking are relatively small when compared to the complexity and scale of actual traffic networks. For instance, popular benchmark datasets like PEMS series [9], MeTR-LA, PEMS-Bay [1] consist of only a few hundred nodes and edges. In contrast, real-world traffic systems, such as Caltrans Performance Measurement System in California, USA [10], incorporate nearly 20,000 active sensors. Consequently, as traffic forecasting models are predominantly developed using these limited datasets, They often do not consider the computational overhead of the model fail to scale up to larger sensor networks, presenting a significant challenge in the field.

Refer to caption
Figure 1: Unlike STAEformer, which relies on stacked 2L2𝐿2L2 italic_L layers, STGformer integrates graph and spatiotemporal attention mechanisms, achieving superior performance with only a single layer.
Refer to caption
Figure 2: MAE vs. FLOPs on traffic flow prediction in San Diego dataset [8]. The bubble size indicate the model parameter number. We can observe that the standard graph convolution operation based method like ST-GCN achieve high computation efficiency but with inferior performance. And, the Transformer-based method achieve SOTA but bring large additional computation cost than GCN-basd method.

As depicted in Figure 2, we conducted a comparative analysis of leading methods on the San Diego dataset within the LargeST framework [8]. Our evaluation included performance, model parameters, and computational costs. While Transformer-based models, such as STAEformer [5], demonstrated superior performance, they exhibited significantly higher computational demands compared to GCN-based approaches like STGCN [11]. Based on these findings, we posit that the success of GCN and Transformer methods can be attributed to their respective strengths in capturing global spatial interactions and input-adaptive long-range dependencies [12]. Standard graph convolution operations [13] effectively aggregate local neighbor features with high orders interaction [11]. However, their reliance on local information limits their ability to consider global context in traffic scenarios, potentially hindering performance. In contrast, self-attention mechanisms, while disregarding explicit graph structures [5], implicitly capture global spatial interactions through successive matrix multiplications. This, however, introduces substantial computational overhead, making it challenging to scale to large-scale real-world traffic networks.

Previously, the overhead associated with global attention mechanisms remained within an acceptable range due to the use of only small-scale datasets for validation. However, the recent introduction of LargeST has exacerbated this issue. The global attention mechanism typically exhibits quadratic time and space complexity as the number of nodes increases, while the computational graph experiences exponential growth with an increasing number of layers. A potential compromise is to employ advanced techniques to partition interconnected nodes into smaller mini-batches, thereby reducing computational overhead [14, 15, 16, 17]. Nonetheless, this strategy results in longer training times due to the smaller mini-batches. In this paper, we propose a more efficient model, which demonstrates exceptional competitiveness across seven traffic benchmarks, ranging from small-scale PEMS [9] datasets to large-scale datasets like LargeST [8], utilizing only a single layer with linear spatiotemporal global attention. Despite the model’s simplicity, it retains the full expressive capability necessary to capture all interactions among graph convolutions and attention mechanisms. Furthermore, our findings indicate that using fewer parameters can enhance generalization capabilities. Our research demonstrates that STGformer maintains strong performance even when tested on data from LargeST one year later.

In Figure 1, we specifically examine the differences between STGformer and STAEformer. Firstly, STAEformer predominantly relies on stacked 2L2𝐿2L2 italic_L layers and utilizes a spatiotemporal separable attention mechanism to achieve higher-order interactions by deepening the model. As previously noted, this approach incurs substantial computational overhead. In contrast, we introduce a more efficient attention module that integrates both graph and spatiotemporal attention mechanisms. Specifically, we conceptualize the temporal and spatial dimensions as a unified entity, employing the same query, key, and value in the attention mechanism to facilitate efficient spatiotemporal attention computation, which markedly reduces computational overhead compared to the separate treatment of these dimensions. STGformer surpasses current state-of-the-art models by leveraging graph information, allowing for efficient computation using only a single layer of the attention module. Furthermore, we adopt linear attention [18, 19, 20], which replaces the softmax operation of the standard attention mechanism with decomposed inner products, thereby reducing the computational complexity from quadratic to linear. This significantly alleviates memory consumption and computational burden when handling large-scale spatiotemporal datasets. The main contributions of this paper are summarized as follows:

  • \bullet

    We propose a novel STG-attention that efficiently captures high-order spatiotemporal interactions for both global and local patterns in a single layer, unlike previous methods requiring multiple stacked layers.

  • \bullet

    STGformer combines the advantages of GCNs and Transformers while maintaining low computational and memory costs, significantly improving efficiency in processing large-scale traffic graphs compared to existing methods.

  • \bullet

    STGformer outperforms state-of-the-art Transformer-based methods on the LargeST benchmark, demonstrating remarkable efficiency by being 100×\times× faster and using 99.8% less GPU memory than STAEformer during batch inference on California’s traffic network.

Refer to caption
Figure 3: Illustration the main idea of STGformer. We illustrate typical spatial modeling in graphs, highlighting various levels of interactions and receptive fields. (a) Conventional Graph Convolution [11, 9] explicitly handles arbitrary-order spatial interaction but within a limited receptive area. (b) Spatiotemporal Transformer  [5, 6] performs interactions up to two orders through two consecutive matrix multiplications, covering a broad receptive area. (c) Our STGformer achieves high-order global spatial interactions with structure information by integrating graph convolutional networks with Transformer architecture.

II Related Work

II-A Traffic Forecasting

Deep neural networks have become the predominant approach for traffic forecasting [21, 22, 23, 24, 25], typically combining graph neural networks (GNNs) with either Recurrent Neural Networks (RNNs) or Temporal Convolutional Networks (TCNs) to capture complex spatio-temporal patterns in traffic data. RNN-based models, such as DCRNN [21], incorporate diffusion convolution with GRU layers to model spatial and temporal dependencies. Extensions of this approach include ST-MetaNet [26], which employs meta-learning with graph attention networks, and AGCRN [27], which introduces node-specific adaptive parameters in graph convolution. To improve computational efficiency, TCN-based models like STGCN [11] and GWNet [2] have adopted dilated causal convolutions for temporal modeling. These architectures demonstrate faster training times and competitive performance on various benchmarks. Attention mechanisms have been integrated into models such as ASTGCN [9] and STAEformer [5] to better capture long-range dependencies and complex spatio-temporal interactions. These approaches have shown improved performance in handling diverse traffic patterns. Recent research directions include the integration of GNNs with neural ordinary differential equations for continuous modeling of spatio-temporal dependencies [28, 29], and the development of dynamic adjacency matrices to reflect changing relationships over time [30, 31]. These emerging approaches aim to address limitations of previous models and enhance the adaptability and interpretability of traffic forecasting systems.

II-B Graph Transformer

Graph Transformers have emerged as a powerful class of models for learning on graph-structured data, with several surveys reviewing different aspects of these models. The incorporation of graph structure into Transformer architectures has been explored through various graph inductive biases, as discussed by Dwivedi et al. [32] and Rampášek et al. [33], who provided comprehensive overviews of node positional encodings, edge structural encodings, and attention bias. In terms of graph attention mechanisms, Velickovic et al. [34] introduced graph attention networks (GAT) leveraging multi-head attention for node classification, while subsequent works like GATv2 [35] addressed limitations in GAT’s expressiveness. The literature has explored various types of graph Transformers, including shallow models like GAT and GTN [36], deep architectures stacking multiple attention layers [37, 38], scalable versions addressing efficiency challenges for large graphs [33, 39], and pre-trained models leveraging self-supervised learning on large graph datasets [40, 41]. Graph Transformers have demonstrated promising results across various domains, including protein structure prediction in bioinformatics [42, 43], entity resolution in data management [44, 45], and anomaly detection in temporal data [46, 47]. Despite their success, recent surveys [48, 33] have highlighted ongoing challenges in scalability, generalization, interpretability, and handling dynamic graphs, indicating that addressing these issues remains an active area of research in the graph learning community. At the same time, the Transformer based on spatiotemporal graph correlation has not yet been explored in the field of traffic prediction.

III Preliminaries

III-A Problem Statement

In this paper, we formalize the representation of a graph as 𝒢=(𝒱,,A)𝒢𝒱𝐴\mathcal{G}=(\mathcal{V},\mathcal{E},A)caligraphic_G = ( caligraphic_V , caligraphic_E , italic_A ), where 𝒱𝒱\mathcal{V}caligraphic_V denotes the set of nodes with cardinality N=|𝒱|𝑁𝒱N=|\mathcal{V}|italic_N = | caligraphic_V |, 𝒱×𝒱𝒱𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}caligraphic_E ⊆ caligraphic_V × caligraphic_V defines the set of edges, and AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT represents the adjacency matrix. The dynamic nature of the graph is captured by a time-dependent feature matrix 𝐗t|𝒱|×𝒞subscript𝐗𝑡superscript𝒱𝒞\mathbf{X}_{t}\in\mathbb{R}^{|\mathcal{V}|\times\mathcal{C}}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_V | × caligraphic_C end_POSTSUPERSCRIPT at each discrete time step t𝑡titalic_t, where 𝒞𝒞\mathcal{C}caligraphic_C denotes the dimensionality of node features (e.g., traffic flow, vehicle speed, and road occupation). Traffic forecasting can be formally expressed as a function:

f:[𝐗(tT):t,𝒢]𝐗(t+1):(t+S),:𝑓maps-tosubscript𝐗:𝑡𝑇𝑡𝒢subscript𝐗:𝑡1𝑡𝑆\displaystyle f:\left[\mathbf{X}_{(t-T):t},\mathcal{G}\right]\mapsto\mathbf{X}% _{(t+1):(t+S)},italic_f : [ bold_X start_POSTSUBSCRIPT ( italic_t - italic_T ) : italic_t end_POSTSUBSCRIPT , caligraphic_G ] ↦ bold_X start_POSTSUBSCRIPT ( italic_t + 1 ) : ( italic_t + italic_S ) end_POSTSUBSCRIPT , (1)

where T𝑇Titalic_T and S𝑆Sitalic_S represent the input and output sequence lengths, respectively, which encapsulates the temporal dependency by considering a historical window of T𝑇Titalic_T time steps and predicting future states for S𝑆Sitalic_S time steps ahead.

III-B Spatiotemporal Graph Convolution

Graph neural networks (GNNs) have the advantage of aggregating node neighborhood contexts to generate spatial representations, which is earliest method introduced to spatiotemporal graph modeling [3, 11, 21]. Formally, let 𝒢subscript𝒢*_{\mathcal{G}}∗ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT is graph convolution operator [13], which is reformulated as

Θ𝒢h=Θ()xk=0K1θkTk(~)h,subscript𝒢ΘΘ𝑥superscriptsubscript𝑘0𝐾1subscript𝜃𝑘subscript𝑇𝑘~\displaystyle\Theta*_{\mathcal{G}}h=\Theta(\mathcal{L})x\approx\sum_{k=0}^{K-1% }\theta_{k}T_{k}(\tilde{\mathcal{L}})h,roman_Θ ∗ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT italic_h = roman_Θ ( caligraphic_L ) italic_x ≈ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG caligraphic_L end_ARG ) italic_h , (2)

where θ=[θ0,,θK1]K𝜃subscript𝜃0subscript𝜃𝐾1superscript𝐾\theta=[\theta_{0},\ldots,\theta_{K-1}]\in\mathbb{R}^{K}italic_θ = [ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is a vector of polynomial coefficients K𝐾Kitalic_K denotes the kernel size of the graph convolution Tk(~)N×Nsubscript𝑇𝑘~superscript𝑁𝑁T_{k}(\tilde{\mathcal{L}})\in\mathbb{R}^{N\times N}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG caligraphic_L end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT represents the k𝑘kitalic_k-th order Chebyshev polynomial evaluated at the rescaled Laplacian ~=2/λmaxIN~2subscript𝜆subscript𝐼𝑁\tilde{\mathcal{L}}=2\mathcal{L}/\lambda_{\max}-I_{N}over~ start_ARG caligraphic_L end_ARG = 2 caligraphic_L / italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT =D1/2(DA)D1/2superscript𝐷12𝐷𝐴superscript𝐷12\mathcal{L}=D^{-1/2}(D-A)D^{-1/2}caligraphic_L = italic_D start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( italic_D - italic_A ) italic_D start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT is the normalized graph Laplacian D𝐷Ditalic_D is the degree matrix of the graph λmaxsubscript𝜆\lambda_{\max}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the largest eigenvalue of \mathcal{L}caligraphic_L INsubscript𝐼𝑁I_{N}italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the N×N𝑁𝑁N\times Nitalic_N × italic_N identity matrix. This approach enables the efficient computation of K𝐾Kitalic_K-localized convolutions by leveraging polynomial approximation, effectively capturing the local structure of the graph within a K𝐾Kitalic_K-hop neighborhood.

III-C Spatiotemporal Self-Attention Layer

Instead of GCNs [11, 2, 9], which aggregate neighboring features using static convolution kernels, Transformers [6, 5] employ multi-head self-attention to dynamically generate weights that mix spatial and temporal signals. Formally, given a hidden spatiotemporal representation hT×N×Csuperscript𝑇𝑁𝐶h\in\mathbb{R}^{T\times N\times C}italic_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_C end_POSTSUPERSCRIPT, where T𝑇Titalic_T is the number of time frames, N=|𝒱|𝑁𝒱N=|\mathcal{V}|italic_N = | caligraphic_V | is the number of graph nodes, and C𝐶Citalic_C denotes the channel dimension, we can formulate the spatiotemporal self-attention mechanism as follows: For spatial self-attention, the query, key, and value matrices are derived as: Qs=hWQssubscript𝑄𝑠superscriptsubscript𝑊𝑄𝑠Q_{s}=hW_{Q}^{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_h italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, Ks=hWKssubscript𝐾𝑠superscriptsubscript𝑊𝐾𝑠K_{s}=hW_{K}^{s}italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_h italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, and Vs=hWVssubscript𝑉𝑠superscriptsubscript𝑊𝑉𝑠V_{s}=hW_{V}^{s}italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_h italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, where WQs,WKs,WVsC×Csuperscriptsubscript𝑊𝑄𝑠superscriptsubscript𝑊𝐾𝑠superscriptsubscript𝑊𝑉𝑠superscript𝐶𝐶W_{Q}^{s},W_{K}^{s},W_{V}^{s}\in\mathbb{R}^{C\times C}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT are learnable parameter matrices. The spatial self-attention scores are then computed as: As=Softmax(QsKsT/C)superscript𝐴𝑠Softmaxsubscript𝑄𝑠superscriptsubscript𝐾𝑠𝑇𝐶A^{s}=\text{Softmax}(Q_{s}K_{s}^{T}/\sqrt{C})italic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = Softmax ( italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_C end_ARG ), where AsT×N×Nsuperscript𝐴𝑠superscript𝑇𝑁𝑁A^{s}\in\mathbb{R}^{T\times N\times N}italic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_N end_POSTSUPERSCRIPT captures spatial dependencies across nodes. Similarly, for temporal self-attention, we have: Qt=hWQtsubscript𝑄𝑡superscriptsubscript𝑊𝑄𝑡Q_{t}=hW_{Q}^{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_h italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, Kt=hWKtsubscript𝐾𝑡superscriptsubscript𝑊𝐾𝑡K_{t}=hW_{K}^{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_h italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and Vt=hWVtsubscript𝑉𝑡superscriptsubscript𝑊𝑉𝑡V_{t}=hW_{V}^{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_h italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, with WQt,WKt,WVtC×Csuperscriptsubscript𝑊𝑄𝑡superscriptsubscript𝑊𝐾𝑡superscriptsubscript𝑊𝑉𝑡superscript𝐶𝐶W_{Q}^{t},W_{K}^{t},W_{V}^{t}\in\mathbb{R}^{C\times C}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT being distinct learnable parameters. The temporal self-attention scores are calculated as: At=Softmax(QtTKt/C)superscript𝐴𝑡Softmaxsuperscriptsubscript𝑄𝑡𝑇subscript𝐾𝑡𝐶A^{t}=\text{Softmax}(Q_{t}^{T}K_{t}/\sqrt{C})italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = Softmax ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / square-root start_ARG italic_C end_ARG ), where AtN×T×Tsuperscript𝐴𝑡superscript𝑁𝑇𝑇A^{t}\in\mathbb{R}^{N\times T\times T}italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_T × italic_T end_POSTSUPERSCRIPT captures temporal dependencies across different time horizons. This formulation allows for the dynamic generation of attention weights that simultaneously consider both spatial and temporal contexts, enabling the model to adapt to varying spatiotemporal patterns in the input data.

III-D Analysis of GCN and Transformer Flops

In analyzing the computational complexity of spatiotemporal graph modeling techniques, we observe distinct characteristics between graph convolution and self-attention mechanisms. The spatiotemporal graph convolution, utilizing Chebyshev polynomials, exhibits a computational complexity of 𝒪(K||C)𝒪𝐾𝐶\mathcal{O}(K|\mathcal{E}|C)caligraphic_O ( italic_K | caligraphic_E | italic_C ), where K𝐾Kitalic_K represents the kernel size, |||\mathcal{E}|| caligraphic_E | the number of edges, and C𝐶Citalic_C the number of channels. This complexity arises primarily from the polynomial approximation of the graph Laplacian. In contrast, the spatiotemporal self-attention layer demonstrates a more intricate computational profile, with a complexity of 𝒪(TN2C+NT2C)𝒪𝑇superscript𝑁2𝐶𝑁superscript𝑇2𝐶\mathcal{O}(TN^{2}C+NT^{2}C)caligraphic_O ( italic_T italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C + italic_N italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C ), where T𝑇Titalic_T denotes the number of frames. This increased complexity stems from the dynamic weight generation in multi-head self-attention, encompassing operations such as query-key interactions, softmax computations, and attention-weighted aggregations across both spatial and temporal dimensions. The self-attention approach, while more computationally intensive, offers enhanced flexibility in capturing complex spatiotemporal dependencies, particularly when dealing with lengthy sequences or high-dimensional feature spaces. The choice between these methods thus presents a trade-off between computational efficiency and model expressiveness, contingent upon the specific requirements of the spatiotemporal modeling task at hand.

Refer to caption
Figure 4: The overall architecture of STGformer. STGformer differs from the traditional Transformer in that we replace the attention mechanism with a spatiotemporal graph attention module, while simplifying the interaction process to require only a single layer rather than stacking multiple layers. Specifically, node features are first propagated through the graph propagation module for k𝑘kitalic_k iterations. Then, the information from each propagation step is fed into the spatiotemporal attention module, where each module shares the same query, key, and values. Finally, the attention outputs are sequentially processed from lower to higher order for interaction.

IV Methodology

IV-A Overview

The overall architecture of our proposed STGformer is illustrated in Figure 4. The key feature of our model is its efficiency, as it achieves joint spatiotemporal graph attention using only a single attention module. Our model is divided into two branches: the graph propagation module and the attention module, with specific details shown in Figure 3. First, the spatiotemporal data undergoes graph propagation and is then fed into the attention module separately. Subsequently, a 1x1 convolution is applied to interact with the outputs of different-order attentions, which are finally aggregated.

IV-B Data Embedding Layer

To transform the input data into a high-dimensional representation, we adopt a data embedding layer consistent with the STAEformer. Specifically, the raw input 𝐗𝐗\mathbf{X}bold_X is first projected into XdataT×N×dsubscript𝑋𝑑𝑎𝑡𝑎superscript𝑇𝑁𝑑X_{data}\in\mathbb{R}^{T\times N\times d}italic_X start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_d end_POSTSUPERSCRIPT through a fully connected layer, where d𝑑ditalic_d is the embedding dimension. Recognizing the inherent periodicity of urban traffic flow influenced by human commuting patterns and lifestyles, such as rush hours, we introduce two embeddings to capture weekly and daily cycles, denoted as tw(t),td(t)dsubscript𝑡𝑤𝑡subscript𝑡𝑑𝑡superscript𝑑t_{w(t)},t_{d(t)}\in\mathbb{R}^{d}italic_t start_POSTSUBSCRIPT italic_w ( italic_t ) end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_d ( italic_t ) end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, respectively. Here, w(t)𝑤𝑡w(t)italic_w ( italic_t ) and d(t)𝑑𝑡d(t)italic_d ( italic_t ) are functions that map time t𝑡titalic_t to the corresponding week index (1 to 7) and minute index (1 to 288, with a 5-minute interval). The temporal cycle embeddings Xw,XdT×dsubscript𝑋𝑤subscript𝑋𝑑superscript𝑇𝑑X_{w},X_{d}\in\mathbb{R}^{T\times d}italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT are obtained by concatenating the embeddings of all T𝑇Titalic_T time steps. Following [5], we also incorporate spatiotemporal positional encoding XsteN×T×dsubscript𝑋𝑠𝑡𝑒superscript𝑁𝑇𝑑X_{ste}\in\mathbb{R}^{N\times T\times d}italic_X start_POSTSUBSCRIPT italic_s italic_t italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_T × italic_d end_POSTSUPERSCRIPT to introduce positional information into the input sequence. Finally, the output of the data embedding layer is obtained by simply concatenating the aforementioned embedding vectors: Xemb=Xdata||Xw||Xd||Xste.subscript𝑋𝑒𝑚𝑏subscript𝑋𝑑𝑎𝑡𝑎subscript𝑋𝑤subscript𝑋𝑑subscript𝑋𝑠𝑡𝑒X_{emb}=X_{data}\ ||\ X_{w}\ ||\ X_{d}\ ||\ X_{ste}.italic_X start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT | | italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | | italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | | italic_X start_POSTSUBSCRIPT italic_s italic_t italic_e end_POSTSUBSCRIPT .

IV-C Spatiotemporal Graph Transformer

As previously elucidated, GCNs excel in modeling locally high-interaction information, whereas Transformers are adept at capturing global, limited interaction information. Although Transformer-based methodologies have achieved state-of-the-art performance in traffic forecasting, their quadratic computational complexity with respect to graph size significantly impedes their application, particularly on real-world road networks characterized by high sensor density [10, 8]. In this study, we propose a more efficient and effective approach to spatiotemporal interaction modeling by synergistically combining graph propagation and Transformer architectures. We adopt a simplified variant of the GCN formulation presented in Equation 2, focusing exclusively on graph propagation. For traffic signals, denoted as Xembsubscript𝑋𝑒𝑚𝑏X_{emb}italic_X start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT, we omit the feature parameter W𝑊Witalic_W, resulting in the following formulation:

[X0|X1|X2||Xk]=GraphPropagation(Xemb),delimited-[]subscript𝑋0subscript𝑋1subscript𝑋2subscript𝑋𝑘GraphPropagationsubscript𝑋𝑒𝑚𝑏\left[X_{0}\ |\ X_{1}\ |X_{2}\ |\dots\ |\ X_{k}\right]=\texttt{% GraphPropagation}(X_{emb}),[ italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | … | italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = GraphPropagation ( italic_X start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT ) ,

where X0=Xsubscript𝑋0𝑋X_{0}=Xitalic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_X and Xk=kXsubscript𝑋𝑘subscript𝑘𝑋X_{k}=\mathcal{L}_{k}Xitalic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_X. The GraphPropagation operation is analogous to the simplification introduced by the simplified graph convolution [49], which streamlines graph convolutional networks by eliminating nonlinearities and collapsing weight matrices across consecutive layers. However, in contrast to SGC, our approach retains the different orders of Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to facilitate further interactions.

We then propose a recursive attention module to introduce higher-order spatiotemporal interactions, further enhancing the model’s capability. The recursive attention module first takes graph-propagated information as input, then recursively applies gated convolutions:

pn+1=an(qn)gn(pn),subscript𝑝𝑛1direct-productsubscript𝑎𝑛subscript𝑞𝑛subscript𝑔𝑛subscript𝑝𝑛\displaystyle p_{n+1}=a_{n}(q_{n})\odot g_{n}(p_{n}),italic_p start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⊙ italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (3)

where k=0,1,,n1𝑘01𝑛1k=0,1,\ldots,n-1italic_k = 0 , 1 , … , italic_n - 1, with aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT representing the spatiotemporal attention module and gksubscript𝑔𝑘g_{k}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT used for dimensional matching:

gn={ Identity, n=0 Linear (Cn1,Cn),1nk1subscript𝑔𝑛cases Identity, 𝑛0 Linear subscript𝐶𝑛1subscript𝐶𝑛1𝑛𝑘1\displaystyle g_{n}=\begin{cases}\text{ Identity, }&n=0\\ \text{ Linear }\left(C_{n-1},C_{n}\right),&1\leq n\leq k-1\end{cases}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { start_ROW start_CELL Identity, end_CELL start_CELL italic_n = 0 end_CELL end_ROW start_ROW start_CELL Linear ( italic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , end_CELL start_CELL 1 ≤ italic_n ≤ italic_k - 1 end_CELL end_ROW (4)

As mentioned earlier in Sec. III-C, traditional spatiotemporal attention mechanisms mainly capture spatiotemporal patterns through a separable approach and achieve higher-order interactions by stacking layers. In contrast, as shown in Figure 4, we treat space and time as a unified entity, employing a single projection to generate the query, key, and value vectors, and using a simple transposition to compute the attention mechanism. Q=hwQ,K=hwK,V=hwV,formulae-sequence𝑄subscript𝑤𝑄formulae-sequence𝐾subscript𝑤𝐾𝑉subscript𝑤𝑉Q=hw_{Q},\ K=hw_{K},\ V=hw_{V},italic_Q = italic_h italic_w start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_K = italic_h italic_w start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_V = italic_h italic_w start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , where wQ,wK,wVC×Csubscript𝑤𝑄subscript𝑤𝐾subscript𝑤𝑉superscript𝐶𝐶w_{Q},w_{K},w_{V}\in\mathbb{R}^{C\times C}italic_w start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT are trainable parameters. Subsequently, the spatial and temporal self-attention scores are defined as:

As=Softmax(QKTC),At=Softmax(QTKC),formulae-sequencesuperscript𝐴𝑠Softmax𝑄superscript𝐾𝑇𝐶superscript𝐴𝑡Softmaxsuperscript𝑄𝑇𝐾𝐶\displaystyle A^{s}=\text{Softmax}\left(\frac{QK^{T}}{\sqrt{C}}\right),\quad A% ^{t}=\text{Softmax}\left(\frac{Q^{T}K}{\sqrt{C}}\right),italic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = Softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_C end_ARG end_ARG ) , italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = Softmax ( divide start_ARG italic_Q start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_K end_ARG start_ARG square-root start_ARG italic_C end_ARG end_ARG ) , (5)

where AsT×N×Nsuperscript𝐴𝑠superscript𝑇𝑁𝑁A^{s}\in\mathbb{R}^{T\times N\times N}italic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_N end_POSTSUPERSCRIPT captures spatial relations across different nodes, and AtN×T×Tsuperscript𝐴𝑡superscript𝑁𝑇𝑇A^{t}\in\mathbb{R}^{N\times T\times T}italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_T × italic_T end_POSTSUPERSCRIPT captures temporal relations within individual nodes. However, despite our integration of spatiotemporal attention, which partially reduces the computational overhead, the inherent quadratic computational complexity still results in considerable computational costs. Therefore, we will next introduce a linear attention mechanism to further alleviate this issue.

Spatiotemporal Linearized Attention. Recent work [8] have point out that a major part of existing both graph neural network (GCN) and Transformer-based method failed to adopt to the real road graph since the high computation cost in capture global and local traffic pattern. We consider the major issue is the high computation cost with graph size in real world increase dramatically. To address this challenge, the efficient attention mechanism [18, 19, 20] is adopt in this paper to address the significant resource demands of traditional dot-product attention, which has quadratic memory and computational complexities. This makes dot-product attention impractical for real world traffic graph due to its high computational and memory costs. Efficient attention maintains mathematical equivalence to dot-product attention but achieves substantial improvements in speed and memory efficiency, which is done by interpreting the keys differently: instead of viewing them as N𝑁Nitalic_N node feature vectors in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, they are treated as d𝑑ditalic_d single-channel feature maps. Each of these maps acts as a weighting over all positions, aggregating value features through weighted summation to form a global context vector. The core equation for efficient attention is:

𝑬(Q,K,V)=wq(X)(wk(X)𝖳wv(X)),𝑬𝑄𝐾𝑉subscript𝑤𝑞𝑋subscript𝑤𝑘superscript𝑋𝖳subscript𝑤𝑣𝑋\displaystyle\bm{E}(Q,K,V)=w_{q}(X)\left(w_{k}(X)^{\mathsf{T}}w_{v}(X)\right),bold_italic_E ( italic_Q , italic_K , italic_V ) = italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_X ) ) ,

where wqsubscript𝑤𝑞w_{q}italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are normalization functions. For scaling, both wqsubscript𝑤𝑞w_{q}italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide the matrices by n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG. We can easily prove the equivalence between dot-product and efficient attention with scaling normalization demonstrates that:

𝑬(Q,K,V)=Qn(K𝖳nV)=1n(QK𝖳)V,𝑬𝑄𝐾𝑉𝑄𝑛superscript𝐾𝖳𝑛𝑉1𝑛𝑄superscript𝐾𝖳𝑉\displaystyle\bm{E}(Q,K,V)=\frac{Q}{\sqrt{n}}\left(\frac{K^{\mathsf{T}}}{\sqrt% {n}}V\right)=\frac{1}{n}(QK^{\mathsf{T}})V,bold_italic_E ( italic_Q , italic_K , italic_V ) = divide start_ARG italic_Q end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ( divide start_ARG italic_K start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG italic_V ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( italic_Q italic_K start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ) italic_V , (6)

Therefore, we substitute Eq. (5) into Eq. (6) and get

As=1n(QK𝖳)V,At=1n(Q𝖳K)V,formulae-sequencesuperscript𝐴𝑠1𝑛𝑄superscript𝐾𝖳𝑉superscript𝐴𝑡1𝑛superscript𝑄𝖳𝐾𝑉\displaystyle A^{s}=\frac{1}{n}(QK^{\mathsf{T}})V,\quad A^{t}=\frac{1}{n}(Q^{% \mathsf{T}}K)V,italic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( italic_Q italic_K start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ) italic_V , italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( italic_Q start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_K ) italic_V , (7)

IV-D Computational Cost Analysis.

From Eq. (5), it is evident that the computational cost of the spatiotemporal softmax attention mechanism scales with 𝒪(TN2+NT2)𝒪𝑇superscript𝑁2𝑁superscript𝑇2\mathcal{O}\left(TN^{2}+NT^{2}\right)caligraphic_O ( italic_T italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The memory requirements similarly increase, as the complete attention matrix must be stored to compute the gradients for the queries, keys, and values. In contrast, the linear transformer we adopt in Eq. (7) has both time and memory complexities of 𝒪(N+T)𝒪𝑁𝑇\mathcal{O}(N+T)caligraphic_O ( italic_N + italic_T ).

We will divide the computation of our into 3 parts, and calculate the FLOPs for each part.

  • \bullet

    Graph Propagation. Graph propagation, employing Chebyshev polynomials, owns a computational complexity of 𝒪(K||C)𝒪𝐾𝐶\mathcal{O}(K|\mathcal{E}|C)caligraphic_O ( italic_K | caligraphic_E | italic_C ), where K𝐾Kitalic_K denotes the order, |||\mathcal{E}|| caligraphic_E | represents the number of edges, and C𝐶Citalic_C signifies the number of channels. This complexity is predominantly attributed to the polynomial approximation of the graph Laplacian.

  • \bullet

    Spatiotemporal Linear Attention. As previously mentioned, the time complexity of Spatiotemporal Linear Attention is 𝒪(N+T)𝒪𝑁𝑇\mathcal{O}(N+T)caligraphic_O ( italic_N + italic_T ), where N𝑁Nitalic_N is the number of spatial nodes and T𝑇Titalic_T is the temporal length. Since the process needs to be performed K𝐾Kitalic_K times, the overall computational complexity becomes 𝒪(K(N+T))𝒪𝐾𝑁𝑇\mathcal{O}(K(N+T))caligraphic_O ( italic_K ( italic_N + italic_T ) ).

  • \bullet

    Recursive Interaction. We consider the FLOPs of the element-wise multiplication with 1×\times×1 convolution. Therefore, the computational cost is KNTC2𝐾𝑁𝑇superscript𝐶2KNTC^{2}italic_K italic_N italic_T italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Therefore, the total FLOPs with spatiotemporal attention are:

FLOPs(STGformer)FLOPsSTGformer\displaystyle\operatorname{FLOPs}(\text{STGformer})roman_FLOPs ( STGformer ) =KC(||+N+T+NTC).absent𝐾𝐶𝑁𝑇𝑁𝑇𝐶\displaystyle=KC(|\mathcal{E}|+N+T+NTC).= italic_K italic_C ( | caligraphic_E | + italic_N + italic_T + italic_N italic_T italic_C ) .

Because STAEFormer performs self-attention operations on spatial and temporal stacked with L𝐿Litalic_L layers respectively, we can easily calculate its FLOPs as:

FLOPs(STAEFormer)FLOPsSTAEFormer\displaystyle\operatorname{FLOPs}(\text{STAEFormer})roman_FLOPs ( STAEFormer ) =L(TN2·C+NT2·C).absent𝐿𝑇superscript𝑁2·𝐶𝑁superscript𝑇2·𝐶\displaystyle=L(TN^{2}\textperiodcentered C+NT^{2}\textperiodcentered C).= italic_L ( italic_T italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT · italic_C + italic_N italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT · italic_C ) .

For instance, assuming input lengths of 12, California graph with 8600 nodes, a hidden dimension of 32, an interaction order K𝐾Kitalic_K of 3, ||=201,363201363|\mathcal{E}|=201,363| caligraphic_E | = 201 , 363 and L=3𝐿3L=3italic_L = 3, the ratio of FLOPs between STGformer and STAEFormer can be calculated as follows:

FLOPs(STGformer)FLOPs(STAEFormer)0.00131,FLOPs(STGformer)FLOPs(STAEFormer)0.00131\frac{\text{FLOPs(STGformer)}}{\text{FLOPs(STAEFormer)}}\approx 0.00131,divide start_ARG FLOPs(STGformer) end_ARG start_ARG FLOPs(STAEFormer) end_ARG ≈ 0.00131 ,

which STGformer significantly reduces 99.869% computational burden compared to STAEFormer.

TABLE I: Performance comparisons are presented, with the best-performing baseline results highlighted in bold. ”Param” denotes the number of learnable parameters, where K represents thousands (103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) and M represents millions (106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT).
Data Method Param Horizon 3 Horizon 6 Horizon 12 Average
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
San Diego (N=716𝑁716N=716italic_N = 716) HA 33.61 50.97 20.77% 57.80 84.92 37.73% 101.74 140.14 76.84% 60.79 87.40 41.88%
LSTM 98K 19.03 30.53 11.81% 25.84 40.87 16.44% 37.63 59.07 25.45% 26.44 41.73 17.20%
DCRNN 373K 17.14 27.47 11.12% 20.99 33.29 13.95% 26.99 42.86 18.67% 21.03 33.37 14.13%
AGCRN 761K 15.71 27.85 11.48% 18.06 31.51 13.06% 21.86 39.44 16.52% 18.09 32.01 13.28%
STGCN 508K 17.45 29.99 12.42% 19.55 33.69 13.68% 23.21 41.23 16.32% 19.67 34.14 13.86%
GWNET 311K 15.24 25.13 9.86% 17.74 29.51 11.70% 21.56 36.82 15.13% 17.74 29.62 11.88%
ASTGCN 2.2M 19.56 31.33 12.18% 24.13 37.95 15.38% 30.96 49.17 21.98% 23.70 37.63 15.65%
STGODE 729K 16.75 28.04 11.00% 19.71 33.56 13.16% 23.67 42.12 16.58% 19.55 33.57 13.22%
DSTAGNN 3.9M 18.13 28.96 11.38% 21.71 34.44 13.93% 27.51 43.95 19.34% 21.82 34.68 14.40%
DGCRN 243K 15.34 25.35 10.01% 18.05 30.06 11.90% 22.06 37.51 15.27% 18.02 30.09 12.07%
D2STGNN 406K 14.92 24.95 9.56% 17.52 29.24 11.36% 22.62 37.14 14.86% 17.85 29.51 11.54%
STID 258K 15.08 25.20 9.88% 17.79 30.15 11.97% 21.68 38.59 15.15% 17.82 30.98 11.96%
STAEformer 1.7M 15.37 25.66 10.15% 18.03 30.46 12.11% 22.21 37.79 15.49% 18.01 30.38 12.03%
STGformer 256K 14.97 24.96 9.41% 17.44 29.26 11.12% 20.94 35.93 14.08% 17.36 29.52 11.22%
Bay Area (N=2,352𝑁2352N=2,352italic_N = 2 , 352) HA 32.57 48.42 22.78% 53.79 77.08 43.01% 92.64 126.22 92.85% 56.44 79.82 48.87%
LSTM 98K 20.38 33.34 15.47% 27.56 43.57 23.52% 39.03 60.59 37.48% 27.96 44.21 24.48%
DCRNN 373K 18.71 30.36 14.72% 23.06 36.16 20.45% 29.85 46.06 29.93% 23.13 36.35 20.84%
AGCRN 777K 18.31 30.24 14.27% 21.27 34.72 16.89% 24.85 40.18 20.80% 21.01 34.25 16.90%
STGCN 1.3M 21.05 34.51 16.42% 23.63 38.92 18.35% 26.87 44.45 21.92% 23.42 38.57 18.46%
GWNET 344K 17.85 29.12 13.92% 21.11 33.69 17.79% 25.58 40.19 23.48% 20.91 33.41 17.66%
ASTGCN 22.3M 21.46 33.86 17.24% 26.96 41.38 24.22% 34.29 52.44 32.53% 26.47 40.99 23.65%
STTN 218K 18.25 29.64 14.05% 21.06 33.87 17.03% 25.29 40.58 21.20% 20.97 33.78 16.84%
STGODE 788K 18.84 30.51 15.43% 22.04 35.61 18.42% 26.22 42.90 22.83% 21.79 35.37 18.26%
DSTAGNN 26.9M 19.73 31.39 15.42% 24.21 37.70 20.99% 30.12 46.40 28.16% 23.82 37.29 20.16%
DGCRN 374K 18.02 29.49 14.13% 21.08 34.03 16.94% 25.25 40.63 21.15% 20.91 33.83 16.88%
STID 711K 17.25 29.18 13.42% 20.31 34.20 16.13% 24.29 41.29 20.16% 20.14 34.39 16.07%
D2STGNN 446K 17.54 28.94 12.12% 20.92 33.92 14.89% 25.48 40.99 19.83% 20.71 33.65 15.04%
STAEformer 3.3M 17.55 29.25 13.00% 20.55 33.87 15.45% 24.75 41.00 19.75% 20.39 34.21 15.55%
STGformer 491K 17.13 28.63 12.72% 20.11 33.20 15.12% 24.22 40.16 19.35% 19.98 33.50 15.22%
Los Angeles (N=3,834𝑁3834N=3,834italic_N = 3 , 834) HA 33.66 50.91 19.16% 56.88 83.54 34.85% 98.45 137.52 71.14% 59.58 86.19 38.76%
LSTM 98K 20.02 32.41 11.36% 27.73 44.05 16.49% 39.55 61.65 25.68% 28.05 44.38 17.23%
DCRNN 373K 18.41 29.23 10.94% 23.16 36.15 14.14% 30.26 46.85 19.68% 23.17 36.19 14.40%
AGCRN 792K 17.27 29.70 10.78% 20.38 34.82 12.70% 24.59 42.59 16.03% 20.25 34.84 12.87%
STGCN 2.1M 19.86 34.10 12.40% 22.75 38.91 14.11% 26.70 45.78 17.00% 22.64 38.81 14.17%
GWNET 374K 17.28 27.68 10.18% 21.31 33.70 13.02% 26.99 42.51 17.64% 21.20 33.58 13.18%
ASTGCN 59.1M 21.89 34.17 13.29% 29.54 45.01 19.36% 39.02 58.81 29.23% 28.99 44.33 19.62%
STGODE 841K 18.10 30.02 11.18% 21.71 36.46 13.64% 26.45 45.09 17.60% 21.49 36.14 13.72%
DSTAGNN 66.3M 19.49 31.08 11.50% 24.27 38.43 15.24% 30.92 48.52 20.45% 24.13 38.15 15.07%
STID 901K 16.43 27.40 9.89% 19.77 33.43 12.26% 24.23 42.02 15.88% 19.66 33.99 12.31%
STAEformer 4.7M 16.72 27.50 9.77% 20.10 33.05 11.89% 24.69 41.42 15.47% 19.97 33.53 12.01%
STGformer 705K 16.39 26.95 9.58% 19.70 32.39 11.66% 24.19 40.59 15.16% 19.58 32.88 11.78%

V Experiment

Datasets. We experimented with LargeST [8], which aggregated traffic readings from 5-minute intervals into 15-minute windows, aiming to predict future 12-step outcomes based on historical 12-step data [50]. LargeST comprises three California sub-datasets constructed from three representative regions within the state. The first is Los Angeles, encompassing 3,834 sensors installed across five counties in the Los Angeles region: Los Angeles, Orange, Riverside, San Bernardino, and Ventura. The second sub-dataset, the Bay Area, includes 2,352 sensors located in 11 counties: Alameda, Contra Costa, Marin, Napa, San Benito, San Francisco, San Mateo, Santa Clara, Santa Cruz, Solano, and Sonoma. The smallest sub-dataset, San Diego, contains 716 sensors. To further verify the performance of our method, we also conducted experiments on the widely-used PEMS-series benchmarks i.e., PEMS03, PEMS04, PEMS07, PEMS08. [9]. PEMS-series, representing the four major districts in California, are aggregated into 5-minute intervals, resulting in 12 data points per hour and 288 data points per day.

Implementation Details. Our experiments ran on a GPU server with eight GeForce GTX 3090 graphics cards, employing the PyTorch 2.0.3 framework. Raw data have been standardized using z-score normalization [51]. If validation error stabilized within 15-20 epochs or ceased after 200 epochs, training halted prematurely, preserving the best model based on validation data [52]. We maintained fidelity to the original paper’s model parameters and settings, while also conducting multiple parameter tuning iterations to enhance experimental outcomes. Data were partitioned chronologically into training, validation, and test sets at a 6:2:2 ratio across all sub-datasets. In our experiments, we assess model performance using the Mask-Based Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) as metrics, wherein zero values (indicating noisy data) are disregarded.

TABLE II: Performance comparisons on cross year scenario, with the best-performing baseline results highlighted in bold.
Data Method Horizon 3 Horizon 6 Horizon 12 Average
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
San Diego 2019 \downarrow 2020 AGCRN 21.36 34.17 26.36% 27.37 42.02 32.45% 35.14 52.60 42.91% 27.18 41.90 32.95%
ASTGCN 20.44 32.75 17.91% 31.33 51.44 28.05% 40.38 65.34 41.34% 30.08 48.98 27.10%
D2STGNN 19.52 30.84 20.86% 25.32 38.85 27.68% 33.19 48.55 34.87% 25.33 38.40 27.81%
DGCRN 18.17 27.89 18.20% 24.43 37.42 27.77% 39.34 58.15 35.98% 25.63 38.88 27.50%
DSTAGNN 20.89 33.13 18.06% 30.26 46.25 28.82% 41.15 60.90 41.99% 29.65 45.17 28.17%
GWNET 18.16 29.01 17.38% 24.52 38.55 27.75% 32.57 47.92 37.63% 24.58 37.61 27.89%
STGCN 28.48 41.99 36.03% 33.44 48.43 39.78% 38.55 56.53 40.97% 32.90 48.11 38.78%
STGODE 20.29 31.67 21.60% 27.14 41.31 29.82% 33.52 49.51 38.64% 26.29 39.69 29.28%
STID 18.38 29.12 17.90% 25.00 38.13 27.75% 32.65 48.33 38.08% 24.52 38.29 26.88%
STAEformer 21.33 31.55 29.07% 26.66 39.20 33.41% 32.98 47.53 38.65% 26.22 38.29 32.74%
STGformer 17.91 27.09 17.20% 24.19 37.09 27.64% 31.94 47.26 37.53% 23.92 37.30 27.54%
Bay Area 2019 \downarrow 2020 AGCRN 23.46 35.28 19.93% 31.44 46.63 25.30% 39.87 57.83 32.48% 30.71 45.33 25.12%
ASTGCN 22.54 33.86 18.48% 33.43 49.05 26.90% 42.38 61.57 34.91% 32.18 47.41 26.23%
D2STGNN 21.62 31.95 17.43% 27.42 41.46 23.53% 35.19 51.78 28.44% 27.43 40.83 22.84%
DGCRN 20.27 30.00 16.77% 26.53 40.03 23.62% 41.34 55.38 29.98% 27.73 40.81 23.55%
DSTAGNN 22.99 34.24 18.63% 32.36 48.86 24.67% 43.15 57.13 35.49% 31.75 46.60 26.22%
GWNET 20.26 30.12 16.95% 26.62 41.16 23.60% 34.57 52.15 31.63% 26.58 40.04 23.89%
STGCN 30.36 43.13 26.55% 35.47 50.06 29.31% 40.40 56.84 32.35% 34.80 49.18 29.07%
STGODE 22.39 32.78 18.60% 29.24 43.92 25.67% 35.52 53.74 32.64% 28.29 42.12 25.30%
STID 19.40 29.69 14.86% 27.09 40.15 22.17% 36.08 52.10 30.51% 26.55 39.30 21.68%
STAEformer 23.43 32.66 23.07% 28.76 41.81 29.26% 34.98 51.76 32.65% 28.22 40.72 27.74%
STGformer 19.01 28.20 14.20% 26.29 39.70 23.49% 33.94 51.49 31.53% 25.92 39.73 22.54%
Los Angeles 2019 \downarrow 2020 AGCRN 22.85 36.59 22.69% 31.66 48.33 30.95% 41.21 61.71 40.04% 31.13 47.80 30.38%
ASTGCN 21.93 35.17 19.24% 33.65 50.75 29.55% 43.72 65.45 42.47% 32.60 49.88 29.49%
D2STGNN 21.01 33.26 21.19% 27.64 43.16 28.18% 36.53 55.66 35.97% 27.85 43.30 28.31%
DGCRN 19.66 31.31 18.53% 26.75 41.73 28.27% 42.68 59.26 37.08% 28.15 43.28 28.00%
DSTAGNN 22.38 35.55 18.39% 32.58 49.56 29.32% 44.49 61.01 43.09% 32.17 47.97 28.67%
GWNET 19.20 29.69 15.20% 26.94 40.31 22.14% 38.34 56.90 32.52% 27.20 40.85 22.51%
STGCN 30.42 46.30 34.14% 36.51 54.42 38.00% 42.62 63.34 42.18% 35.85 53.74 37.58%
STGODE 21.78 34.09 21.93% 29.46 45.62 30.32% 36.86 57.62 39.74% 28.71 45.05 29.78%
STID 20.32 31.52 17.23% 29.07 43.62 27.97% 39.67 58.29 40.73% 28.71 43.10 27.69%
STAEformer 22.82 33.97 29.40% 28.98 43.51 33.91% 36.32 55.64 39.75% 28.64 43.65 33.24%
STGformer 19.40 29.52 17.53% 26.51 41.40 28.14% 35.28 55.37 38.63% 26.34 41.76 27.04%
TABLE III: Performance on PEMS03, PEMS04, PEMS07, and PEMS08 benchmarks w/o using incident information.
Dataset Metric HA STGCN DCRNN GWNet AGCRN GTS STID PDFormer STAEformer STGformer
PEMS03 Average MAE 32.62 15.83 15.54 14.59 15.24 15.41 15.33 14.94 15.35 14.47
RMSE 49.89 27.51 27.18 25.24 26.65 26.15 27.40 25.39 27.55 25.08
MAPE 30.60% 16.13% 15.62% 15.52% 15.89% 15.39% 16.40% 15.82% 15.18% 14.41%
PEMS04 Average MAE 42.35 19.57 19.63 18.53 19.38 20.96 18.38 18.36 18.22 17.89
RMSE 61.66 31.38 31.26 29.92 31.25 32.95 29.95 30.03 30.18 29.79
MAPE 29.92% 13.44% 13.59% 12.89% 13.40% 14.66% 12.04% 12.00% 11.98% 11.83%
PEMS07 Average MAE 49.03 21.74 21.16 20.47 20.57 22.15 19.61 19.97 19.14 19.04
RMSE 71.18 35.27 34.14 33.47 34.40 35.10 32.79 32.95 32.60 32.50
MAPE 22.75% 9.24% 9.02% 8.61% 8.74% 9.38% 8.30% 8.55% 8.01% 7.89%
PEMS08 Average MAE 36.66 16.08 15.22 14.40 15.32 16.49 14.21 13.58 13.46 13.41
RMSE 50.45 25.39 24.17 23.39 24.41 26.08 23.28 23.41 23.25 23.21
MAPE 21.63% 10.60% 10.21% 9.21% 10.03% 10.54% 9.27% 9.05% 8.88% 8.77%
Refer to caption
Figure 5: Ablation study on San Diego and Bay Area.

Baselines. In this study, we conduct a comprehensive evaluation of traffic forecasting methodologies, encompassing a diverse array of baselines with publicly available implementations. These baselines span traditional approaches, contemporary deep learning techniques, and state-of-the-art models, providing a thorough representation of the field’s progression.

  • HA (Historical Average): Conceptualizes traffic flows as periodic processes, utilizing weighted averages from antecedent periods for future predictions.

  • LSTM [53]: Long Short-Term Memory, a type of recurrent neural network architecture that is particularly effective at learning and remembering long-term dependencies in sequential data.

  • DCRNN [21]: Diffusion Convolutional Recurrent Neural Network, which models traffic flow as a diffusion process, innovatively replacing the fully connected layer in Gated Recurrent Units (GRU) [54] with a diffusion convolutional layer.

  • Graph WaveNet [2]: Strategically stacks Gated Temporal Convolutional Networks (TCN) and Graph Convolutional Networks (GCN) to concurrently capture spatial and temporal dependencies.

  • ASTGCN [9]: Attention-based Spatial-Temporal Graph Convolutional Network, which synergistically combines spatial-temporal attention mechanisms to simultaneously capture dynamic spatial-temporal characteristics of traffic data.

  • MTGNN [55]: Multi-Task Graph Neural Network, which extends the Graph WaveNet through the integration of mix-hop propagation layers in the spatial module, dilated inception layers in the temporal module, and a more sophisticated graph learning layer.

  • DGCRN [30]: Dynamic Graph Convolutional Recurrent Network, which uses hyper-networks to model dynamic spatial relationships and employs an efficient training strategy for improved traffic prediction performance.

  • GTS [56]: Graph Structure Learning for Time Series, a method that simultaneously learns the graph structure and performs forecasting for multiple time series.

  • STGCN [11]: Spatial-Temporal Graph Convolutional Networks, which utilizes graph convolutions to model spatial dependencies and 1D convolutions for temporal modeling in traffic forecasting.

  • STTN [7]: Spatial-Temporal Transformer Network, which applies the Transformer architecture to capture both spatial and temporal dependencies in traffic data.

  • STGODE [28]: Spatial-Temporal Graph Ordinary Differential Equation, which uses neural ordinary differential equations to model continuous changes in traffic signals over time and space.

  • DSTAGNN [57]: Dynamic Spatial-Temporal Attention Graph Neural Network, which incorporates dynamic graph learning and attention mechanisms to capture evolving spatial-temporal dependencies in traffic networks.

  • D2STGNN [31]: Decoupled Dynamic Spatial-Temporal Graph Neural Network, which separates the modeling of spatial and temporal dependencies while addressing dynamic correlations among sensors in traffic networks.

  • STID [58]: Spatial-Temporal Identity, a model that combines spatial and temporal embeddings to capture the unique characteristics of each node and time step in traffic forecasting.

  • PDFormer [6]: Propagation Delay-aware Dynamic Long-Range Transformer for traffic flow prediction. It innovatively captures dynamic spatial dependencies, models both short- and long-range spatial information, and explicitly accounts for the time delay in traffic condition propagation between locations.

  • STAEformer [5]: Spatio-Temporal Adaptive Embedding transformer, enhancing vanilla transformers with adaptive embeddings to capture complex spatio-temporal traffic patterns effectively.

Refer to caption
(a) Node 24 in San Diego
Refer to caption
(b) Node 64 in San Diego
Refer to caption
(c) Node 93 in Los Angeles
Refer to caption
(d) Node 208 in Los Angeles
Figure 6: Model prediction analysis in one day on San Diego and Los Angeles datasets.

V-A Performance Comparisons.

Performance on LargeST. We further evaluated the performance of STGformer on the LargeST datasets. Experimental results in Table I demonstrate that STGformer consistently outperforms STAEformer across all evaluated datasets, exhibiting significant performance improvements. Specifically, in the San Diego dataset, STGformer achieved the most remarkable advancements in average metrics, with improvements of 3.61%, 2.83%, and 6.73% in MAE, RMSE, and MAPE, respectively. While the improvement margins were relatively smaller for the Bay Area and Los Angeles datasets, STGformer maintained a consistent advantage, with average metric improvements ranging from 1.92% to 2.12%. Notably, STGformer not only excels in prediction accuracy but also demonstrates superior model efficiency. Taking the Los Angeles dataset as an example, STGformer achieved performance superior to STAEformer, which has 4.7M parameters, while utilizing only 705K parameters, highlighting its significant advantage in parameter efficiency. These results collectively indicate that STGformer can effectively enhance the accuracy of spatiotemporal sequence forecasting while maintaining a lower computational complexity, providing a valuable new direction for research in this field.

Performance on Cross Year Scenario. Table II presents a comparative analysis of STGformer against other spatiotemporal baseline models in cross-year scenarios, including three major urban regions: San Diego, Bay Area, and Los Angeles. We compared the performance of spatiotemporal models trained on 2019 data when applied to 2020 data to evaluate model generalization ability. In the San Diego dataset, STGformer achieved a significant 14.14% reduction in RMSE for 3-hour predictions, decreasing from 31.55 to 27.09. Similarly, on the Bay Area dataset, RMSE for the same forecast horizon decreased from 32.66 to 28.20, representing a substantial 13.66% improvement. The Los Angeles dataset exhibited a similar trend, with a notable 13.10% reduction in RMSE for 3-hour predictions, from 33.97 to 29.52. These results demonstrate that employing the STG attention block, as opposed to separate stacked attention mechanisms, offers greater robustness and adaptability when handling diverse urban traffic patterns and cross-year data variations, providing strong support for enhancing the accuracy of urban traffic flow forecasting.

Performance on PEMS-series Benchmark. We validated the effectiveness of STGformer on the conventional PEMS-series datasets. As demonstrated in Table III, the proposed STGformer model consistently outperforms STAEformer across all four PEMS datasets, showcasing its superior performance in traffic flow forecasting tasks. Notably, STGformer achieves the most significant improvement on the PEMS03 dataset, with a remarkable 8.97% reduction in the RMSE metric (from 27.55 to 25.08). This result not only highlights STGformer’s advantages in handling complex spatiotemporal data but also indicates its notable effectiveness in reducing prediction errors and enhancing model stability. Furthermore, STGformer achieves consistent performance improvements across all datasets while incurring only 0.2% of the computational cost of STAEformer, further validating its adaptability and robustness in various traffic network environments.

V-B Ablation Study

To comprehensively evaluate the significance of various components within the STGformer, we conducted an ablation study using the LargeST dataset. Four scenarios were considered in Figure 5: without self-attention (W/o SA), without temporal self-attention (W/o TSA), without spatiotemporal self-attention (W/o SSA), and without graph high-order interaction (W/o Graph). Our findings indicate that the absence of any component led to a substantial decline in performance. Notably, the removal of spatiotemporal self-attention (W/o SA) resulted in the most significant performance degradation across all metrics (MAE, RMSE, and MAPE) and datasets, as it reduced the model to a feedforward module. Higher-order interactions (W/o Graph) also played a crucial role, with their absence having a relatively larger impact compared to removing only spatial or temporal self-attention. Furthermore, spatial self-attention (W/o SSA) and temporal self-attention (W/o TSA) contributed significantly to performance, as their removal led to substantial performance drops. These results underscore the critical importance of each component, particularly global spatiotemporal information and higher-order interactions, for achieving optimal performance with the STGformer.

V-C Case Study

Figure 6 illustrates the results of traffic flow prediction, including STGformer, STAEformer, AGCRN, D2STGNN, and ST-GCN, comparing the performance of multiple models across different nodes. (a) Node 24 in San Diego: The STGformer model closely adheres to the ground truth traffic flow, demonstrating superior accuracy particularly during transitional periods between peak and off-peak hours. In contrast, other models exhibit deviations from the actual values at certain intervals. (b) Node 64 in San Diego: STGformer’s predictive curve aligns closely with the ground truth, notably outperforming other models in the afternoon period. Alternative models such as STAEformer and D2STGNN display greater fluctuations or discrepancies. (c) Node 93 in Los Angeles: STGformer’s performance is particularly noteworthy during morning and evening peak hours, with predictions nearly coinciding with the ground truth. Other models show varying degrees of overestimation or underestimation during these critical periods. (d) Node 208 in Los Angeles: Despite all models exhibiting some bias in predictions during the afternoon traffic peak, STGformer demonstrates the closest approximation to actual flow trends. This is especially evident in the post-18:00 timeframe, where it significantly outperforms other models.

VI Conclusion and Limitation

In conclusion, this study introduces STGformer, a novel spatiotemporal graph Transformer model that addresses the computational challenges faced by existing GCN and Transformer-based methods in adapting to real-world road networks. STGformer demonstrates superior performance across various traffic benchmarks, from small-scale PEMS datasets to the large-scale LargeST dataset, utilizing only a single layer and linear spatiotemporal global attention. ST-Graph attention block enables efficient high-order spatiotemporal interactions for both global and local patterns, significantly reducing computational costs and memory usage compared to state-of-the-art methods like STAEformer. Furthermore, STGformer exhibits remarkable generalization capabilities, maintaining robust performance even when tested on data from a year later. These results position STGformer as a promising backbone for spatiotemporal models in large-scale traffic forecasting applications.

References

  • [1] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” arXiv preprint arXiv:1707.01926, 2017.
  • [2] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph wavenet for deep spatial-temporal graph modeling,” in Proceedings of International Joint Conference on Artificial Intelligence, 2019, pp. 1907–1913.
  • [3] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Thirty-second AAAI conference on artificial intelligence, 2018.
  • [4] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: Deep learning on spatio-temporal graphs,” in Proceedings of the ieee conference on computer vision and pattern recognition, 2016, pp. 5308–5317.
  • [5] H. Liu, Z. Dong, R. Jiang, J. Deng, J. Deng, Q. Chen, and X. Song, “Spatio-temporal adaptive embedding makes vanilla transformer sota for traffic forecasting,” in Proceedings of the 32nd ACM international conference on information and knowledge management, 2023, pp. 4125–4129.
  • [6] J. Jiang, C. Han, W. X. Zhao, and J. Wang, “Pdformer: Propagation delay-aware dynamic long-range transformer for traffic flow prediction,” in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 4, 2023, pp. 4365–4373.
  • [7] M. Xu, W. Dai, C. Liu, X. Gao, W. Lin, G.-J. Qi, and H. Xiong, “Spatial-temporal transformer networks for traffic flow forecasting,” arXiv preprint arXiv:2001.02908, 2020.
  • [8] X. Liu, Y. Xia, Y. Liang, J. Hu, Y. Wang, L. Bai, C. Huang, Z. Liu, B. Hooi, and R. Zimmermann, “Largest: A benchmark dataset for large-scale traffic forecasting,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [9] S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan, “Attention based spatial-temporal graph convolutional networks for traffic flow forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 922–929.
  • [10] C. Chen, K. Petty, A. Skabardonis, P. Varaiya, and Z. Jia, “Freeway performance measurement system: mining loop detector data,” Transportation research record, pp. 96–102, 2001.
  • [11] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting,” in Proceedings of International Joint Conference on Artificial Intelligence, 2018, pp. 3634–3640.
  • [12] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal attention for long-range interactions in vision transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 30 008–30 022, 2021.
  • [13] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
  • [14] Q. Wu, W. Zhao, Z. Li, D. Wipf, and J. Yan, “Nodeformer: A scalable graph structure learning transformer for node classification,” in Advances in Neural Information Processing Systems, 2022.
  • [15] Q. Wu, C. Yang, W. Zhao, Y. He, D. Wipf, and J. Yan, “DIFFormer: Scalable (graph) transformers induced by energy constrained diffusion,” in International Conference on Learning Representations, 2023.
  • [16] E. Min, R. Chen, Y. Bian, T. Xu, K. Zhao, W. Huang, P. Zhao, J. Huang, S. Ananiadou, and Y. Rong, “Transformer for graphs: An overview from architecture perspective,” CoRR, vol. abs/2202.08455, 2022.
  • [17] J. Chen, K. Gao, G. Li, and K. He, “NAGphormer: A tokenized graph transformer for node classification in large graphs,” CoRR, vol. abs/2206.04910, 2022.
  • [18] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning.   PMLR, 2020, pp. 5156–5165.
  • [19] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
  • [20] Q. Wu, W. Zhao, C. Yang, H. Zhang, F. Nie, H. Jiang, Y. Bian, and J. Yan, “Simplifying and empowering transformers for large-graph representations,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [21] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” in International Conference on Learning Representations, 2018.
  • [22] D. Cao, Y. Wang, J. Duan, C. Zhang, X. Zhu, C. Huang, Y. Tong, B. Xu, J. Bai, J. Tong et al., “Spectral temporal graph neural network for multivariate time-series forecasting,” in Proceedings of Advances in Neural Information Processing Systems, 2020, pp. 17 766–17 778.
  • [23] X. Liu, Y. Liang, C. Huang, Y. Zheng, B. Hooi, and R. Zimmermann, “When do contrastive learning signals help spatio-temporal graph forecasting?” in Proceedings of the 30th International Conference on Advances in Geographic Information Systems, 2022, pp. 1–12.
  • [24] X. Liu, Y. Liang, C. Huang, H. Hu, Y. Cao, B. Hooi, and R. Zimmermann, “Do we really need graph neural networks for traffic forecasting?” arXiv preprint arXiv:2301.12603, 2023.
  • [25] Y. Fang, Y. Qin, H. Luo, F. Zhao, B. Xu, L. Zeng, and C. Wang, “When spatio-temporal meet wavelets: Disentangled traffic forecasting via efficient spectral graph attention networks,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE), 2023, pp. 517–529.
  • [26] Z. Pan, Y. Liang, W. Wang, Y. Yu, Y. Zheng, and J. Zhang, “Urban traffic prediction from spatio-temporal data using deep meta learning,” in Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2019, pp. 1720–1730.
  • [27] L. Bai, L. Yao, C. Li, X. Wang, and C. Wang, “Adaptive graph convolutional recurrent network for traffic forecasting,” in Proceedings of Advances in Neural Information Processing Systems, 2020, pp. 17 804–17 815.
  • [28] Z. Fang, Q. Long, G. Song, and K. Xie, “Spatial-temporal graph ode networks for traffic flow forecasting,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021, pp. 364–373.
  • [29] J. Choi, H. Choi, J. Hwang, and N. Park, “Graph neural controlled differential equations for traffic forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 6367–6374.
  • [30] F. Li, J. Feng, H. Yan, G. Jin, F. Yang, F. Sun, D. Jin, and Y. Li, “Dynamic graph convolutional recurrent network for traffic prediction: Benchmark and solution,” ACM Transactions on Knowledge Discovery from Data, pp. 1–21, 2021.
  • [31] Z. Shao, Z. Zhang, W. Wei, F. Wang, Y. Xu, X. Cao, and C. S. Jensen, “Decoupled dynamic spatial-temporal graph neural network for traffic forecasting,” in Proceedings of the VLDB Endowment, 2022, pp. 2733–2746.
  • [32] V. P. Dwivedi and X. Bresson, “A generalization of transformer networks to graphs,” AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021.
  • [33] L. Rampášek, M. Galkin, V. P. Dwivedi, A. T. Luu, G. Wolf, and D. Beaini, “Recipe for a general, powerful, scalable graph transformer,” Advances in Neural Information Processing Systems, vol. 35, pp. 14 501–14 515, 2022.
  • [34] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
  • [35] S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention networks?” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.   OpenReview.net, 2022.
  • [36] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim, “Graph transformer networks,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [37] J. Gong, Z. Teng, Q. Teng, H. Zhang, L. Du, S. Chen, M. Z. A. Bhuiyan, J. Li, M. Liu, and H. Ma, “Hierarchical graph transformer-based deep learning model for large-scale multi-label text classification,” IEEE Access, vol. 8, pp. 30 885–30 896, 2020.
  • [38] Y. Yang and D. P. Wipf, “Transformers from an optimization perspective,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 958–36 971, 2022.
  • [39] Q. Wu, W. Zhao, Z. Li, D. P. Wipf, and J. Yan, “Nodeformer: A scalable graph structure learning transformer for node classification,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 387–27 401, 2022.
  • [40] J. Li, B. Hui, R. Cheng, B. Qin, C. Ma, N. Huo, F. Huang, W. Du, L. Si, and Y. Li, “Graphix-t5: Mixing pre-trained transformers with graph-aware layers for text-to-sql parsing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 076–13 084.
  • [41] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang, “Self-supervised graph transformer on large-scale molecular data,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 559–12 571, 2020.
  • [42] Z. Gu, X. Luo, J. Chen, M. Deng, and L. Lai, “Hierarchical graph transformer with contrastive learning for protein function prediction,” Bioinformatics, vol. 39, no. 7, p. btad410, 2023.
  • [43] A. Pepe, J. Lasenby, and P. Chacón, “Using a Graph Transformer Network to Predict 3D Coordinates of Proteins via Geometric Algebra Modelling,” in Empowering Novel Geometric Algebra for Graphics and Engineering, E. Hitzer, G. Papagiannakis, and P. Vasik, Eds.   Cham: Springer Nature Switzerland, 2023, vol. 13862, pp. 83–95.
  • [44] D. Yao, Y. Gu, G. Cong, H. Jin, and X. Lv, “Entity Resolution with Hierarchical Graph Attention Networks,” in Proceedings of the 2022 International Conference on Management of Data.   Philadelphia PA USA: ACM, Jun. 2022, pp. 429–442.
  • [45] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu, “Do transformers really perform badly for graph representation?” Advances in Neural Information Processing Systems, vol. 34, pp. 28 877–28 888, 2021.
  • [46] Y. Liu, S. Pan, Y. G. Wang, F. Xiong, L. Wang, Q. Chen, and V. C. Lee, “Anomaly detection in dynamic graphs via transformer,” IEEE Transactions on Knowledge and Data Engineering, 2021.
  • [47] J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer: Time series anomaly detection with association discrepancy,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.   OpenReview.net, 2022.
  • [48] L. Müller, M. Galkin, C. Morris, and L. Rampášek, “Attending to graph transformers,” Transactions on Machine Learning Research, 2024.
  • [49] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning.   PMLR, 2019, pp. 6861–6871.
  • [50] R. Jiang, D. Yin, Z. Wang, Y. Wang, J. Deng, H. Liu, Z. Cai, J. Deng, X. Song, and R. Shibasaki, “Dl-traff: Survey and benchmark of deep learning models for urban traffic prediction,” in Proceedings of the 30th ACM international conference on information & knowledge management, 2021, pp. 4515–4525.
  • [51] C. Cheadle, M. P. Vawter, W. J. Freed, and K. G. Becker, “Analysis of microarray data using z score transformation,” The Journal of molecular diagnostics, vol. 5, no. 2, pp. 73–81, 2003.
  • [52] X. Luo, C. Zhu, D. Zhang, and Q. Li, “Dynamic graph convolution network with spatio-temporal attention fusion for traffic flow prediction,” arXiv preprint arXiv:2302.12598, 2023.
  • [53] S. Hochreiter, J. urgen Schmidhuber, and C. Elvezia, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [54] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [55] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, “Connecting the dots: Multivariate time series forecasting with graph neural networks,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 753–763.
  • [56] C. Shang, J. Chen, and J. Bi, “Discrete graph structure learning for forecasting multiple time series,” arXiv preprint arXiv:2101.06861, 2021.
  • [57] S. Lan, Y. Ma, W. Huang, W. Wang, H. Yang, and P. Li, “Dstagnn: Dynamic spatial-temporal aware graph neural network for traffic flow forecasting,” in International conference on machine learning.   PMLR, 2022, pp. 11 906–11 917.
  • [58] Z. Shao, Z. Zhang, F. Wang, W. Wei, and Y. Xu, “Spatial-temporal identity: A simple yet effective baseline for multivariate time series forecasting,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 4454–4458.
[Uncaptioned image] Hongjun Wang is working toward the PhD degree in the Department of Mechano-Informatics at The University of Tokyo. He received his M.S. degree in computer science and technology from Southern University of Science and Technology, China. He received his B.E. degree from the Nanjing University of Posts and Telecommunications, China, in 2019. His research interests are broadly in machine learning, with a focus on urban computing, explainable AI, data mining, and data visualization.
[Uncaptioned image] Jiyuan Chen is working towards his PhD degree at The Hong Kong Polytechnic University. He received his B.S. degree in Computer Science and Technology from Southern University of Science and Technology, China. His major research fields include artificial intelligence, deep learning, urban computing, and data mining.
[Uncaptioned image] Tong Pan received a B.S. degree in Physics from East China Normal University, China, in 2019, and a Ph.D. degree in Physics from the Chinese University of Hong Kong, China, in 2024. From 2024, she has been a postdoctal researcher at Southern University of Science and Technology, China. Her research interests include data analysis, machine learning and AI for science.
[Uncaptioned image] Zheng Dong received his B.E. degree in computer science and technology from Southern University of Science and Technology (SUSTech) in 2022. He is currently persuing a M.S. degree in the Department of Computer Science and Engineering, SUSTech. His research interests include deep learning and spatio-temporal data mining.
[Uncaptioned image] Lingyu Zhang joined Baidu in 2012 as a search strategy algorithm research and development engineer. He joined Didi in 2013 and served as senior algorithm engineer, technical director of taxi strategy algorithm direction, and technical expert of strategy model department. Currently a researcher at Didi AI Labs, he used machine learning and big data technology to design and lead the implementation of multiple company-level intelligent system engines during his work at Didi, such as the order distribution system based on combination optimization, and the capacity based on density clustering and global optimization. Scheduling engine, traffic guidance and personalized recommendation engine, ”Guess where you are going” personalized destination recommendation system, etc. Participated in the company’s dozens of international and domestic core technology innovation patent research and development, application, good at using mathematical modeling, business model abstraction, machine learning, etc. to solve practical business problems. He has won honorary titles such as Beijing Invention and Innovation Patent Gold Award and QCon Star Lecturer, and his research results have been included in top international conferences related to artificial intelligence and data mining such as KDD, SIGIR, AAAI, and CIKM.
[Uncaptioned image] Renhe Jiang received a B.S. degree in software engineering from the Dalian University of Technology, China, in 2012, a M.S. degree in information science from Nagoya University, Japan, in 2015, and a Ph.D. degree in civil engineering from The University of Tokyo, Japan, in 2019. From 2019, he has been an Assistant Professor at the Information Technology Center, The University of Tokyo. His research interests include ubiquitous computing, deep learning, and spatio-temporal data analysis.
[Uncaptioned image] Prof. Xuan Song received the Ph.D. degree in signal and information processing from Peking University in 2010. In 2017, he was selected as an Excellent Young Researcher of Japan MEXT. In the past ten years, he led and participated in many important projects as a principal investigator or primary actor in Japan, such as the DIAS/GRENE Grant of MEXT, Japan; Japan/US Big Data and Disaster Project of JST, Japan; Young Scientists Grant and Scientific Research Grant of MEXT, Japan; Research Grant of MLIT, Japan; CORE Project of Microsoft; Grant of JR EAST Company and Hitachi Company, Japan. He served as Associate Editor, Guest Editor, Area Chair, Program Committee Member or reviewer for many famous journals and top-tier conferences, such as IMWUT, IEEE Transactions on Multimedia, WWW Journal, Big Data Journal, ISTC, MIPR, ACM TIST, IEEE TKDE, UbiComp, ICCV, CVPR, ICRA and etc.