1 Introduction

Visual object tracking is a fundamental task in computer vision that involves continuously tracking a specific object within a video sequence, starting from its initial state. This task has wide-ranging applications in areas such as video surveillance, autonomous driving, and robotic vision. In recent years, advancements in deep neural networks Krizhevsky et al. (2012); He et al. (2016); Vaswani et al. (2017) have significantly accelerated progress in visual tracking Li et al. (2018). In particular, the introduction of transformers Vaswani et al. (2017) has played a pivotal role in enabling high-performance trackers Chen et al. (2021); Yan et al. (2021); Wang et al. (2021); Cui et al. (2022); Ye et al. (2022); Chen et al. (2023); Wei et al. (2023). Despite these advancements, much of the research Zhao et al. (2022); Li et al. (2018); Bhat et al. (2019); Chen et al. (2021); Zhao et al. (2022); Dai et al. (2019); Liu et al. (2024) has primarily focused on improving tracking accuracy, often overlooking the importance of speed. While many of these trackers can achieve real-time performance on high-end GPUs, their practical use is limited on resource-constrained devices. For instance, the high-performance tracker TransT Chen et al. (2021) operates at only 5 frames per second (fps) on an Intel Core i9-9900K CPU and 13 fps on the NVIDIA Jetson AGX. This highlights the pressing need for a tracker that not only delivers high accuracy but also operates efficiently on devices with limited computational power.

Fig. 1
figure 1

Comparison of our methods and other trackers. In (a), the reported speed refers to the inference speed of the tracker, excluding data pre-processing. Adhering to the VOT real-time setting Kristan et al. (2020), we set the real-time line at 20 fps in (a). For (b), we evaluate on the Nvidia GeForce RTX 2080 Ti GPU using a single thread on the GOT-10k test set. Subsequently, we submitted the test results to the GOT-10k evaluation server Huang et al. (2021), obtaining the speed.

The one-stream structure has gained significant popularity in tracking applications Ye et al. (2022); Chen et al. (2022); Xie et al. (2022); Cui et al. (2022). This architecture integrates feature extraction and feature fusion into a unified process, fully leveraging the potential of backbone networks Dosovitskiy et al. (2021) pre-trained for image classification. Following this trend, our study adopts the one-stream architecture, employing a lightweight transformer backbone network pre-trained for classification tasks. However, a critical gap remains between the requirements of tracking and the design of networks optimized for image classification. Lightweight networks Graham et al. (2021); Mehta and Rastegari (2022); Wu et al. (2022) in image classification typically adopt hierarchical architectures with high-stride downsampling to reduce computational costs. While this approach is effective in classification, the use of large-stride downsampling in tracking often leads to a loss of critical fine-grained information necessary for precise object localization. This discrepancy raises an question: How can we balance the need for detailed spatial information in tracking with the computational efficiency of high-stride downsampling in hierarchical backbone networks?

Fig. 2
figure 2

Examples of “easy" and “hard" cases.

To address this challenge, we propose the Bridge Module, designed to integrate features from different stages of the hierarchical backbone. By merging deep semantic information with shallow, fine-grained details, the Bridge Module mitigates the information loss caused by large-stride downsampling. Incorporating this module into the lightweight hierarchical backbone LeViT Graham et al. (2021), we develop HiT, a novel family of efficient tracking models. In addition, we introduce a approach for relative position encoding, termed dual-image position encoding, to further enhance the representation of positional information. This method encodes the positional information of both the template and the search region simultaneously, promoting more effective interaction between the two and improving tracking performance.

Building upon the foundation of the HiT model family, we further propose an efficient dynamic tracking framework in this study. As illustrated in Fig. 2, tracking scenarios can range from simple to complex. Simple scenarios typically demand minimal computational resources and can be effectively handled by smaller models. In contrast, complex scenarios require larger models with greater computational capacity. However, conventional trackers rely on static models, applying the same processing framework to all scenarios. This results in inefficient resource utilization: over-provisioning for simple scenarios and inadequate performance for complex ones. Previous studies Zhu et al. (2024); Huang et al. (2017) have explored methods to partition tracking scenarios by incorporating dedicated judgment modules and sophisticated decision-making mechanisms. These methods activate different models based on scenario complexity, aiming to balance accuracy and efficiency. However, such approaches often introduce substantial latency due to their intricate judgment processes, making them unsuitable for efficient tracking. For example, the fastest version of preveious dynamic tracker Zhu et al. (2024) achieves only 37 FPS on AGX, which limits its practical applications. Consequently, developing a lightweight and efficient mechanism to evaluate tracking scenarios and implement a divide-and-conquer strategy remains an urgent challenge.

To address this issue, we propose an efficient feature-driven dynamic routing architecture, extending our HiT model to implement DyHiT. The core of this architecture lies in the incorporation of an early exit strategy into the framework. During forward propagation, the intermediate feature map of the search area is fed into a lightweight router to assess whether the current feature is sufficient to predict the tracking result accurately. If so, the forward propagation halts, and the intermediate feature is directly used for prediction. Otherwise, the propagation continues to extract more refined features. This approach enables efficient adaptation to varying scenario complexities. In simple scenarios, only a shallow sub-network is activated, significantly reducing inference time. Conversely, in complex scenarios, deeper layers of the network are utilized to ensure precise predictions. Unlike previous methods, our router is designed to be both simple and efficient, comprising only a few linear layers. This avoids the substantial latency typically introduced by the complex judgment modules and decision mechanisms used in earlier approaches. By leveraging feature-driven classification, our router performs straightforward assessments based on features extracted by the backbone, eliminating the need for intricate scene complexity evaluations. As a result, DyHiT offers superior efficiency and practicality compared to prior dynamic trackers, achieving an optimal balance between speed and accuracy in resource-constrained environments.

Our comprehensive experiments demonstrate the effectiveness and efficiency of both HiT and DyHiT. As shown in Fig.1a, HiT-Base achieves an 11.1% higher AUC score on the LaSOT benchmark compared to the high-speed tracker FEAR Lin et al. (2017), while operating at 1.6 times faster speed on the Nvidia Jetson AGX Xavier. Furthermore, when compared to the high-performance tracker STARK-ST50 Yan et al. (2021), HiT-Base delivers comparable accuracy but with an impressive 4.7 times faster speed on the AGX, marking a substantial improvement over prior real-time trackers. The red line in Fig.1aillustrates the speed-accuracy trade-off curve of DyHiT, which achieves a broad spectrum of trade-offs using a single model. Notably, DyHiT outperforms all previous efficient trackers in both speed and precision. The fastest version of DyHiT operates at 111 FPS on the Nvidia Jetson AGX while maintaining an AUC of 62.4% on LaSOT. This surpasses the recent MixformerV2-S Cui et al. (2024) by 1.8% in AUC, while achieving a speed that is 1.6 times faster on the AGX.

Moreover, building on the dynamic routing architecture of DyHiT, we introduce a training-free acceleration method for existing high-performance trackers. This approach significantly enhances tracking speed while maintaining accuracy. We integrate the fastest route of DyHiT (DyHiT-Route1) and our efficient feature-driven router into various high-performance trackers to construct their corresponding fast variants, namely DyTracker. During inference, DyHiT-Route1 first extracts image features, which are then evaluated by the router for reliability. If the features are deemed reliable, DyHiT-Route1 directly predicts the tracking results. Otherwise, the high-performance tracker is activated to ensure higher accuracy. This dynamic mechanism enables DyTracker to efficiently handle simple scenarios using DyHiT-Route1, conserving computational resources, while employing the high-performance tracker for complex scenarios to achieve precise predictions. As illustrated in Fig.1b, this approach delivers a substantial speed-up for existing high-performance trackers without compromising accuracy. For instance, state-of-the-art trackers OSTrack-256 Ye et al. (2022) and SeqTrack-B256 Chen et al. (2023) achieve speed improvements of \(1.6\times \) and \(2.7\times \) , respectively, when augmented with our method.

Our main contributions can be summarized as:

  • We introduce a new family of efficient tracking models, HiT. The proposed Bridge Module integrates high-level semantic information with shallow, fine-grained details, enabling the use of large-stride downsampling backbones in tracking. To improve positional accuracy, we introduce a dual-image position encoding approach that jointly encodes positional information from both the template and search region. HiT achieves superior performance and exceptional speed compared to previous efficient tracking methods.

  • We propose an efficient feature-driven dynamic routing architecture to extend HiT, resulting in DyHiT. DyHiT efficiently evaluates tracking scenarios and applies different routes based on their demands. A divide-and-conquer strategy is employed to maximize computational resource utilization. This approach achieves a wide range of speed-accuracy trade-offs.

  • Expanding upon DyHiT, we further develop a training-free acceleration method for high-performance trackers. Experiments on seven tracking models demonstrate that our method significantly improves the speed of high-performance trackers without compromising accuracy.

This study builds on our conference paper Kang et al. (2023) , which was published at the ICCV 2023 conference. We significantly extends it in various aspects. First, we extend HiT by incorporating the proposed efficient feature-driven dynamic routing architecture, implementing DyHiT. DyHiT achieves a wide range of speed-precision trade-offs, outperforming all previous efficient trackers. Second, we introduce a training-free acceleration method for existing high-performance trackers. This method significantly improves the speed of high-performance trackers without compromising their accuracy. Third, we conduct a more comprehensive comparison of HiT and DyHiT with a variety of trackers on a broader range of datasets.

2 Related Work

2.1 Visual Tracking

Siamese-based methods Bertinetto et al. (2016); Tao et al. (2016); Li et al. (2018); Wang et al. (2019, 2021); Xu et al. (2020); Guo et al. (2020); Chen et al. (2020); Zhang and Peng (2019) have gained popularity in tracking. Typically, the Siamese-based framework employs two backbone networks with shared parameters to extract features from template and search region images. It utilizes a correlation-based network for feature interaction and head networks for final prediction. The introduction of transformers Vaswani et al. (2017) in works like TransT Chen et al. (2021), TMT Wang et al. (2021), and their subsequent iterations Xu et al. (2024); Liu et al. (2021); Mayer et al. (2022); Song et al. (2022); Gao et al. (2022) further enhances tracking performance through advanced feature interaction. Recently, a one-stream framework has established new state-of-the-art performance in tracking, exemplified by methods like MixFormer Cui et al. (2022), SBT Xie et al. (2022), SimTrack Chen et al. (2022), OSTrack Ye et al. (2022), and SeqTrack Chen et al. (2023). This one-stream framework jointly performs feature extraction and feature fusion with the backbone network, proving simple yet effective by leveraging the capabilities of a pre-trained backbone. However, these methods are designed for powerful GPUs, and their speeds on edge devices are suboptimal, limiting their applicability. In this work, we focus on enhancing the efficiency of the one-stream framework, thereby broadening its applicability to a wider range of real-world scenarios.

2.2 Efficient Tracking Network

Practical applications necessitate efficient trackers capable of achieving both high performance and fast speed on edge devices. Early methods such as ECO Danelljan et al. (2017) and ATOM Danelljan et al. (2019) achieve real-time speed on edge devices, but their performance lags behind current state-of-the-art trackers. Recently, some efficient trackers have emerged. LightTrack Yan et al. (2021) employs NAS to search networks, resulting in low computational requirements and relatively high performance. FEAR Borsuk et al. (2022) achieves a family of efficient and accurate trackers by employing a dual-template representation and a pixel-wise fusion block. Despite these advances, a significant performance gap remains between these efficient trackers and high-performance trackers Chen et al. (2021); Ye et al. (2022). In this work, we propose HiT and its extension, DyHiT. Both HiT and DyHiT achieve high speeds on edge devices while delivering competitive performance compared to high-performance trackers.

2.3 Vision Transformer

ViT Dosovitskiy et al. (2021) introduces the transformer to image classification and demonstrates impressive performance. Subsequently, numerous vision transformer networks Touvron et al. (2021); Yuan et al. (2021); Wu et al. (2021); Wang et al. (2021); Liu et al. (2021); Zhang et al. (2024) have been developed. While transformers are renowned for their superior modeling capabilities, their speed is a limitation. Hence, lightweight vision transformers Mehta and Rastegari (2022); Graham et al. (2021); Wu et al. (2022) have emerged, significantly accelerating the speed of transformer-based networks. These lightweight transformers deviate from classical vision transformers by adopting a hierarchical architecture with high-stride downsampling, reducing computational overhead. In this work, we integrate a lightweight hierarchical vision transformer into the one-stream tracking framework, using LeViT Graham et al. (2021) as the default backbone. However, our approach differs fundamentally from LeViT in several key aspects: i) While LeViT makes predictions based on heavily downsampled features, our HiT framework incorporates a Bridge Module to fuse features from multiple stages, enabling predictions on fused high-resolution features. Additionally, we modify the transformer module to process both the search region and the template simultaneously. Our dynamic framework further enhances model efficiency. ii) LeViT is designed specifically for image classification, focusing more on high-level semantic information, while shallow details are less important. In tracking tasks, however, shallow details are crucial for accurate localization. Our model combines both high-level semantic information and shallow details, making it more suitable for tracking tasks. iii) LeViT employs position encoding for individual images. In comparison, we introduce dual-image position encoding to jointly represent positional information for both the template and the search region, enhancing the model’s ability to capture fine details.

2.4 Dynamic network

In image classification, dynamic networks can be broadly categorized into two types: instance-wise Huang et al. (2018); Li et al. (2021); Wang et al. (2018, 2021); Yang et al. (2020) and spatial-wise Figurnov et al. (2017); Han et al. (2022); Rao et al. (2021); Zheng et al. (2023). Spatial-wise dynamic networks, such as SACT Figurnov et al. (2017) and the dynamic token sparsity framework Rao et al. (2021), enhance efficiency by dynamically adjusting the execution layers of the network or pruning redundant tokens based on input characteristics, thereby accelerating processing speed. However, the performance of spatial-wise dynamic networks often depends on the co-design of software and hardware for optimized execution. In contrast, instance-wise dynamic networks are more adaptable to traditional CPUs and GPUs as they do not require sparse computation. A notable example is MSDNet Huang et al. (2018), which employs an early-exit strategy by training multiple classifiers with varying resource demands and integrating them into a unified neural network. During inference, these classifiers are adaptively utilized based on prediction confidence, enabling efficient acceleration. Dynamic networks have also been applied in visual tracking. For instance, EAST Huang et al. (2017) formulates the adaptive tracking problem as a decision-making process using reinforcement learning and achieves dynamic tracking through the design of complex decision schemes. Similarly, Zhu et al. Zhu et al. (2024) introduce a dedicated and complex module for scene assessment, enabling the selection of appropriate inference paths for different inputs to achieve dynamic inference. However, the reliance on intricate judgment modules and decision schemes in previous works introduces significant time overhead for scene assessment, limiting their practical applications. In this work, we propose an efficient feature-driven dynamic routing architecture to extend our HiT, resulting in DyHiT. DyHiT enables rapid scene assessment and implements a divide-and-conquer strategy, ensuring both efficiency and practicality.

3 Method

This section presents a comprehensive overview of our model. We begin with a brief introduction to the HiT framework, followed by a detailed explanation of its architecture. This includes a lightweight hierarchical backbone with dual-image position encoding, a Bridge Module, and a tracking head. Next, we introduce DyHiT, our dynamic tracking extension, along with the acceleration strategy for high-performance trackers. Finally, we describe the training pipelines.

Fig. 3
figure 3

Architecture of the proposed HiT framework. The HiT framework contains three components: a lightweight hierarchical vision transformer, a Bridge Module, and a prediction head.

3.1 Overview

As depicted in Fig. 3, HiT is a one-stream tracking framework comprising three key components: the lightweight hierarchical transformer, the proposed Bridge Module, and the head network. The input comprises an image pair, including the search region and template images, which is processed by the lightweight hierarchical transformer for simultaneous feature extraction and fusion. The core elements of the hierarchical vision transformer include Multi-Head Attention (MHA), Shrink Attention (SA), and dual-image position encoding. MHA extracts and fuses features from the search and template images, SA reduces the feature resolution for computational efficiency, and dual-image position encoding jointly encodes positional information for both images. Each stage of the transformer generates a sequence of features at different resolutions, culminating in a global vector obtained by averaging the final output features from the last stage. Subsequently, the feature sequence enters the Bridge Module, where features are fused to acquire enhanced features. Finally, the global vector and the enhanced features are fed into the prediction head to produce the tracking result.

3.2 Lightweight Hierarchical ViT

Hierarchical Backbone. We employ LeViT Graham et al. (2021) as the backbone of HiT, and adapt it into our tracking framework. Specifically, the backbone takes as input the template image \(\textbf{Z} \in {\mathbb {R}}^{3 \times {H_{z}} \times {W_{z}}}\) and the search region image \(\textbf{X} \in {\mathbb {R}}^{3 \times {H_{x}} \times {W_{x}}}\). Initially, we downsample the image pair by a factor of 16 through patch embedding, resulting in \(\mathbf {Z_{p}} \in {\mathbb {R}}^{C \times {\frac{H_{z}}{16}} \times {\frac{W_{z}}{16}}}\) and \(\mathbf {X_{p}} \in {\mathbb {R}}^{C \times {\frac{H_{x}}{16}} \times {\frac{W_{x}}{16}}}\). Subsequently, we flatten and concatenate \(\mathbf {Z_{p}}\) and \(\mathbf {X_{p}}\) in the spatial dimension before feeding them into the hierarchical transformer. The hierarchical transformer comprises three stages, with the \(i\) \(-\) \(th\) stage having Li blocks (\(L1\) \(=\) \(L2\) \(=\) \(L3\) \(=\) \(4\), by default). Each block includes a Multi-Head Attention and an MLP in the residual form. Shrink Attention modules connect each stage and downsample features by a factor of 4 in the spatial dimension. For the output features of each stage, we extract partial features corresponding to the search image. In the final stage, we average its output features to obtain a global vector \(\textbf{G}\). After the transformer backbone, we obtain a global vector \(\textbf{G} \in {\mathbb {R}}^{1 \times C_{min}}\) and a feature sequence with three feature maps of different sizes: \(\mathbf {S_{max}} \in {\mathbb {R}}^{{H_{max}} \times W_{max} \times C_{max}}\), \(\mathbf {S_{mid}} \in {\mathbb {R}}^{{H_{mid}} \times W_{mid} \times C_{mid}}\), \(\mathbf {S_{min}} \in {\mathbb {R}}^{{H_{min}} \times W_{min} \times C_{min}}\).

Multi-Head Attention (MHA). The structure of MHA is illustrated in Fig. 4a. The number of channels of Q and K is half of V to reduce the amount of calculation. Following LeViT, we use attention bias as a relative position encoding rather than absolute position encoding. We generate the attention bias using our dual-image position encoding, the details of which will be introduced later. The mechanism of MHA can be summarized as:

$$\begin{aligned} \begin{aligned} \mathrm{{Attn}}(\textbf{Q},\textbf{K},\textbf{V},\textbf{B}_i) = \mathrm{{softmax}}(\frac{\textbf{Q}\textbf{K}^\top }{\sqrt{d_k}}+\textbf{B}_i)\textbf{V}, \\ {\textbf{H}_i}=\mathrm{{Hardswish}(\textrm{Attn}}(\textbf{X}\textbf{W}_i^Q,\textbf{X}\textbf{W}_i^K,\textbf{X}\textbf{W}_i^V,\textbf{B}_i)),\\ \mathrm{{MultiHead}}(\textbf{X}) = \mathrm{{Concat}}({\textbf{H}_1},...,{\textbf{H}_{N}}){\textbf{W}^O}, \end{aligned} \end{aligned}$$
(1)

where \(\textbf{X}\in \mathbb {R}^{HW \times C}\) is the input, \(\textbf{B}_i\in \mathbb {R}^{HW \times HW}\) is the attention bias, and \(\textbf{W}_i^Q, \textbf{W}_i^K \in \mathbb {R}^{C \times D}\), \(\textbf{W}_i^V \in \mathbb {R}^{C \times 2D}\), and \(\textbf{W}^O \in \mathbb {R}^{2ND \times C}\) are parameter matrices.

Shrink Attention (SA). The structure of SA is illustrated in Fig. 4b. SA plays a crucial role in connecting the stages of the hierarchical transformer and downsampling the features. The architecture of SA mirrors that of MHA with some notable modifications: 1) The 2D input features are split into template features (T) and search region features (S) based on their spatial positions. These features are then reshaped into 3D, subsampled by a factor of 2 in each spatial direction, and re-flattened before being concatenated along the spatial dimension. This process effectively reduces the size of Q by a factor of 4 overall, resulting in downsampled SA output. 2) To address potential information loss due to downsampling, the number of channels in V is doubled. Additionally, the output features are configured with an increased number of channels to enhance their representational capacity. These modifications enable SA to efficiently downsample features while preserving critical information, ensuring effective integration within the hierarchical transformer framework.

Fig. 4
figure 4

Detailed architectures of MHA and SA.

Dual-image Position Encoding. In line with LeViT, we utilize attention bias to incorporate relative position information into attention maps. To more effectively encode the joint position information of both the template and the search region, we propose the dual-image position encoding method. Specifically, attention bias is represented as a set of learnable parameters. The process involves computing the relative positions between every pair of pixels, utilizing these relative positions as indices to retrieve the corresponding learned parameters, and subsequently adding them to the attention map. This method effectively introduces crucial position information into the attention mechanism. It is calculated as

$$\begin{aligned} \begin{array}{cc} \mathrm{{Bias}}^h = {\textbf{B}^{h}(|{x}-{x}^{\prime } |, |{y}-{y}^{\prime } |)} \end{array}, \end{aligned}$$
(2)

where (xy) and \(({x}^{\prime },{y}^{\prime })\) \(\in [H] \times [W]\) are the two pixels on the feature map. \(\textbf{B}^{h}\) is the learned parameters, and \(\mathrm{{Bias}}^{h}\) is the indexed learned parameters. As illustrated in Fig. 5a, the previous position encoding approach encodes the template and the search region separately. However, the positions of the two images partially overlap, leading to information confusion. Specifically, the position of the template aligns with the upper-left portion of the search region. To mitigate this issue, our dual-image position encoding adopts a diagonal arrangement for the template and the search region, encoding their position information jointly, as depicted in Fig. 5b. This diagonal arrangement ensures the encoding of unique horizontal and vertical coordinates for each pixel in the template and search region, thereby avoiding the confusion of detailed position information.

Fig. 5
figure 5

Comparison of our dual-image position encoding and the previous position encoding.

3.3 Bridge Module and Head

Bridge Module. The Bridge Module serves to fuse features from different stages of the hierarchical transformer, producing an enhanced feature that combines both detailed and semantic information. It acts as a bridge between the lightweight hierarchical transformer and the tracking framework. To ensure model efficiency, we aim for the Bridge Module to have a minimal yet effective architecture. The simplicity of its design leads to compelling results. In Fig. 3, the red box illustrates the outputs of the transformer: three 2D features with distinct sizes. We reshape these 2D features into 3D feature maps denoted as \(\mathbf {S_{min}}\), \(\mathbf {S_{mid}}\), and \(\mathbf {S_{max}}\). The Bridge Module follows a simple procedure: first, \(\mathbf {S_{min}}\) is upsampled and added to \(\mathbf {S_{mid}}\). Next, the resulting feature is upsampled and combined with \(\mathbf {S_{max}}\), yielding the final enhanced feature. For all upsampling operations, a transpose convolutional layer with a stride of 2 is used. The mechanism of the Bridge Module can be summarized as

$$\begin{aligned} \begin{aligned} {\mathbf {O_\text {s}}} = {\mathbf {S_\text {max}}}+\mathrm{{Upsample}}({\mathbf {S_\text {mid}}}+\mathrm{{Upsample}}({\mathbf {S_\text {min}}})), \end{aligned} \end{aligned}$$
(3)

where \(\mathbf {O_{s}} \in {\mathbb {R}}^{{H_{max}} \times W_{max} \times C_{max}}\) is the output of the Bridge Module; \(\mathbf {S_{max}} \in {\mathbb {R}}^{{H_{max}} \times W_{max} \times C_{max}}\), \(\mathbf {S_{mid}} \in {\mathbb {R}}^{{H_{mid}} \times W_{mid} \times C_{mid}}\) and \(\mathbf {S_{min}} \in {\mathbb {R}}^{{H_{min}} \times W_{min} \times C_{min}}\) are feature maps output by the lightweight hierarchical transformer. The Bridge Module effectively integrates deep semantic information with shallow detail, mitigating information loss caused by large-stride downsampling. It introduces only 327M FLOPs and 2.6M parameters, accounting for merely 7.5% and 6.3% of the total network, respectively. Despite its minimalist design, the module consistently produces compelling results while maintaining high efficiency.

Head. We use the corner head Yan et al. (2021) for prediction. First, the attention map between \({\textbf{G}}\) and \({\mathbf {O_{s}}}\) is computed. Next, \({\mathbf {O_{s}}}\) is re-weighted using the attention map, allowing local features to be enhanced or suppressed based on global information. Finally, the re-weighted \({\mathbf {O_{s}}}\) is passed through a fully-convolutional network to produce the target coordinates.

Fig. 6
figure 6

Framework of the proposed DyHiT. DyHiT consists of three components: a Router for assessing the complexity of scenes, Route1 for simple scenarios, and Route2 for complex scenarios.

3.4 Dynamic Routing Mechanism for HiT

To further improve the efficiency of HiT, we develop DyHiT by introducing an efficient feature-driven dynamic routing architecture to HiT-Base. DyHiT effectively classifies tracking scenarios and flexibly invokes different routes based on scene complexity, optimizing computational resource usage. This enables a wide range of speed-accuracy trade-offs with a single tracker. As shown in Fig. 6, DyHiT and HiT share the same backbone network. The input images undergo patch embedding for downsampling, and the concatenated tokens after downsampling are input into the subsequent backbone network for processing. We divide the backbone network of HiT into two paths: one for handling simple scenes, referred to as Route1 in Fig. 6, and another for handling complex scenes, referred to as Route2. From the features \(\mathbf {S_{1}}\) output by the first stage of the backbone, we extract the search region features (\(\mathbf {S_{max}}\)) and input them into the router. The router generates a score (\(\textbf{F}\)), which is compared to a threshold (\(\textbf{T}\)). If \(\textbf{F}\) exceeds \(\textbf{T}\), we classify the scene as simple, and the features from the first stage are sufficient for accurate prediction. In this case, we terminate the inference process and activate Route1 for prediction. The global vector \(\textbf{G}\) is obtained by computing the mean of \(\mathbf {S_{1}}\), then \(\mathbf {S_{max}}\) and \(\textbf{G}\) are fed into Head1 for prediction. If \(\textbf{F}\) is less than \(\textbf{T}\), we consider the scene to be complex and activate Route2. The backbone continues through stages 2 and 3, and the features from all three stages are fused via the Bridge Module. The fused features are then input into Head2 for prediction, resulting in the final tracking output.

Fig. 7
figure 7

Framework of the Router in DyHiT. The blue regions indicate the foreground, while the gray regions represent the background.

This divide-and-conquer strategy allows DyHiT to use shallow networks for fast predictions in simple scenarios, while invoking deeper networks for more precise predictions in complex scenarios. This optimizes computational resource utilization. Both Head1 and Head2 utilize the corner head Yan et al. (2021), consistent with the head in HiT. Notably, in our efficient feature-driven approach, the router only relies on the search region features extracted by the backbone, without additional feature extraction or integration. As a result, the router design is extremely simple, as shown in Fig. 7, consisting of just three linear layers, which are sufficient for making accurate decisions based on the existing features. The Router introduces only 11M FLOPs and 0.05M parameters, accounting for just 0.2% and 0.1% of the entire network, respectively, which is negligible. This simplicity ensures the router???s high efficiency and avoids the substantial time costs typically associated with scene complexity assessments in previous methods. In summary, DyHiT can be characterized as:

$$\begin{aligned} \begin{gathered} \textbf{F} = \textrm{Mean}(\textrm{R}(\mathbf {S_\text {max}})), \\ \textbf{y} = {\left\{ \begin{array}{ll} \textrm{Head1}(\mathbf {S_\text {max}}, \textbf{G}), & \text {if } \textbf{F} > \textbf{T} \\ \textrm{Head2}(\textrm{B}(\mathbf {S_\text {max}}, \mathbf {S_\text {mid}}, \mathbf {S_\text {min}}), \textbf{G}). & \text {if } \textbf{F} < \textbf{T} \end{array}\right. } \end{gathered} \end{aligned}$$
(4)

In the equation, \(\textrm{R}(\cdot )\) represents the router, consisting of three linear layers. \(\textrm{Mean}(\cdot )\) denotes the calculation of the mean, \(\textbf{F}\) signifies the difficulty score computed by the router, \(\textbf{T}\) is the threshold, \(\textbf{G}\) stands for the global vector, and \(\mathbf {S_{max}}\), \(\mathbf {S_{mid}}\), \(\mathbf {S_{min}}\) represent the three feature maps output by the backbone network. \(\textrm{B}(\cdot )\) denotes the bridge module, and \(\textbf{y}\) represents the final prediction result. By adjusting \(\textbf{T}\), DyHiT can achieve a wide range of speed-accuracy trade-offs.

3.5 Training-Free Acceleration Method

Based on the efficient dynamic routing architecture of DyHiT, we develop a training-free acceleration method to speed up the high-performance tracker. As shown in Fig. 8a, existing high-performance trackers typically adopt a static structure, offering high accuracy but low speed. These trackers handle all tracking scenarios with a single model, which leads to inefficiencies since simple scenarios can often be accurately predicted using a lightweight model with minimal computational cost. This results in wasted computational resources when processing simple tracking scenarios. To address this issue, we propose a plug-and-play module, which can be flexibly integrated into existing high-performance base trackers without additional training process. By combining the plug-and-play module with different high-performance trackers, we can obtain various DyTrackers. As shown in Fig. 8b, the plug-and-play module consists of an efficient tracker and a router. The efficient tracker utilizes the fastest route from DyHiT (Route1 in DyHiT, denote as DyHiT-Route1), while the router design is identical to the one in DyHiT, ensuring efficient and consistent decision-making across different scenarios. Firstly, we input the image pairs into DyHiT-Route1 for feature extraction. Subsequently, the extracted features are fed into the router for assessment. Similar to DyHiT, the router outputs a prediction score, and based on this score, we determine the difficulty level of the current scene for DyHiT-Route1. If it is deemed easy, we terminate the inference and directly use DyHiT-Route1 for prediction. Conversely, if it is considered challenging, we input the image pairs into the high-performance tracker for a re-prediction. This design enables a divide-and-conquer strategy: DyHiT-Route1 performs fast predictions in simple scenarios, while the high-performance tracker is activated for precise predictions in complex scenarios. This approach eliminates the computational resource waste typically associated with handling simple scenarios using a large model, while maintaining high accuracy in complex scenarios. As a result, DyTracker enables the acceleration of high-performance trackers without sacrificing their accuracy.

Fig. 8
figure 8

Comparison of tracking frameworks. By combining the plug-and-play module with different trackers, we can obtain various DyTrackers.

3.6 Training Objective

HiT. For HiT, we combine the \(\ell _1\) loss and the generalized GIoU loss Rezatofighi et al. (2019) as the training objective. The loss function is formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {L}=\lambda _\text {G}\mathcal {L}_\text {GIoU}(b_i,{\hat{b}}_i)+\lambda _\text {l}\mathcal {L}_\text {l}(b_i,\hat{b}_i), \end{aligned} \end{aligned}$$
(5)

where \(b_i\) represents the ground-truth, and \({\hat{b}}_i\) represents the predicted box. \(\lambda _{G}\) and \(\lambda _{l}\) are weights, in experiments, we set \(\lambda _{G}=2\) and \(\lambda _{l}=5\).

DyHiT. The training process of DyHiT consists of two stages. The objective of the first stage aligns with that of HiT, as illustrated in Equation 5. In the second stage, our aim is to enhance the router’s ability to classify scenes. The feature map \(\textbf{F} \in {\mathbb {R}}^{{H_{x}} \times W_{x}\times C_{x}}\) is input into the router, producing the score map \(\textbf{S} \in {\mathbb {R}}^{{H_{x}} \times W_{x}}\) as output. We divide the score map \(\textbf{S}\) into positive and negative samples. Specifically, based on the ground truth bounding box, the score points located inside the ground truth bounding box in the score map \(\textbf{S}\) are classified as positive samples, while the other score points are classified as negative samples. All samples contribute to the overall loss. The loss in the second stage comprises three components: \(\ell _1\) loss, GIoU loss, and MSE loss. The overall loss can be summarized as:

$$\begin{aligned} \begin{gathered} \mathcal {L} = \lambda _\text {G}\mathcal {L}_\text {GIoU}(b_i,{\hat{b}}_i)+\lambda _\text {l}\mathcal {L}_\text {l}(b_i,\hat{b}_i)+\lambda _\text {R}\mathcal {L}_\text {MSE}(y_i,{\hat{y}}_i),\\ \mathcal {L}_\text {MSE}(y_i,{\hat{y}}_i) = \frac{1}{n} \sum _{i=1}^{n} (y_i - \hat{y}_i)^2, \end{gathered} \end{aligned}$$
(6)

where \(b_i\) represents the ground-truth, and \({\hat{b}}_i\) represents the predicted box. \(y_i\) denotes the ground-truth label of the i-th sample, when a sample belongs to the positive class, \(y_i\) is equal to the intersection over union (IoU) value calculated between the corresponding \(b_i\) and \({\hat{b}}_i\). When a sample is a negative example, \(y_i\) = 0. \({\hat{y}}_i\) represents the score predicted by the router for the i-th sample. \(\lambda _{G}\), \(\lambda _{l}\) and \(\lambda _{R}\) are weights, in experiments, we set \(\lambda _{G}=\lambda _{l}=1\) and \(\lambda _{R}=5\).

4 Experiments

4.1 Implementation Details

Model. We develop three HiT variants with different lightweight transformers, as shown in Table 1. We adopt LeViT-384 Graham et al. (2021), LeViT-128, and LeViT-128S for HiT-Base, HiT-Small, and HiT-Tiny, respectively. Table 1 also reports model parameters, MACs, and inference speed on multiple devices. All models are implemented with Python 3.8 and PyTorch 1.11.0.

Table 1 Details of our HiT model variants.

Training for HiT. The training datasets of HiT encompass the train-splits of popular datasets including TrackingNet Muller et al. (2018), GOT-10k Huang et al. (2021), LaSOT Fan et al. (2019), and COCO2017 Lin et al. (2014). The network processes image pairs, each comprising a template image and a search image. In the case of video datasets, we randomly select image pairs from video sequences. For the COCO image dataset, a random image is sampled, and then data augmentations, such as scaling, translation, and jittering, are applied to create an image pair. To define the search region and template, we expand the target box by factors of 4 and 2, respectively. Subsequently, the search and template images are resized to \(256 \times 256\) and \(128 \times 128\), respectively. The transformer backbone is initialized with the pretrained LeViT Graham et al. (2021) from ImageNet Russakovsky et al. (2015). The remaining model parameters are initialized randomly. We employ the AdamW optimizer Loshchilov and Hutter (2019) with a weight decay of 1e-4 and an initial learning rate of 5e-4 for training. Our model is trained for 1500 epochs on 4 NVidia RTX 3090 GPUs, with a batch size of 128 and each epoch comprising 60,000 sampling pairs. Notably, the learning rate undergoes a \(10\times \) reduction at epoch 1200.

Training for DyHiT. The training process for DyHiT consists of two distinct stages. Initially, we focus on training Route1 of DyHiT, and subsequently, we train the router responsible for evaluating scene difficulty. The dataset and preprocessing strategy used for training DyHiT are the same as those employed for HiT. During Route1 training, we initialize the backbone and Head2 of DyHiT with parameters from the pre-trained HiT model, while Head1 is initialized randomly. To prevent interference with Route2 predictions, we freeze the backbone network and parameters of Head2, and only update Head1. This ensures that Head1 makes accurate predictions based on the features obtained in the first stage. For this stage, we use the AdamW optimizer with a weight decay of 1e-4 and an initial learning rate of 1e-4. Training runs for 90 epochs on a single NVIDIA A100 GPU with a batch size of 128. Each epoch consists of 60,000 sample pairs, and the learning rate is reduced by a factor of 10 at epoch 70. At the end of this stage, accurate predictions from both Route1 and Route2 are obtained. In the second stage, we freeze other parameters and train the router. This stage lasts 60 epochs, using a single NVIDIA A100 GPU, with a batch size of 128 and 60,000 sample pairs per epoch. As in the first stage, the learning rate is reduced by a factor of 10 at epoch 48, while the other parameters remain the same as in the initial stage. For DyTracker, no additional training is required. We simply combine Route1 and the router from DyHiT with existing high-performance trackers to construct DyTrackers for inference.

Table 2 State-of-the-art comparison on TrackingNet Muller et al. (2018), LaSOT Fan et al. (2019), and GOT-10k Huang et al. (2021) benchmarks. The best real-time results are shown in bold fonts, and the best non-real-time results are shown in underline font. We use \(^*\) to indicate that the results of the corresponding models on GOT-10k are only trained with GOT-10k training set, while others are trained with additional datasets. The reported speed refers to the inference speed of the tracker, excluding data pre-processing.
Table 3 Comparison with state-of-the-art methods on additional benchmarks in AUC score.

Inference. During the inference, we begin by initializing the template in the first frame of a video sequence. For each subsequent frame, the search region is cropped based on the bounding box of the target from the previous frame. HiT operates as an end-to-end framework, where both the template and search images are input into the tracker, and the output of the model provides the final result. No additional post-processing techniques, such as window penalty or scale penalty Li et al. (2018), are employed. For DyHiT, before determining the prediction path, the model first employs a router to assess the difficulty level of each frame. After the first stage, features related to the search region are input into the router, which outputs 256 scores ranging from 0 to 1. We set a foreground-background segmentation threshold, with a default value of 0.6. Scores above the threshold are selected, and the average score is calculated to derive the output of the router. This score classifies the scene as either easy or difficult. If classified as easy, inference by the backbone network is terminated, and the prediction from Route1 is used; otherwise, the prediction from Route2 is used. DyTracker follows a similar approach. Initially, features from the efficient tracker are input into the router to obtain a difficulty score, which determines the complexity of the scene. For easy scenes, the efficient tracker handles predictions; for more complex scenes, the high-performance tracker is activated.

4.2 State-of-the-Art Comparisons

Based on the speed on the Nvidia Jetson AGX Xavier edge device, we categorize trackers into real-time and non-real-time trackers. Consistent with the VOT real-time setting Kristan et al. (2020), we define the real-time threshold at 20 fps. The evaluation compares HiT and DyHiT with state-of-the-art trackers across both real-time and non-real-time categories, using seven tracking benchmarks. The speed assessments are conducted on three platforms: Nvidia GeForce RTX 2080 Ti GPU, Intel Core i9-9900K @ 3.60GHz CPU, and the Nvidia Jetson AGX Xavier edge device. Results are presented in Tables 2 and 3. The results of DyHiT reported in Table 2 and Table 3 are based solely on Route1, without employing dynamic routing. For the performance with dynamic routing enabled, please refer to the results presented in Fig. 11.

Speed. Table 2 presents the speeds of various trackers. On the GPU, HiT-Base, HiT-Small, HiT-Tiny, and DyHiT operate at 175 fps, 192 fps, 204 fps, and 299 fps, respectively, showcasing speeds 1.66\(\times \), 1.82\(\times \), 1.94\(\times \), and 2.85\(\times \) faster than FEAR Borsuk et al. (2022). On the AGX edge device, HiT-Base, HiT-Small, HiT-Tiny, and DyHiT achieve speeds of 61 fps, 68 fps, 77 fps, and 111 fps, respectively, surpassing FEAR by 1.61\(\times \), 1.79\(\times \), 2.03\(\times \), and 2.92\(\times \). On the CPU, HiT-Base, HiT-Small, HiT-Tiny, and DyHiT reach speeds of 33 fps, 72 fps, 76 fps, and 63 fps. While only HiT-Base is slightly slower than FEAR, it still maintains a real-time speed. In addition, we also evaluate the speed of HiT on the NVIDIA Jetson Xavier NX. As shown in Table 1, HiT-Base, HiT-Small, and HiT-Tiny achieve speeds of 32fps, 34fps, and 39fps, respectively. Overall, HiT and DyHiT exhibit impressive speeds across multiple devices, suggesting its suitability for various tracking applications.

Fig. 9
figure 9

AUC scores of different attributes on LaSOT. The numbers below each attribute represent the maximum and minimum values of that attribute among the evaluated trackers.

TrackingNet. TrackingNet Muller et al. (2018) is a comprehensive dataset, encompassing diverse situations in natural scenes and featuring multiple categories. Its test set comprises 511 video sequences. As outlined in Table 2, both HiT and DyHiT exhibit competitive performance when compared with previous real-time trackers. Notably, HiT-Base and DyHiT achieve the top two AUC scores of 80.0% and 77.9%, surpassing the leading real-time tracker HCAT Chen et al. (2022) by 3.4% and 1.3%, respectively. In comparison to the non-real-time tracker STARK-ST50 Yan et al. (2021), HiT-Base demonstrates comparable AUC performance (80.0 vs. 81.3). However, it achieves this with impressive speed gains: being \(3.5 \times \) faster on the GPU, \(4.7 \times \) faster on the CPU, and \(4.7 \times \) faster on the AGX. This emphasizes the efficiency of HiT-Base in delivering competitive tracking results with significantly improved processing speed.

LaSOT. LaSOT Fan et al. (2019) is a large-scale, long-term dataset encompassing 1400 video sequences, with 1120 training videos and 280 test videos. The performance results on LaSOT are detailed in Table 2. HiT-Base stands out by achieving top-tier real-time results with AUC, \(\hbox {P}_{Norm}\), and P scores of 64.6%, 73.3%, and 68.1%, respectively. Additionally, DyHiT secures the second-best AUC score. In comparison with the recent efficient tracker MixformerV2-S Cui et al. (2024), HiT-Base and DyHiT outperform it by 4.0% and 1.8%, respectively. In comparison with the non-real-time tracker TransT Chen et al. (2021), HiT-Base exhibits slightly lower performance by 0.3% in AUC but compensates with significantly faster processing speed. Fig. 9 shows the attribute-wise analysis of HiT, the current state-of-the-art high-speed trackers, and some non-real-time trackers. Our HiT has completely surpassed the current state-of-the-art high-speed trackers and some non-real-time trackers (PrDiMP Danelljan et al. (2020),DiMP Bhat et al. (2019)) in various attributes. Notably, HiT excels in effectively managing fast motion and viewpoint changes, showcasing superior performance in these aspects.

GOT-10k. GOT-10k Huang et al. (2021) is a large-scale and challenging dataset, comprising 10,000 training sequences and 180 test sequences. The tracking results, detailed in Table 2, showcase HiT-Base achieving the second-best real-time performance with an AO score of 64.0%. Simultaneously, DyHiT secures the third-best AO score, reaching 62.9%. Notably, HiT-Base outperforms the recent efficient tracker FEAR Borsuk et al. (2022) and MixformerV2-S Cui et al. (2024) by a significant margin of 2.1% and 1.9%, respectively.

NFS. NFS Kiani Galoogahi et al. (2017) is a challenging dataset renowned for its fast-moving objects, comprising 100 video sequences. As depicted in Table 3, HiT-Base and HiT-Small attain the top and third positions, respectively, in terms of real-time performance. Notably, HiT-Base outperforms FEAR by a margin of 2.2%.

UAV123. UAV123 Mueller et al. (2016) is specifically designed for low-altitude UAVs, comprising 123 video clips. As shown in Table 3, HiT-Base and DyHiT attain the second and third positions among real-time trackers, with AUC scores of 65.6% and 64.9%, respectively.

LaSOT\(_{ext}\). \(\hbox {LaSOT}_{ext}\) Fan et al. (2021) is a recently introduced tracking dataset comprising 150 videos, serving as an extension to LaSOT. The performance of HiT and DyHiT on \(\hbox {LaSOT}_{ext}\) are detailed in Table 3. HiT-Base, HiT-Small, and DyHiT exhibit competitive results with AUC scores of 44.1%, 40.4%, and 42.1%, respectively. In comparison to the non-real-time tracker TransT Chen et al. (2021), HiT-Base demonstrates only a 0.3% decrease in performance while operating at a remarkable speed of 4.7 \(\times \) faster on AGX.

TNL2K. TNL2K Wang et al. (2021) is a comprehensive benchmark for tracking, featuring 1300 training sequences and 700 test sequences. As shown in Table 3, HiT-Base and DyHiT secure the top two positions for real-time performance, surpassing the leading real-time tracker MixFormerV2-S Cui et al. (2024) by 2.3% and 0.8%, respectively. Compared to the non-real-time tracker TransT Chen et al. (2021), HiT-Base demonstrates nearly identical AUC performance (50.6% vs. 50.7%). However, HiT-Base achieves this with significant speed advantages: it is \(2.8 \times \) faster on GPU, \(6.6 \times \) faster on CPU, and \(4.7 \times \) faster on AGX.

Table 4 Real-time experiment on VOT2021. The evaluation is conducted using box evaluation. For real-time trackers, the results are obtained on the Nvidia Jetson AGX, while for non-real-time trackers, the evaluation is performed on an 2080 Ti GPU.
Table 5 Speed comparison under low-power mode (10W) on Nvidia Jetson AGX.

VOT. VOT competition is regarded as one of the most challenging competitions in visual tracking. We also conduct VOT real-time experiments on VOT2021 benchmark Kristan et al. (2021). The results are shown in the Table 4. In DyHiT, the subscript indicates the scene-splitting threshold: when the predicted score is higher than this threshold, the scene is classified as easy; otherwise, it is considered challenging. When the subscript is set to 1, only Route2(full pipeline) is used; when set to 0, only Route1 is used. As shown, \(\hbox {DyHiT}_{0.75}\) achieves the best real-time performance with an EAO score of 0.253. HiT-Base, \(\hbox {DyHiT}_{1}\), and \(\hbox {DyHiT}_{0.65}\) achieve the second-best real-time performance with an EAO score of 0.252. \(\hbox {DyHiT}_{0}\) achieves the third-best result with an EAO score of 0.250.

Analysis of VOT Real-Time Experiments. In VOT , the real-time threshold is defined as 20 fps. When a model fails to reach this threshold, its performance becomes significantly constrained by speed. However, once the model’s speed exceeds 20 fps, its tracking performance is determined solely by its inherent capabilities???faster speed does not necessarily translate to better performance. The results from the VOT real-time experiments reveal a notable performance gap between real-time and non-real-time trackers. This disparity arises from the experimental setup: to simulate real-world deployment scenarios, real-time trackers were evaluated on the Nvidia Jetson AGX platform. Due to their large computational requirements and slower inference speeds, non-real-time trackers cannot run effectively on AGX. As a result, we conducted their evaluations on a desktop-class Nvidia 2080Ti GPU. Since most non-real-time trackers achieve frame rates exceeding 20 fps on the 2080Ti, their performance is not limited by speed. Moreover, their larger model sizes and higher parameter counts generally lead to better performance. However, in practical applications, tracking algorithms are often deployed on low-power, resource-constrained devices where non-real-time trackers are not feasible. In contrast, real-time trackers can be effectively deployed in such environments. Additionally, in real-world perception systems, tracking is typically integrated with other tasks such as object detection and segmentation. This means the computing resources allocated to the tracker are further limited. Therefore, achieving real-time performance under low computational budgets becomes a critical advantage. Our proposed DyHiT and HiT models are specifically designed for such scenarios. To further validate their practical utility, we conducted additional experiments under a more constrained setting by configuring the Jetson AGX to low-power mode (10W power limit). As shown in Table 5, only DyHiT and HiT-Small were able to maintain frame rates above the 20 fps threshold in this setting. This result underscores the high practical value of our models for real-time visual tracking in low-power, real-world deployments.

Table 6 Evaluation of DyTrackers on TrackingNet Muller et al. (2018), LaSOT Fan et al. (2019), and GOT-10k Huang et al. (2021) benchmarks. \(\Delta _A\) denotes the performance (AUC or AO) change (averaged over benchmarks) compared with the baseline. \(\Delta _S\) denotes the speed change compared withe the baseline. The results on GOT-10k are trained with additional datasets. We tested on the Nvidia GeForce RTX 2080 Ti GPU using a single thread on the GOT-10k test set. Subsequently, we submitted the test results to the GOT-10k evaluation server Huang et al. (2021) for evaluation, obtaining the speed.

4.3 Evaluation of DyTrackers

Our acceleration method exhibits excellent generalization ability, allowing it to seamlessly integrate with various high-performance trackers to create different DyTrackers. We apply our acceleration scheme to various high-performance trackers, including OSTrack-256 Ye et al. (2022), SeqTrack-256 Chen et al. (2023), MixFormer-22k Cui et al. (2022), Sim-B/16 Chen et al. (2022), STARK-S50 Yan et al. (2021), STARK-ST101 Yan et al. (2021), and TransT Chen et al. (2021), resulting in a series of DyTrackers, namely DyOSTrack-256, DySeqTrack-B256, DyMixFormer-22k, DySim-B/16, DySTARK-S50, DySTARK-ST101, and DyTransT. We conduct a comprehensive comparison of DyTrackers with their base trackers across seven datasets, evaluating both speed and accuracy. The evaluation results are presented in Tables 6 and 7.

Table 7 Evaluation of DyTrackers on additional benchmarks in AUC score. \(\Delta _A\) denotes the performance (AUC) change (averaged over benchmarks) compared to the corresponding base tracker.

It can be observed that after applying our acceleration method, there is a significant improvement in the speed of these high-performance trackers. For instance, the well-known one-stream tracker OSTrack-256 achieves a speed of 110 fps, which is \(1.57 \times \) faster than before. Moreover, there is an improvement in accuracy, reaching 69.5% AUC on LaSOT, which is 0.4% higher than before, 65.8% AUC on NFS, an increase of 1.1%, 70.8% AUC on UAV123, a gain of 2.5%, and 55.5% AUC on TNL2K, a rise of 1.2%. In summary, these mainstream high-performance trackers, when accelerated using our method, experience significant speed improvements with almost no loss in accuracy. This demonstrates the effectiveness and generalization ability of our acceleration method.

4.4 Ablation and Analysis

In this section, we present a series of ablation experiments to thoroughly analyze our approach. These experiments primarily include: ablation analysis of the Bridge Module, ablation analysis of the dual-image position encoding, a study on the generalization of the HiT framework, an investigation of the speed-accuracy trade-off in DyHiT, ablation analysis of the efficient dynamic routing structure in DyHiT, and a discussion on the training methods of DyHiT. It is essential to highlight that, for the ablation study concerning HiT, we employ HiT-Base as the baseline model. All HiT models in the ablation experiments are trained for 500 epochs.

Different combinations of features. To verify the effectiveness of the Bridge Module and explore the importance of different features, we compare various feature combinations within the Bridge Module. Table 8 presents the results, where Max, Mid, and Min denote the features of the first, second, and third stages of the transformer, respectively. To ensure a fair comparison, the features are upsampled to the same resolution. The first row (#1) represents our default setting. Initially, without utilizing the Bridge Module, we make predictions based on independent Max, Mid, and Min features. Table 8 (#2, #3, and #4) indicates that these methods result in inferior performance, underscoring the effectiveness of feature fusion achieved by our Bridge Module. Subsequently, Table 8 (#5, #6, and #7) presents the results of alternative combinations, and our default method stands out as the most effective. In our default approach, incorporating all three features provides a richer blend of semantic and detailed information, contributing to superior results. To gain a deeper understanding of the Bridge Module, we visualize the attention map in the corner head for different feature combination methods, as depicted in Fig. 10. In the visualization results, we make two key observations: (1) Collapse Phenomenon: Methods that exclude the Max feature exhibit a collapse phenomenon. Taking the Mid method as an example, the final feature originates from the second stage of the transformer and is up-sampled by a factor of 2. Consequently, one pixel on the feature map is up-sampled to four pixel points. In the visualization result, we observe that the attention collapses to a relatively fixed distribution for every four upsampling grids. The Min column and the Mid-Min column show similar behavior to the Mid column. This highlights that even if the deep feature is up-sampled to a larger resolution, it does not inherently provide more detailed information. Therefore, the involvement of shallow, large-resolution features becomes crucial for supplementing information. (2) Improved Accuracy with Min Feature: The attention map of our default method, which includes all three features, demonstrates higher accuracy compared to methods that exclude the Min feature. This finding underscores that leveraging deep features to complement semantic information significantly enhances the discriminative ability of the model.

Table 8 Ablation study on different feature combination methods and the use of Shrink Attention in terms of AUC performance. The default setting is shown in gray. The best results are highlighted in red. Max, Mid, and Min denote the features from the first, second, and third stages of the transformer, respectively. SA denotes Shrink Attention.
Fig. 10
figure 10

Visualization of the attention maps in the corner head of different feature combining manners. Bridge means our default manner, Max-Min means combining the Max and the Min features, Max-Mid means combining the Max and the Mid features, Max, Mid, and Min mean only using the Max feature, Mid feature, and Min feature, respectively.

Shrink Attention. In our HiT-Base model, we follow LeViT Graham et al. (2021) and use the Shrink Attention to downsample the feature map. Here, we compare different downsample methods. As shown in Table 8. The eighth row (#8) indicates that we do not use Shrink Attention and instead downsample the feature map using a convolution with a stride of 2. As a result, a significant performance drop is observed, with decreases of 5.6%, 2.8%, and 6.2% on LaSOT, TrackingNet, and GOT-10k, respectively.

Table 9 Comparison of different Position Encoding (PE) in AUC score. DI denotes our dual-image PE. Abs denotes the absolute PE. Sep denotes the relative PE which encodes the template and search region separately. Ver and Hor denote the joint encoding of the template and search images in a vertical and horizontal arrangement, respectively.
Table 10 HiT with different lightweight hierarchical vision transformers.

Different Position Encoding. In previous transformer-based trackers Chen et al. (2021); Yan et al. (2021), the position information of the search image and the template image is encoded separately. In our dual-image position encoding method, we assign a unique position for each image and jointly encode their position information. Here, we compare our method with four potential encoding methods, and the results are reported in Table 9. First, we compare our method with absolute position encoding (denoted as Abs) and relative position encoding, which encodes the search and template images separately (denoted as Sep). Table 9 (#1, #2 and #3) indicates that these methods show inferior performance to our dual-image position encoding. The separate encoding fails to model the positional relationship between the search and template images, introducing overlapping positions and leading to inferior performance. Second, in our dual-image position encoding, we explore different arrangements of the template and search regions. By default, we diagonally arrange the template and the search region, as shown in Fig. 5b. Here we compare it with two other arrangements: the vertical arrangement (denoted as Ver) and the horizontal arrangement (denoted as Hor). Table 9 (#1, #4, and #5) shows that the default diagonal arrangement achieves the best performance. In the vertical and horizontal arrangements, the horizontal and vertical positions of the template and the search region overlap, leading to information loss. The diagonal arrangement assigns unique horizontal and vertical positions for the template and the search region, providing more informative encoding. Therefore, we choose the diagonal arrangement.

Different Backbones. To assess the generalization of our HiT framework, we extend our architecture with another hierarchical vision transformer, PVT Wang et al. (2021). The results are presented in Table 10. We utilize PVT-Small Wang et al. (2021) as the transformer backbone, keeping the other components consistent with HiT-Base. As shown in Table 10, HiT with PVT-Small achieves 63.9% AUC on LaSOT, 78.4% AUC on TrackingNet, and 64.8% AO on GOT-10k, while maintaining real-time speeds on all three platforms. This result is competitive compared to our base model with LeViT-384 and other efficient trackers, showcasing the superior generalization ability of our framework.

Different Thresholds. We use the scene complexity threshold, denoted as \(\textbf{T}\), to classify tracking scenarios. When the router predicts a score \(\textbf{F}\) greater than \(\textbf{T}\), the current scene is considered simple, and DyHiT utilizes Route1 for prediction. Otherwise, Route2 is employed. The value of \(\textbf{T}\) influences the speed and performance of DyHiT, as it determines the choice of route. As shown in Fig. 11, adjusting \(\textbf{T}\) allows DyHiT to achieve a wide range of speed-accuracy trade-offs. When \(\textbf{T}\)=0, only Route1 is used, resulting in a speed of 231 fps, which is \(1.94 \times \) faster than HiT-Base, with an average performance of 62.6% on LaSOT and GOT-10K. When \(\textbf{T}\)=1, only Route2 is used, which is equivalent to HiT-Base. As the value of \(\textbf{T}\) increases, performance gradually improves, but speed decreases. When \(\textbf{T}\)=0.77, DyHiT achieves its best performance of 64.4%, slightly outperforming HiT-Base by 0.1%, while maintaining a speed of 130 fps. The value of \(\textbf{T}\) can be adjusted according to specific use case requirements, allowing for a tailored speed-accuracy balance suited to the task at hand.

Fig. 11
figure 11

Impact of the scene complexity threshold on the speed and performance of DyHiT. Performance is measured as the average performance on LaSOT and GOT-10k, while speed is evaluated using a single-threaded test on the GOT-10K test set with an Nvidia GeForce RTX 2080 Ti GPU.

Fig. 12
figure 12

Comparison of different routers on GOT-10k in terms of speed (vertical axis) on Nvidia GeForce RTX 2080 Ti GPU and success rate (AO).

Different Routers. To ensure the efficiency of the model, it is crucial to spend as little time as possible on determining the current scene, meaning our router should be concise and effective. As shown in Fig. 12, we explore a total of five different routers, including our default setting DyHiT, Self-att method using a self-attention module to replace the linear layer, Random method using a random exit, One-token method employing only one token instead of the 256 tokens corresponding to the search region in our default setting, and Only-pos method that trains the router using only positive samples within the ground-truth bounding box to calculate the loss. By adjusting the scene classification threshold (ranging from 0.6 to 0.8) during inference, we can obtain a series of trackers with different speed-accuracy trade-offs. It can be observed that DyHiT achieves the best speed-accuracy trade-off. When obtaining the same Success Rate, The speed of DyHiT is faster than other methods. Taking a 63.2% Success Rate as an example, The speed of DyHiT can reach 212 fps, which is 29 fps faster than Random (183 fps), 23 fps faster than One-token (189 fps), and 44 fps faster than Self-att (168 fps). This demonstrates the simplicity and effectiveness of our default router. Fig. 13 depicts the visualization results of the scores output by the router of our default DyHiT from simple to challenging scenes. Fig. 14 presents a visualization of the router scores produced by our default DyHiT across a continuous video sequence, demonstrating the stability of our routing mechanism.

Fig. 13
figure 13

Visualization of the Router Score. Zoom in for better visibility.

Different training settings. We explore different training approaches for DyHiT, as shown in Table 11. #1 represents our default training approach described in Section 4.1, where DyHiT undergoes a two-stage training with a frozen backbone. #2 indicates training without freezing the backbone in two stages, and #3 denotes training with a frozen backbone but using a single stage to jointly train Route1 and the router in DyHiT. #4 represents the approach where we re-use the features from the efficient tracker to assist base tracker predictions in DyTracker and undergo training. Specifically, when DyTracker predicts in challenging scenes using the base tracker, we blend in features from the efficient tracker, originally used to measure scene difficulty. This is different from the default approach of directly using the results from the base tracker. We evaluate these different training approaches on three trackers, DyOSTrack, \(\hbox {DyHiT}_{0.6}\), and \(\hbox {DyHiT}_{0.65}\), on the LaSOT dataset. In DyHiT, the subscript indicates the scene-splitting threshold: when the predicted score is higher than this threshold, the scene is classified as easy; otherwise, it is considered challenging. Compared to #2 and #3, our default training approach shows an average improvement of 2.5% and 3.7% in AUC across the three trackers. This demonstrates that the backbone trained in HiT is already capable of providing sufficiently accurate target features without the need for additional training. Additional training might lead to a performance decline. It also emphasizes the importance of training Route1 and the router in two stages. Combining the training of Route1 and the router in a single stage poses challenges, as inaccuracies in Route1 predictions could adversely affect router training. This can result in diminished discriminative capabilities of the router, leading to inaccurate assessments of scene difficulty. Compared to the method in #4, where features from the efficient tracker are re-used and trained, our approach of directly combining base tracker and DyHiT without additional training outperforms by 1.6% in AUC. This suggests that the features extracted by the efficient tracker for challenging scenes might not be accurate, introducing noise when incorporated into the base tracker and leading to a performance decline. Therefore, we do not re-use the features output by the efficient tracker in complex scenes.

Fig. 14
figure 14

Visualization of router scores across consecutive frames in a video sequence. The sequence is bottle-14 from the LaSOT dataset. Zoom in for better visibility.

Table 11 Comparison of different training settings on LaSOT (AUC). \(\Delta \) denotes the performance (AUC) change (averaged over three trackers) compared with the baseline. We use gray to denote baseline setting.
Table 12 Worst-case analysis on DyHiT. The reported speed refers to the inference speed of the tracker, excluding data pre-processing.

Worst-case analysis. The acceleration achieved by DyHiT primarily stems from leveraging a lightweight model that utilizes shallow features for prediction in simple scenarios. In contrast, for complex or challenging scenarios, the full model pipeline is activated to ensure robust performance. This dynamic routing mechanism, while effective, introduces additional computational overhead during the process of scene classification, which may lead to a reduction in inference speed under worst-case conditions. To evaluate the performance impact under such conditions, we conducted worst-case analysis experiments, as reported in Table 12. Specifically, \(\hbox {DyHiT}_{1}\), \(\hbox {DyOSTrack}_{1}\), and \(\hbox {DySeqTrack}_{1}\) denote configurations where the full pipeline is used for evaluation, simulating the worst-case scenario. The results demonstrate that our router is highly efficient and introduces minimal overhead. For example, \(\hbox {DyHiT}_{1}\) experiences only a minor drop of 4 fps, 1 fps, and 5 fps compared to HiT-Base on GPU, CPU, and AGX respectively. Similarly, \(\hbox {DySeqTrack}_{1}\) exhibits only a 2 fps and 1 fps decrease relative to SeqTrack on GPU and AGX, respectively.

Fig. 15
figure 15

Comparison of different scene classification frequencies on GOT-10k with respect to speed (vertical axis, measured on Nvidia GeForce RTX 2080 Ti GPU) and success rate (AO).

Table 13 Statistics of scene occurrence frequency and corresponding speed on GOT-10k and LaSOT datasets. The speed is evaluated using a single-threaded test on the GOT-10K test set with an Nvidia GeForce RTX 2080 Ti GPU.
Table 14 Ablation study on dynamic template. \(\Delta _A\) denotes the performance change (averaged over benchmarks) compared with the baseline. 2T denotes dual templates. The speed is evaluated using a single threaded test on the GOT-10K test set with an Nvidia GeForce RTX 2080 Ti GPU.

Different scene classification frequencies. To investigate the impact of different scene classification frequencies on model performance, we conducted a series of experiments. The results are shown in Fig. 15. DyHiT refers to our default model, where scene classification is performed on every frame. Interval-5, Interval-10, and Interval-20 denote that scene classification is executed every 5, 10, and 20 frames, respectively. First indicates that classification is performed only once at the beginning of the video. By adjusting the scene classification threshold (ranging from 0.6 to 0.8) during inference, we can obtain a series of trackers with different speed-accuracy trade-offs. As illustrated, DyHiT achieves the best trade-off between speed and accuracy. It reaches a maximum speed of 212 fps, and achieves the highest accuracy of 64.2% AO score while maintaining a speed of 140 fps.

Scene occurrence frequency. To more clearly demonstrate the primary source of DyHiT’s speed improvement, we analyze the occurrence frequency of the two scene types on GOT-10k and LaSOT. The statistics are shown in Table 13. Specifically, Easy-GOT and Hard-GOT represent the frequencies of easy and hard scenes on the GOT-10k dataset, respectively, while Easy-LaSOT and Hard-LaSOT denote the corresponding frequencies on the LaSOT dataset. In DyHiT, the subscript indicates the scene-splitting threshold: when the predicted score is higher than this threshold, the scene is classified as easy; otherwise, it is considered challenging. When the subscript is set to 1, only Route2(full pipeline) is used; when set to 0, only Route1 is used. As the threshold increases, the proportion of easy scenes gradually decreases, while the proportion of hard scenes increases. At the same time, the model’s speed also gradually decreases. Specifically, \(\hbox {DyHiT}_{0}\)(Route1) achieves the highest speed of 231 fps, while \(\hbox {DyHiT}_{1}\)(Route2) reaches a speed of 119 fps. A scene-splitting threshold of 0.75 is adopted in DyOSTrack, resulting in a proportion of 0.2 for easy scenes and 0.8 for challenging ones.

Dynamic Template. To address challenges such as target deformation and interference from similar objects in video sequences, we introduce a dynamic template mechanism based on the baseline model. We then analyze the impact of this dynamic template on both accuracy and speed. The results are presented in Table 14. It can be observed that the incorporation of the dynamic template brings modest performance gains. The average performance on LaSOT and GOT-10k increases by 0.35% for HiT-Base, \(\hbox {DyHiT}_{0.65}\), and \(\hbox {DyHiT}_{1}\), and by 0.45% for \(\hbox {DyHiT}_{0}\). However, these improvements come with a trade-off in speed due to the additional computational overhead introduced by the dynamic template, resulting in fps drops of 9, 23, 18, and 9 for HiT-Base, \(\hbox {DyHiT}_{0}\), \(\hbox {DyHiT}_{0.65}\), and \(\hbox {DyHiT}_{1}\), respectively.

Different foreground-background segmentation thresholds. During the inference process, a threshold is employed to distinguish between foreground and background. We investigate the impact of five distinct thresholds on model performance. The results are presented in Fig. 16. DyHiT corresponds to the model using the default threshold of 0.6. Threshold-0.5, Threshold-0.55, Threshold-0.65, and Threshold-0.7 represent the utilization of 0.5, 0.55, 0.65, and 0.7 as the foreground-background segmentation thresholds, respectively. It is evident that DyHiT achieves a more extensive accuracy-speed trade-off. It reaches a maximum speed of 212 fps, and achieves the highest accuracy of 64.2% AO score while maintaining a speed of 140 fps.

Fig. 16
figure 16

Comparison of different foreground-background segmentation thresholds on GOT-10k with respect to speed (vertical axis, measured on Nvidia RTX 2080 Ti GPU) and success rate (AO).

Fig. 17
figure 17

Qualitative comparison among HiT and other efficient trackers. The first three rows display successful tracking cases for HiT, while the following three rows show instances where tracking failed. The performance of HiT tends to degrade in the presence of distractions and cluttered backgrounds.

5 Conclusion

This study introduces HiT, a new family of efficient transformer-based tracking models. HiT addresses the disparity between tracking frameworks and lightweight hierarchical transformers through the proposed Bridge Module and dual-image position encoding. Building upon HiT, we further present DyHiT, a tracker capable of achieving a versatile range of speed-accuracy trade-offs using an efficient feature-driven dynamic routing architecture. Furthermore, we propose a training-free method based on DyHiT, to accelerate numerous high-performance trackers without compromising accuracy. Our extensive experiments demonstrate that our methods deliver promising performance within high speeds. We hope that this work could enhance the practical applicability of visual tracking and give insights for efficient visual tracking.

Limitation. One limitation of HiT is its challenge in handling distractors and background clutter, as illustrated in Fig. 17. Additionally, our research primarily concentrates on closing the gap between lightweight hierarchical transformers and tracking frameworks. As a result, we make minimal adjustments to the existing hierarchical transformer without designing a new transformer specifically tailored for tracking.