FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter

Liu, Chongwei; Li, Haojie; Wang, Zhihui

doi:10.1007/s11263-023-01933-4

FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter

Published: 21 November 2023

Volume 132, pages 1463–1483, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1261 Accesses
13 Citations
2 Altmetric
Explore all metrics

Abstract

The Kalman Filter based on uniform assumption has been a crucial motion estimation module in trackers. However, it has limitations in non-uniform motion modeling and computational efficiency when applied to large-scale object tracking scenarios. To address these issues, we propose a novel Parallel Kalman Filter (PKF), which simplifies conventional state variables to reduces computational load and enable effective non-uniform modeling. Within PKF, we propose a non-uniform formulation which models non-uniform motion as uniform motion by transforming the time interval $\Delta t$ from a constant into a variable related to displacement, and incorporate a deceleration strategy into the control-input model of the formulation to tackle the escape problem in Multi-Object Tracking (MOT); an innovative parallel computation method is also proposed, which transposes the computation graph of PKF from the matrix to the quadratic form, significantly reducing the computational load and facilitating parallel computation between distinct tracklets via CUDA, thus making the time consumption of PKF independent of the input tracklet scale, i.e., O(1). Based on PKF, we introduce Fast, the first fully GPU-based tracker paradigm, which significantly enhances tracking efficiency in large-scale object tracking scenarios; and FastTrack, the MOT system composed of Fast and a general detector, offering high efficiency and generality. Within FastTrack, Fast only requires bounding boxes with scores and class ids for a single association during one iteration, and introduces innovative GPU-based tracking modules, such as an efficient GPU 2D-array data structure for tracklet management, a novel cost matrix implemented in CUDA for automatic association priority determination, a new association metric called HIoU, and the first implementation of the Auction Algorithm in CUDA for the asymmetric assignment problem. Experiments show that the average time per iteration of PKF on a GTX 1080Ti is only 0.2 ms; Fast can achieve a real-time efficiency of 250FPS on a GTX 1080Ti and 42FPS even on a Jetson AGX Xavier, outperforming conventional CPU-based trackers. Concurrently, FastTrack demonstrates state-of-the-art performance on four public benchmarks, specifically MOT17, MOT20, KITTI, and DanceTrack, and attains the highest speed in large-scale tracking scenarios of MOT20.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Kalman Filter with power transformation for online multi-object tracking

Article 23 January 2023

A GPU-Based Kalman Filter for Track Fitting

Article Open access 05 October 2021

MotionTrack: rethinking the motion cue for multiple object tracking in USV videos

Article 09 July 2023

Data Availability Statement

The above four datasets supporting the findings of this study are available in github.com/DanceTrack/DanceTrack for DanceTrack, cvlibs.net/datasets/kitti/eval_tracking.php for KITTI, motchallenge.net for MOT17 and MOT20.

Notes

The term “generic” implies that a tracker can easily be combined with any general object detector for object tracking.
The term “general tracking scene” refers to one of the most common cases of MOT, i.e., scenes containing people or vehicles captured by handheld or fixed cameras. For this study’s convenience, it is equivalent to the set of MOT17 (Milan et al., 2016), MOT20 (Dendorfer et al., 2020), KITTI (Geiger et al., 2013), and DanceTrack (Sun et al., 2021) in this paper.
robotics.stackexchange.com/questions/15393/the-final-step-in-kalman-filter-to-correct-update-the-covariance-matrix.
LAPJV has replaced the Hungarian Algorithm as the default assignment strategy in all current trackers. This is because scikit-learn (Pedregosa et al., 2011), the only library that supported the Hungarian Algorithm, has deprecated it. Now, all trackers perform assignment using LAPJV, which is implemented in the third-party library lap (https://github.com/gatagat/lap) or scipy (Virtanen et al., 2020).
Due to the number 1000 is a typical upper limit of the number of objects detected by the detector in a single image (Chen et al., 2019), we choose it to align the upper limits of the tracker and the detector.
Considering the MOT20 test set, averaging 139 objects per frame with a peak of 300 objects in a single frame, owns the densest large-scale object tracking scenarios in reality, we pragmatically consider 200 objects as a representative case, striking a balance between the average and peak object count.
A CPU-based cost matrix calculation method (https://github.com/samson-wang/cython_bbox) used in BYTE.

References

Bertsekas, D. P. (1992a). Auction algorithms for network flow problems: A tutorial introduction. Computational Optimization & Applications, 1(1), 7–66.
Bertsekas, D. P. (1992b). Auction algorithms for network flow problems: A tutorial introduction. Computational Optimization and Applications, 1(1), 7–66.
Bewley, A., Ge, Z., & Ott, L., et al. (2016). Simple online and realtime tracking. In: ICIP (pp. 3464–3468). IEEE.
Bishop, G., Welch, G., et al. (2001). An introduction to the kalman filter. Proc of SIGGRAPH, Course, 8(27599–23175), 41.
Google Scholar
Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv:2004.10934.
Brasó, G., & Leal-Taixé, L. (2020) Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6247–6257).
Cao, J., Weng, X., & Khirodkar, R., et al. (2022). Observation-centric sort: Rethinking sort for robust multi-object tracking. https://doi.org/10.48550/ARXIV.2203.14360.
Chen, K., Wang, J., & Pang, J., et al. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155.
Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE international conference on computer vision (pp. 3029–3037).
Chu, P., Wang, J., & You, Q., et al. (2021) Transmot: Spatial-temporal graph transformer for multiple object tracking. arXiv:2104.00194.
Dendorfer, P., Rezatofighi, H., & Milan, A., et al. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003.
Ge, Z., Liu, S., & Wang, F., et al. (2021). Yolox: Exceeding yolo series in 2021. arXiv:2107.08430.
Geiger, A., Lenz, P., Stiller, C., et al. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11), 1231–1237.
Article Google Scholar
Gonzalez, N. F., Ospina, A., & Calvez, P. (2020). Smat: Smart multiple affinity metrics for multiple object tracking. In International conference on image analysis and recognition (pp. 48–62). Springer.
Gustafsson, F., Gunnarsson, F., Bergman, N., et al. (2002). Particle filters for positioning, navigation, and tracking. IEEE Transactions on Signal Processing, 50(2), 425–437.
Article Google Scholar
Hu, H. N., Yang, Y. H., Fischer, T., et al. (2022). Monocular quasi-dense 3d object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 1992–2008.
Article Google Scholar
Jonker, R., & Volgenant, A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4), 325–340.
Article MathSciNet Google Scholar
Julier, S. J., & Uhlmann, J. K. (1997). New extension of the kalman filter to nonlinear systems. In Signal processing, sensor fusion, and target recognition VI (pp. 182–193). Spie.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Fluids Engineering, 82(1), 35–45.
MathSciNet Google Scholar
Kim, A., Ošep, A., & Leal-Taixé, L. (2021). Eagermot: 3d multi-object tracking via sensor fusion. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 11315–11321). IEEE.
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.
Article MathSciNet Google Scholar
Li, W., Xiong, Y., & Yang, S., et al. (2021). Semi-tcl: Semi-supervised track contrastive representation learning. arXiv:2107.02396.
Lin, T. Y., Goyal, P., & Girshick, R., et al. (2017). Focal loss for dense object detection. In ICCV (pp. 2980–2988).
Lu, Z., Rathod, V., & Votel, R., et al. (2020). Retinatrack: Online single stage joint detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14668–14678).
Luiten, J., Osep, A., Dendorfer, P., et al. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129(2), 548–578.
Article Google Scholar
Milan, A., Leal-Taixé, L., & Reid, I., et al. (2016). Mot16: A benchmark for multi-object tracking. arXiv:1603.00831.
Okuta, R., Unno, Y., & Nishino, D., et al. (2017). Cupy: A numpy-compatible library for nvidia gpu calculations. In Proceedings of workshop on machine learning systems (LearningSys) in the thirty-first annual conference on neural information processing systems (NIPS). http://learningsys.org/nips17/assets/papers/paper_16.pdf
Pang, J., Qiu, L., & Li, X., et al. (2021). Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 164–173).
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet Google Scholar
Peng, J., Wang, C., & Wan, F., et al. (2020). Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European conference on computer vision (pp. 145–161). Springer.
Rangesh, A., Maheshwari, P., & Gebre, M., et al. (2021). Trackmpnn: A message passing graph neural architecture for multi-object tracking. arXiv:2101.04206.
Redmon, J., & Farhadi, A. (2018) Yolov3: An incremental improvement. arXiv:1804.02767.
Ren, S., He, K., Girshick, R., et al. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 91–99.
Google Scholar
Smith, G. L., Schmidt, S. F., & McGee, L. A. (1962). Application of statistical filter theory to the optimal estimation of position and velocity on board a circumlunar vehicle. National Aeronautics and Space Administration.
Stadler, D., & Beyerer, J. (2022). Modelling ambiguous assignments for multi-person tracking in crowds. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 133–142).
Sun, P., Cao, J., & Jiang, Y., et al. (2020). Transtrack: Multiple object tracking with transformer. arXiv:2012.15460.
Sun, P., Cao, J., & Jiang, Y., et al. (2021). Dancetrack: Multi-object tracking in uniform appearance and diverse motion. arXiv:2111.14690.
Tokmakov, P., Li, J., & Burgard, W., et al. (2021). Learning to track with object permanence. arXiv:2103.14258.
Virtanen, P., Gommers, R., Oliphant, T. E., et al. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2
Article Google Scholar
Wang, G., Gu, R., & Liu, Z., et al. (2021a). Track without appearance: Learn box and tracklet embedding with local and global motion patterns for vehicle tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9876–9886).
Wang, S., Sheng, H., & Zhang, Y., et al. (2021b). A general recurrent tracking framework without real data. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13219–13228).
Wang, Y., Kitani, K., & Weng, X. (2021c). Joint object detection and multi-object tracking with graph neural networks. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 13708–13715). IEEE.
Wang, Z., Zheng, L., & Liu, Y., et al. (2020). Towards real-time multi-object tracking. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part XI 16 (pp. 107–122). Springer.
Weng, X., Wang, J., & Held, D., et al. (2020). 3d multi-object tracking: A baseline and new evaluation metrics. In 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 10359–10366). IEEE.
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645–3649). IEEE.
Wu, J., Cao, J., & Song, L., et al. (2021). Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12352–12361).
Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to track: Online multi-object tracking by decision making. In ICCV (pp. 4705–4713).
Xu, Y., Ban, Y., & Delorme, G., et al. (2021). Transcenter: Transformers with dense queries for multiple-object tracking. arXiv:2103.15145.
Yang, F., Chang, X., Sakti, S., et al. (2021). Remot: A model-agnostic refinement for multiple object tracking. Image and Vision Computing, 106(104), 091.
Google Scholar
Yu, E., Li, Z., & Han, S., et al. (2021). Relationtrack: Relation-aware multiple object tracking with decoupled representation. arXiv:2105.04322.
Zeng, F., Dong, B., & Wang, T., et al. (2021). Motr: End-to-end multiple-object tracking with transformer. arXiv:2105.03247.
Zhang, Y., Sun, P., & Jiang, Y., et al. (2022). Bytetrack: Multi-object tracking by associating every detection box. In ECCV 2022.
Zhang, Y., Wang, C., Wang, X., et al. (2021). Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11), 3069–3087.
Zhou, X., Koltun, V., Krähenbühl, P. (2020). Tracking objects as points. In ECCV (pp. 474–490). Springer.
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv:1904.07850.

Download references

Acknowledgements

We thank Yifu Zhang and Jinkun Cao for providing codes of ByteTrack and OCSORT. This work is supported in part by the National Natural Science Foundation of China (NSFC) under Grants (No.61976038 and No.61932020), and the Taishan Scholar Program of Shandong Province (tstp20221128).

Author information

Authors and Affiliations

DUT-RU International School of Information Science and Engineering, Dalian University of Technology, Dalian, Liaoning, China
Chongwei Liu & Zhihui Wang
College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, Shandong, China
Haojie Li

Authors

Chongwei Liu
View author publications
Search author on:PubMed Google Scholar
Haojie Li
View author publications
Search author on:PubMed Google Scholar
Zhihui Wang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Haojie Li.

Additional information

Communicated by Svetlana Lazebnik.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Naive Auction

Algorithm Algorithm 4 indicates the naive Auction Algorithm. For a cost matrix ${\textbf{C}}$ generated by n persons and m items where m must be greater than or equal to n, we define the following variables to execute the algorithm: a positive scaler $\epsilon $ equal to $\frac{1}{n}$; a bids table ${\textbf{b}}$ initialized to a 2D-array with 0 values and shape $(n + 1, m)$ where ${\textbf{b}}[n,:]$ denotes the flag vector; a person2item mapping $\textbf{p2i}$ initialized to a 1D-array with -1 values and length n; an item2person mapping $\textbf{i2p}$ initialized to a 1D-array with -1 values and length m; a price vector ${\textbf{p}}$ initialized to a 1D-array with 0 values and length m. The algorithm is based on a bidding-assignment loop to solve the best match, and the loop terminates when there is no value − 1 in $\textbf{p2i}$.

In the bidding phase (line 7–14), a temporary vector ${\textbf{t}}$ with length m is first calculated via ${\textbf{C}}[i,:]- {\textbf{p}}$ where $\textbf{p2i}[i] == -1$. For each ${\textbf{t}}$, $bid(\cdot )$ is employed to obtain the max value $x_1$ with its index $x_{idx}$ and the second max value $x_2$. Then the biding increment is computed and set into ${\textbf{b}}[i, x_{idx}]$ (line 11); the flag in ${\textbf{b}}[n, x_{idx}]$ is set to 1 (line 12).

In the assignment phase (line 15–25), for each vector ${\textbf{b}}[:n, j]$ where ${\textbf{b}}[n, j] == 1$, $assign(\cdot )$ is employed to obtain the max value $x_1$ with its index $x_{idx}$. Then $x_1$ is added into ${\textbf{p}}[j]$ (line 18) and a new mapping is established between the $x_{idx}$ person and the j item (line 19–23).

Next, all positions in ${\textbf{b}}$ are reset to 0 values for the next iteration (line 26).

Computation In the CUDA implementation, the bidding phase and the assignment phase are established into two separate kernel functions.

In the bidding kernel function, the grid includes n blocks to obtain the $x_{1/2/idx}$ of different rows in ${\textbf{C}}$ simultaneously. Each block contains $\lceil \frac{m}{2} \rceil $ threads and within each thread, two different values are read from ${\textbf{t}}$ and the CUDA built-in function atomicMax() is employed to obtain the $x_{1/2/idx}$.

In the assignment kernel function, the grid includes m blocks to obtain the $x_{1/idx}$ of different columns in ${\textbf{b}}$ simultaneously. Each block contains $\lceil \frac{n}{2} \rceil $ threads and within each thread, two different values are read from ${\textbf{b}}[:n, j]$ and atomicMax() is also employed to obtain the $x_{1/idx}$.

Appendix B MOT17-val

Table 15 Basic information about the ablation dataset

Full size table

MOT17 (Milan et al., 2016) contains a train set with the ground truth and a test set without the ground truth. For the ablation study, we define the first half of each video in the original MOT17 train set as the train set of the ablation dataset and the last half as the validation set of the ablation dataset called MOT17-val following (Zhang et al., 2022; Zhou et al., 2020). Table 15 shows the basic information about the ablation dataset. There are seven videos with 5316 images and the detector YOLOX performs both high recall and precision on the seven videos except MOT17-02. As illustrated in Fig. 10, the validation video of MOT17-02 in MOT17-val has a total of 300 images (numbered from Frame299 to Frame469) and the pedestrians in the green dashed boxes are obscured most of the time from Frame299 to Frame469 (a total of 170 frames) while they are stilled labeled as the ground truth even if they are completely invisible. This is the reason why the validation video of MOT17-02 has the lowest recall among all the seven videos. MOT17 has a tendency to label pedestrians that are not visible at all as the ground truth. This is reflected in all seven videos, with MOT17-02 being the most severe, which explains why all recalls in Table 15 are lower than precisions, i.e., the detector cannot detect fully obscured pedestrians who have all ready been labeled as the ground truth. Fortunately, this situation is not serious in the other six validation videos. For this reason, the metric number obtained on video MOT17-02 in the ablation studies would be considered less meaningful and we only use the remaining six videos as MOT17-val in our ablation studies.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, C., Li, H. & Wang, Z. FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter. Int J Comput Vis 132, 1463–1483 (2024). https://doi.org/10.1007/s11263-023-01933-4

Download citation

Received: 27 July 2022
Accepted: 17 October 2023
Published: 21 November 2023
Version of record: 21 November 2023
Issue date: May 2024
DOI: https://doi.org/10.1007/s11263-023-01933-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adaptive Kalman Filter with power transformation for online multi-object tracking

A GPU-Based Kalman Filter for Track Fitting

MotionTrack: rethinking the motion cue for multiple object tracking in USV videos

Explore related subjects

Data Availability Statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A Naive Auction

Appendix B MOT17-val

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now