Abstract
The Kalman Filter based on uniform assumption has been a crucial motion estimation module in trackers. However, it has limitations in non-uniform motion modeling and computational efficiency when applied to large-scale object tracking scenarios. To address these issues, we propose a novel Parallel Kalman Filter (PKF), which simplifies conventional state variables to reduces computational load and enable effective non-uniform modeling. Within PKF, we propose a non-uniform formulation which models non-uniform motion as uniform motion by transforming the time interval \(\Delta t\) from a constant into a variable related to displacement, and incorporate a deceleration strategy into the control-input model of the formulation to tackle the escape problem in Multi-Object Tracking (MOT); an innovative parallel computation method is also proposed, which transposes the computation graph of PKF from the matrix to the quadratic form, significantly reducing the computational load and facilitating parallel computation between distinct tracklets via CUDA, thus making the time consumption of PKF independent of the input tracklet scale, i.e., O(1). Based on PKF, we introduce Fast, the first fully GPU-based tracker paradigm, which significantly enhances tracking efficiency in large-scale object tracking scenarios; and FastTrack, the MOT system composed of Fast and a general detector, offering high efficiency and generality. Within FastTrack, Fast only requires bounding boxes with scores and class ids for a single association during one iteration, and introduces innovative GPU-based tracking modules, such as an efficient GPU 2D-array data structure for tracklet management, a novel cost matrix implemented in CUDA for automatic association priority determination, a new association metric called HIoU, and the first implementation of the Auction Algorithm in CUDA for the asymmetric assignment problem. Experiments show that the average time per iteration of PKF on a GTX 1080Ti is only 0.2 ms; Fast can achieve a real-time efficiency of 250FPS on a GTX 1080Ti and 42FPS even on a Jetson AGX Xavier, outperforming conventional CPU-based trackers. Concurrently, FastTrack demonstrates state-of-the-art performance on four public benchmarks, specifically MOT17, MOT20, KITTI, and DanceTrack, and attains the highest speed in large-scale tracking scenarios of MOT20.
Similar content being viewed by others
Data Availability Statement
The above four datasets supporting the findings of this study are available in github.com/DanceTrack/DanceTrack for DanceTrack, cvlibs.net/datasets/kitti/eval_tracking.php for KITTI, motchallenge.net for MOT17 and MOT20.
Notes
The term “generic” implies that a tracker can easily be combined with any general object detector for object tracking.
The term “general tracking scene” refers to one of the most common cases of MOT, i.e., scenes containing people or vehicles captured by handheld or fixed cameras. For this study’s convenience, it is equivalent to the set of MOT17 (Milan et al., 2016), MOT20 (Dendorfer et al., 2020), KITTI (Geiger et al., 2013), and DanceTrack (Sun et al., 2021) in this paper.
robotics.stackexchange.com/questions/15393/the-final-step-in-kalman-filter-to-correct-update-the-covariance-matrix.
LAPJV has replaced the Hungarian Algorithm as the default assignment strategy in all current trackers. This is because scikit-learn (Pedregosa et al., 2011), the only library that supported the Hungarian Algorithm, has deprecated it. Now, all trackers perform assignment using LAPJV, which is implemented in the third-party library lap (https://github.com/gatagat/lap) or scipy (Virtanen et al., 2020).
Due to the number 1000 is a typical upper limit of the number of objects detected by the detector in a single image (Chen et al., 2019), we choose it to align the upper limits of the tracker and the detector.
Considering the MOT20 test set, averaging 139 objects per frame with a peak of 300 objects in a single frame, owns the densest large-scale object tracking scenarios in reality, we pragmatically consider 200 objects as a representative case, striking a balance between the average and peak object count.
A CPU-based cost matrix calculation method (https://github.com/samson-wang/cython_bbox) used in BYTE.
References
Bertsekas, D. P. (1992a). Auction algorithms for network flow problems: A tutorial introduction. Computational Optimization & Applications, 1(1), 7–66.
Bertsekas, D. P. (1992b). Auction algorithms for network flow problems: A tutorial introduction. Computational Optimization and Applications, 1(1), 7–66.
Bewley, A., Ge, Z., & Ott, L., et al. (2016). Simple online and realtime tracking. In: ICIP (pp. 3464–3468). IEEE.
Bishop, G., Welch, G., et al. (2001). An introduction to the kalman filter. Proc of SIGGRAPH, Course, 8(27599–23175), 41.
Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv:2004.10934.
Brasó, G., & Leal-Taixé, L. (2020) Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6247–6257).
Cao, J., Weng, X., & Khirodkar, R., et al. (2022). Observation-centric sort: Rethinking sort for robust multi-object tracking. https://doi.org/10.48550/ARXIV.2203.14360.
Chen, K., Wang, J., & Pang, J., et al. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155.
Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE international conference on computer vision (pp. 3029–3037).
Chu, P., Wang, J., & You, Q., et al. (2021) Transmot: Spatial-temporal graph transformer for multiple object tracking. arXiv:2104.00194.
Dendorfer, P., Rezatofighi, H., & Milan, A., et al. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003.
Ge, Z., Liu, S., & Wang, F., et al. (2021). Yolox: Exceeding yolo series in 2021. arXiv:2107.08430.
Geiger, A., Lenz, P., Stiller, C., et al. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11), 1231–1237.
Gonzalez, N. F., Ospina, A., & Calvez, P. (2020). Smat: Smart multiple affinity metrics for multiple object tracking. In International conference on image analysis and recognition (pp. 48–62). Springer.
Gustafsson, F., Gunnarsson, F., Bergman, N., et al. (2002). Particle filters for positioning, navigation, and tracking. IEEE Transactions on Signal Processing, 50(2), 425–437.
Hu, H. N., Yang, Y. H., Fischer, T., et al. (2022). Monocular quasi-dense 3d object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 1992–2008.
Jonker, R., & Volgenant, A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4), 325–340.
Julier, S. J., & Uhlmann, J. K. (1997). New extension of the kalman filter to nonlinear systems. In Signal processing, sensor fusion, and target recognition VI (pp. 182–193). Spie.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Fluids Engineering, 82(1), 35–45.
Kim, A., Ošep, A., & Leal-Taixé, L. (2021). Eagermot: 3d multi-object tracking via sensor fusion. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 11315–11321). IEEE.
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.
Li, W., Xiong, Y., & Yang, S., et al. (2021). Semi-tcl: Semi-supervised track contrastive representation learning. arXiv:2107.02396.
Lin, T. Y., Goyal, P., & Girshick, R., et al. (2017). Focal loss for dense object detection. In ICCV (pp. 2980–2988).
Lu, Z., Rathod, V., & Votel, R., et al. (2020). Retinatrack: Online single stage joint detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14668–14678).
Luiten, J., Osep, A., Dendorfer, P., et al. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129(2), 548–578.
Milan, A., Leal-Taixé, L., & Reid, I., et al. (2016). Mot16: A benchmark for multi-object tracking. arXiv:1603.00831.
Okuta, R., Unno, Y., & Nishino, D., et al. (2017). Cupy: A numpy-compatible library for nvidia gpu calculations. In Proceedings of workshop on machine learning systems (LearningSys) in the thirty-first annual conference on neural information processing systems (NIPS). http://learningsys.org/nips17/assets/papers/paper_16.pdf
Pang, J., Qiu, L., & Li, X., et al. (2021). Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 164–173).
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Peng, J., Wang, C., & Wan, F., et al. (2020). Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European conference on computer vision (pp. 145–161). Springer.
Rangesh, A., Maheshwari, P., & Gebre, M., et al. (2021). Trackmpnn: A message passing graph neural architecture for multi-object tracking. arXiv:2101.04206.
Redmon, J., & Farhadi, A. (2018) Yolov3: An incremental improvement. arXiv:1804.02767.
Ren, S., He, K., Girshick, R., et al. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 91–99.
Smith, G. L., Schmidt, S. F., & McGee, L. A. (1962). Application of statistical filter theory to the optimal estimation of position and velocity on board a circumlunar vehicle. National Aeronautics and Space Administration.
Stadler, D., & Beyerer, J. (2022). Modelling ambiguous assignments for multi-person tracking in crowds. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 133–142).
Sun, P., Cao, J., & Jiang, Y., et al. (2020). Transtrack: Multiple object tracking with transformer. arXiv:2012.15460.
Sun, P., Cao, J., & Jiang, Y., et al. (2021). Dancetrack: Multi-object tracking in uniform appearance and diverse motion. arXiv:2111.14690.
Tokmakov, P., Li, J., & Burgard, W., et al. (2021). Learning to track with object permanence. arXiv:2103.14258.
Virtanen, P., Gommers, R., Oliphant, T. E., et al. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2
Wang, G., Gu, R., & Liu, Z., et al. (2021a). Track without appearance: Learn box and tracklet embedding with local and global motion patterns for vehicle tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9876–9886).
Wang, S., Sheng, H., & Zhang, Y., et al. (2021b). A general recurrent tracking framework without real data. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13219–13228).
Wang, Y., Kitani, K., & Weng, X. (2021c). Joint object detection and multi-object tracking with graph neural networks. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 13708–13715). IEEE.
Wang, Z., Zheng, L., & Liu, Y., et al. (2020). Towards real-time multi-object tracking. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part XI 16 (pp. 107–122). Springer.
Weng, X., Wang, J., & Held, D., et al. (2020). 3d multi-object tracking: A baseline and new evaluation metrics. In 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 10359–10366). IEEE.
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645–3649). IEEE.
Wu, J., Cao, J., & Song, L., et al. (2021). Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12352–12361).
Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to track: Online multi-object tracking by decision making. In ICCV (pp. 4705–4713).
Xu, Y., Ban, Y., & Delorme, G., et al. (2021). Transcenter: Transformers with dense queries for multiple-object tracking. arXiv:2103.15145.
Yang, F., Chang, X., Sakti, S., et al. (2021). Remot: A model-agnostic refinement for multiple object tracking. Image and Vision Computing, 106(104), 091.
Yu, E., Li, Z., & Han, S., et al. (2021). Relationtrack: Relation-aware multiple object tracking with decoupled representation. arXiv:2105.04322.
Zeng, F., Dong, B., & Wang, T., et al. (2021). Motr: End-to-end multiple-object tracking with transformer. arXiv:2105.03247.
Zhang, Y., Sun, P., & Jiang, Y., et al. (2022). Bytetrack: Multi-object tracking by associating every detection box. In ECCV 2022.
Zhang, Y., Wang, C., Wang, X., et al. (2021). Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11), 3069–3087.
Zhou, X., Koltun, V., Krähenbühl, P. (2020). Tracking objects as points. In ECCV (pp. 474–490). Springer.
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv:1904.07850.
Acknowledgements
We thank Yifu Zhang and Jinkun Cao for providing codes of ByteTrack and OCSORT. This work is supported in part by the National Natural Science Foundation of China (NSFC) under Grants (No.61976038 and No.61932020), and the Taishan Scholar Program of Shandong Province (tstp20221128).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Svetlana Lazebnik.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Naive Auction
Algorithm Algorithm 4 indicates the naive Auction Algorithm. For a cost matrix \({\textbf{C}}\) generated by n persons and m items where m must be greater than or equal to n, we define the following variables to execute the algorithm: a positive scaler \(\epsilon \) equal to \(\frac{1}{n}\); a bids table \({\textbf{b}}\) initialized to a 2D-array with 0 values and shape \((n + 1, m)\) where \({\textbf{b}}[n,:]\) denotes the flag vector; a person2item mapping \(\textbf{p2i}\) initialized to a 1D-array with -1 values and length n; an item2person mapping \(\textbf{i2p}\) initialized to a 1D-array with -1 values and length m; a price vector \({\textbf{p}}\) initialized to a 1D-array with 0 values and length m. The algorithm is based on a bidding-assignment loop to solve the best match, and the loop terminates when there is no value − 1 in \(\textbf{p2i}\).
In the bidding phase (line 7–14), a temporary vector \({\textbf{t}}\) with length m is first calculated via \({\textbf{C}}[i,:]- {\textbf{p}}\) where \(\textbf{p2i}[i] == -1\). For each \({\textbf{t}}\), \(bid(\cdot )\) is employed to obtain the max value \(x_1\) with its index \(x_{idx}\) and the second max value \(x_2\). Then the biding increment is computed and set into \({\textbf{b}}[i, x_{idx}]\) (line 11); the flag in \({\textbf{b}}[n, x_{idx}]\) is set to 1 (line 12).
In the assignment phase (line 15–25), for each vector \({\textbf{b}}[:n, j]\) where \({\textbf{b}}[n, j] == 1\), \(assign(\cdot )\) is employed to obtain the max value \(x_1\) with its index \(x_{idx}\). Then \(x_1\) is added into \({\textbf{p}}[j]\) (line 18) and a new mapping is established between the \(x_{idx}\) person and the j item (line 19–23).
Next, all positions in \({\textbf{b}}\) are reset to 0 values for the next iteration (line 26).
Computation In the CUDA implementation, the bidding phase and the assignment phase are established into two separate kernel functions.
In the bidding kernel function, the grid includes n blocks to obtain the \(x_{1/2/idx}\) of different rows in \({\textbf{C}}\) simultaneously. Each block contains \(\lceil \frac{m}{2} \rceil \) threads and within each thread, two different values are read from \({\textbf{t}}\) and the CUDA built-in function atomicMax() is employed to obtain the \(x_{1/2/idx}\).
In the assignment kernel function, the grid includes m blocks to obtain the \(x_{1/idx}\) of different columns in \({\textbf{b}}\) simultaneously. Each block contains \(\lceil \frac{n}{2} \rceil \) threads and within each thread, two different values are read from \({\textbf{b}}[:n, j]\) and atomicMax() is also employed to obtain the \(x_{1/idx}\).
Appendix B MOT17-val
Illustration of MOT17-02 in Val Half. FrameXXX denotes the frame number in the video (start from 299 to 600). The solid boxes with same color in different images represent the ground truth of the same trajectory; the small pedestrians in the dashed boxes are also marked as the ground truth even if they are completely invisible (Color figure online)
MOT17 (Milan et al., 2016) contains a train set with the ground truth and a test set without the ground truth. For the ablation study, we define the first half of each video in the original MOT17 train set as the train set of the ablation dataset and the last half as the validation set of the ablation dataset called MOT17-val following (Zhang et al., 2022; Zhou et al., 2020). Table 15 shows the basic information about the ablation dataset. There are seven videos with 5316 images and the detector YOLOX performs both high recall and precision on the seven videos except MOT17-02. As illustrated in Fig. 10, the validation video of MOT17-02 in MOT17-val has a total of 300 images (numbered from Frame299 to Frame469) and the pedestrians in the green dashed boxes are obscured most of the time from Frame299 to Frame469 (a total of 170 frames) while they are stilled labeled as the ground truth even if they are completely invisible. This is the reason why the validation video of MOT17-02 has the lowest recall among all the seven videos. MOT17 has a tendency to label pedestrians that are not visible at all as the ground truth. This is reflected in all seven videos, with MOT17-02 being the most severe, which explains why all recalls in Table 15 are lower than precisions, i.e., the detector cannot detect fully obscured pedestrians who have all ready been labeled as the ground truth. Fortunately, this situation is not serious in the other six validation videos. For this reason, the metric number obtained on video MOT17-02 in the ablation studies would be considered less meaningful and we only use the remaining six videos as MOT17-val in our ablation studies.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, C., Li, H. & Wang, Z. FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter. Int J Comput Vis 132, 1463–1483 (2024). https://doi.org/10.1007/s11263-023-01933-4
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-023-01933-4