Abstract
Multi-object tracking (MOT) in the scenario of low-frame-rate videos is a promising solution to better meet the computing, storage, and transmitting bandwidth resource constraints of edge devices. Tracking with a low frame rate poses particular challenges in the association stage as objects in two successive frames typically exhibit much quicker variations in locations, velocities, appearances, and visibilities than those in normal frame rates. In this paper, we observe severe performance degeneration of many existing association strategies caused by such variations. Though optical-flow-based methods like CenterTrack can handle the large displacement to some extent due to their large receptive field, the temporally local nature makes them fail to give reliable displacement estimations of objects that newly appear in the current frame (i.e., not visible in the previous frame). To overcome the local nature of optical-flow-based methods, we propose an online tracking method by extending the CenterTrack architecture with a new head, named APP, to recognize unreliable displacement estimations. Further, to capture the fine-grained and private unreliability of each displacement estimation, we extend the binary APP predictions to displacement uncertainties. To this end, we reformulate the displacement estimation task via Bayesian deep learning tools. With APP predictions, we propose to conduct association in a multi-stage manner where vision cues or historical motion cues are leveraged in the corresponding stage. By rethinking the commonly used bipartite matching algorithms, we equip the proposed multi-stage association policy with a hybrid matching strategy conditioned on displacement uncertainties. Our method shows robustness in preserving identities in low-frame-rate video sequences. Experimental results on public datasets in various low-frame-rate settings demonstrate the advantages of the proposed method.
Similar content being viewed by others
Data Availability
The datasets analyzed during the current study are available in the CrowdHuman repository, MOTChallenge repository, and KITTI repository, with the links as http://www.crowdhuman.org/, https://motchallenge.net/data/MOT17/, https://motchallenge.net/data/MOT20/, and https://www.cvlibs.net/download.php?file=data_tracking_image_2.zip.
Notes
All the oracle experiments are conducted on the MOT17 dataset with a 1/10 frame rate.
To eliminate association noise, we provide ground-truth displacements for all detections, even if they were discarded in the previous frame.
DeepSORT rejects matches with \((1 - \cos (\textbf{e}_a, \textbf{e}_b)) > \tau \), where \(\textbf{e}_a\) and \(\textbf{e}_b\) are the appearance embeddings of a pair of detection and track, respectively. Unlike FairMOT (Zhang et al., 2021), where the memorized appearance embedding of a track is updated by new observations with a fixed weight \(\alpha \) at each time step, DeepSORT uses extra memory to store the appearance embeddings of all previous observations for each track. Therefore, the matching threshold \(\tau \) is the only hyperparameter that needs to be tuned for DeepSORT.
References
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., & Savarese, S. (2016). Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961–971.
Ballas, N., Yao, L., Pal, C., & Courville, A. (2015). Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432
Bergmann, P., Meinhardt, T., & Leal-Taixe, L. (2019). Tracking without bells and whistles. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 941–951.
Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008, 1–10.
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pp. 3464–3468. IEEE.
Brasó, G., Cetintas, O., & Leal-Taixé, L. (2022). Multi-object tracking and segmentation via neural message passing. International Journal of Computer Vision, 130(12), 3035–3053.
Brasó, G., & Leal-Taixé, L. (2020). Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6247–6257.
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631.
Cao, Z., Huang, Z., Pan, L., Zhang, S., Liu, Z., & Fu, C. (2022). Tctrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14798–14808.
Cao, Z., Huang, Z., Pan, L., Zhang, S., Liu, Z., & Fu, C. (2023). Towards real-world visual tracking with temporal contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Cao, J., Pang, J., Weng, X., Khirodkar, R., & Kitani, K. (2023). Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9686–9696.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020) End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Springer.
Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE international conference on computer vision, pp. 3029–3037.
Chu, P., Wang, J., You, Q., Ling, H., & Liu, Z. (2023). Transmot: Spatial-temporal graph transformer for multiple object tracking. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 4870–4880.
Chuang, M. C., Hwang, J. N., Williams, K., & Towler, R. (2014). Tracking live fish from low-contrast and low-frame-rate stereo videos. IEEE Transactions on Circuits and Systems for Video Technology, 25(1), 167–179.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773.
Dai, P., Weng, R., Choi, W., Zhang, C., He, Z., & Ding, W. (2021). Learning a proposal classifier for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2443–2452.
Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., & Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003
Evangelidis, G. D., & Psarakis, E. Z. (2008). Parametric image alignment using enhanced correlation coefficient maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1858–1865.
Feng, W., Bai, L., Yao, Y., Yu, F., & Ouyang, W. (2024). Towards frame rate agnostic multi-object tracking. International Journal of Computer Vision, 132(5), 1443–1462.
Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3354–3361. IEEE.
Gonzalez, N. F., Ospina, A., & Calvez, P. (2020). Smat: Smart multiple affinity metrics for multiple object tracking. In Image analysis and recognition: 17th international conference, ICIAR 2020, Póvoa de Varzim, Portugal, June 24–26, 2020, Proceedings, Part II 17, pp. 48–62. Springer.
Guo, S., Wang, J., Wang, X., & Tao, D. (2021). Online multiple object tracking with cross-task synergy. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8136–8145.
He, J., Huang, Z., Wang, N., & Zhang, Z. (2021). Learnable graph matching: Incorporating graph partitioning with deep feature learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5299–5309.
Isard, M., & Blake, A. (1998). Condensation-conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.
Karunasekera, H., Wang, H., & Zhang, H. (2019). Multiple object tracking with attention to appearance, structure, motion and size. IEEE Access, 7, 104423–104434.
Kendall, A., & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30.
Kim, C., Fuxin, L., Alotaibi, M., & Rehg, J. M. (2021). Discriminative appearance modeling with multi-track pooling for real-time multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9553–9562.
Kim, C., Li, F., Ciptadi, A., Rehg, & J. M. (2015). Multiple hypothesis tracking revisited. In Proceedings of the IEEE international conference on computer vision, pp. 4696–4704.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pp. 734–750.
Le, Q. V., Smola, A. J., & Canu, S. (2005). Heteroscedastic Gaussian process regression. In Proceedings of the 22nd international conference on Machine learning, pp. 489–496.
Li, Y., Ai, H., Yamashita, T., Lao, S., & Kawade, M. (2008). Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1728–1740.
Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., & Hu, W. (2022). Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing, 31, 3182–3196.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer.
Liu, Y., Wu, J., Fu, Y. (2023). Collaborative tracking learning for frame-rate-insensitive multi-object tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9964–9973.
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al., (2023). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499
Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129, 548–578.
Luo, W., Stenger, B., Zhao, X., & Kim, T. K. (2018). Trajectories as topics: Multi-object tracking by topic discovery. IEEE Transactions on Image Processing, 28(1), 240–252.
Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., & Kim, T. K. (2021). Multiple object tracking: A literature review. Artificial Intelligence, 293, 103448.
Ma, C., Yang, F., Li, Y., Jia, H., Xie, X., & Gao, W. (2021). Deep human-interaction and association by graph-based learning for multiple object tracking in the wild. International Journal of Computer Vision, 129, 1993–2010.
Meinhardt, T., Kirillov, A., Leal-Taixe, L., & Feichtenhofer, C. (2022). Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8844–8854.
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831
Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., & Yu, F. (2021). Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 164–173.
Peng, J., Wang, C., Wan, F., Wu, Y., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., & Fu, Y. (2020) Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European conference on computer vision, pp. 145–161. Springer.
Qin, Z., Zhou, S., Wang, L., Duan, J., Hua, G., & Tang, W. (2023). Motiontrack: Learning robust short-term and long-term motions for multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 17939–17948.
Rangesh, A., Maheshwari, P., Gebre, M., Mhatre, S., Ramezani, V., & Trivedi, M. M. (2021). Trackmpnn: A message passing graph neural architecture for multi-object tracking. arXiv preprint arXiv:2101.04206
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666.
Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pp. 17–35. Springer.
Saleh, F., Aliakbarian, S., Rezatofighi, H., Salzmann, M., & Gould, S. (2021). Probabilistic tracklet scoring and inpainting for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14329–14339.
Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., & Zhang, X., Sun, J. (2018). Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123
Sun, S., Akhtar, N., Song, X., Song, H., & Mian, A., Shah, M. (2020). Simultaneous detection and tracking with motion modelling for multiple object tracking. In European conference on computer vision, pp. 626–643. Springer.
Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., & Luo, P. (2020). Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460
Sun, J., Shen, Z., Wang, Y., Bao, H., & Zhou, X. (2021). Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8922–8931.
Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 402–419. Springer.
Tokmakov, P., Jabri, A., Li, J., & Gaidon, A. (2022). Object permanence emerges in a random walk along memory. In International conference on machine learning, pp. 21506–21519. PMLR.
Tokmakov, P., Li, J., Burgard, W., & Gaidon, A. (2021). Learning to track with object permanence. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10860–10869.
Wang, G., Gu, R., Liu, Z., Hu, W., Song, M., & Hwang, J. N. (2021). Track without appearance: Learn box and tracklet embedding with local and global motion patterns for vehicle tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9876–9886.
Wang, Z., Zheng, L., Liu, Y., Li, Y., & Wang, S. (2020). Towards real-time multi-object tracking. In European conference on computer vision, pp. 107–122. Springer.
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649. IEEE.
Xu, J., Cao, Y., Zhang, Z., & Hu, H. (2019). Spatial-temporal relation networks for multi-object tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3988–3998.
Yoon, J. H., Lee, C. R., Yang, M. H., & Yoon, K. J. (2019). Structural constraint data association for online multi-object tracking. International Journal of Computer Vision, 127, 1–21.
Yu, E., Li, Z., & Han, S. (2022). Towards discriminative representation: Multi-view trajectory contrastive learning for online multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8834–8843.
Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., & Yan, J. (2016). Poi: Multiple object tracking with high performance detection and appearance feature. In Computer vision–ECCV 2016 workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14, pp. 36–42. Springer.
Yu, F., Wang, D., Shelhamer, E., & Darrell, T. (2018). Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2403–2412.
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., & Wei, Y. (2022). Motr: End-to-end multiple-object tracking with transformer. In Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pp. 659–675. Springer.
Zhang, X., Hu, W., Xie, N., Bao, H., & Maybank, S. (2015). A robust tracking system for low frame rate video. International Journal of Computer Vision, 115, 279–304.
Zhang, Y., Sheng, H., Wu, Y., Wang, S., Ke, W., & Xiong, Z. (2020). Multiplex labeling graph for near-online tracking in crowded scenes. IEEE Internet of Things Journal, 7(9), 7892–7902.
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., & Wang, X. (2022). Bytetrack: Multi-object tracking by associating every detection box. In Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 1–21. Springer.
Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2021). Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11), 3069–3087.
Zhang, Y., Wang, T., & Zhang, X. (2023). Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22056–22065.
Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., & Tian, Q. (2016). Mars: A video benchmark for large-scale person re-identification. In Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pp. 868–884. Springer.
Zhou, T., Luo, W., Shi, Z., Chen, J., & Ye, Q. (2022). Apptracker: Improving tracking multiple objects in low-frame-rate videos. In Proceedings of the 30th ACM international conference on multimedia, pp. 6664–6674.
Zhou, T., Luo, W., Shi, Z., Chen, J., & Ye, Q. (2022). Apptracker: Improving tracking multiple objects in low-frame-rate videos. In Proceedings of the 30th ACM international conference on multimedia, pp. 6664–6674.
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850
Zhou, Z., Luo, W., Wang, Q., Xing, J., & Hu, W. (2020). Distractor-aware discrimination learning for online multiple object tracking. Pattern Recognition, 107, 107512.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Matej Kristan.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Detailed Experimental Setups
This section elaborates on the detailed setups of experiments on MOT17 (Milan et al., 2016) and MOT20 (Dendorfer et al., 2020).
FairMOT (Zhang et al., 2021). It is a ReID-based model and it randomly samples frames during training, making its training stage frame-rate-agnostic.
ByteTrack (Zhang et al., 2022). The tracking module of ByteTrack only includes a Kalman filter without learnable parameters. In our experiments, we evaluated ByteTrack with the same detection set obtained from our model.
Tracktor (Bergmann et al., 2019). The trainable module of Tracktor is an object detector (Ren et al., 2015) and thus is frame-rate-agnostic. In the inference stage, tracking is achieved by propagating the bounding boxes from the previous frame to the next frame as proposals and leveraging the denoising capability of the detector’s regression branch to obtain the refined boxes.
APPTracker/APPTracker+. Our model took two adjacent frames as input and was trained on 1/10-frame-rate videos. Following CenterTrack (Zhou et al., 2020a), the time interval between the two input frames is uniformly sampled from {-20, -10, 0, 10, 20} where pairs in temporally reverse order are allowed.
CenterTrack (Zhou et al., 2020a). To ensure a fair comparison on the same detection set, CenterTrack is evaluated using detections and displacements estimated by our model, while employing its own association logic. The simulated performance of CenterTrack slightly exceeds that reported by the original authors, as shown in Table 9. This is due to the different ignoring policies on low-visible objects during training, as illustrated in Sect. 3.3.1 of the main text.
Buffer size. Table 10 lists the buffer sizes (above which an inactive tracklet is terminated) we adopted for different frame rate settings.
For each method, only one model is trained on each dataset and is tested at different frame rates.
Appendix B: Cross-Dataset Evaluation with Multi-Frame-Rate Settings
Figure 10 shows the multi-frame-rate experiments for the cross-dataset evaluation (trained on MOT17 (Milan et al., 2016) and tested on MOT20 (Dendorfer et al., 2020)) where we maintained our original experimental setup, using the same detection set for all methods except FairMOT and FairMOT*. As shown in Fig. 10, the shape of the curves approximately aligns with the previous results on the MOT17 validation set except the ReID-based method denoted by FairMOT*. Besides, we note that (1) in high-frame-rate videos, our method slightly falls behind ByteTrack (Zhang et al., 2022) and FairMOT (Zhang et al., 2021), mainly due to the advantages of the Kalman filter (Kalman, 1960) in high-frame-rate scenarios, as has been analyzed in Table 3 of the main text. This gap can be mitigated on the engineering side by switching the motion model according to target frame rates; (2) the detection performance of our model is slightly affected by frame rate. This is because our detection branch is learned jointly with the displacement estimation branch which shares the input of two adjacent frames. On the engineering side, one may address this influence by decoupling the detection and displacement estimation modules.
Appendix C: Further Investigations on Re-Identification Models
1.1 C.1 FairMOT
In this section, we conduct a comprehensive investigation of FairMOT (Zhang et al., 2021). The conclusions can be summarized in two points: (1) FairMOT’s excessive confidence in historical appearance embeddings leads to significant additional failures in low-frame-rate videos. (2) The differences in detection entries are one of the main reasons for the performance degradation of ReID-based trackers in low-frame-rate videos, and it is not well addressed by adjusting the matching threshold. We elaborate on the details below. To investigate the ReID performance of FairMOT, in the following experiments, we disable the Kalman filter together with the strict spawning strategy.
Impact of matching threshold. We first study the impact of adjusting the matching threshold \(\tau \). Given two appearance embeddings, \({\varvec{e}}_a\) and \({\varvec{e}}_b\), FairMOT takes the cosine distance \(1-\frac{{\varvec{e}}_a \cdot {\varvec{e}}_b}{\Vert {\varvec{e}}_a \Vert _2 \Vert {\varvec{e}}_b \Vert _2}\) as the matching cost, which ranges from 0 to 2. According to Table 11, adjusting the matching threshold does not receive significant performance improvements. We set the matching threshold \(\tau \) to 0.5 throughout our experiments (which was officially set to 0.4 in normal-frame-rate cases).
Impact of online appearance embedding update. Further, we conducted additional investigations on FairMOT and found that its excessive confidence in historical appearance embeddings leads to significant additional failures in low-frame-rate videos. Specifically, FairMOT (Zhang et al., 2021) smoothly updates the latest appearance embeddings of tracklets. In a new time step t, the registered appearance embedding \(\varvec{e}^{t}_{\text {smooth}}\) of an associated tracklet is updated by
where \(\alpha \) is a smoothness factor and is set to 0.9 by default, indicating the low confidence on the new appearance observation \(\varvec{e}^{t}\). We performed a grid search on \(\alpha \) for various frame rate settings in Table 12 and plot the curves of \(\alpha = \{0.0, 0.3, 0.6, 0.9\}\) in Fig. 11 for ease of observation. We use a fixed \(\alpha \) of 0.6 in the main text to avoid over-tailored hyperparameters.
Typical failure cases of FairMOT* (1/10 frame rate). In the first row of each case, we present cosine similarity heatmaps of appearance embeddings and annotate the query object in the first frame. The second and third rows provide tracking results from FairMOT* and our method, respectively. The appearance embedding is updated frame-by-frame with the historical embedding weight \(\alpha = 0\). We highlight the recognized appearing objects in our results by indicating the predicted APP scores on the right top of corresponding bounding boxes
According to Table 12, the optimal values for parameter \(\alpha \) are roughly distributed along the diagonal. This is reasonable, considering that targets exhibit larger appearance changes as the frame rate goes lower. According to Fig. 8a and b of the main text, while FairMOT* achieves comparable IDF1 scores to CenterTrack (Zhou et al., 2020a), there is still a gap in terms of AssA. This is because the IDF1 score encourages more about estimating the total number of unique objects in a scene than about good detection or association (Luiten et al., 2021). Without the Kalman filter and the strict spawning strategy, FairMOT* tends to transfer identities between targets rather than spawning new tracks with new identities, which is preferred by the IDF1 score. Another weakness of using smaller values of \(\alpha \) is that it may compromise long-term re-identification across occlusions.
Finally, we once again conducted a grid search of matching threshold \(\tau \) with \(\alpha =0.5\). As shown in Table 13, using a small value of \(\alpha \) makes the performance more sensitive to the matching threshold \(\tau \). However, compared to the default threshold \(\tau \) of 0.4, parameter tuning still brings minor gains.
Impact of difference in detection entries. Further, we visualize the typical failure cases of the ReID-based tracking in Fig. 12, where we find that the differences in detection entries caused by occlusion are one of the main reasons for the performance degradation of ReID-based trackers in low-frame-rate videos. Considering that identifying the detection variations is one of the main ideas in our work, we believe our work has the potential to enhance the performance of ReID-based trackers in low-frame-rate scenarios.
1.2 C.2 DeepSORT
Besides FairMOT (Zhang et al., 2021), we also study the effectiveness of DeepSORT with a pre-trained ReID model and our detection set. Specifically, we follow the same DeepSORT implementation adopted by ByteTrack (Zhang et al., 2022), where the ReID model was trained on a large-scale ReID dataset, MARS (Zheng et al., 2016), containing over 1,100,000 images of 1,261 pedestrians. We provide the results on the MOT17 (Milan et al., 2016) validation set in Fig. 13. To study the tracking performance through ReID alone, we also include a version with the Kalman filter (Kalman, 1960) turned off (denoted as DeepSORT*). Additionally, we experiment with different matching thresholdsFootnote 4\(\tau \), but receive no benefit compared to the default setting of \(\tau =0.1\).
For comparison, we attach the results of APPTracker+ (our method) and FairMOT* in Fig. 13, but set them in to avoid potential unfair comparisons, as APPTracker+ and FairMOT* are trained on half of the MOT17 data. From Fig. 13, the pre-trained model demonstrates better appearance discrimination capability compared to FairMOT*, especially at high frame rates.
Further, in Fig. 14, we provide results on the MOT20 (Dendorfer et al., 2020) validation set, which is characterized by crowded scenes and poor lighting conditions. Similarly, we attach the results of APPTracker+ and FairMOT*, which are trained on half of the MOT17 data. Shown in Fig. 14, under crowded and poor lighting conditions, (1) the pre-trained ReID also experiences reduced effectiveness, and (2) by comparing the results of DeepSORT and DeepSORT*, the motion model is found to consistently bring tracking performance improvements (as shown by the IDF1 score and the association accuracy, AssA), but lead to a huge decrease in DetA at low frame rates.
Appendix D: ByteTrack with GIoU
In Table 14, we report the results of adopting GIoU (Rezatofighi et al., 2019) in the Kalman-filter-based tracker, ByteTrack (Zhang et al., 2022), with various matching threshold settings but receive minor performance gain. The failure of the motion model in low-frame-rate videos is mainly due to (1) the failure of zero-velocity initialization, and (2) quick changes in target velocity.
Appendix E: Computational Cost
We reported the FLOPs, MACs, and parameter count in Table 15. Compared with CenterTrack (Zhou et al., 2020a), our method introduces 7.7% extra FLOPs/MACs and 0.7% extra parameters when running on MOT17 (Milan et al., 2016) and MOT20 (Dendorfer et al., 2020). With the absence of the APP prediction head when running on KITTI (Geiger et al., 2012), our method only introduces negligible additional computational cost.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, T., Ye, Q., Luo, W. et al. APPTracker+: Displacement Uncertainty for Occlusion Handling in Low-Frame-Rate Multiple Object Tracking. Int J Comput Vis 133, 2044–2069 (2025). https://doi.org/10.1007/s11263-024-02237-x
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02237-x