这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

APPTracker+: Displacement Uncertainty for Occlusion Handling in Low-Frame-Rate Multiple Object Tracking

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Multi-object tracking (MOT) in the scenario of low-frame-rate videos is a promising solution to better meet the computing, storage, and transmitting bandwidth resource constraints of edge devices. Tracking with a low frame rate poses particular challenges in the association stage as objects in two successive frames typically exhibit much quicker variations in locations, velocities, appearances, and visibilities than those in normal frame rates. In this paper, we observe severe performance degeneration of many existing association strategies caused by such variations. Though optical-flow-based methods like CenterTrack can handle the large displacement to some extent due to their large receptive field, the temporally local nature makes them fail to give reliable displacement estimations of objects that newly appear in the current frame (i.e., not visible in the previous frame). To overcome the local nature of optical-flow-based methods, we propose an online tracking method by extending the CenterTrack architecture with a new head, named APP, to recognize unreliable displacement estimations. Further, to capture the fine-grained and private unreliability of each displacement estimation, we extend the binary APP predictions to displacement uncertainties. To this end, we reformulate the displacement estimation task via Bayesian deep learning tools. With APP predictions, we propose to conduct association in a multi-stage manner where vision cues or historical motion cues are leveraged in the corresponding stage. By rethinking the commonly used bipartite matching algorithms, we equip the proposed multi-stage association policy with a hybrid matching strategy conditioned on displacement uncertainties. Our method shows robustness in preserving identities in low-frame-rate video sequences. Experimental results on public datasets in various low-frame-rate settings demonstrate the advantages of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Algorithm 2
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The datasets analyzed during the current study are available in the CrowdHuman repository, MOTChallenge repository, and KITTI repository, with the links as http://www.crowdhuman.org/, https://motchallenge.net/data/MOT17/, https://motchallenge.net/data/MOT20/, and https://www.cvlibs.net/download.php?file=data_tracking_image_2.zip.

Notes

  1. To evaluate the performance of the re-identification head of FairMOT (Zhang et al., 2021) we disable the Kalman filter inside the original FairMOT method. Details can be found in Sect. 4.6.

  2. All the oracle experiments are conducted on the MOT17 dataset with a 1/10 frame rate.

  3. To eliminate association noise, we provide ground-truth displacements for all detections, even if they were discarded in the previous frame.

  4. DeepSORT rejects matches with \((1 - \cos (\textbf{e}_a, \textbf{e}_b)) > \tau \), where \(\textbf{e}_a\) and \(\textbf{e}_b\) are the appearance embeddings of a pair of detection and track, respectively. Unlike FairMOT (Zhang et al., 2021), where the memorized appearance embedding of a track is updated by new observations with a fixed weight \(\alpha \) at each time step, DeepSORT uses extra memory to store the appearance embeddings of all previous observations for each track. Therefore, the matching threshold \(\tau \) is the only hyperparameter that needs to be tuned for DeepSORT.

References

  • Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., & Savarese, S. (2016). Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961–971.

  • Ballas, N., Yao, L., Pal, C., & Courville, A. (2015). Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432

  • Bergmann, P., Meinhardt, T., & Leal-Taixe, L. (2019). Tracking without bells and whistles. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 941–951.

  • Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008, 1–10.

    MATH  Google Scholar 

  • Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pp. 3464–3468. IEEE.

  • Brasó, G., Cetintas, O., & Leal-Taixé, L. (2022). Multi-object tracking and segmentation via neural message passing. International Journal of Computer Vision, 130(12), 3035–3053.

    MATH  Google Scholar 

  • Brasó, G., & Leal-Taixé, L. (2020). Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6247–6257.

  • Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631.

  • Cao, Z., Huang, Z., Pan, L., Zhang, S., Liu, Z., & Fu, C. (2022). Tctrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14798–14808.

  • Cao, Z., Huang, Z., Pan, L., Zhang, S., Liu, Z., & Fu, C. (2023). Towards real-world visual tracking with temporal contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Cao, J., Pang, J., Weng, X., Khirodkar, R., & Kitani, K. (2023). Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9686–9696.

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020) End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Springer.

  • Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE international conference on computer vision, pp. 3029–3037.

  • Chu, P., Wang, J., You, Q., Ling, H., & Liu, Z. (2023). Transmot: Spatial-temporal graph transformer for multiple object tracking. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 4870–4880.

  • Chuang, M. C., Hwang, J. N., Williams, K., & Towler, R. (2014). Tracking live fish from low-contrast and low-frame-rate stereo videos. IEEE Transactions on Circuits and Systems for Video Technology, 25(1), 167–179.

    MATH  Google Scholar 

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773.

  • Dai, P., Weng, R., Choi, W., Zhang, C., He, Z., & Ding, W. (2021). Learning a proposal classifier for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2443–2452.

  • Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., & Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003

  • Evangelidis, G. D., & Psarakis, E. Z. (2008). Parametric image alignment using enhanced correlation coefficient maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1858–1865.

    MATH  Google Scholar 

  • Feng, W., Bai, L., Yao, Y., Yu, F., & Ouyang, W. (2024). Towards frame rate agnostic multi-object tracking. International Journal of Computer Vision, 132(5), 1443–1462.

    Google Scholar 

  • Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430

  • Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3354–3361. IEEE.

  • Gonzalez, N. F., Ospina, A., & Calvez, P. (2020). Smat: Smart multiple affinity metrics for multiple object tracking. In Image analysis and recognition: 17th international conference, ICIAR 2020, Póvoa de Varzim, Portugal, June 24–26, 2020, Proceedings, Part II 17, pp. 48–62. Springer.

  • Guo, S., Wang, J., Wang, X., & Tao, D. (2021). Online multiple object tracking with cross-task synergy. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8136–8145.

  • He, J., Huang, Z., Wang, N., & Zhang, Z. (2021). Learnable graph matching: Incorporating graph partitioning with deep feature learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5299–5309.

  • Isard, M., & Blake, A. (1998). Condensation-conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.

    MATH  Google Scholar 

  • Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.

  • Karunasekera, H., Wang, H., & Zhang, H. (2019). Multiple object tracking with attention to appearance, structure, motion and size. IEEE Access, 7, 104423–104434.

    MATH  Google Scholar 

  • Kendall, A., & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30.

  • Kim, C., Fuxin, L., Alotaibi, M., & Rehg, J. M. (2021). Discriminative appearance modeling with multi-track pooling for real-time multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9553–9562.

  • Kim, C., Li, F., Ciptadi, A., Rehg, & J. M. (2015). Multiple hypothesis tracking revisited. In Proceedings of the IEEE international conference on computer vision, pp. 4696–4704.

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.

    MathSciNet  MATH  Google Scholar 

  • Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pp. 734–750.

  • Le, Q. V., Smola, A. J., & Canu, S. (2005). Heteroscedastic Gaussian process regression. In Proceedings of the 22nd international conference on Machine learning, pp. 489–496.

  • Li, Y., Ai, H., Yamashita, T., Lao, S., & Kawade, M. (2008). Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1728–1740.

    MATH  Google Scholar 

  • Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., & Hu, W. (2022). Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing, 31, 3182–3196.

    MATH  Google Scholar 

  • Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125.

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer.

  • Liu, Y., Wu, J., Fu, Y. (2023). Collaborative tracking learning for frame-rate-insensitive multi-object tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9964–9973.

  • Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al., (2023). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499

  • Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129, 548–578.

    Google Scholar 

  • Luo, W., Stenger, B., Zhao, X., & Kim, T. K. (2018). Trajectories as topics: Multi-object tracking by topic discovery. IEEE Transactions on Image Processing, 28(1), 240–252.

    MathSciNet  Google Scholar 

  • Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., & Kim, T. K. (2021). Multiple object tracking: A literature review. Artificial Intelligence, 293, 103448.

    MathSciNet  Google Scholar 

  • Ma, C., Yang, F., Li, Y., Jia, H., Xie, X., & Gao, W. (2021). Deep human-interaction and association by graph-based learning for multiple object tracking in the wild. International Journal of Computer Vision, 129, 1993–2010.

    MathSciNet  MATH  Google Scholar 

  • Meinhardt, T., Kirillov, A., Leal-Taixe, L., & Feichtenhofer, C. (2022). Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8844–8854.

  • Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831

  • Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., & Yu, F. (2021). Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 164–173.

  • Peng, J., Wang, C., Wan, F., Wu, Y., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., & Fu, Y. (2020) Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European conference on computer vision, pp. 145–161. Springer.

  • Qin, Z., Zhou, S., Wang, L., Duan, J., Hua, G., & Tang, W. (2023). Motiontrack: Learning robust short-term and long-term motions for multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 17939–17948.

  • Rangesh, A., Maheshwari, P., Gebre, M., Mhatre, S., Ramezani, V., & Trivedi, M. M. (2021). Trackmpnn: A message passing graph neural architecture for multi-object tracking. arXiv preprint arXiv:2101.04206

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28.

  • Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666.

  • Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pp. 17–35. Springer.

  • Saleh, F., Aliakbarian, S., Rezatofighi, H., Salzmann, M., & Gould, S. (2021). Probabilistic tracklet scoring and inpainting for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14329–14339.

  • Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., & Zhang, X., Sun, J. (2018). Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123

  • Sun, S., Akhtar, N., Song, X., Song, H., & Mian, A., Shah, M. (2020). Simultaneous detection and tracking with motion modelling for multiple object tracking. In European conference on computer vision, pp. 626–643. Springer.

  • Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., & Luo, P. (2020). Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460

  • Sun, J., Shen, Z., Wang, Y., Bao, H., & Zhou, X. (2021). Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8922–8931.

  • Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 402–419. Springer.

  • Tokmakov, P., Jabri, A., Li, J., & Gaidon, A. (2022). Object permanence emerges in a random walk along memory. In International conference on machine learning, pp. 21506–21519. PMLR.

  • Tokmakov, P., Li, J., Burgard, W., & Gaidon, A. (2021). Learning to track with object permanence. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10860–10869.

  • Wang, G., Gu, R., Liu, Z., Hu, W., Song, M., & Hwang, J. N. (2021). Track without appearance: Learn box and tracklet embedding with local and global motion patterns for vehicle tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9876–9886.

  • Wang, Z., Zheng, L., Liu, Y., Li, Y., & Wang, S. (2020). Towards real-time multi-object tracking. In European conference on computer vision, pp. 107–122. Springer.

  • Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649. IEEE.

  • Xu, J., Cao, Y., Zhang, Z., & Hu, H. (2019). Spatial-temporal relation networks for multi-object tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3988–3998.

  • Yoon, J. H., Lee, C. R., Yang, M. H., & Yoon, K. J. (2019). Structural constraint data association for online multi-object tracking. International Journal of Computer Vision, 127, 1–21.

    MATH  Google Scholar 

  • Yu, E., Li, Z., & Han, S. (2022). Towards discriminative representation: Multi-view trajectory contrastive learning for online multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8834–8843.

  • Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., & Yan, J. (2016). Poi: Multiple object tracking with high performance detection and appearance feature. In Computer vision–ECCV 2016 workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14, pp. 36–42. Springer.

  • Yu, F., Wang, D., Shelhamer, E., & Darrell, T. (2018). Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2403–2412.

  • Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., & Wei, Y. (2022). Motr: End-to-end multiple-object tracking with transformer. In Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pp. 659–675. Springer.

  • Zhang, X., Hu, W., Xie, N., Bao, H., & Maybank, S. (2015). A robust tracking system for low frame rate video. International Journal of Computer Vision, 115, 279–304.

    MathSciNet  MATH  Google Scholar 

  • Zhang, Y., Sheng, H., Wu, Y., Wang, S., Ke, W., & Xiong, Z. (2020). Multiplex labeling graph for near-online tracking in crowded scenes. IEEE Internet of Things Journal, 7(9), 7892–7902.

    MATH  Google Scholar 

  • Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., & Wang, X. (2022). Bytetrack: Multi-object tracking by associating every detection box. In Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 1–21. Springer.

  • Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2021). Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11), 3069–3087.

    MATH  Google Scholar 

  • Zhang, Y., Wang, T., & Zhang, X. (2023). Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22056–22065.

  • Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., & Tian, Q. (2016). Mars: A video benchmark for large-scale person re-identification. In Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pp. 868–884. Springer.

  • Zhou, T., Luo, W., Shi, Z., Chen, J., & Ye, Q. (2022). Apptracker: Improving tracking multiple objects in low-frame-rate videos. In Proceedings of the 30th ACM international conference on multimedia, pp. 6664–6674.

  • Zhou, T., Luo, W., Shi, Z., Chen, J., & Ye, Q. (2022). Apptracker: Improving tracking multiple objects in low-frame-rate videos. In Proceedings of the 30th ACM international conference on multimedia, pp. 6664–6674.

  • Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850

  • Zhou, Z., Luo, W., Wang, Q., Xing, J., & Hu, W. (2020). Distractor-aware discrimination learning for online multiple object tracking. Pattern Recognition, 107, 107512.

    MATH  Google Scholar 

  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Ye.

Additional information

Communicated by Matej Kristan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Detailed Experimental Setups

This section elaborates on the detailed setups of experiments on MOT17 (Milan et al., 2016) and MOT20 (Dendorfer et al., 2020).

FairMOT (Zhang et al., 2021). It is a ReID-based model and it randomly samples frames during training, making its training stage frame-rate-agnostic.

ByteTrack (Zhang et al., 2022). The tracking module of ByteTrack only includes a Kalman filter without learnable parameters. In our experiments, we evaluated ByteTrack with the same detection set obtained from our model.

Tracktor (Bergmann et al., 2019). The trainable module of Tracktor is an object detector (Ren et al., 2015) and thus is frame-rate-agnostic. In the inference stage, tracking is achieved by propagating the bounding boxes from the previous frame to the next frame as proposals and leveraging the denoising capability of the detector’s regression branch to obtain the refined boxes.

APPTracker/APPTracker+. Our model took two adjacent frames as input and was trained on 1/10-frame-rate videos. Following CenterTrack (Zhou et al., 2020a), the time interval between the two input frames is uniformly sampled from {-20, -10, 0, 10, 20} where pairs in temporally reverse order are allowed.

CenterTrack (Zhou et al., 2020a). To ensure a fair comparison on the same detection set, CenterTrack is evaluated using detections and displacements estimated by our model, while employing its own association logic. The simulated performance of CenterTrack slightly exceeds that reported by the original authors, as shown in Table 9. This is due to the different ignoring policies on low-visible objects during training, as illustrated in Sect. 3.3.1 of the main text.

Table 10 Buffer sizes for different settings of \(n_d\)

Buffer size. Table 10 lists the buffer sizes (above which an inactive tracklet is terminated) we adopted for different frame rate settings.

Fig. 10
figure 10

Cross-dataset evaluation (trained on MOT17 (Milan et al., 2016) and tested on MOT20 (Dendorfer et al., 2020)) with multiple frame rate downsampling factor \(n_d\) (x-axis label). Arrows indicate the direction of better performance

For each method, only one model is trained on each dataset and is tested at different frame rates.

Appendix B: Cross-Dataset Evaluation with Multi-Frame-Rate Settings

Figure 10 shows the multi-frame-rate experiments for the cross-dataset evaluation (trained on MOT17 (Milan et al., 2016) and tested on MOT20 (Dendorfer et al., 2020)) where we maintained our original experimental setup, using the same detection set for all methods except FairMOT and FairMOT*. As shown in Fig. 10, the shape of the curves approximately aligns with the previous results on the MOT17 validation set except the ReID-based method denoted by FairMOT*. Besides, we note that (1) in high-frame-rate videos, our method slightly falls behind ByteTrack (Zhang et al., 2022) and FairMOT (Zhang et al., 2021), mainly due to the advantages of the Kalman filter (Kalman, 1960) in high-frame-rate scenarios, as has been analyzed in Table 3 of the main text. This gap can be mitigated on the engineering side by switching the motion model according to target frame rates; (2) the detection performance of our model is slightly affected by frame rate. This is because our detection branch is learned jointly with the displacement estimation branch which shares the input of two adjacent frames. On the engineering side, one may address this influence by decoupling the detection and displacement estimation modules.

Appendix C: Further Investigations on Re-Identification Models

1.1 C.1 FairMOT

In this section, we conduct a comprehensive investigation of FairMOT (Zhang et al., 2021). The conclusions can be summarized in two points: (1) FairMOT’s excessive confidence in historical appearance embeddings leads to significant additional failures in low-frame-rate videos. (2) The differences in detection entries are one of the main reasons for the performance degradation of ReID-based trackers in low-frame-rate videos, and it is not well addressed by adjusting the matching threshold. We elaborate on the details below. To investigate the ReID performance of FairMOT, in the following experiments, we disable the Kalman filter together with the strict spawning strategy.

Impact of matching threshold. We first study the impact of adjusting the matching threshold \(\tau \). Given two appearance embeddings, \({\varvec{e}}_a\) and \({\varvec{e}}_b\), FairMOT takes the cosine distance \(1-\frac{{\varvec{e}}_a \cdot {\varvec{e}}_b}{\Vert {\varvec{e}}_a \Vert _2 \Vert {\varvec{e}}_b \Vert _2}\) as the matching cost, which ranges from 0 to 2. According to Table 11, adjusting the matching threshold does not receive significant performance improvements. We set the matching threshold \(\tau \) to 0.5 throughout our experiments (which was officially set to 0.4 in normal-frame-rate cases).

Table 11 IDF1 score vs. matching threshold \(\tau \) on MOT17 validation set (1/10-frame-rate)
Table 12 IDF1 score vs. \(\alpha \) and \(n_d\)
Fig. 11
figure 11

IDF1 scores vs. \(\alpha \). Results on MOT17 validation set (1/10 frame rate)

Impact of online appearance embedding update. Further, we conducted additional investigations on FairMOT and found that its excessive confidence in historical appearance embeddings leads to significant additional failures in low-frame-rate videos. Specifically, FairMOT (Zhang et al., 2021) smoothly updates the latest appearance embeddings of tracklets. In a new time step t, the registered appearance embedding \(\varvec{e}^{t}_{\text {smooth}}\) of an associated tracklet is updated by

$$\begin{aligned} \varvec{e}^{t}_{\text {smooth}} = \alpha \varvec{e}^{t-1}_{\text {smooth}} + (1-\alpha )\varvec{e}^{t}, \end{aligned}$$
(10)

where \(\alpha \) is a smoothness factor and is set to 0.9 by default, indicating the low confidence on the new appearance observation \(\varvec{e}^{t}\). We performed a grid search on \(\alpha \) for various frame rate settings in Table 12 and plot the curves of \(\alpha = \{0.0, 0.3, 0.6, 0.9\}\) in Fig. 11 for ease of observation. We use a fixed \(\alpha \) of 0.6 in the main text to avoid over-tailored hyperparameters.

Fig. 12
figure 12

Typical failure cases of FairMOT* (1/10 frame rate). In the first row of each case, we present cosine similarity heatmaps of appearance embeddings and annotate the query object in the first frame. The second and third rows provide tracking results from FairMOT* and our method, respectively. The appearance embedding is updated frame-by-frame with the historical embedding weight \(\alpha = 0\). We highlight the recognized appearing objects in our results by indicating the predicted APP scores on the right top of corresponding bounding boxes

Table 13 IDF1 score vs. matching threshold \(\tau \) (1/10-frame-rate)
Fig. 13
figure 13

Results on MOT17 validation set

Fig. 14
figure 14

Results on MOT20 validation set (APPTracker+ and FairMOT* are trained on MOT17)

According to Table 12, the optimal values for parameter \(\alpha \) are roughly distributed along the diagonal. This is reasonable, considering that targets exhibit larger appearance changes as the frame rate goes lower. According to Fig. 8a and b of the main text, while FairMOT* achieves comparable IDF1 scores to CenterTrack (Zhou et al., 2020a), there is still a gap in terms of AssA. This is because the IDF1 score encourages more about estimating the total number of unique objects in a scene than about good detection or association (Luiten et al., 2021). Without the Kalman filter and the strict spawning strategy, FairMOT* tends to transfer identities between targets rather than spawning new tracks with new identities, which is preferred by the IDF1 score. Another weakness of using smaller values of \(\alpha \) is that it may compromise long-term re-identification across occlusions.

Finally, we once again conducted a grid search of matching threshold \(\tau \) with \(\alpha =0.5\). As shown in Table 13, using a small value of \(\alpha \) makes the performance more sensitive to the matching threshold \(\tau \). However, compared to the default threshold \(\tau \) of 0.4, parameter tuning still brings minor gains.

Impact of difference in detection entries. Further, we visualize the typical failure cases of the ReID-based tracking in Fig. 12, where we find that the differences in detection entries caused by occlusion are one of the main reasons for the performance degradation of ReID-based trackers in low-frame-rate videos. Considering that identifying the detection variations is one of the main ideas in our work, we believe our work has the potential to enhance the performance of ReID-based trackers in low-frame-rate scenarios.

1.2 C.2 DeepSORT

Besides FairMOT (Zhang et al., 2021), we also study the effectiveness of DeepSORT with a pre-trained ReID model and our detection set. Specifically, we follow the same DeepSORT implementation adopted by ByteTrack (Zhang et al., 2022), where the ReID model was trained on a large-scale ReID dataset, MARS (Zheng et al., 2016), containing over 1,100,000 images of 1,261 pedestrians. We provide the results on the MOT17 (Milan et al., 2016) validation set in Fig. 13. To study the tracking performance through ReID alone, we also include a version with the Kalman filter (Kalman, 1960) turned off (denoted as DeepSORT*). Additionally, we experiment with different matching thresholdsFootnote 4\(\tau \), but receive no benefit compared to the default setting of \(\tau =0.1\).

For comparison, we attach the results of APPTracker+ (our method) and FairMOT* in Fig. 13, but set them in to avoid potential unfair comparisons, as APPTracker+ and FairMOT* are trained on half of the MOT17 data. From Fig. 13, the pre-trained model demonstrates better appearance discrimination capability compared to FairMOT*, especially at high frame rates.

Table 14 IDF1 score vs. matching threshold on MOT17 validation set (1/10-frame-rate)

Further, in Fig. 14, we provide results on the MOT20 (Dendorfer et al., 2020) validation set, which is characterized by crowded scenes and poor lighting conditions. Similarly, we attach the results of APPTracker+ and FairMOT*, which are trained on half of the MOT17 data. Shown in Fig. 14, under crowded and poor lighting conditions, (1) the pre-trained ReID also experiences reduced effectiveness, and (2) by comparing the results of DeepSORT and DeepSORT*, the motion model is found to consistently bring tracking performance improvements (as shown by the IDF1 score and the association accuracy, AssA), but lead to a huge decrease in DetA at low frame rates.

Appendix D: ByteTrack with GIoU

In Table 14, we report the results of adopting GIoU (Rezatofighi et al., 2019) in the Kalman-filter-based tracker, ByteTrack (Zhang et al., 2022), with various matching threshold settings but receive minor performance gain. The failure of the motion model in low-frame-rate videos is mainly due to (1) the failure of zero-velocity initialization, and (2) quick changes in target velocity.

Table 15 Computational cost

Appendix E: Computational Cost

We reported the FLOPs, MACs, and parameter count in Table 15. Compared with CenterTrack (Zhou et al., 2020a), our method introduces 7.7% extra FLOPs/MACs and 0.7% extra parameters when running on MOT17 (Milan et al., 2016) and MOT20 (Dendorfer et al., 2020). With the absence of the APP prediction head when running on KITTI (Geiger et al., 2012), our method only introduces negligible additional computational cost.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, T., Ye, Q., Luo, W. et al. APPTracker+: Displacement Uncertainty for Occlusion Handling in Low-Frame-Rate Multiple Object Tracking. Int J Comput Vis 133, 2044–2069 (2025). https://doi.org/10.1007/s11263-024-02237-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02237-x

Keywords