Abstract
Tracking of dynamic people in cluttered and crowded human-centered environments is a challenging robotics problem due to the presence of intraclass variations including occlusions, pose deformations, and lighting variations. This paper introduces a novel deep learning architecture, using conditional latent diffusion models, the Latent Diffusion Track (LDTrack), for tracking multiple dynamic people under intraclass variations. By uniquely utilizing conditional latent diffusion models to capture temporal person embeddings, our architecture can adapt to appearance changes of people over time. We incorporated a latent feature encoder network which enables the diffusion process to operate within a high-dimensional latent space to allow for the extraction and spatial–temporal refinement of such rich features as person appearance, motion, location, identity, and contextual information. Extensive experiments demonstrate the effectiveness of LDTrack over other state-of-the-art tracking methods in cluttered and crowded human-centered environments under intraclass variations. Namely, the results show our method outperforms existing deep learning robotic people tracking methods in both tracking accuracy and tracking precision with statistical significance. Additionally, a comprehensive multi-object tracking comparison study was performed against the state-of-the-art methods in urban environments, demonstrating the generalizability of LDTrack. An ablation study was performed to validate the design choices of LDTrack.
Similar content being viewed by others
References
Agrawal, K., & Lal, R. (2021). Person following mobile robot using multiplexed detection and tracking. In V. R. Kalamkar & K. Monkova (Eds.), Advances in Mechanical Engineering (pp. 815–822). Berlin: Springer.
Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing, 2008(1), 1–10.
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464–3468).
Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., & Soatto, S. (2022, March 31). MeMOT: multi-object tracking with memory. arXiv. https://arxiv.org/abs/2203.16761
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer vision—ECCV 2020 (Vol. 12346, pp. 213–229). Berlin: Springer.
Chaabane, M., Zhang, P., Beveridge, J. R., & O’Hara, S. (2021, June 6). DEFT: Detection embeddings for tracking. arXiv http://arxiv.org/abs/2102.02267.
Chen, S., Sun, P., Song, Y., & Luo, P. (2023). DiffusionDet: Diffusion model for object detection. In 2023 IEEE/CVF international conference on computer vision (ICCV) (pp. 19773–19786). Paris: IEEE.
Chuang, Z., Sifa, Z., Haoran, W., Ziqing, G., Wenchao, S., & Lei, Y. (2024, February 1). AttentionTrack: Multiple object tracking in traffic scenarios using features attention. https://ieeexplore.ieee.org/abstract/document/10260285?casa_token=CKgXKurS06oAAAAA:Ghg0vJ0bQ1X3nftk2dCiyHnmWDS7_UFKQvR8EoH3HtG6Tu14-fBVYb_FaGowl4Y8nijjzEQcUYk.
Dai, Z., Cai, B., Lin, Y., & Chen, J. (2022). UP-DETR: Unsupervised pre-training for object detection with transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3216514
Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., et al. (2021). MOTChallenge: A benchmark for single-camera multiple target tracking. International Journal of Computer Vision, 129(4), 845–881.
Dworakowski, D., Fung, A., & Nejat, G. (2023). Robots understanding contextual information in human-centered environments using weakly supervised mask data distillation. International Journal of Computer Vision, 131(2), 407–430.
Fung, A., Benhabib, B., & Nejat, G. (2023). Robots autonomously detecting people: A multimodal deep contrastive learning method robust to intraclass variations. IEEE Robotics and Automation Letters, 8(6), 3550–3557.
Fung, A., Wang, L. Y., Zhang, K., Nejat, G., & Benhabib, B. (2020). Using deep learning to find victims in unknown cluttered urban search and rescue environments. Current Robotics Reports, 1(3), 105–115.
Gao, R., Zhang, Y., & Wang, L. (2024, March 25). Multiple object tracking as ID prediction. arXiv http://arxiv.org/abs/2403.16848.
Gupta, S., Tolani, V., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2020). Cognitive mapping and planning for visual navigation. International Journal of Computer Vision, 128(5), 1311–1330.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 2980–2988).
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in neural information processing systems (Vol. 33, pp. 6840–6851). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html
Jiang, L., Wang, Z., Yin, S., Ma, G., Zhang, P., & Wu, B. (2024, August 28). ConsistencyTrack: A robust multi-object tracker with a generation strategy of consistency model. arXiv. https://arxiv.org/abs/2408.15548
Kollmitz, M., Eitel, A., Vasquez, A., & Burgard, W. (2019). Deep 3D perception of people and their mobility aids. Robotics and Autonomous Systems, 114, 29–40.
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2020). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327.
Liu, T., Sun, J. J., Zhao, L., Zhao, J., Yuan, L., Wang, Y., et al. (2022). View-invariant, occlusion-robust probabilistic embedding for human pose. International Journal of Computer Vision, 130(1), 111–135.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot MultiBox detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer vision—ECCV 2016 (pp. 21–37). Berlin: Springer.
Lu, Z., Rathod, V., Votel, R., & Huang, J. (2020). RetinaTrack: Online single stage joint detection and tracking. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 14656–14666). Seattle, WA, USA: IEEE.
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Van Gool, L. (2022). RePaint: Inpainting using denoising diffusion probabilistic models. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11451–11461). New Orleans, LA, USA: IEEE.
Luo, R., Song, Z., Ma, L., Wei, J., Yang, W., & Yang, M. (2023, August 19). DiffusionTrack: Diffusion model for multi-object tracking. arXiv http://arxiv.org/abs/2308.09905
Mees, O., Eitel, A., & Burgard, W. (2016). Choosing smartly: Adaptive multimodal fusion for object detection in changing environments. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 151–156). Daejeon, South Korea: IEEE.
Meinhardt, T., Kirillov, A., Leal-Taixe, L., & Feichtenhofer, C. (2022). TrackFormer: Multi-object tracking with transformers. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8834–8844), New Orleans, LA, USA: IEEE.
Mohamed, S. C., Fung, A., & Nejat, G. (2023). A multirobot person search system for finding multiple dynamic users in human-centered environments. IEEE Transactions on Cybernetics, 53(1), 628–640.
Munaro, M., & Menegatti, E. (2014). Fast RGB-D people tracking for service robots. Autonomous Robots, 37(3), 227–242.
Murray, S. (2017). Real-time multiple object tracking—A study on the importance of speed. arXiv:1709.03572 [cs]
Pang, L., Cao, Z., Yu, J., Guan, P., Chen, X., & Zhang, W. (2020). A robust visual person-following approach for mobile robots in disturbing environments. IEEE Systems Journal, 14(2), 2965–2968.
Pereira, R., Carvalho, G., Garrote, L., & Nunes, U. J. (2022). Sort and deep-SORT based multi-object tracking for mobile robotics: Evaluation with new data association metrics. Applied Sciences, 12(3), 1319.
Pinto, V., Bettencourt, R., & Ventura, R. (2023). People re-identification in service robots. In 2023 IEEE international conference on autonomous robot systems and competitions (ICARSC) (pp. 44–49), Tomar, Portugal: IEEE.
Rebello, J., Fung, A., & Waslander, S. L. (2020). AC/DCC : Accurate calibration of dynamic camera clusters for visual SLAM. In 2020 IEEE international conference on robotics and automation (ICRA) (pp. 6035–6041).
Redmon, J., & Farhadi, A. (2018, April 8). YOLOv3: An incremental improvement. arXiv. http://arxiv.org/abs/1804.02767
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 658–666).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10674–10685), New Orleans, LA, USA: IEEE.
Royer, E., Lhuillier, M., Dhome, M., & Lavest, J.-M. (2007). Monocular vision for mobile robot localization and autonomous navigation. International Journal of Computer Vision, 74(3), 237–260.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Sanz, D., Ahmad, A., & Lima, P. (2015). Onboard robust person detection and tracking for domestic service robots. In Robot 2015: Second iberian robotics conference (pp. 547–559). Cham: Springer.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd international conference on machine learning (pp. 2256–2265).
Sun, S., Zhao, X., & Tan, M. (2019). Fast and robust RGB-D multiple human tracking based on part model for mobile robots. In 2019 Chinese control conference (CCC) (pp. 4525–4530). Guangzhou, China: IEEE.
Sun, Pei, Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., et al. (2020). Scalability in perception for autonomous driving: Waymo open dataset. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). Presented at the 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2443–2451).
Sun, Peize, Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., et al. (2021, May 4). TransTrack: Multiple object tracking with transformer. arXiv. http://arxiv.org/abs/2012.15460
Tan, A. H., Narasimhan, S., & Nejat, G. (2024, February 27). 4CNet: A confidence-aware, contrastive, conditional, consistency model for robot map prediction in multi-robot environments. arXiv. https://arxiv.org/abs/2402.17904
Tan, A. H., Bejarano, F. P., Zhu, Y., Ren, R., & Nejat, G. (2023). Deep reinforcement learning for decentralized multi-robot exploration with macro actions. IEEE Robotics and Automation Letters, 8(1), 272–279.
Taylor, A., & Riek, L. D. (2022). REGROUP: A robot-centric group detection and tracking system. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (pp. 412–421). Sapporo, Japan: IEEE.
Tokmakov, P., Li, J., Burgard, W., & Gaidon, A. (2021, September 30). Learning to track with object permanence. arXiv. http://arxiv.org/abs/2103.14258
Vasquez, A., Kollmitz, M., Eitel, A., & Burgard, W. (2017). Deep Detection of People and their Mobility Aids for a Hospital Robot. In 2017 European conference on mobile robots (ECMR) (pp. 1–7). Paris: IEEE.
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition (CVPR 2001) (Vol. 1, pp. I-511–I-518). Kauai, HI: IEEE Comput. Soc.
Vo, D. M., Jiang, L., & Zell, A. (2014). Real time person detection and tracking by mobile robots using RGB-D images. In 2014 IEEE international conference on robotics and biomimetics (ROBIO 2014) (pp. 689–694). Bali, Indonesia: IEEE.
Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., & Leibe, B. (2019). MOTS: Multi-object tracking and segmentation. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7934–7943), Long Beach, CA, USA: IEEE.
Wang, Haitong, Tan, A. H., & Nejat, G. (2024, February 9). NavFormer: A transformer architecture for robot target-driven navigation in unknown and dynamic environments. arXiv http://arxiv.org/abs/2402.06838.
Wang, H., Zhu, X., Gong, S., & Xiang, T. (2018). Person re-identification in identity regression space. International Journal of Computer Vision, 126(12), 1288–1310.
Weber, T., Triputen, S., Danner, M., Braun, S., Schreve, K., & Rätsch, M. (2018). Follow me: Real-time in the wild person tracking application for autonomous robotics. In H. Akiyama, O. Obst, C. Sammut, & F. Tonidandel (Eds.), RoboCup 2017: Robot World Cup XXI (Vol. 11175, pp. 156–167). Berlin: Springer.
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645–3649).
Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., & Yuan, J. (2021). Track to detect and segment: An online multi-object tracker. In: Presented at the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12352–12361).
Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., & Alameda-Pineda, X. (2022, September 30). TransCenter: Transformers with dense representations for multiple-object tracking. arXiv http://arxiv.org/abs/2103.15145.
Xue, F., Chang, Y., Wang, T., Zhou, Y., & Ming, A. (2024). Indoor obstacle discovery on reflective ground via monocular camera. International Journal of Computer Vision, 132(3), 987–1007.
Yan, Y., Li, J., Qin, J., Zheng, P., Liao, S., & Yang, X. (2023). Efficient person search: An anchor-free approach. International Journal of Computer Vision, 131(7), 1642–1661.
Yuan, Y., Chen, W., Yang, Y., & Wang, Z. (2020). In defense of the triplet loss again: learning robust person re-identification with fast approximated triplet loss and label distillation. In 2020 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 1454–1463). Seattle, WA, USA: IEEE.
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., & Wei, Y. (2022). MOTR: End-to-end multiple-object tracking with transformer. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer vision—ECCV 2022 (Vol. 13687, pp. 659–675). Cham: Springer.
Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer vision—ECCV 2020 (pp. 474–490). Berlin: Springer.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021, March 17). Deformable DETR: Deformable transformers for end-to-end object detection. arXiv. http://arxiv.org/abs/2010.04159.
Acknowledgements
The authors would like to thank Aaron Hao Tan and Haitong Wang for their invaluable discussions and assistance.
Funding
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), AGE-WELL Inc., and the Canada Research Chairs (CRC) program.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We have no known conflicts of interest.
Additional information
Communicated by Yasushi Yagi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fung, A., Benhabib, B. & Nejat, G. LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models. Int J Comput Vis 133, 3392–3412 (2025). https://doi.org/10.1007/s11263-024-02336-9
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02336-9