LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models

Fung, Angus; Benhabib, Beno; Nejat, Goldie

doi:10.1007/s11263-024-02336-9

LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models

Published: 08 January 2025

Volume 133, pages 3392–3412, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

530 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

Tracking of dynamic people in cluttered and crowded human-centered environments is a challenging robotics problem due to the presence of intraclass variations including occlusions, pose deformations, and lighting variations. This paper introduces a novel deep learning architecture, using conditional latent diffusion models, the Latent Diffusion Track (LDTrack), for tracking multiple dynamic people under intraclass variations. By uniquely utilizing conditional latent diffusion models to capture temporal person embeddings, our architecture can adapt to appearance changes of people over time. We incorporated a latent feature encoder network which enables the diffusion process to operate within a high-dimensional latent space to allow for the extraction and spatial–temporal refinement of such rich features as person appearance, motion, location, identity, and contextual information. Extensive experiments demonstrate the effectiveness of LDTrack over other state-of-the-art tracking methods in cluttered and crowded human-centered environments under intraclass variations. Namely, the results show our method outperforms existing deep learning robotic people tracking methods in both tracking accuracy and tracking precision with statistical significance. Additionally, a comprehensive multi-object tracking comparison study was performed against the state-of-the-art methods in urban environments, demonstrating the generalizability of LDTrack. An ablation study was performed to validate the design choices of LDTrack.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World

Distractor-Supported Single Target Tracking in Extremely Cluttered Scenes

Correlation Filter Detection and Tracking Model Based on Dynamic Spatial Feature Selection

Availability of Data and Materials

The IOD, KTP, ISRT, MOT17, and MOT20 datasets are available at: (1) IOD (Mees et al., 2016). (2) KTP (Munaro & Menegatti, 2014). (3) ISRT (Pereira et al., 2022). (4) MOT17 (Dendorfer et al., 2021). (5) MOT20 (Voigtlaender et al., 2019).

References

Agrawal, K., & Lal, R. (2021). Person following mobile robot using multiplexed detection and tracking. In V. R. Kalamkar & K. Monkova (Eds.), Advances in Mechanical Engineering (pp. 815–822). Berlin: Springer.
Chapter Google Scholar
Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing, 2008(1), 1–10.
Article Google Scholar
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464–3468).
Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., & Soatto, S. (2022, March 31). MeMOT: multi-object tracking with memory. arXiv. https://arxiv.org/abs/2203.16761
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer vision—ECCV 2020 (Vol. 12346, pp. 213–229). Berlin: Springer.
Chapter Google Scholar
Chaabane, M., Zhang, P., Beveridge, J. R., & O’Hara, S. (2021, June 6). DEFT: Detection embeddings for tracking. arXiv http://arxiv.org/abs/2102.02267.
Chen, S., Sun, P., Song, Y., & Luo, P. (2023). DiffusionDet: Diffusion model for object detection. In 2023 IEEE/CVF international conference on computer vision (ICCV) (pp. 19773–19786). Paris: IEEE.
Chuang, Z., Sifa, Z., Haoran, W., Ziqing, G., Wenchao, S., & Lei, Y. (2024, February 1). AttentionTrack: Multiple object tracking in traffic scenarios using features attention. https://ieeexplore.ieee.org/abstract/document/10260285?casa_token=CKgXKurS06oAAAAA:Ghg0vJ0bQ1X3nftk2dCiyHnmWDS7_UFKQvR8EoH3HtG6Tu14-fBVYb_FaGowl4Y8nijjzEQcUYk.
Dai, Z., Cai, B., Lin, Y., & Chen, J. (2022). UP-DETR: Unsupervised pre-training for object detection with transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3216514
Article Google Scholar
Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., et al. (2021). MOTChallenge: A benchmark for single-camera multiple target tracking. International Journal of Computer Vision, 129(4), 845–881.
Article Google Scholar
Dworakowski, D., Fung, A., & Nejat, G. (2023). Robots understanding contextual information in human-centered environments using weakly supervised mask data distillation. International Journal of Computer Vision, 131(2), 407–430.
Article Google Scholar
Fung, A., Benhabib, B., & Nejat, G. (2023). Robots autonomously detecting people: A multimodal deep contrastive learning method robust to intraclass variations. IEEE Robotics and Automation Letters, 8(6), 3550–3557.
Article Google Scholar
Fung, A., Wang, L. Y., Zhang, K., Nejat, G., & Benhabib, B. (2020). Using deep learning to find victims in unknown cluttered urban search and rescue environments. Current Robotics Reports, 1(3), 105–115.
Article Google Scholar
Gao, R., Zhang, Y., & Wang, L. (2024, March 25). Multiple object tracking as ID prediction. arXiv http://arxiv.org/abs/2403.16848.
Gupta, S., Tolani, V., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2020). Cognitive mapping and planning for visual navigation. International Journal of Computer Vision, 128(5), 1311–1330.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 2980–2988).
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in neural information processing systems (Vol. 33, pp. 6840–6851). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html
Jiang, L., Wang, Z., Yin, S., Ma, G., Zhang, P., & Wu, B. (2024, August 28). ConsistencyTrack: A robust multi-object tracker with a generation strategy of consistency model. arXiv. https://arxiv.org/abs/2408.15548
Kollmitz, M., Eitel, A., Vasquez, A., & Burgard, W. (2019). Deep 3D perception of people and their mobility aids. Robotics and Autonomous Systems, 114, 29–40.
Article Google Scholar
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.
Article MathSciNet Google Scholar
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2020). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327.
Article Google Scholar
Liu, T., Sun, J. J., Zhao, L., Zhao, J., Yuan, L., Wang, Y., et al. (2022). View-invariant, occlusion-robust probabilistic embedding for human pose. International Journal of Computer Vision, 130(1), 111–135.
Article Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot MultiBox detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer vision—ECCV 2016 (pp. 21–37). Berlin: Springer.
Chapter Google Scholar
Lu, Z., Rathod, V., Votel, R., & Huang, J. (2020). RetinaTrack: Online single stage joint detection and tracking. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 14656–14666). Seattle, WA, USA: IEEE.
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Van Gool, L. (2022). RePaint: Inpainting using denoising diffusion probabilistic models. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11451–11461). New Orleans, LA, USA: IEEE.
Luo, R., Song, Z., Ma, L., Wei, J., Yang, W., & Yang, M. (2023, August 19). DiffusionTrack: Diffusion model for multi-object tracking. arXiv http://arxiv.org/abs/2308.09905
Mees, O., Eitel, A., & Burgard, W. (2016). Choosing smartly: Adaptive multimodal fusion for object detection in changing environments. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 151–156). Daejeon, South Korea: IEEE.
Meinhardt, T., Kirillov, A., Leal-Taixe, L., & Feichtenhofer, C. (2022). TrackFormer: Multi-object tracking with transformers. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8834–8844), New Orleans, LA, USA: IEEE.
Mohamed, S. C., Fung, A., & Nejat, G. (2023). A multirobot person search system for finding multiple dynamic users in human-centered environments. IEEE Transactions on Cybernetics, 53(1), 628–640.
Article Google Scholar
Munaro, M., & Menegatti, E. (2014). Fast RGB-D people tracking for service robots. Autonomous Robots, 37(3), 227–242.
Article Google Scholar
Murray, S. (2017). Real-time multiple object tracking—A study on the importance of speed. arXiv:1709.03572 [cs]
Pang, L., Cao, Z., Yu, J., Guan, P., Chen, X., & Zhang, W. (2020). A robust visual person-following approach for mobile robots in disturbing environments. IEEE Systems Journal, 14(2), 2965–2968.
Article Google Scholar
Pereira, R., Carvalho, G., Garrote, L., & Nunes, U. J. (2022). Sort and deep-SORT based multi-object tracking for mobile robotics: Evaluation with new data association metrics. Applied Sciences, 12(3), 1319.
Article Google Scholar
Pinto, V., Bettencourt, R., & Ventura, R. (2023). People re-identification in service robots. In 2023 IEEE international conference on autonomous robot systems and competitions (ICARSC) (pp. 44–49), Tomar, Portugal: IEEE.
Rebello, J., Fung, A., & Waslander, S. L. (2020). AC/DCC : Accurate calibration of dynamic camera clusters for visual SLAM. In 2020 IEEE international conference on robotics and automation (ICRA) (pp. 6035–6041).
Redmon, J., & Farhadi, A. (2018, April 8). YOLOv3: An incremental improvement. arXiv. http://arxiv.org/abs/1804.02767
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 658–666).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10674–10685), New Orleans, LA, USA: IEEE.
Royer, E., Lhuillier, M., Dhome, M., & Lavest, J.-M. (2007). Monocular vision for mobile robot localization and autonomous navigation. International Journal of Computer Vision, 74(3), 237–260.
Article Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Sanz, D., Ahmad, A., & Lima, P. (2015). Onboard robust person detection and tracking for domestic service robots. In Robot 2015: Second iberian robotics conference (pp. 547–559). Cham: Springer.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd international conference on machine learning (pp. 2256–2265).
Sun, S., Zhao, X., & Tan, M. (2019). Fast and robust RGB-D multiple human tracking based on part model for mobile robots. In 2019 Chinese control conference (CCC) (pp. 4525–4530). Guangzhou, China: IEEE.
Sun, Pei, Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., et al. (2020). Scalability in perception for autonomous driving: Waymo open dataset. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). Presented at the 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2443–2451).
Sun, Peize, Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., et al. (2021, May 4). TransTrack: Multiple object tracking with transformer. arXiv. http://arxiv.org/abs/2012.15460
Tan, A. H., Narasimhan, S., & Nejat, G. (2024, February 27). 4CNet: A confidence-aware, contrastive, conditional, consistency model for robot map prediction in multi-robot environments. arXiv. https://arxiv.org/abs/2402.17904
Tan, A. H., Bejarano, F. P., Zhu, Y., Ren, R., & Nejat, G. (2023). Deep reinforcement learning for decentralized multi-robot exploration with macro actions. IEEE Robotics and Automation Letters, 8(1), 272–279.
Article Google Scholar
Taylor, A., & Riek, L. D. (2022). REGROUP: A robot-centric group detection and tracking system. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (pp. 412–421). Sapporo, Japan: IEEE.
Tokmakov, P., Li, J., Burgard, W., & Gaidon, A. (2021, September 30). Learning to track with object permanence. arXiv. http://arxiv.org/abs/2103.14258
Vasquez, A., Kollmitz, M., Eitel, A., & Burgard, W. (2017). Deep Detection of People and their Mobility Aids for a Hospital Robot. In 2017 European conference on mobile robots (ECMR) (pp. 1–7). Paris: IEEE.
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition (CVPR 2001) (Vol. 1, pp. I-511–I-518). Kauai, HI: IEEE Comput. Soc.
Vo, D. M., Jiang, L., & Zell, A. (2014). Real time person detection and tracking by mobile robots using RGB-D images. In 2014 IEEE international conference on robotics and biomimetics (ROBIO 2014) (pp. 689–694). Bali, Indonesia: IEEE.
Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., & Leibe, B. (2019). MOTS: Multi-object tracking and segmentation. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7934–7943), Long Beach, CA, USA: IEEE.
Wang, Haitong, Tan, A. H., & Nejat, G. (2024, February 9). NavFormer: A transformer architecture for robot target-driven navigation in unknown and dynamic environments. arXiv http://arxiv.org/abs/2402.06838.
Wang, H., Zhu, X., Gong, S., & Xiang, T. (2018). Person re-identification in identity regression space. International Journal of Computer Vision, 126(12), 1288–1310.
Article Google Scholar
Weber, T., Triputen, S., Danner, M., Braun, S., Schreve, K., & Rätsch, M. (2018). Follow me: Real-time in the wild person tracking application for autonomous robotics. In H. Akiyama, O. Obst, C. Sammut, & F. Tonidandel (Eds.), RoboCup 2017: Robot World Cup XXI (Vol. 11175, pp. 156–167). Berlin: Springer.
Chapter Google Scholar
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645–3649).
Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., & Yuan, J. (2021). Track to detect and segment: An online multi-object tracker. In: Presented at the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12352–12361).
Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., & Alameda-Pineda, X. (2022, September 30). TransCenter: Transformers with dense representations for multiple-object tracking. arXiv http://arxiv.org/abs/2103.15145.
Xue, F., Chang, Y., Wang, T., Zhou, Y., & Ming, A. (2024). Indoor obstacle discovery on reflective ground via monocular camera. International Journal of Computer Vision, 132(3), 987–1007.
Article Google Scholar
Yan, Y., Li, J., Qin, J., Zheng, P., Liao, S., & Yang, X. (2023). Efficient person search: An anchor-free approach. International Journal of Computer Vision, 131(7), 1642–1661.
Article Google Scholar
Yuan, Y., Chen, W., Yang, Y., & Wang, Z. (2020). In defense of the triplet loss again: learning robust person re-identification with fast approximated triplet loss and label distillation. In 2020 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 1454–1463). Seattle, WA, USA: IEEE.
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., & Wei, Y. (2022). MOTR: End-to-end multiple-object tracking with transformer. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer vision—ECCV 2022 (Vol. 13687, pp. 659–675). Cham: Springer.
Chapter Google Scholar
Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer vision—ECCV 2020 (pp. 474–490). Berlin: Springer.
Chapter Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021, March 17). Deformable DETR: Deformable transformers for end-to-end object detection. arXiv. http://arxiv.org/abs/2010.04159.

Download references

Acknowledgements

The authors would like to thank Aaron Hao Tan and Haitong Wang for their invaluable discussions and assistance.

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), AGE-WELL Inc., and the Canada Research Chairs (CRC) program.

Author information

Authors and Affiliations

Autonomous Systems and Biomechatronics Laboratory (ASBLab), Department of Mechanical and Industrial Engineering, University of Toronto, 5 King’s College Road, Toronto, ON, M5S 3G8, Canada
Angus Fung, Beno Benhabib & Goldie Nejat

Authors

Angus Fung
View author publications
Search author on:PubMed Google Scholar
Beno Benhabib
View author publications
Search author on:PubMed Google Scholar
Goldie Nejat
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Angus Fung.

Ethics declarations

Conflict of interest

We have no known conflicts of interest.

Additional information

Communicated by Yasushi Yagi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fung, A., Benhabib, B. & Nejat, G. LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models. Int J Comput Vis 133, 3392–3412 (2025). https://doi.org/10.1007/s11263-024-02336-9

Download citation

Received: 24 February 2024
Accepted: 17 December 2024
Published: 08 January 2025
Version of record: 08 January 2025
Issue date: June 2025
DOI: https://doi.org/10.1007/s11263-024-02336-9

Keywords

Profiles

Angus Fung View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World

Distractor-Supported Single Target Tracking in Extremely Cluttered Scenes

Correlation Filter Detection and Tracking Model Based on Dynamic Spatial Feature Selection

Explore related subjects

Availability of Data and Materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now