这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Tracking of dynamic people in cluttered and crowded human-centered environments is a challenging robotics problem due to the presence of intraclass variations including occlusions, pose deformations, and lighting variations. This paper introduces a novel deep learning architecture, using conditional latent diffusion models, the Latent Diffusion Track (LDTrack), for tracking multiple dynamic people under intraclass variations. By uniquely utilizing conditional latent diffusion models to capture temporal person embeddings, our architecture can adapt to appearance changes of people over time. We incorporated a latent feature encoder network which enables the diffusion process to operate within a high-dimensional latent space to allow for the extraction and spatial–temporal refinement of such rich features as person appearance, motion, location, identity, and contextual information. Extensive experiments demonstrate the effectiveness of LDTrack over other state-of-the-art tracking methods in cluttered and crowded human-centered environments under intraclass variations. Namely, the results show our method outperforms existing deep learning robotic people tracking methods in both tracking accuracy and tracking precision with statistical significance. Additionally, a comprehensive multi-object tracking comparison study was performed against the state-of-the-art methods in urban environments, demonstrating the generalizability of LDTrack. An ablation study was performed to validate the design choices of LDTrack.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of Data and Materials

The IOD, KTP, ISRT, MOT17, and MOT20 datasets are available at: (1) IOD (Mees et al., 2016). (2) KTP (Munaro & Menegatti, 2014). (3) ISRT (Pereira et al., 2022). (4) MOT17 (Dendorfer et al., 2021). (5) MOT20 (Voigtlaender et al., 2019).

References

  • Agrawal, K., & Lal, R. (2021). Person following mobile robot using multiplexed detection and tracking. In V. R. Kalamkar & K. Monkova (Eds.), Advances in Mechanical Engineering (pp. 815–822). Berlin: Springer.

    Chapter  Google Scholar 

  • Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing, 2008(1), 1–10.

    Article  Google Scholar 

  • Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464–3468).

  • Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., & Soatto, S. (2022, March 31). MeMOT: multi-object tracking with memory. arXiv. https://arxiv.org/abs/2203.16761

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer vision—ECCV 2020 (Vol. 12346, pp. 213–229). Berlin: Springer.

    Chapter  Google Scholar 

  • Chaabane, M., Zhang, P., Beveridge, J. R., & O’Hara, S. (2021, June 6). DEFT: Detection embeddings for tracking. arXiv http://arxiv.org/abs/2102.02267.

  • Chen, S., Sun, P., Song, Y., & Luo, P. (2023). DiffusionDet: Diffusion model for object detection. In 2023 IEEE/CVF international conference on computer vision (ICCV) (pp. 19773–19786). Paris: IEEE.

  • Chuang, Z., Sifa, Z., Haoran, W., Ziqing, G., Wenchao, S., & Lei, Y. (2024, February 1). AttentionTrack: Multiple object tracking in traffic scenarios using features attention. https://ieeexplore.ieee.org/abstract/document/10260285?casa_token=CKgXKurS06oAAAAA:Ghg0vJ0bQ1X3nftk2dCiyHnmWDS7_UFKQvR8EoH3HtG6Tu14-fBVYb_FaGowl4Y8nijjzEQcUYk.

  • Dai, Z., Cai, B., Lin, Y., & Chen, J. (2022). UP-DETR: Unsupervised pre-training for object detection with transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3216514

    Article  Google Scholar 

  • Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., et al. (2021). MOTChallenge: A benchmark for single-camera multiple target tracking. International Journal of Computer Vision, 129(4), 845–881.

    Article  Google Scholar 

  • Dworakowski, D., Fung, A., & Nejat, G. (2023). Robots understanding contextual information in human-centered environments using weakly supervised mask data distillation. International Journal of Computer Vision, 131(2), 407–430.

    Article  Google Scholar 

  • Fung, A., Benhabib, B., & Nejat, G. (2023). Robots autonomously detecting people: A multimodal deep contrastive learning method robust to intraclass variations. IEEE Robotics and Automation Letters, 8(6), 3550–3557.

    Article  Google Scholar 

  • Fung, A., Wang, L. Y., Zhang, K., Nejat, G., & Benhabib, B. (2020). Using deep learning to find victims in unknown cluttered urban search and rescue environments. Current Robotics Reports, 1(3), 105–115.

    Article  Google Scholar 

  • Gao, R., Zhang, Y., & Wang, L. (2024, March 25). Multiple object tracking as ID prediction. arXiv http://arxiv.org/abs/2403.16848.

  • Gupta, S., Tolani, V., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2020). Cognitive mapping and planning for visual navigation. International Journal of Computer Vision, 128(5), 1311–1330.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 2980–2988).

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in neural information processing systems (Vol. 33, pp. 6840–6851). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html

  • Jiang, L., Wang, Z., Yin, S., Ma, G., Zhang, P., & Wu, B. (2024, August 28). ConsistencyTrack: A robust multi-object tracker with a generation strategy of consistency model. arXiv. https://arxiv.org/abs/2408.15548

  • Kollmitz, M., Eitel, A., Vasquez, A., & Burgard, W. (2019). Deep 3D perception of people and their mobility aids. Robotics and Autonomous Systems, 114, 29–40.

    Article  Google Scholar 

  • Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.

    Article  MathSciNet  Google Scholar 

  • Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2020). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327.

    Article  Google Scholar 

  • Liu, T., Sun, J. J., Zhao, L., Zhao, J., Yuan, L., Wang, Y., et al. (2022). View-invariant, occlusion-robust probabilistic embedding for human pose. International Journal of Computer Vision, 130(1), 111–135.

    Article  Google Scholar 

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot MultiBox detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer vision—ECCV 2016 (pp. 21–37). Berlin: Springer.

    Chapter  Google Scholar 

  • Lu, Z., Rathod, V., Votel, R., & Huang, J. (2020). RetinaTrack: Online single stage joint detection and tracking. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 14656–14666). Seattle, WA, USA: IEEE.

  • Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Van Gool, L. (2022). RePaint: Inpainting using denoising diffusion probabilistic models. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11451–11461). New Orleans, LA, USA: IEEE.

  • Luo, R., Song, Z., Ma, L., Wei, J., Yang, W., & Yang, M. (2023, August 19). DiffusionTrack: Diffusion model for multi-object tracking. arXiv http://arxiv.org/abs/2308.09905

  • Mees, O., Eitel, A., & Burgard, W. (2016). Choosing smartly: Adaptive multimodal fusion for object detection in changing environments. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 151–156). Daejeon, South Korea: IEEE.

  • Meinhardt, T., Kirillov, A., Leal-Taixe, L., & Feichtenhofer, C. (2022). TrackFormer: Multi-object tracking with transformers. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8834–8844), New Orleans, LA, USA: IEEE.

  • Mohamed, S. C., Fung, A., & Nejat, G. (2023). A multirobot person search system for finding multiple dynamic users in human-centered environments. IEEE Transactions on Cybernetics, 53(1), 628–640.

    Article  Google Scholar 

  • Munaro, M., & Menegatti, E. (2014). Fast RGB-D people tracking for service robots. Autonomous Robots, 37(3), 227–242.

    Article  Google Scholar 

  • Murray, S. (2017). Real-time multiple object tracking—A study on the importance of speed. arXiv:1709.03572 [cs]

  • Pang, L., Cao, Z., Yu, J., Guan, P., Chen, X., & Zhang, W. (2020). A robust visual person-following approach for mobile robots in disturbing environments. IEEE Systems Journal, 14(2), 2965–2968.

    Article  Google Scholar 

  • Pereira, R., Carvalho, G., Garrote, L., & Nunes, U. J. (2022). Sort and deep-SORT based multi-object tracking for mobile robotics: Evaluation with new data association metrics. Applied Sciences, 12(3), 1319.

    Article  Google Scholar 

  • Pinto, V., Bettencourt, R., & Ventura, R. (2023). People re-identification in service robots. In 2023 IEEE international conference on autonomous robot systems and competitions (ICARSC) (pp. 44–49), Tomar, Portugal: IEEE.

  • Rebello, J., Fung, A., & Waslander, S. L. (2020). AC/DCC : Accurate calibration of dynamic camera clusters for visual SLAM. In 2020 IEEE international conference on robotics and automation (ICRA) (pp. 6035–6041).

  • Redmon, J., & Farhadi, A. (2018, April 8). YOLOv3: An incremental improvement. arXiv. http://arxiv.org/abs/1804.02767

  • Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 658–666).

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10674–10685), New Orleans, LA, USA: IEEE.

  • Royer, E., Lhuillier, M., Dhome, M., & Lavest, J.-M. (2007). Monocular vision for mobile robot localization and autonomous navigation. International Journal of Computer Vision, 74(3), 237–260.

    Article  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Sanz, D., Ahmad, A., & Lima, P. (2015). Onboard robust person detection and tracking for domestic service robots. In Robot 2015: Second iberian robotics conference (pp. 547–559). Cham: Springer.

  • Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd international conference on machine learning (pp. 2256–2265).

  • Sun, S., Zhao, X., & Tan, M. (2019). Fast and robust RGB-D multiple human tracking based on part model for mobile robots. In 2019 Chinese control conference (CCC) (pp. 4525–4530). Guangzhou, China: IEEE.

  • Sun, Pei, Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., et al. (2020). Scalability in perception for autonomous driving: Waymo open dataset. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). Presented at the 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2443–2451).

  • Sun, Peize, Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., et al. (2021, May 4). TransTrack: Multiple object tracking with transformer. arXiv. http://arxiv.org/abs/2012.15460

  • Tan, A. H., Narasimhan, S., & Nejat, G. (2024, February 27). 4CNet: A confidence-aware, contrastive, conditional, consistency model for robot map prediction in multi-robot environments. arXiv. https://arxiv.org/abs/2402.17904

  • Tan, A. H., Bejarano, F. P., Zhu, Y., Ren, R., & Nejat, G. (2023). Deep reinforcement learning for decentralized multi-robot exploration with macro actions. IEEE Robotics and Automation Letters, 8(1), 272–279.

    Article  Google Scholar 

  • Taylor, A., & Riek, L. D. (2022). REGROUP: A robot-centric group detection and tracking system. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (pp. 412–421). Sapporo, Japan: IEEE.

  • Tokmakov, P., Li, J., Burgard, W., & Gaidon, A. (2021, September 30). Learning to track with object permanence. arXiv. http://arxiv.org/abs/2103.14258

  • Vasquez, A., Kollmitz, M., Eitel, A., & Burgard, W. (2017). Deep Detection of People and their Mobility Aids for a Hospital Robot. In 2017 European conference on mobile robots (ECMR) (pp. 1–7). Paris: IEEE.

  • Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition (CVPR 2001) (Vol. 1, pp. I-511–I-518). Kauai, HI: IEEE Comput. Soc.

  • Vo, D. M., Jiang, L., & Zell, A. (2014). Real time person detection and tracking by mobile robots using RGB-D images. In 2014 IEEE international conference on robotics and biomimetics (ROBIO 2014) (pp. 689–694). Bali, Indonesia: IEEE.

  • Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., & Leibe, B. (2019). MOTS: Multi-object tracking and segmentation. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7934–7943), Long Beach, CA, USA: IEEE.

  • Wang, Haitong, Tan, A. H., & Nejat, G. (2024, February 9). NavFormer: A transformer architecture for robot target-driven navigation in unknown and dynamic environments. arXiv http://arxiv.org/abs/2402.06838.

  • Wang, H., Zhu, X., Gong, S., & Xiang, T. (2018). Person re-identification in identity regression space. International Journal of Computer Vision, 126(12), 1288–1310.

    Article  Google Scholar 

  • Weber, T., Triputen, S., Danner, M., Braun, S., Schreve, K., & Rätsch, M. (2018). Follow me: Real-time in the wild person tracking application for autonomous robotics. In H. Akiyama, O. Obst, C. Sammut, & F. Tonidandel (Eds.), RoboCup 2017: Robot World Cup XXI (Vol. 11175, pp. 156–167). Berlin: Springer.

    Chapter  Google Scholar 

  • Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645–3649).

  • Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., & Yuan, J. (2021). Track to detect and segment: An online multi-object tracker. In: Presented at the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12352–12361).

  • Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., & Alameda-Pineda, X. (2022, September 30). TransCenter: Transformers with dense representations for multiple-object tracking. arXiv http://arxiv.org/abs/2103.15145.

  • Xue, F., Chang, Y., Wang, T., Zhou, Y., & Ming, A. (2024). Indoor obstacle discovery on reflective ground via monocular camera. International Journal of Computer Vision, 132(3), 987–1007.

    Article  Google Scholar 

  • Yan, Y., Li, J., Qin, J., Zheng, P., Liao, S., & Yang, X. (2023). Efficient person search: An anchor-free approach. International Journal of Computer Vision, 131(7), 1642–1661.

    Article  Google Scholar 

  • Yuan, Y., Chen, W., Yang, Y., & Wang, Z. (2020). In defense of the triplet loss again: learning robust person re-identification with fast approximated triplet loss and label distillation. In 2020 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 1454–1463). Seattle, WA, USA: IEEE.

  • Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., & Wei, Y. (2022). MOTR: End-to-end multiple-object tracking with transformer. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer vision—ECCV 2022 (Vol. 13687, pp. 659–675). Cham: Springer.

    Chapter  Google Scholar 

  • Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer vision—ECCV 2020 (pp. 474–490). Berlin: Springer.

    Chapter  Google Scholar 

  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021, March 17). Deformable DETR: Deformable transformers for end-to-end object detection. arXiv. http://arxiv.org/abs/2010.04159.

Download references

Acknowledgements

The authors would like to thank Aaron Hao Tan and Haitong Wang for their invaluable discussions and assistance.

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), AGE-WELL Inc., and the Canada Research Chairs (CRC) program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Angus Fung.

Ethics declarations

Conflict of interest

We have no known conflicts of interest.

Additional information

Communicated by Yasushi Yagi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fung, A., Benhabib, B. & Nejat, G. LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models. Int J Comput Vis 133, 3392–3412 (2025). https://doi.org/10.1007/s11263-024-02336-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02336-9

Keywords

Profiles

  1. Angus Fung