Abstract
Video person re-identification (VReID) aims to recognize individuals across video sequences. Existing methods primarily use Euclidean space for representation learning but struggle to capture complex hierarchical structures, especially in scenarios with occlusions and background clutter. In contrast, hyperbolic space, with its negatively curved geometry, excels at preserving hierarchical relationships and enhancing discrimination between similar appearances. Inspired by these, we propose Dual-Space Video Person Re-Identification (DS-VReID) to utilize the strength of both Euclidean and hyperbolic geometries, capturing the visual features while also exploring the intrinsic hierarchical relations, thereby enhancing the discriminative capacity of the features. Specifically, we design the Dynamic Prompt Graph Construction (DPGC) module, which uses a pre-trained CLIP model with learnable dynamic prompts to construct 3D graphs that capture subtle changes and dynamic information in video sequences. Building upon this, we introduce the Hyperbolic Disentangled Aggregation (HDA) module, which addresses long-range dependency modeling by decoupling node distances and integrating adjacency matrices, capturing detailed spatial-temporal hierarchical relationships. Extensive experiments on benchmark datasets demonstrate the superiority of DS-VReID over state-of-the-art methods, showcasing its potential in complex VReID scenarios.
Similar content being viewed by others
Data Availability
These datasets for this study can be found in the following: MARS: http://zheng-lab.cecs.anu.edu.au/Project/project_mars.html iLIDS-VID: https://xiatian-zhu.github.io/downloads_qmul_iLIDS-VID_ReID_dataset.html PRID2011: https://www.tugraz.at/institute/icg/research/team-bischof/lrs/ downloads/PRID11/ DukeMTMC-VideoReID: http://vision.cs.duke.edu/DukeMTMC LS-VID: The dataset generated during and/or analysed during the current study are not publicly available due to [LS-VID RELEASE AGREEMENT].
References
Bachmann, G., Bécigneul, G., & Ganea, O. (2020). Constant curvature graph convolutional networks. In International conference on machine learning, PMLR (pp. 486–496).
Bai, S., Ma, B., Chang, H., et al. (2022). Salient-to-broad transition for video person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7339–7348).
Chami, I., Ying, Z., Ré, C., et al. (2019). Hyperbolic graph convolutional neural networks. In Advances in neural information processing systems, Vol. 32.
Chen, Z., Zhou, Z., Huang, J., et al. (2020). Frame-guided region-aligned representation for video person re-identification. In Proceedings of the AAAI conference on artificial intelligence (pp. 10591–10598).
Chen, S., Da, H., Wang, D. H., et al. (2023). Hasi: Hierarchical attention-aware Spatio-temporal interaction for video-based person re-identification. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2023.3340428
Eom, C., Lee, G., Lee, J., et al. (2021). Video-based person re-identification with spatial and temporal memory networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12036–12045).
Ermolov, A., Mirvakhabova, L., Khrulkov, V., et al. (2022). Hyperbolic vision transformers: Combining improvements in metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7409–7419).
Fu, Y., Wang, X., Wei, Y., et al. (2019). Sta: Spatial-temporal attention for large-scale video-based person re-identification. In Proceedings of the AAAI conference on artificial intelligence (pp. pp 8287–8294).
Ganea, O., Bécigneul, G., & Hofmann, T. (2018). Hyperbolic neural networks. In Advances in neural information processing systems, Vol. 31.
Gu, X., Chang, H., Ma, B., et al. (2020). Appearance-preserving 3d convolution for video-based person re-identification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 228–243. Springer.
Gu, X., Chang, H., Ma, B., et al. (2022). Motion feature aggregation for video-based person re-identification. IEEE Transactions on Image Processing, 31, 3908–3919.
He, T., Jin, X., Shen, X., et al. (2021). Dense interaction learning for video-based person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1490–1501).
He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
He, S., Chen, W., Wang, K., et al. (2023). Region generation and assessment network for occluded person re-identification. IEEE Transactions on Information Forensics and Security. https://doi.org/10.1109/TIFS.2023.3318956
Hermans, A., Beyer, L., & Leibe, B. (2017). In defense of the triplet loss for person re-identification. arXiv:1703.07737
Hirzer, M,. Beleznai, C., Roth, P.M., et al. (2011). Person re-identification by descriptive and discriminative classification. In Image analysis: 17th scandinavian conference, SCIA 2011, Ystad, Sweden, May 2011. Proceedings 17 (pp. 91–102). Springer.
Hou, R., Chang, H., Ma, B., et al. (2020). Temporal complementary learning for video person re-identification. In Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16 (pp. 388–405). Springer.
Hou, R., Chang, H., Ma, B., et al. (2021) Bicnet-tks: Learning efficient spatial-temporal representation for video person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2014–2023).
Hou, R., Ma, B., Chang, H., et al. (2019). Vrstc: Occlusion-free video person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7183–7192).
Huang, Y., Fu, X., Li, L., et al. (2022). Learning degradation-invariant representation for robust real-world person re-identification. International Journal of Computer Vision, 130(11), 2770–2796.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980
Kipf, T.N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv:1609.02907
Law, M., Liao, R., Snell, J., et al. (2019). Lorentzian distance learning for hyperbolic representations. In International conference on machine learning, PMLR (pp. 3672–3681).
Leng, J., Wang, H., Gao, X., et al. (2023). Where to look: Multi-granularity occlusion aware for video person re-identification. Neurocomputing, 536, 137–151.
Li, S., Bak, S., Carr, P., et al. (2018). Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 369–378).
Li, S., Sun, L., Li, Q. (2023). Clip-reid: Exploiting vision-language model for image re-identification without concrete text labels. In Proceedings of the AAAI conference on artificial intelligence (pp. 1405–1413).
Li, J., Wang, J., Tian, Q., et al. (2019a). Global-local temporal representations for video person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3958–3967).
Li, S., Yu, H., & Hu, H. (2020b). Appearance and motion enhancement for video-based person re-identification. In Proceedings of the AAAI conference on artificial intelligence (pp. 11394–11401).
Li, J., Zhang, S., & Huang, T. (2019b). Multi-scale 3d convolution network for video based person re-identification. In Proceedings of the AAAI conference on artificial intelligence (pp. 8618–8625).
Liu, Q., Nickel, M., & Kiela, D. (2019a). Hyperbolic graph neural networks. In Advances in neural information processing systems, Vol. 32.
Liu, Y., Yuan, Z., Zhou, W., et al. (2019b). Spatial and temporal mutual promotion for video-based person re-identification. In Proceedings of the AAAI conference on artificial intelligence (pp. 8786–8793).
Liu, J., Zha, Z.J., Wu, W., et al. (2021a). Spatial-temporal correlation and topology learning for person re-identification in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4370–4379).
Liu, X., Zhang, P., Yu, C., et al. (2021b). Watching you: Global-guided reciprocal learning for video-based person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13334–13343).
Liu, L., Yang, X., Wang, N., et al. (2023). Frequency information disentanglement network for video-based person re-identification. IEEE Transactions on Image Processing. https://doi.org/10.1109/TIP.2023.3296901
Liu, X., Yu, C., Zhang, P., et al. (2023). Deeply coupled convolution-transformer with spatial-temporal complementary learning for video-based person re-identification. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2023.3271353
Liu, X., Zhang, P., Yu, C., et al. (2024). A video is worth three views: Trigeminal transformers for video-based person re-identification. IEEE Transactions on Intelligent Transportation Systems. https://doi.org/10.1109/TITS.2024.3386914
Li, P., Xu, Y., Wei, Y., et al. (2020). Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 3260–3271.
Long, T., Mettes, P., Shen, H.T., et al. (2020). Searching for actions on the hyperbole. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1141–1150).
Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, Vol. 30.
Pan, H., Bai, Y., He, Z., et al. (2022). Aagcn: Adjacency-aware graph convolutional network for person re-identification. Knowledge-Based Systems, 236, 107300.
Pan, H., Liu, Q., Chen, Y., et al. (2023). Pose-aided video-based person re-identification via recurrent graph convolutional network. IEEE Transactions on Circuits and Systems for Video Technology, 33(12), 7183–7196.
Park, J., Cho, J., Chang, H.J., et al. (2021). Unsupervised hyperbolic representation learning via message passing auto-encoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5516–5526).
Peng, W., Shi, J., Xia, Z., et al. (2020). Mix dimension in Poincaré geometry for 3d skeleton-based action recognition. In Proceedings of the 28th ACM international conference on multimedia (pp. 1432–1440).
Peng, W., Varanka, T., Mostafa, A., et al. (2021). Hyperbolic deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 10023–10044.
Radford, A., Kim, J.W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763).
Ren, S., He, K., Girshick, R., et al. (2016). Faster r-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.
Ruan, W., Liang, C., Yu, Y., et al. (2020). Correlation discrepancy insight network for video re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(4), 1–21.
Shimizu, R., Mukuta, Y., & Harada, T. (2020). Hyperbolic neural networks++. arXiv:2006.08210
Szegedy, C., Vanhoucke, V., Ioffe, S., et al. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).
Tang, Z., Zhang, R., Peng, Z., et al. (2022). Multi-stage spatio-temporal aggregation transformer for video person re-identification. IEEE Transactions on Multimedia, 25, 7917–7929.
Wang, T., Gong, S., Zhu, X., et al. (2014). Person re-identification by video ranking. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13 (pp 688–703). Springer.
Wang, Y., Wang, L., You, Y., et al. (2018b). Resource aware person re-identification across multiple resolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8042–8051).
Wang, G., Yang, S., Liu, H., et al. (2020). High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6449–6458).
Wang, Y., Zhang, P., Gao, S., et al. (2021). Pyramid spatial-temporal aggregation for video-based person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12026–12035).
Wang, C., Zhang, Q., Huang, C., et al. (2018a). Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European conference on computer vision (ECCV) (pp. 365–381).
Wu, Y., Lin, Y., Dong, X., et al. (2018). Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5177–5186).
Wu, Y., Bourahla, O. E. F., Li, X., et al. (2020). Adaptive graph representation learning for video person re-identification. IEEE Transactions on Image Processing, 29, 8821–8830.
Yan, Y., Qin, J., Chen, J., et al. (2020). Learning multi-granular hypergraphs for video-based person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2899–2908).
Yang, J., Zheng, W.S., Yang, Q., et al. (2020) Spatial-temporal graph convolutional network for video-based person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3289–3299).
Yang, M., Zhou, M., Li, Z., et al. (2022). Hyperbolic graph neural networks: A review of methods and applications. arXiv:2202.13852
Ye, M., Shen, J., Lin, G., et al. (2021). Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2872–2893.
Yin, J., Wu, A., & Zheng, W. S. (2020). Fine-grained person re-identification. International Journal of Computer Vision, 128(6), 1654–1672.
Yu, C., Liu, X., Wang, Y., et al. (2024). Tf-clip: Learning text-free clip for video-based person re-identification. In Proceedings of the AAAI conference on artificial intelligence (pp. 6764–6772).
Zang, X., Li, G., & Gao, W. (2022). Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval. IEEE Transactions on Industrial Informatics, 18(12), 8776–8785.
Zhang, Z., Lan, C., Zeng, W., et al. (2020). Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10407–10416).
Zhang, T., Wei, L., Xie, L., et al. (2021b). Spatiotemporal transformer for video-based person re-identification. arXiv:2103.16469
Zhang, Z., Zhang, H., & Liu, S. (2021d). Person re-identification using heterogeneous local graph attention networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12136–12145).
Zhang, S., Chen, D., Yang, J., et al. (2021). Guided attention in CNNs for occluded pedestrian detection and re-identification. International Journal of Computer Vision, 129, 1875–1892.
Zhang, L., Fu, X., Huang, F., et al. (2024). An open-world, diverse, cross-spatial-temporal benchmark for dynamic wild person re-identification. International Journal of Computer Vision, 132, 3823–3846.
Zhang, Y., Wang, X., Shi, C., et al. (2021). Hyperbolic graph attention network. IEEE Transactions on Big Data, 8(6), 1690–1701.
Zhang, Z., Zhang, H., Liu, S., et al. (2021). Part-guided graph convolution networks for person re-identification. Pattern Recognition, 120, 108155.
Zheng, L., Bie, Z., Sun, Y., et al. (2016). Mars: A video benchmark for large-scale person re-identification. In Computer Vision—ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14 (pp. 868–884). Springer.
Zhong, Z., Zheng, L., Kang, G., et al. (2020). Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence (pp. 13001–13008).
Zhou, Z., Huang, Y., Wang, W., et al. (2017). See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4747–4756).
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grants No. 62472060, 62441601, U23A20318, 62206035, and 62221005, in part by the Natural Science Foundation of Chongqing under Grand No. CSTB2022NSC QMSX1024, CSTB2024NSCQQCXMX0060, CSTB2023NSCQLZX0 061, in part by the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant No. KJZDK 202300604, in part by the Science and Technology Innovation Key R&D Program of Chongqing under Grant No. CSTB2023TIADSTX0016, and in part by the Chongqing Institute for Brain and Intelligence.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Bumsub Ham.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Leng, J., Kuang, C., Li, S. et al. Dual-Space Video Person Re-identification. Int J Comput Vis 133, 3667–3688 (2025). https://doi.org/10.1007/s11263-025-02350-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02350-5