Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives

Zhang, Pengcheng; Yu, Xiaohan; Bai, Xiao; Zheng, Jin; Ning, Xin; Hancock, Edwin R.

doi:10.1007/s11263-025-02407-5

Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives

Published: 19 March 2025

Volume 133, pages 4795–4816, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Pengcheng Zhang ORCID: orcid.org/0000-0001-8585-7545¹,
Xiaohan Yu²,
Xiao Bai¹,
Jin Zheng¹,
Xin Ning³ &
…
Edwin R. Hancock^1,4

485 Accesses
7 Citations
Explore all metrics

Abstract

End-to-end person search aims to jointly detect and re-identify a target person in raw scene images with a unified model. The detection sub-task learns to identify all persons as one category while the re-identification (re-id) sub-task aims to discriminate persons of different identities, resulting in conflicting optimal objectives. Existing works proposed to decouple end-to-end person search to alleviate such conflict. Yet these methods are still sub-optimal on the sub-tasks due to their partially decoupled models, which limits the overall person search performance. To further eliminate the last coupled part in decoupled models without sacrificing the efficiency of end-to-end person search, we propose a fully decoupled person search framework in this work. Specifically, we design a task-incremental network to construct an end-to-end model in a task-incremental learning procedure. Given that the detection subtask is easier, we start by training a lightweight detection sub-network and expand it with a re-id sub-network trained in another stage. On top of the fully decoupled design, we also enable one-stage training for the task-incremental network. The fully decoupled framework further allows an Online Representation Distillation to mitigate the representation gap between the end-to-end model and two-step models for learning robust representations. Without requiring an offline teacher re-id model, this transfers structured representational knowledge learned from cropped images to the person search model. The learned person representations thus focus more on discriminative clues of foreground persons and suppress the distractive background information. To understand the effectiveness and efficiency of the proposed method, we conduct comprehensive experimental evaluations on two popular person search datasets PRW and CUHK-SYSU. The experimental results demonstrate that the fully decoupled model achieves superior performance than previous decoupled methods. The inference of the model is also shown to be efficient among recent end-to-end methods. The source code is available at https://github.com/PatrickZad/fdps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-End Detection and Re-identification Integrated Net for Person Search

Horizontal Flipping Assisted Disentangled Feature Learning for Semi-supervised Person Re-identification

Semi-supervised dictionary learning based deep network for person search

Article 21 August 2025

Data Availability

For experiments in this work, we employ the two popular person search datasets, CUHK-SYSU and PRW, released in Xiao et al. (2017) and Zheng et al. (2017), respectively. Both datasets are made available upon request to their authors. For further discussion and exploration of the proposed method, the source code and other generated data of this study are available on request from the corresponding author.

References

Abati, D., Tomczak, J., Blankevoort, T., Calderara, S., Cucchiara, R., & Bejnordi, B. E. (2020). Conditional channel gated networks for task-aware continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3931–3940).
Bergmann, P., Meinhardt, T., & Leal-Taixe, L. (2019). Tracking without bells and whistles. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 941–951).
Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154–6162).
Cao, J., Pang, Y., Anwer, R. M., Cholakkal, H., Xie, J., Shah, M., & Khan, F. S. (2022). PSTR: End-to-end one-step person search with transformers. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Chen, W., Xu, X., Jia, J., Luo, H., Wang, Y., Wang, F., Jin, R., & Sun, X. (2023). Beyond appearance: A semantic controllable self-supervised learning framework for human-centric visual tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15050–15061).
Chen, D., Zhang, S., Ouyang, W., Yang, J., & Schiele, B. (2020). Hierarchical online instance matching for person search. In AAAI.
Chen, D., Zhang, S., Ouyang, W., Yang, J., & Tai, Y. (2020). Person search by separated modeling and a mask-guided two-stream CNN model. IEEE Transactions on Image Processing, 29, 4669–4682.
Article Google Scholar
Chen, D., Zhang, S., Yang, J., & Schiele, B. (2021). Norm-aware embedding for efficient person search and tracking. International Journal of Computer Vision, 129, 3154–3168.
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L. -J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.
Dong, W., Zhang, Z., Song, C., & Tan, T. (2020). Bi-directional interaction network for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2839–2848).
Dong, W., Zhang, Z., Song, C., & Tan, T. (2020). Instance guided proposal network for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2585–2594).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Golkar, S., Kagan, M., & Cho, K. (2019). Continual learning via neural pruning. arXiv preprint arXiv:1903.04476
Guo, Q., Wang, X., Wu, Y., Yu, Z., Liang, D., Hu, X., & Luo, P. (2020). Online knowledge distillation via collaborative learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11020–11029).
Han, C., Ye, J., Zhong, Y., Tan, X., Zhang, C., Gao, C., & Sang, N. (2019). Re-ID driven localization refinement for person search. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9814–9823).
Han, C., Zheng, Z., Gao, C., Sang, N., & Yang, Y. (2021). Decoupled and memory-reinforced networks: Towards effective feature learning for one-step person search. In Proceedings of the AAAI conference on artificial intelligence (pp. 1505–1512).
Han, C., Zheng, Z., Su, K., Yu, D., Yuan, Z., Gao, C., Sang, N., & Yang, Y. (2022). DMRNet++: Learning discriminative features with decoupled networks and enriched pairs for one-step person search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7319–7337.
Article Google Scholar
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Hou, S., Zhao, C., Chen, Z., Wu, J., Wei, Z., & Miao, D. (2021). Improved instance discrimination and feature compactness for end-to-end person search. IEEE Transactions on Circuits and Systems for Video Technology, 32(4), 2079–2090.
Article Google Scholar
Hung, C. -Y., Tu, C. -H., Wu, C. -E., Chen, C. -H., Chan, Y. -M., & Chen, C. -S. (2019). Compacting, picking and growing for unforgetting continual learning. Advances in Neural Information Processing Systems,32, 13647–13657.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448–456). PMLR.
Jin, Y., Gao, F., Yu, J., Wang, J., & Shuang, F. (2023). Multi-object tracking: Decoupling features to solve the contradictory dilemma of feature requirements. IEEE Transactions on Circuits and Systems for Video Technology, 33(9), 5117–5132.
Article Google Scholar
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. Advances in Neural Information Processing Systems, 33, 18661–18673.
Google Scholar
Kim, H., Joung, S., Kim, I. -J., & Sohn, K. (2021). Prototype-guided saliency feature learning for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4865–4874).
Lan, X., Zhu, X., & Gong, S. (2018). Person search by multi-scale matching. In Proceedings of the European conference on computer vision (ECCV) (pp. 536–552).
Lee, S., Oh, Y., Baek, D., Lee, J., & Ham, B. (2022). OIMNet++: Prototypical normalization and localization-aware learning for person search. In European conference on computer vision: Springer.
Google Scholar
Li, X., Zhou, Y., Wu, T., Socher, R., & Xiong, C. (2019). Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In International Conference On Machine Learning (pp. 3925–3934). PMLR.
Li, Z., & Miao, D. (2021). Sequential end-to-end network for efficient person search. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 2011–2019).
Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., & Hu, W. (2022). Rethinking the competition between detection and ReID in multiobject tracking. IEEE Transactions on Image Processing, 31, 3182–3196.
Article Google Scholar
Lin, T. -Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Lin, T. -Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part V 13 (pp. 740–755). Springer.
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). DAB-DETR: Dynamic anchor boxes are better queries for DETR. In International conference on learning representations. https://openreview.net/forum?id=oMI9PjOb9Jl.
Liu, Z., Mao, H., Wu, C. -Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976–11986).
Luo, H., Gu, Y., Liao, X., Lai, S., & Jiang, W. (2019). Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.
Mallya, A., & Lazebnik, S. (2018). PackNet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7765–7773).
Munjal, B., Amin, S., Tombari, F., & Galasso, F. (2019). Query-guided end-to-end person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 811–820).
Munjal, B., Flaborea, A., Amin, S., Tombari, F., & Galasso, F. (2023). Query-guided networks for few-shot fine-grained classification and person search. Pattern Recognition, 133, 109049.
Article Google Scholar
Ning, X., Gong, K., Li, W., Zhang, L., Bai, X., & Tian, S. (2021). Feature refinement and filter network for person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 31(9), 3391–3402. https://doi.org/10.1109/TCSVT.2020.3043026
Article Google Scholar
Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748
Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., & Yu, F. (2021). Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 164–173).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems,28, 91–99.
Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision (pp. 17–35). Springer.
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671
Serra, J., Suris, D., Miron, M., & Karatzoglou, A. (2018). Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning (pp. 4548–4557). PMLR.
Sung, Y. L., Cho, J., & Bansal, M. (2022). LST: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 13, 12991–13005.
Google Scholar
Tian, Y., Krishnan, D., & Isola, P. (2019). Contrastive representation distillation. In International conference on learning representations
Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9627–9636).
Ven, G. M., & Tolias, A. S. (2019). Three scenarios for continual learning. arXiv preprint arXiv:1904.07734
Wallingford, M., Li, H., Achille, A., Ravichandran, A., Fowlkes, C., Bhotika, R., & Soatto, S. (2022). Task adaptive parameter sharing for multi-task learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (pp. 7561–7570).
Wang, C., Ma, B., Chang, H., Shan, S., & Chen, X. (2020). TCTS: A task-consistent two-stage framework for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11952–11961).
Wang, C., Ma, B., Chang, H., Shan, S., & Chen, X. (2022). Person search by a bi-directional task-consistent learning model. IEEE Transactions on Multimedia, 25, 1190–1203.
Article Google Scholar
Wang, Y. -X., Ramanan, D., & Hebert, M. (2017). Growing a brain: Fine-tuning by increasing model capacity. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2471–2480).
Wang, Z., Zheng, L., Liu, Y., Li, Y., & Wang, S. (2020). Towards real-time multi-object tracking. In European conference on computer vision (pp. 107–122). Springer.
Wu, Y., Kirillov, A., Massa, F., Lo, W. -Y., & Girshick, R. (2019). Detectron2. https://github.com/facebookresearch/detectron2.
Xiao, T., Li, S., Wang, B., Lin, L., & Wang, X. (2017). Joint detection and identification feature learning for person search. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3415–3424).
Xu, Y., Ma, B., Huang, R., & Lin, L. (2014). Person search in a scene by jointly modeling people commonness and person uniqueness. In Proceedings of the 22nd ACM international conference on multimedia (pp. 937–940).
Yan, Y., Li, J., Qin, J., Bai, S., Liao, S., Liu, L., Zhu, F., & Shao, L. (2021). Anchor-free person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7690–7699).
Yan, Y., Zhang, Q., Ni, B., Zhang, W., Xu, M., & Yang, X. (2019). Learning context graph for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2158–2167).
Yan, Y., Li, J., Qin, J., Zheng, P., Liao, S., & Yang, X. (2023). Efficient person search: An anchor-free approach. International Journal of Computer Vision, 131, 1642–1661.
Yao, H., & Xu, C. (2020). Joint person objectness and repulsion for person search. IEEE Transactions on Image Processing, 30, 685–696.
Article Google Scholar
Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., & Hoi, S. C. (2021). Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2872–2893.
Article Google Scholar
Yoon, J., Yang, E., Lee, J., & Hwang, S. J. (2018). Lifelong learning with dynamically expandable networks. In International conference on learning representations.
Yu, R., Du, D., LaLonde, R., Davila, D., Funk, C., Hoogs, A., & Clipp, B. (2022). Cascade transformers for end-to-end person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7267–7276).
Yu, E., Li, Z., Han, S., & Wang, H. (2022). RelationTrack: Relation-aware multiple object tracking with decoupled representation. IEEE Transactions on Multimedia, 25, 2686–2697.
Article Google Scholar
Zhang, J. O., Sax, A., Zamir, A., Guibas, L., & Malik, J. (2020). Side-tuning: A baseline for network adaptation via additive side networks. In European conference on computer vision (pp. 698–714). Springer.
Zhang, P., Bai, X., Zheng, J., & Ning, X. (2023). Towards fully decoupled end-to-end person search. arXiv preprint arXiv:2309.04967
Zhang, X., Wang, X., Bian, J. -W., Shen, C., & You, M. (2021). Diverse knowledge distillation for end-to-end person search. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 3412–3420).
Zhang, Y., Li, X., & Zhang, Z. (2019). Efficient person search via expert-guided knowledge distillation. IEEE Transactions on Cybernetics, 51(10), 5093–5104.
Article Google Scholar
Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2021). FairMOT: On the fairness of detection and re-identification in multiple object tracking. International journal of computer vision, 129, 3069–3087.
Article Google Scholar
Zhao, C., Chen, Z., Dou, S., Qu, Z., Yao, J., Wu, J., & Miao, D. (2022). Context-aware feature learning for noise robust person search. IEEE Transactions on Circuits and Systems for Video Technology, 32(10), 7047–7060.
Article Google Scholar
Zhao, Y., Wang, X., Yu, X., Liu, C., & Gao, Y. (2023). Gait-assisted video person retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 33(2), 897–908. https://doi.org/10.1109/TCSVT.2022.3202531
Article Google Scholar
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision (pp. 1116–1124).
Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., & Tian, Q. (2017). Person re-identification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1367–1376).
Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In European conference on computer vision (pp. 474–490). Springer.
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. In International conference on learning representations.

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China 62276016 and 62372029.

Author information

Authors and Affiliations

School of Computer Science and Engineering, State Key Laboratory of Complex & Critical Software Environment, Jiangxi Research Institute, Beihang University, Haidian District, Beijing, 100191, China
Pengcheng Zhang, Xiao Bai, Jin Zheng & Edwin R. Hancock
School of Computing, Macquarie University, Sydney, NSW, 2109, Australia
Xiaohan Yu
Institute of Semiconductors, Chinese Academy of Sciences, Haidian District, Beijing, 100083, China
Xin Ning
Department of Computer Science, University of York, York, YO10 5DD, UK
Edwin R. Hancock

Authors

Pengcheng Zhang
View author publications
Search author on:PubMed Google Scholar
Xiaohan Yu
View author publications
Search author on:PubMed Google Scholar
Xiao Bai
View author publications
Search author on:PubMed Google Scholar
Jin Zheng
View author publications
Search author on:PubMed Google Scholar
Xin Ning
View author publications
Search author on:PubMed Google Scholar
Edwin R. Hancock
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Xiaohan Yu or Xiao Bai.

Additional information

Communicated by Jifeng Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, P., Yu, X., Bai, X. et al. Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives. Int J Comput Vis 133, 4795–4816 (2025). https://doi.org/10.1007/s11263-025-02407-5

Download citation

Received: 11 March 2024
Accepted: 23 February 2025
Published: 19 March 2025
Version of record: 19 March 2025
Issue date: July 2025
DOI: https://doi.org/10.1007/s11263-025-02407-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

End-to-End Detection and Re-identification Integrated Net for Person Search

Horizontal Flipping Assisted Disentangled Feature Learning for Semi-supervised Person Re-identification

Semi-supervised dictionary learning based deep network for person search

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now