Abstract
End-to-end person search aims to jointly detect and re-identify a target person in raw scene images with a unified model. The detection sub-task learns to identify all persons as one category while the re-identification (re-id) sub-task aims to discriminate persons of different identities, resulting in conflicting optimal objectives. Existing works proposed to decouple end-to-end person search to alleviate such conflict. Yet these methods are still sub-optimal on the sub-tasks due to their partially decoupled models, which limits the overall person search performance. To further eliminate the last coupled part in decoupled models without sacrificing the efficiency of end-to-end person search, we propose a fully decoupled person search framework in this work. Specifically, we design a task-incremental network to construct an end-to-end model in a task-incremental learning procedure. Given that the detection subtask is easier, we start by training a lightweight detection sub-network and expand it with a re-id sub-network trained in another stage. On top of the fully decoupled design, we also enable one-stage training for the task-incremental network. The fully decoupled framework further allows an Online Representation Distillation to mitigate the representation gap between the end-to-end model and two-step models for learning robust representations. Without requiring an offline teacher re-id model, this transfers structured representational knowledge learned from cropped images to the person search model. The learned person representations thus focus more on discriminative clues of foreground persons and suppress the distractive background information. To understand the effectiveness and efficiency of the proposed method, we conduct comprehensive experimental evaluations on two popular person search datasets PRW and CUHK-SYSU. The experimental results demonstrate that the fully decoupled model achieves superior performance than previous decoupled methods. The inference of the model is also shown to be efficient among recent end-to-end methods. The source code is available at https://github.com/PatrickZad/fdps.
Similar content being viewed by others
Data Availability
For experiments in this work, we employ the two popular person search datasets, CUHK-SYSU and PRW, released in Xiao et al. (2017) and Zheng et al. (2017), respectively. Both datasets are made available upon request to their authors. For further discussion and exploration of the proposed method, the source code and other generated data of this study are available on request from the corresponding author.
References
Abati, D., Tomczak, J., Blankevoort, T., Calderara, S., Cucchiara, R., & Bejnordi, B. E. (2020). Conditional channel gated networks for task-aware continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3931–3940).
Bergmann, P., Meinhardt, T., & Leal-Taixe, L. (2019). Tracking without bells and whistles. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 941–951).
Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154–6162).
Cao, J., Pang, Y., Anwer, R. M., Cholakkal, H., Xie, J., Shah, M., & Khan, F. S. (2022). PSTR: End-to-end one-step person search with transformers. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Chen, W., Xu, X., Jia, J., Luo, H., Wang, Y., Wang, F., Jin, R., & Sun, X. (2023). Beyond appearance: A semantic controllable self-supervised learning framework for human-centric visual tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15050–15061).
Chen, D., Zhang, S., Ouyang, W., Yang, J., & Schiele, B. (2020). Hierarchical online instance matching for person search. In AAAI.
Chen, D., Zhang, S., Ouyang, W., Yang, J., & Tai, Y. (2020). Person search by separated modeling and a mask-guided two-stream CNN model. IEEE Transactions on Image Processing, 29, 4669–4682.
Chen, D., Zhang, S., Yang, J., & Schiele, B. (2021). Norm-aware embedding for efficient person search and tracking. International Journal of Computer Vision, 129, 3154–3168.
Deng, J., Dong, W., Socher, R., Li, L. -J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.
Dong, W., Zhang, Z., Song, C., & Tan, T. (2020). Bi-directional interaction network for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2839–2848).
Dong, W., Zhang, Z., Song, C., & Tan, T. (2020). Instance guided proposal network for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2585–2594).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Golkar, S., Kagan, M., & Cho, K. (2019). Continual learning via neural pruning. arXiv preprint arXiv:1903.04476
Guo, Q., Wang, X., Wu, Y., Yu, Z., Liang, D., Hu, X., & Luo, P. (2020). Online knowledge distillation via collaborative learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11020–11029).
Han, C., Ye, J., Zhong, Y., Tan, X., Zhang, C., Gao, C., & Sang, N. (2019). Re-ID driven localization refinement for person search. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9814–9823).
Han, C., Zheng, Z., Gao, C., Sang, N., & Yang, Y. (2021). Decoupled and memory-reinforced networks: Towards effective feature learning for one-step person search. In Proceedings of the AAAI conference on artificial intelligence (pp. 1505–1512).
Han, C., Zheng, Z., Su, K., Yu, D., Yuan, Z., Gao, C., Sang, N., & Yang, Y. (2022). DMRNet++: Learning discriminative features with decoupled networks and enriched pairs for one-step person search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7319–7337.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Hou, S., Zhao, C., Chen, Z., Wu, J., Wei, Z., & Miao, D. (2021). Improved instance discrimination and feature compactness for end-to-end person search. IEEE Transactions on Circuits and Systems for Video Technology, 32(4), 2079–2090.
Hung, C. -Y., Tu, C. -H., Wu, C. -E., Chen, C. -H., Chan, Y. -M., & Chen, C. -S. (2019). Compacting, picking and growing for unforgetting continual learning. Advances in Neural Information Processing Systems,32, 13647–13657.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448–456). PMLR.
Jin, Y., Gao, F., Yu, J., Wang, J., & Shuang, F. (2023). Multi-object tracking: Decoupling features to solve the contradictory dilemma of feature requirements. IEEE Transactions on Circuits and Systems for Video Technology, 33(9), 5117–5132.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. Advances in Neural Information Processing Systems, 33, 18661–18673.
Kim, H., Joung, S., Kim, I. -J., & Sohn, K. (2021). Prototype-guided saliency feature learning for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4865–4874).
Lan, X., Zhu, X., & Gong, S. (2018). Person search by multi-scale matching. In Proceedings of the European conference on computer vision (ECCV) (pp. 536–552).
Lee, S., Oh, Y., Baek, D., Lee, J., & Ham, B. (2022). OIMNet++: Prototypical normalization and localization-aware learning for person search. In European conference on computer vision: Springer.
Li, X., Zhou, Y., Wu, T., Socher, R., & Xiong, C. (2019). Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In International Conference On Machine Learning (pp. 3925–3934). PMLR.
Li, Z., & Miao, D. (2021). Sequential end-to-end network for efficient person search. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 2011–2019).
Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., & Hu, W. (2022). Rethinking the competition between detection and ReID in multiobject tracking. IEEE Transactions on Image Processing, 31, 3182–3196.
Lin, T. -Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Lin, T. -Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part V 13 (pp. 740–755). Springer.
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). DAB-DETR: Dynamic anchor boxes are better queries for DETR. In International conference on learning representations. https://openreview.net/forum?id=oMI9PjOb9Jl.
Liu, Z., Mao, H., Wu, C. -Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976–11986).
Luo, H., Gu, Y., Liao, X., Lai, S., & Jiang, W. (2019). Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.
Mallya, A., & Lazebnik, S. (2018). PackNet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7765–7773).
Munjal, B., Amin, S., Tombari, F., & Galasso, F. (2019). Query-guided end-to-end person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 811–820).
Munjal, B., Flaborea, A., Amin, S., Tombari, F., & Galasso, F. (2023). Query-guided networks for few-shot fine-grained classification and person search. Pattern Recognition, 133, 109049.
Ning, X., Gong, K., Li, W., Zhang, L., Bai, X., & Tian, S. (2021). Feature refinement and filter network for person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 31(9), 3391–3402. https://doi.org/10.1109/TCSVT.2020.3043026
Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748
Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., & Yu, F. (2021). Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 164–173).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems,28, 91–99.
Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision (pp. 17–35). Springer.
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671
Serra, J., Suris, D., Miron, M., & Karatzoglou, A. (2018). Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning (pp. 4548–4557). PMLR.
Sung, Y. L., Cho, J., & Bansal, M. (2022). LST: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 13, 12991–13005.
Tian, Y., Krishnan, D., & Isola, P. (2019). Contrastive representation distillation. In International conference on learning representations
Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9627–9636).
Ven, G. M., & Tolias, A. S. (2019). Three scenarios for continual learning. arXiv preprint arXiv:1904.07734
Wallingford, M., Li, H., Achille, A., Ravichandran, A., Fowlkes, C., Bhotika, R., & Soatto, S. (2022). Task adaptive parameter sharing for multi-task learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (pp. 7561–7570).
Wang, C., Ma, B., Chang, H., Shan, S., & Chen, X. (2020). TCTS: A task-consistent two-stage framework for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11952–11961).
Wang, C., Ma, B., Chang, H., Shan, S., & Chen, X. (2022). Person search by a bi-directional task-consistent learning model. IEEE Transactions on Multimedia, 25, 1190–1203.
Wang, Y. -X., Ramanan, D., & Hebert, M. (2017). Growing a brain: Fine-tuning by increasing model capacity. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2471–2480).
Wang, Z., Zheng, L., Liu, Y., Li, Y., & Wang, S. (2020). Towards real-time multi-object tracking. In European conference on computer vision (pp. 107–122). Springer.
Wu, Y., Kirillov, A., Massa, F., Lo, W. -Y., & Girshick, R. (2019). Detectron2. https://github.com/facebookresearch/detectron2.
Xiao, T., Li, S., Wang, B., Lin, L., & Wang, X. (2017). Joint detection and identification feature learning for person search. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3415–3424).
Xu, Y., Ma, B., Huang, R., & Lin, L. (2014). Person search in a scene by jointly modeling people commonness and person uniqueness. In Proceedings of the 22nd ACM international conference on multimedia (pp. 937–940).
Yan, Y., Li, J., Qin, J., Bai, S., Liao, S., Liu, L., Zhu, F., & Shao, L. (2021). Anchor-free person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7690–7699).
Yan, Y., Zhang, Q., Ni, B., Zhang, W., Xu, M., & Yang, X. (2019). Learning context graph for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2158–2167).
Yan, Y., Li, J., Qin, J., Zheng, P., Liao, S., & Yang, X. (2023). Efficient person search: An anchor-free approach. International Journal of Computer Vision, 131, 1642–1661.
Yao, H., & Xu, C. (2020). Joint person objectness and repulsion for person search. IEEE Transactions on Image Processing, 30, 685–696.
Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., & Hoi, S. C. (2021). Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2872–2893.
Yoon, J., Yang, E., Lee, J., & Hwang, S. J. (2018). Lifelong learning with dynamically expandable networks. In International conference on learning representations.
Yu, R., Du, D., LaLonde, R., Davila, D., Funk, C., Hoogs, A., & Clipp, B. (2022). Cascade transformers for end-to-end person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7267–7276).
Yu, E., Li, Z., Han, S., & Wang, H. (2022). RelationTrack: Relation-aware multiple object tracking with decoupled representation. IEEE Transactions on Multimedia, 25, 2686–2697.
Zhang, J. O., Sax, A., Zamir, A., Guibas, L., & Malik, J. (2020). Side-tuning: A baseline for network adaptation via additive side networks. In European conference on computer vision (pp. 698–714). Springer.
Zhang, P., Bai, X., Zheng, J., & Ning, X. (2023). Towards fully decoupled end-to-end person search. arXiv preprint arXiv:2309.04967
Zhang, X., Wang, X., Bian, J. -W., Shen, C., & You, M. (2021). Diverse knowledge distillation for end-to-end person search. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 3412–3420).
Zhang, Y., Li, X., & Zhang, Z. (2019). Efficient person search via expert-guided knowledge distillation. IEEE Transactions on Cybernetics, 51(10), 5093–5104.
Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2021). FairMOT: On the fairness of detection and re-identification in multiple object tracking. International journal of computer vision, 129, 3069–3087.
Zhao, C., Chen, Z., Dou, S., Qu, Z., Yao, J., Wu, J., & Miao, D. (2022). Context-aware feature learning for noise robust person search. IEEE Transactions on Circuits and Systems for Video Technology, 32(10), 7047–7060.
Zhao, Y., Wang, X., Yu, X., Liu, C., & Gao, Y. (2023). Gait-assisted video person retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 33(2), 897–908. https://doi.org/10.1109/TCSVT.2022.3202531
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision (pp. 1116–1124).
Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., & Tian, Q. (2017). Person re-identification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1367–1376).
Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In European conference on computer vision (pp. 474–490). Springer.
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. In International conference on learning representations.
Acknowledgements
This work is supported by the National Natural Science Foundation of China 62276016 and 62372029.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Jifeng Dai.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, P., Yu, X., Bai, X. et al. Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives. Int J Comput Vis 133, 4795–4816 (2025). https://doi.org/10.1007/s11263-025-02407-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02407-5