这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

End-to-end person search aims to jointly detect and re-identify a target person in raw scene images with a unified model. The detection sub-task learns to identify all persons as one category while the re-identification (re-id) sub-task aims to discriminate persons of different identities, resulting in conflicting optimal objectives. Existing works proposed to decouple end-to-end person search to alleviate such conflict. Yet these methods are still sub-optimal on the sub-tasks due to their partially decoupled models, which limits the overall person search performance. To further eliminate the last coupled part in decoupled models without sacrificing the efficiency of end-to-end person search, we propose a fully decoupled person search framework in this work. Specifically, we design a task-incremental network to construct an end-to-end model in a task-incremental learning procedure. Given that the detection subtask is easier, we start by training a lightweight detection sub-network and expand it with a re-id sub-network trained in another stage. On top of the fully decoupled design, we also enable one-stage training for the task-incremental network. The fully decoupled framework further allows an Online Representation Distillation to mitigate the representation gap between the end-to-end model and two-step models for learning robust representations. Without requiring an offline teacher re-id model, this transfers structured representational knowledge learned from cropped images to the person search model. The learned person representations thus focus more on discriminative clues of foreground persons and suppress the distractive background information. To understand the effectiveness and efficiency of the proposed method, we conduct comprehensive experimental evaluations on two popular person search datasets PRW and CUHK-SYSU. The experimental results demonstrate that the fully decoupled model achieves superior performance than previous decoupled methods. The inference of the model is also shown to be efficient among recent end-to-end methods. The source code is available at https://github.com/PatrickZad/fdps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

For experiments in this work, we employ the two popular person search datasets, CUHK-SYSU and PRW, released in Xiao et al. (2017) and Zheng et al. (2017), respectively. Both datasets are made available upon request to their authors. For further discussion and exploration of the proposed method, the source code and other generated data of this study are available on request from the corresponding author.

References

  • Abati, D., Tomczak, J., Blankevoort, T., Calderara, S., Cucchiara, R., & Bejnordi, B. E. (2020). Conditional channel gated networks for task-aware continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3931–3940).

  • Bergmann, P., Meinhardt, T., & Leal-Taixe, L. (2019). Tracking without bells and whistles. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 941–951).

  • Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154–6162).

  • Cao, J., Pang, Y., Anwer, R. M., Cholakkal, H., Xie, J., Shah, M., & Khan, F. S. (2022). PSTR: End-to-end one-step person search with transformers. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Chen, W., Xu, X., Jia, J., Luo, H., Wang, Y., Wang, F., Jin, R., & Sun, X. (2023). Beyond appearance: A semantic controllable self-supervised learning framework for human-centric visual tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15050–15061).

  • Chen, D., Zhang, S., Ouyang, W., Yang, J., & Schiele, B. (2020). Hierarchical online instance matching for person search. In AAAI.

  • Chen, D., Zhang, S., Ouyang, W., Yang, J., & Tai, Y. (2020). Person search by separated modeling and a mask-guided two-stream CNN model. IEEE Transactions on Image Processing, 29, 4669–4682.

    Article  Google Scholar 

  • Chen, D., Zhang, S., Yang, J., & Schiele, B. (2021). Norm-aware embedding for efficient person search and tracking. International Journal of Computer Vision, 129, 3154–3168.

    Article  Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L. -J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

  • Dong, W., Zhang, Z., Song, C., & Tan, T. (2020). Bi-directional interaction network for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2839–2848).

  • Dong, W., Zhang, Z., Song, C., & Tan, T. (2020). Instance guided proposal network for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2585–2594).

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations.

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Golkar, S., Kagan, M., & Cho, K. (2019). Continual learning via neural pruning. arXiv preprint arXiv:1903.04476

  • Guo, Q., Wang, X., Wu, Y., Yu, Z., Liang, D., Hu, X., & Luo, P. (2020). Online knowledge distillation via collaborative learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11020–11029).

  • Han, C., Ye, J., Zhong, Y., Tan, X., Zhang, C., Gao, C., & Sang, N. (2019). Re-ID driven localization refinement for person search. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9814–9823).

  • Han, C., Zheng, Z., Gao, C., Sang, N., & Yang, Y. (2021). Decoupled and memory-reinforced networks: Towards effective feature learning for one-step person search. In Proceedings of the AAAI conference on artificial intelligence (pp. 1505–1512).

  • Han, C., Zheng, Z., Su, K., Yu, D., Yuan, Z., Gao, C., Sang, N., & Yang, Y. (2022). DMRNet++: Learning discriminative features with decoupled networks and enriched pairs for one-step person search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7319–7337.

    Article  Google Scholar 

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  • Hou, S., Zhao, C., Chen, Z., Wu, J., Wei, Z., & Miao, D. (2021). Improved instance discrimination and feature compactness for end-to-end person search. IEEE Transactions on Circuits and Systems for Video Technology, 32(4), 2079–2090.

    Article  Google Scholar 

  • Hung, C. -Y., Tu, C. -H., Wu, C. -E., Chen, C. -H., Chan, Y. -M., & Chen, C. -S. (2019). Compacting, picking and growing for unforgetting continual learning. Advances in Neural Information Processing Systems,32, 13647–13657.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448–456). PMLR.

  • Jin, Y., Gao, F., Yu, J., Wang, J., & Shuang, F. (2023). Multi-object tracking: Decoupling features to solve the contradictory dilemma of feature requirements. IEEE Transactions on Circuits and Systems for Video Technology, 33(9), 5117–5132.

    Article  Google Scholar 

  • Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. Advances in Neural Information Processing Systems, 33, 18661–18673.

    Google Scholar 

  • Kim, H., Joung, S., Kim, I. -J., & Sohn, K. (2021). Prototype-guided saliency feature learning for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4865–4874).

  • Lan, X., Zhu, X., & Gong, S. (2018). Person search by multi-scale matching. In Proceedings of the European conference on computer vision (ECCV) (pp. 536–552).

  • Lee, S., Oh, Y., Baek, D., Lee, J., & Ham, B. (2022). OIMNet++: Prototypical normalization and localization-aware learning for person search. In European conference on computer vision: Springer.

    Google Scholar 

  • Li, X., Zhou, Y., Wu, T., Socher, R., & Xiong, C. (2019). Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In International Conference On Machine Learning (pp. 3925–3934). PMLR.

  • Li, Z., & Miao, D. (2021). Sequential end-to-end network for efficient person search. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 2011–2019).

  • Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., & Hu, W. (2022). Rethinking the competition between detection and ReID in multiobject tracking. IEEE Transactions on Image Processing, 31, 3182–3196.

    Article  Google Scholar 

  • Lin, T. -Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).

  • Lin, T. -Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part V 13 (pp. 740–755). Springer.

  • Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). DAB-DETR: Dynamic anchor boxes are better queries for DETR. In International conference on learning representations. https://openreview.net/forum?id=oMI9PjOb9Jl.

  • Liu, Z., Mao, H., Wu, C. -Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976–11986).

  • Luo, H., Gu, Y., Liao, X., Lai, S., & Jiang, W. (2019). Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.

  • Mallya, A., & Lazebnik, S. (2018). PackNet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7765–7773).

  • Munjal, B., Amin, S., Tombari, F., & Galasso, F. (2019). Query-guided end-to-end person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 811–820).

  • Munjal, B., Flaborea, A., Amin, S., Tombari, F., & Galasso, F. (2023). Query-guided networks for few-shot fine-grained classification and person search. Pattern Recognition, 133, 109049.

    Article  Google Scholar 

  • Ning, X., Gong, K., Li, W., Zhang, L., Bai, X., & Tian, S. (2021). Feature refinement and filter network for person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 31(9), 3391–3402. https://doi.org/10.1109/TCSVT.2020.3043026

    Article  Google Scholar 

  • Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

  • Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., & Yu, F. (2021). Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 164–173).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems,28, 91–99.

  • Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision (pp. 17–35). Springer.

  • Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671

  • Serra, J., Suris, D., Miron, M., & Karatzoglou, A. (2018). Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning (pp. 4548–4557). PMLR.

  • Sung, Y. L., Cho, J., & Bansal, M. (2022). LST: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 13, 12991–13005.

    Google Scholar 

  • Tian, Y., Krishnan, D., & Isola, P. (2019). Contrastive representation distillation. In International conference on learning representations

  • Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9627–9636).

  • Ven, G. M., & Tolias, A. S. (2019). Three scenarios for continual learning. arXiv preprint arXiv:1904.07734

  • Wallingford, M., Li, H., Achille, A., Ravichandran, A., Fowlkes, C., Bhotika, R., & Soatto, S. (2022). Task adaptive parameter sharing for multi-task learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (pp. 7561–7570).

  • Wang, C., Ma, B., Chang, H., Shan, S., & Chen, X. (2020). TCTS: A task-consistent two-stage framework for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11952–11961).

  • Wang, C., Ma, B., Chang, H., Shan, S., & Chen, X. (2022). Person search by a bi-directional task-consistent learning model. IEEE Transactions on Multimedia, 25, 1190–1203.

    Article  Google Scholar 

  • Wang, Y. -X., Ramanan, D., & Hebert, M. (2017). Growing a brain: Fine-tuning by increasing model capacity. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2471–2480).

  • Wang, Z., Zheng, L., Liu, Y., Li, Y., & Wang, S. (2020). Towards real-time multi-object tracking. In European conference on computer vision (pp. 107–122). Springer.

  • Wu, Y., Kirillov, A., Massa, F., Lo, W. -Y., & Girshick, R. (2019). Detectron2. https://github.com/facebookresearch/detectron2.

  • Xiao, T., Li, S., Wang, B., Lin, L., & Wang, X. (2017). Joint detection and identification feature learning for person search. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3415–3424).

  • Xu, Y., Ma, B., Huang, R., & Lin, L. (2014). Person search in a scene by jointly modeling people commonness and person uniqueness. In Proceedings of the 22nd ACM international conference on multimedia (pp. 937–940).

  • Yan, Y., Li, J., Qin, J., Bai, S., Liao, S., Liu, L., Zhu, F., & Shao, L. (2021). Anchor-free person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7690–7699).

  • Yan, Y., Zhang, Q., Ni, B., Zhang, W., Xu, M., & Yang, X. (2019). Learning context graph for person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2158–2167).

  • Yan, Y., Li, J., Qin, J., Zheng, P., Liao, S., & Yang, X. (2023). Efficient person search: An anchor-free approach. International Journal of Computer Vision, 131, 1642–1661.

  • Yao, H., & Xu, C. (2020). Joint person objectness and repulsion for person search. IEEE Transactions on Image Processing, 30, 685–696.

    Article  Google Scholar 

  • Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., & Hoi, S. C. (2021). Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2872–2893.

    Article  Google Scholar 

  • Yoon, J., Yang, E., Lee, J., & Hwang, S. J. (2018). Lifelong learning with dynamically expandable networks. In International conference on learning representations.

  • Yu, R., Du, D., LaLonde, R., Davila, D., Funk, C., Hoogs, A., & Clipp, B. (2022). Cascade transformers for end-to-end person search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7267–7276).

  • Yu, E., Li, Z., Han, S., & Wang, H. (2022). RelationTrack: Relation-aware multiple object tracking with decoupled representation. IEEE Transactions on Multimedia, 25, 2686–2697.

    Article  Google Scholar 

  • Zhang, J. O., Sax, A., Zamir, A., Guibas, L., & Malik, J. (2020). Side-tuning: A baseline for network adaptation via additive side networks. In European conference on computer vision (pp. 698–714). Springer.

  • Zhang, P., Bai, X., Zheng, J., & Ning, X. (2023). Towards fully decoupled end-to-end person search. arXiv preprint arXiv:2309.04967

  • Zhang, X., Wang, X., Bian, J. -W., Shen, C., & You, M. (2021). Diverse knowledge distillation for end-to-end person search. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 3412–3420).

  • Zhang, Y., Li, X., & Zhang, Z. (2019). Efficient person search via expert-guided knowledge distillation. IEEE Transactions on Cybernetics, 51(10), 5093–5104.

    Article  Google Scholar 

  • Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2021). FairMOT: On the fairness of detection and re-identification in multiple object tracking. International journal of computer vision, 129, 3069–3087.

    Article  Google Scholar 

  • Zhao, C., Chen, Z., Dou, S., Qu, Z., Yao, J., Wu, J., & Miao, D. (2022). Context-aware feature learning for noise robust person search. IEEE Transactions on Circuits and Systems for Video Technology, 32(10), 7047–7060.

    Article  Google Scholar 

  • Zhao, Y., Wang, X., Yu, X., Liu, C., & Gao, Y. (2023). Gait-assisted video person retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 33(2), 897–908. https://doi.org/10.1109/TCSVT.2022.3202531

    Article  Google Scholar 

  • Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision (pp. 1116–1124).

  • Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., & Tian, Q. (2017). Person re-identification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1367–1376).

  • Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In European conference on computer vision (pp. 474–490). Springer.

  • Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850

  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. In International conference on learning representations.

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China 62276016 and 62372029.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiaohan Yu or Xiao Bai.

Additional information

Communicated by Jifeng Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, P., Yu, X., Bai, X. et al. Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives. Int J Comput Vis 133, 4795–4816 (2025). https://doi.org/10.1007/s11263-025-02407-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02407-5

Keywords