这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

An Adaptive Correlation Filtering Method for Text-Based Person Search

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Text-based person search aims to align person images with natural language descriptions, which can be widely used in video surveillance field, such as missing person searching and suspect tracking. In this task, extracting distinct representations and aligning them among identities based on descriptions is a crucial yet challenging problem. Most previous methods rely on additional language parsers or vision techniques to identify and select the relevant regions and words from inputs. However, these methods suffer from heavy computation costs and error accumulation. Meanwhile, simply using horizontal segmentation images to obtain local-level features would harm the reliability of models. To address these problems, we first present a novel Simple and Robust Correlation Filtering (SRCF) method which is capable of effectively extracting key clues and aligning discriminative features. Different from previous works, we design two different types of filtering modules (including denoising filters and dictionary filters) to extract essential features and establish multi-modal mappings. Furthermore, despite the SRCF being pretty well, it is still struggling with semantic ambiguity and uni-modal updating. Therefore, we further propose Multi-modal Adaptive Correlation Filtering (MACF) method that adaptively learns the vital regions and keywords with a shared update strategy. Meanwhile, we introduce a new mutually conditional gate to dynamically control the updating process of filters. Extensive experiments demonstrate that both proposed methods improve the robustness and reliability of the model and achieve better performance on the two text-based person search datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The CUHK-PEDES and ICFG-PEDES datasets are publicly available at https://github.com/ShuangLI59/Person-Search-with-Natural-Language-Description and https://github.com/zifyloo/SSAN, respectively.

References

  • Aggarwal, S., Radhakrishnan, V. B., & Chakraborty, A. (2020). Text-based person search via attribute-aided matching. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2617–2625).

  • Antoniou, A., Edwards, H., & Storkey, A. (2018). How to train your maml. arXiv preprint. arXiv:1810.09502.

  • Bolme, D. S., Beveridge, J. R, & Draper, B. A., et al. (2010). Visual object tracking using adaptive correlation filters. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE (pp. 2544–2550).

  • Cao, Z., Simon, T., & Wei, S. E., et al. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7291–7299).

  • Chen, X., Lin, K. Y., & Wang, J., et al. (2020). Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In European conference on computer vision. Springer (pp. 561–577).

  • Ding, Z., Ding, C., & Shao, Z. (2021). Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666.

  • Dong, Q., Gong, S., & Zhu, X. (2019). Person search by text attribute query as zero-shot learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3652–3661).

  • Farooq, A., Awais, M., & Kittler, J., et al. (2022). Axm-net: Implicit cross-modal feature alignment for person re-identification. In Proceedings of the AAAI conference on artificial intelligence (pp. 4477–4485).

  • Gao, C., Cai, G., & Jiang, X., et al. (2021). Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036.

  • Ge, R., Kakade, S. M., & Kidambi, R. et al. (2019). The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. Advances in Neural Information Processing Systems 32.

  • Han, X., He, S., & Zhang, L., et al. (2021). Text-based person search with limited data. arXiv preprint arXiv:2110.10807.

  • He, K., Zhang, X., & Ren, S., et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026–1034).

  • He, K., Zhang, X., & Ren, S., et al. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hecht-Nielsen, R. (1992). Theory of the backpropagation neural network. In Neural networks for perception. Elsevier (pp. 65–93).

  • Honnibal, M., & Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 1373–1378).

  • Huang, Z., Zeng, Z., & Huang, Y., et al. (2021). Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12976–12985).

  • Jing, Y., Si, C., & Wang, J. et al. (2020). Pose-guided multi-granularity attention network for text-based person search. In AAAI (pp. 11189–11196).

  • Kaiser, Ł., et al. (2016). Can active memory replace attention? NIPS.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

    Google Scholar 

  • Li, S., Xiao, T., & Li, H., et al. (2017a). Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE international conference on computer vision (pp. 1890–1899).

  • Li, S., Xiao, T., & Li, H, et al. (2017b). Person search with natural language description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1970–1979).

  • Li, S., Cao, M., & Zhang, M. (2022). Learning semantic-aligned feature representation for text-based person search. ICASSP 2022–2022 IEEE International Conference on Acoustics (pp. 2724–2728). IEEE: Speech and Signal Processing (ICASSP).

  • Li, Y., Song, L., & Chen, Y., et al. (2020). Learning dynamic routing for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8553–8562).

  • Liao, Y., Liu, S., & Li, G., et al. (2020). A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10880–10889).

  • Locatello, F., Weissenborn, D., Unterthiner, T., et al. (2020). Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33, 11525–11538.

    Google Scholar 

  • Loper, E., & Bird, S. (2002). Nltk: The natural language toolkit. arXiv preprint arXiv:cs/0205028.

  • Lu, H., Fei, N., & Huo, Y., et al. (2022). Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In CVPR.

  • Manning, C. D., Surdeanu, M., & Bauer, J., et al. (2014). The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55–60).

  • Naresh Boddeti, V., Kanade, T., & Vijaya Kumar, B. V. K. (2013). Correlation filters for object alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2291–2298).

  • Niu, K., Huang, Y., Ouyang, W., et al. (2020). Improving description-based person re-identification by multi-granularity image-text alignments. TIP, 29, 5542–5556.

    Google Scholar 

  • Niu, K., Huang, Y., & Wang, L. (2020). Textual dependency embedding for person search by language. In ACM MM (pp. 4032–4040).

  • Niu, K., Huang, L., & Huang, Y., et al. (2022). Cross-modal co-occurrence attributes alignments for person search by language. In Proceedings of the 30th ACM international conference on multimedia (pp. 4426–4434).

  • Niu, K., Huang, T., & Huang, L., et al. (2023). Improving inconspicuous attributes modeling for person search by language. IEEE Transactions on Image Processing.

  • Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. arXiv preprint arXiv:1711.00937

  • Radford, A., Kim, J. W., & Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763).

  • Russakovsky, O., Deng, J., Su, H., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Sarafianos, N., Xu, X., & Kakadiaris, I. A. (2019). Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5814–5824).

  • Shao, Z., Zhang, X., & Fang, M., et al. (2022). Learning granularity-unified representations for text-to-image person re-identification. In Proceedings of the ACM 30th international conference on multimedia (pp. 5566–5574).

  • Shu, X., Wen, W., Wu, H., et al. (2023). See finer, see more: Implicit modality alignment for text-based person retrieval. In V. Part (Ed.), Computer Vision-ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 624–641). Springer.

    Chapter  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Sun, K., Xiao, B., & Liu, D., et al. (2019). Deep high-resolution representation learning for human pose estimation. In CVPR (pp. 5693–5703).

  • Suo, W., Sun, M., Niu, K., et al. (2022). A simple and robust correlation filtering method for text-based person search. In X. X. X. V. Part (Ed.), Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 726–742). Springer.

    Chapter  Google Scholar 

  • Suo, W., Sun, M., Wang, P., et al. (2022). Rethinking and improving feature pyramids for one-stage referring expression comprehension. IEEE Transactions on Image Processing, 32, 854–864.

    Article  Google Scholar 

  • Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. Advances in neural information processing systems 5998–6008.

  • Wang, Z., Fang, Z., & Wang, J., et al. (2020). Vitaa: Visual-textual attributes alignment in person search by natural language. In ECCV. Springer (pp. 402–420).

  • Wang, Z., Zhu, A., & Xue, J., et al. (2022). Caibc: Capturing all-round information beyond color for text-based person retrieval. In Proceedings of the 30th ACM international conference on multimedia (pp. 5314–5322).

  • Wei, L., Zhang, S., & Gao, W., et al. (2018). Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 79–88).

  • Woo, S., Park, J., & Lee, J. Y., et al. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).

  • Wu, Y., Yan, Z., & Han, X. (2021). Lapscore: Language-guided person search via color reasoning. In ICCV (pp. 1624–1633).

  • Yan, S., Dong, N., & Zhang, L., et al. (2022). Clip-driven fine-grained text-image person re-identification. arXiv preprint arXiv:2210.10276

  • Yang, B., Deng, X., & Shi, H., et al. (2022). Continual object detection via prototypical task correlation guided gating mechanism. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9255–9264).

  • Yang, C., Zhang, L., & Lu, H., et al. (2013). Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3166–3173).

  • Yang, Z., Chen, T., & Wang, L., et al. (2020). Improving one-stage visual grounding by recursive sub-query construction. In Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer (pp. 387–404).

  • Zheng, K., Liu, W., & Liu, J., et al. (2020). Hierarchical gumbel attention network for text-based person search. In Proceedings of the 28th ACM international conference on multimedia (pp. 3441–3449).

  • Zheng, Z., Zheng, L., & Garrett, M., et al. (2020b). Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(2), 1–23.

  • Zhu, A., Wang, Z., & Li, Y. (2021). Dssl: Deep surroundings-person separation learning for text-based person retrieval. Proceedings of the 29th ACM international conference on multimedia (pp. 209–217).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.U23B2013), Shaanxi Provincial Key R &D Program (No.2021KWZ-03), and Natural Science Basic Research Program of Shaanxi (No.2021JCW-03), National Natural Science Foundation of China (NSFC) under Grants 62102323, Innovation Capability Support Program of Shaanxi(Program No. 2023KJXX-142), National Natural Science Foundation of China (62101451), Key Research and Development Program of Shaanxi (Program No.2024GX-YBXM-117), National Postdoctoral Innovation Talent Support Program (BX20230498), Young Talent Fund of Association for Science and Technology in Shaanxi, China (20240150).

Author information

Authors and Affiliations

Corresponding author

Correspondence to Peng Wang.

Additional information

Communicated by Bumsub Ham.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, M., Suo, W., Wang, P. et al. An Adaptive Correlation Filtering Method for Text-Based Person Search. Int J Comput Vis 132, 4440–4455 (2024). https://doi.org/10.1007/s11263-024-02094-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02094-8

Keywords