Deep Learning Technique for Human Parsing: A Survey and Outlook

Yang, Lu; Jia, Wenhe; Li, Shan; Song, Qing

doi:10.1007/s11263-024-02031-9

Deep Learning Technique for Human Parsing: A Survey and Outlook

Published: 09 March 2024

Volume 132, pages 3270–3301, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Lu Yang¹,
Wenhe Jia¹,
Shan Li¹ &
…
Qing Song ORCID: orcid.org/0000-0003-1936-224X¹

1465 Accesses
25 Citations
3 Altmetric
Explore all metrics

Abstract

Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transferring pose and augmenting background for deep human-image parsing and its applications

Article Open access 30 January 2018

Fine-Grained Multi-human Parsing

Article 13 May 2019

Instance-Level Human Parsing via Part Grouping Network

Notes

https://developer.nvidia.com/nvidia-omniverse

References

Bao, H., Dong, L., Piao, S., & Wei, F. (2022). Beit: Bert pre-training of image transformers. In Proceedings of the International Conference on Learning Representations.
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., Manassra, W., Dhariwal, P., Chu, C., & Jiao, Y. (2023). Improving image generation with better captions. OpenAI blog.
Bo, Y., & Fowlkes, C. C. (2011). Shape-based pedestrian parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2265–2272).
Borras, A., Tous, F., Llados, J., & Vanrell, M. (2003). High-level clothes description based on colour-texture and structural features. In Iberian Conference on Pattern Recognition and Image Analysis, (pp. 108–116).
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, pp. 213–229.
Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision, (pp. 139–156).
Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9650–9660).
Chang, Y., Peng, T., He, R., Hu, X., Liu, J., Zhang, Z., & Jiang, M. (2022). Pf-vton: Toward high-quality parser-free virtual try-on network. In International Conference on Multimedia Modeling, (pp. 28–40).
Chen, H., Xu, Z., Liu, Z., & Zhu, S. C. (2006). Composite templates for cloth modeling and sketching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 943–950).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Google Scholar
Chen, L. C., Yang, Y., Wang, J., Xu, W., & Yuille, A. L. (2016). Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3640–3649).
Chen, Q., Ge, T., Xu, Y., Zhang, Z., Yang, X., & Gai, K. (2018). Semantic human matting. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 618–626).
Chen, R., Chen, X., Ni, B., & Ge, Y. (2020). Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM International Conference on Multimedia, (pp. 2003–2011).
Chen, S., & Wang, J. (2023). Virtual reality human-computer interactive english education experience system based on mobile terminal. International Journal of Human-Computer Interaction. https://doi.org/10.1080/10447318.2023.2190674
Article Google Scholar
Chen, W., Xu, X., Jia, J., Luo, H., Wang, Y., Wang, F., Jin, R., & Sun, X. (2023). Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 15050–15061).
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1971–1978).
Chen, Y., Zhu, X., & Gong, S. (2019). Instance-guided context rendering for cross-domain person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 232–242).
Cheng, B., Chen, L. C., Wei, Y., Zhu, Y., Huang, Z., Xiong, J., Huang, T. S., Hwu, W. M., & Shi, H. (2019). Spgnet: Semantic prediction guidance for scene parsing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 5218–5228).
Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., & Schwing, A. G. (2021). Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Cheng, B., Schwing, A. G., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34, 17864–17875.
Google Scholar
Cheng, W., Song, S., Chen, C. Y., Hidayati, S. C., & Liu, J. (2021). Fashion meets computer vision: A survey. ACM Computing Surveys, 54(4), 1–41.
Google Scholar
Ci, Y., Wang, Y., Chen, M., Tang, S., Bai, L., Zhu, F., Zhao, R., Yu, F., Qi, D., & Ouyang, W. (2023). Unihcp: A unified model for human-centric perceptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition., (pp. 17840–17852).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3213–3223).
Dai, Y., Chen, X., Wang, X., Pang, M., Gao, L., & Shen, H. T. (2023). Resparser: Fully convolutional multiple human parsing with representative sets. IEEE Transactions on Multimedia, 26, 1384–1394.
Google Scholar
Devlin, J., Chang, M. W., Lee, K., &Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 4171–4186).
Dong, H., Liang, X., Shen, X., Wang, B., Lai, H., Zhu, J., Hu, Z., & Yin, J. (2019). Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9026–9035).
Dong, J., Chen, Q., Shen, X., Yang, J., & Yan, S. (2014). Towards unified human parsing and pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 843–850).
Dong, J., Chen, Q., Xia, W., Huang, Z., & Yan, S. (2013). A deformable mixture parsing model with parselets. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 3408–3415).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Google Scholar
Fang, H. S., Lu, G., Fang, X., Xie, J., Tai, Y. W., & Lu, C. (2018). Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 70–78).
Fang, J., Sun, Y., Zhang, Q., Li, Y., Liu, W., & Wang, X. (2020). Densely connected search space for more flexible neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 10628–10637).
Fruhstuck, A., Singh, K. K., Shechtman, E., Mitra Niloy, J., Wonka, P., & Lu, J. (2022). Insetgan for full-body image generation. arXiv preprint arXiv:2203.07293
Fulkerson, B., Vedaldi, A., & Soatto, S. (2009). Class segmentation and object localization with superpixel neighborhoods. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 670–677).
Gao, Y., Lang, C., Liu, F., Cao, Y., Sun, L., & Wei, Y. (2023). Dynamic interaction dilation for interactive human parsing. IEEE Transactions on Multimedia. https://doi.org/10.1109/TMM.2023.3262973
Article Google Scholar
Gao, Y., Liang, L., Lang, C., Feng, S., Li, Y., & Wei, Y. (2022). Clicking matters: Towards interactive human parsing. IEEE Transactions on Multimedia. https://doi.org/10.1109/TMM.2022.3156812
Article Google Scholar
Ge, Y., Zhang, R., Wang, X., Tang, X., & Luo, P. (2019). Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 5337–5345).
de Geus, D., Meletis, P., Lu, C., Wen, X., Dubbelman, G. (2021). Part-aware panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 5485–5494).
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., & Misra, I. (2023). Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 15180–15190).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 580–587).
Gong, K., Gao, Y., Liang, X., Shen, X., Wang, M., & Lin, L. (2019). Graphonomy: Universal human parsing via graph transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 7450–7459).
Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., & Lin, L. (2018). Instance-level human parsing via part grouping network. In Proceedings of the European Conference on Computer Vision, (pp. 770–785).
Gong, K., Liang, X., Zhang, D., Shen, X., & Lin, L. (2017). Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 932–940).
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems.
Guan, P., Freifeld, O., & Black, M. J. (2010). A 2d human body model dressed in eigen clothing. In Proceedings of the European Conference on Computer Vision, (pp. 285–298).
Guler, R. A., & Kokkinos, I. (2019). Holopose: Holistic 3d human reconstruction in-the-wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 10884–10894).
Guler, R. A., Neverova, N., & Kokkinos, I. (2018). Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 7297–7306).
Gupta, A., Wu, J., Deng, J., & Fei-Fei, L. (2023). Siamese masked autoencoders. arXiv preprint arXiv:2305.14344
Han, X., Wu, Z., Wu, Z., Yu, R., & Davis, L. S. (2018). Viton: An image-based virtual try-on network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 7543–7552).
Hariharan, B., Arbelaez, P., Girshick, R., & Malik, J. (2014). Simultaneous detection and segmentation. In Proceedings of the European Conference on Computer Vision, (pp. 297–312).
He, H., Zhang, J., Thuraisingham, B., & Tao, D. (2021). Progressive one-shot human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 1522–1530).
He, H., Zhang, J., Zhang, Q., Tao, D. (2020). Grapy-ml: Graph pyramid mutual learning for cross-dataset human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 10949–10956).
He, H., Zhang, J., Zhuang, B., Cai, J., & Tao, D. (2023). End-to-end one-shot human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence.
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 9729–9738).
He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 770–778).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Google Scholar
Hu, Y., Wang, R., Zhang, K., & Gao, Y. (2022). Semantic-aware fine-grained correspondence. In European Conference on Computer Vision, (pp. 97–115).
Huang, H., Yang, W., Lin, J., Huang, G., Xu, J., Wang, G., Chen, X., & Huang, K. (2020). Improve person re-identification with part awareness learning. IEEE Transactions on Image Processing, 29, 7468–7481.
Google Scholar
Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., & Zhou, J. (2023). Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778
Huang, Z., Wang, X., Wei, Y., Huang, L., Shi, H., Liu, W., & Huang, T. S. (2023). Ccnet: Criss-cross attention for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(06), 6896–6908.
Google Scholar
Huo, J., Jin, S., Li, W., Wu, J., Lai, Y. K., Shi, Y., & Gao, Y. (2021). Manifold alignment for semantically aligned style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 14861–14869).
Issenhuth, T., Mary, J., Calauzenes, C. (2020). Do not mask what you do not need to mask: a parser-free virtual try-on. In Proceedings of the European Conference on Computer Vision, (pp. 619–635).
Jabri, A. A., Owens, A., & Efros, A. A. (2020). Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems, 33, 19545–19560.
Google Scholar
Jeon, S., Min, D., Kim, S., & Sohn, K. (2021). Mining better samples for contrastive learning of temporal correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 1034–1044).
Ji, R., Du, D., Zhang, L., Wen, L., Wu, Y., Zhao, C., Huang, F., & Lyu, S. (2020). Learning semantic neural tree for human parsing. In Proceedings of the European Conference on Computer Vision, (pp. 205–221).
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, (pp. 675–678).
Jin, Z., Gong, T., Yu, D., Chu, Q., Wang, J., Wang, C., & Shao, J. (2021). Mining contextual information beyond image for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 7231–7241).
Jin, Z., Liu, B., Chu, Q., & Yu, N. (2021). Isnet: Integrate image-level and semantic-level context for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 7189–7198).
Kae, A., Sohn, K., Lee, H., & Learned-Miller, E. (2013). Augmenting crfs with boltzmann machine shape priors for image labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2019–2026).
Kalayeh, M. M., Basaran, E., Gokmen, M., Kamasak, M. E., & Shah, M. (2018). Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1062–1071).
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4401–4410).
Khan, K., Khan, R. U., Ahmad, K., Ali, F., & Kwak, K. S. (2020). Face segmentation: A journey from classical to deep learning paradigm, approaches, trends, and directions. IEEE Access, 8, 58683–58699.
Google Scholar
Kiefel, M., & Gehler, P. (2014). Human pose estimation with fields of parts. In Proceedings of the European Conference on Computer Vision, (pp. 331—346).
Kim, B. K., Kim, G., & Lee, S. Y. (2019). Style-controlled synthesis of clothing segments for fashion image manipulation. IEEE Transactions on Multimedia, 22(2), 298–310.
Google Scholar
Kirillov, A., Girshick, R., He, K., & Dollar, P. (2019). Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 6399–6408).
Kirillov, A., He, K., Girshick, R., Rother, C., & Dollar, P. (2019). Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 9404–9413).
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W. Y., et al. (2023). Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems.
L2ID: Learning from limited or imperfect data (l2id) workshop. https://l2id.github.io/challenge_localization.html (2021)
Ladicky, L., Torr, P. H., & Zisserman, A. (2013). Human pose estimation using a joint pixel-wise and part-wise formulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3578–3585).
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Google Scholar
Li, J., Zhao, J., Wei, Y., Lang, C., Li, Y., Sim, T., Yan, S., & Feng, J. (2017). Multiple-human parsing in the wild. arXiv preprint arXiv:1705.07206
Li, L., Zhou, T., Wang, W., Li, J., & Yang, Y. (2022). Deep hierarchical semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 1246–1257).
Li, L., Zhou, T., Wang, W., Yang, L., Li, J., & Yang, Y. (2022). Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Li, P., Xu, Y., Wei, Y., & Yang, Y. (2020). Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 3260–3271.
Google Scholar
Li, Q., Arnab, A., & Torr, P. H. (2017). Holistic, instance-level human parsing. In British Machine Vision Conference.
Li, R., & Liu, D. (2023). Spatial-then-temporal self-supervised learning for video correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 2279–2288).
Li, T., Liang, Z., Zhao, S., Gong, J., & Shen, J. (2020). Self-learning with rectification strategy for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 9263–9272).
Li, X., Liu, S., Mello, S. D., Wang, X., Kautz, J., & Yang, M. H. (2019). Joint-task self-supervised learning for temporal correspondence. Advances in Neural Information Processing Systems, 32, 318–328.
Google Scholar
Li, Z., Cao, L., Wang, H., & Xu, L. (2023). End-to-end instance-level human parsing by segmenting persons. IEEE Transactions on Multimedia, 26, 41–50.
Google Scholar
Li, Z., Lv, J., Chen, Y., & Yuan, J. (2021). Person re-identification with part prediction alignment. Computer Vision and Image Understanding, 205, 103172.
Google Scholar
Liang, H., Yuan, J., & Thalmann, D. (2014). Parsing the hand in depth images. IEEE Transactions on Multimedia, 16(5), 1241–1253.
Google Scholar
Liang, X., Gong, K., Shen, X., & Lin, L. (2018). Look into person: Joint body parsing pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4), 871–885.
Google Scholar
Liang, X., Lin, L., Shen, X., Feng, J., Yan, S., & Xing, E. P. (2017). Interpretable structure-evolving lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1010–1019).
Liang, X., Lin, L., Yang, W., Luo, P., Huang, J., & Yan, S. (2016). Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval. IEEE Transactions on Multimedia, 18(6), 1175–1186.
Google Scholar
Liang, X., Liu, S., Shen, X., Yang, J., Liu, L., Dong, J., Lin, L., & Yan, S. (2015). Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(12), 2402–2414.
Google Scholar
Liang, X., Shen, X., Feng, J., Lin, L., Yan, S. (2016). Semantic object parsing with graph lstm. In Proceedings of the European Conference on Computer Vision, (pp. 125–143).
Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., & Yan, S. (2016). Semantic object parsing with local-global long short-term memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3185–3193).
Liang, X., Xu, C., Shen, X., Yang, J., Liu, S., Tang, J., Lin, L., & Yan, S. (2015). Human parsing with contextualized convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 1386–1394).
Lin, C., Li, Z., Zhou, S., Hu, S., Zhang, J., Luo, L., Zhang, J., Huang, L., & He, Y. (2022). Rmgn: A regional mask guided network for parser-free virtual try-on. arXiv preprint arXiv:2204.11258
Lin, J., Yang, H., Chen, D., Zeng, M., Wen, F., & Yuan, L. (2019). Face parsing with roi tanh-warping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 5654–5663).
Lin, L., Zhang, D., & Zuo, W. (2020). Human centric visual analysis with deep learning. Singapore: Springer.
Google Scholar
Lin, T. Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2117–2125).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, (pp. 740–755).
Liu, G., Song, D., Tong, R., Tang, M. (2021). Toward realistic virtual try-on through landmark-guided shape matching. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 2118–2126).
Liu, J., Yao, Y., Hou, W., Cui, M., Xie, X., Zhang, C., & Hua, X. S. (2020). Boosting semantic human matting with coarse annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 8563–8572).
Liu, K., Choi, O., Wang, J., & Hwang, W. (2021). Cdgnet: Class distribution guided network for human parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4473–4482).
Liu, S., Feng, J., Domokos, C., Xu, H., Huang, J., Hu, Z., & Yan, S. (2013). Fashion parsing with weak color-category labels. IEEE Transactions on Multimedia, 16(1), 253–265.
Google Scholar
Liu, S., Liang, X., Liu, L., Lu, K., Lin, L., Cao, X., & Yan, S. (2015). Fashion parsing with video context. IEEE Transactions on Multimedia, 17(8), 1347–1358.
Google Scholar
Liu, S., Liang, X., Liu, L., Shen, X., Yang, J., Xu, C., & Lin, L. (2015). Matching-cnn meets knn: Quasi-parametric human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1419–1427).
Liu, S., Sun, Y., Zhu, D., Ren, G., Chen, Y., Feng, J., Han, J. (2018). Cross-domain human parsing via adversarial feature and label adaptation. In Proceedings of the AAAI Conference On Artificial Intelligence, (pp. 7146–7153).
Liu, S., Zhong, G., Mello, S. D., Gu, J., Jampani, V., Yang, M. H., & Kautz, J. (2018). Switchable temporal propagation network. In Proceedings of the European Conference on Computer Vision, (pp. 87–102).
Liu, X., Zhang, M., Liu, W., Song, J., & Mei, T. (2019). Braidnet: Braiding semantics and details for accurate human parsing. In Proceedings of the 27th ACM International Conference on Multimedia, (pp. 338–346).
Liu, Y., Chen, W., Liu, L., & Lew, M. S. (2019). Swapgan: A multistage generative approach for person-to-person fashion style transfer. IEEE Transactions on Multimedia, 21(9), 2209–2222.
Google Scholar
Liu, Y., Zhang, S., Yang, J., & Yuen, P. (2021). Hierarchical information passing based noise-tolerant hybrid learning for semi-supervised human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 2207–2215).
Liu, Y., Zhao, L., Zhang, S., & Yang, J. (2020). Hybrid resolution network using edge guided region mutual information loss for human parsing. In Proceedings of the 28th ACM International Conference on Multimedia, (pp. 1670–1678).
Liu, Z., Zhu, X., Yang, L., Yan, X., Tang, M., Lei, Z., Zhu, G., Feng, X., Wang, Y., & Wang, J. (2021). Multi-initialization optimization network for accurate 3d human pose and shape estimation. In Proceedings of the 29th ACM International Conference on Multimedia, (pp. 1976–1984).
Loshchilov, I., & Hutter, F. (2018). Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations.
Luo, P., Wang, X., & Tang, X. (2013). Pedestrian parsing via deep decompositional network. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 2648–2655).
Luo, X., Su, Z., & Guo, J. (2018). Trusted guidance pyramid network for human parsing. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 654–662).
Luo, Y., Zheng, Z., Zheng, L., Guan, T., Yu, J., & Yang, Y. (2018). Macro-micro adversarial network for human parsing. In Proceedings of the European Conference on Computer Vision, (pp. 418–434).
Ma, Z., Lin, T., Li, X., Li, F., He, D., Ding, E., Wang, N., & Gao, X. (2022). Dual-affinity style embedding network for semantic-aligned image style transfer. IEEE Transactions on Neural Networks and Learning Systems, 34(10), 7404–7417.
Google Scholar
Mameli, M., Paolanti, M., Pietrini, R., Pazzaglia, G., Frontoni, E., Zingaretti, P. (2021). Deep learning approaches for fashion knowledge extraction from social media: a review. IEEE Access.
Mckee, D., Zhan, Z., Shuai, B., Modolo, D., Tighe, J., & Lazebnik, S. (2022). Transfer of representations to video label propagation: implementation factors matter. arXiv preprint arXiv:2203.05553.
Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., & Terzopoulos, D. (2021). Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3523–3542.
Google Scholar
Neuhold, G., Ollmann, T., Bulo, S. R., & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 4990–4999).
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741
Nie, X., Feng, J., & Yan, S. (2018). Mutual learning to adapt for joint human parsing and pose estimation. In Proceedings of the European Conference on Computer Vision, (pp. 502–517).
Niemeyer, M., & Geiger, A. (2021). Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 11453–11464).
Ntavelis, E., Romero, A., Kastanis, I., Gool, L. V., & Timofte, R. (2020). Sesame: Semantic editing of scenes by adding, manipulating or erasing objects. In Proceedings of the European Conference on Computer Vision, (pp. 394–411).
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., & El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193
Qian, R., Ding, S., Liu, X., & Lin, D. (2023). Semantics meets temporal correspondence: Self-supervised object-centric learning in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 16675–16687).
Qian, X., Wang, W., Zhang, L., Zhu, F., Fu, Y., Tao, X., Jiang, Y. G., & Xue, X. (2020). Long-term cloth-changing person re-identification. In Proceedings of the Asian Conference on Computer Vision, (pp. 71–88).
Qin, H., Hong, W., Hung, W. C., Tsai, Y. H., & Yang, M. H. (2019). A top-down unified framework for instance-level human parsing. In British Machine Vision Conference
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, (pp. 8748–8763).
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2021). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 10684–10695).
Ruan, T., Liu, T., Huang, Z., Wei, Y., Wei, S., Zhao, Y. (2019). Devil in the details: Towards accurate single and multiple human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 4814–4821).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Li, F. F. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
MathSciNet Google Scholar
Schuemie, M. J., Straaten, P. V. D., Krijn, M., & Mast, C. A. V. D. (2001). Research on presence in virtual reality: A survey. Cyberpsychology behavior, 4(2), 183–201.
Google Scholar
Shelhamer, E., Long, J., & Darrell, T. (2016). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651.
Google Scholar
Son, J. (2022). Contrastive learning for space-time correspondence via self-cycle consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 14679–14688).
Sun, Y., Zheng, L., Li, Y., Yang, Y., Tian, Q., & Wang, S. (2019). Learning part-based convolutional features for person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(3), 902–917.
Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1–9).
Tang, B., Jin, C., Zhang, D., & Zheng, Q. (2021). Motion human parsing: A new benchmark for 3d human parsing. In IEEE International Conference on Big Data, (pp. 3203–3208).
Tang, S., Chen, C., Xie, Q., Chen, M., Wang, Y., Ci, Y., Bai, L., Zhu, F., Yang, H., Yi, L., Zhao, R., & Ouyang, W. (2023). Humanbench: Towards general human-centric perception with projector assisted pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 21970–21982).
Tian, M., Yi, S., Li, H., Li, S., Zhang, X., Shi, J., Yan, J., & Wang, X. (2018). Eliminating background-bias for robust person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 5794–5803).
Tian, Z., Shen, C., Chen, H., & He, T. (2020). Fcos: A simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 1922–1933.
Google Scholar
Tighe, J., & Lazebnik, S. (2010). Superparsing: scalable nonparametric image parsing with superpixels. In Proceedings of the European Conference on Computer Vision, (pp. 352–365).
Tseng, H. Y., Fisher, M., Lu, J., Li, Y., Kim, V., & Yang, M. H. (2020). Modeling artistic workflows for image generation and editing. In Proceedings of the European Conference on Computer Vision, (pp. 158–174).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, pp. 6000–6010.
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking emerges by colorizing videos. In Proceedings of the European Conference on Computer Vision, (pp. 391–408).
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., & Yang, M. (2018). Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European Conference on Computer Vision, (pp. 589–604).
Wang, D., & Zhang, S. (2023). Contextual instance decoupling for instance-level human analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8), 9520–9533.
MathSciNet Google Scholar
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., & Xiao, B. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364.
Google Scholar
Wang, N., Zhou, W., & Li, H. (2021). Contrastive transformation for self-supervised correspondence learning. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 10174–10182).
Wang, W., Zhang, Z., Qi, S., Shen, J., Pang, Y., & Shao, L. (2019). Learning compositional neural information fusion for human parsing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 5703–5713).
Wang, W., Zhou, T., Porikli, F., Crandall, D., & Gool, L. V. (2021) A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153
Wang, W., Zhou, T., Qi, S., Shen, J., & Zhu, S. C. (2021). Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Wang, W., Zhu, H., Dai, J., Pang, Y., Shen, J., & Shao, L. (2020). Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 8929–8939).
Wang, X., Jabri, A., & Efros, A. A. (2019). Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 2566–2576).
Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, (pp. 4724–4732).
Wood, E., Baltrusaitis, T., Hewitt, C., Dziadzio, S., Johnson, M., Estellers, V., Cashman, T. J., & Shotton, J. (2021). Fake it till you make it: Face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 3681–3691).
Wu, B., Xie, Z., Liang, X., Xiao, Y., Dong, H., & Lin, L. (2021). Image comes dancing with collaborative parsing-flow video synthesis. IEEE Transactions on Image Processing, 30, 9259–9269.
Google Scholar
Wu, D., Yang, Z., Zhang, P., Wang, R., & Yang, B. (2023). Virtual-reality interpromotion technology for metaverse: A survey. IEEE Internet of Things Journal, 10(18), 15788–15809.
Google Scholar
Wu, Z., Lin, G., Tao, Q., & Cai, J. (2019). M2e-try on net: Fashion from model to everyone. In Proceedings of the 27th ACM International Conference on Multimedia, (pp. 293–301).
Xia, F., Wang, P., Chen, L. C., & Yuille, A. L. (2016). Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In Proceedings of the European Conference on Computer Vision, (pp. 648–663).
Xia, F., Wang, P., Chen, X., & Yuille, A. L. (2017). Joint multi-person pose estimation and semantic part segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 6769–6778).
Xia, F., Zhu, J., Wang, P., & Yuille, A. L. (2016). Pose-guided human parsing by an and/or graph using pose-context features. Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 3632–3640).
Xiao, B., Hu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision, (pp. 466–481).
Xie, Z., Zhang, X., Zhao, F., Dong, H., Kampffmeyer, M., Yan, H., & Liang, X. (2021). Was-vton: Warping architecture search for virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia, (pp. 3350–3359).
Xu, J., & Wang, X. (2021). Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 10075–10085).
Yamaguchi, K., Hadi Kiapour, M., & Berg, T. L. (2013). Paper doll parsing: Retrieving similar styles to parse clothing items. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 3519–3526).
Yamaguchi, K., Kiapour, M. H., Ortiz, L. E., & Berg, T. L. (2012). Parsing clothing in fashion photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3570–3577).
Yang, J., Wang, C., Li, Z., Wang, J., & Zhang, R. (2023). Semantic human parsing via scalable semantic transfer over multiple label domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 19424–19433).
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., & Gao, J. (2023). Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441
Yang, L., Fan, Y., Xu, N. (2019). Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 5188–5197).
Yang, L., Jiang, H., Song, Q., & Guo, J. (2022). A survey on long-tailed visual recognition. International Journal of Computer Vision, 130(7), 1837–1872.
Google Scholar
Yang, L., Liu, Z., Zhou, T., & Song, Q. (2022). Part decomposition and refinement network for human parsing. IEEE/CAA Journal of Automatica Sinica, 9(6), 1111–1114.
Google Scholar
Yang, L., Song, Q., Wang, Z., Hu, M., & Liu, C. (2020). Hier r-cnn: Instance-level human parts detection and a new benchmark. IEEE Transactions on Image Processing, 30, 39–54.
Google Scholar
Yang, L., Song, Q., Wang, Z., Hu, M., Liu, C., Xin, X., Jia, W., & Xu, S. (2020). Renovating parsing r-cnn for accurate multiple human parsing. In Proceedings of the European Conference on Computer Vision, (pp. 421–437).
Yang, L., Song, Q., Wang, Z., & Jiang, M. (2019). Parsing r-cnn for instance-level human analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 364–373).
Yang, L., Song, Q., Wang, Z., Liu, Z., Xu, S., & Li, Z. (2022). Quality-aware network for human parsing. IEEE Transactions on Multimedia, 25, 7128–7138.
Google Scholar
Yang, L., Song, Q., & Wu, Y. (2021). Attacks on state-of-the-art face recognition using attentional adversarial attack generative network. Multimedia Tools and Applications, 80(1), 855–875.
Google Scholar
Yang, L., Song, Q., Wu, Y., & Hu, M. (2018). Attention inspiring receptive-fields network for learning invariant representations. IEEE Transactions on Neural Networks and Learning Systems, 30(6), 1744–1755.
Google Scholar
Yang, W., Huang, H., Zhang, Z., Chen, X., Huang, K., & Zhang, S. (2019). Towards rich feature discovery with class activation maps augmentation for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 1389–1398).
Yang, Y., Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1385–1392).
Yu, C., Zhu, X., Zhang, X., Wang, Z., Zhang, Z., & Lei, Z. (2022). Hp-capsule: Unsupervised face part discovery by hierarchical parsing capsule network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4032–4041).
Yu, R., Wang, X., & Xie, X. (2019). Vtnfp: An image-based virtual try-on network with body and clothing feature preservation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 10511–10520).
Yu, S., Li, S., Chen, D., Zhao, R., Yan, J., & Qiao, Y. (2020). Cocas: A large-scale clothes changing person dataset for re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 3400–3409).
Yu, Z., Yoon, J. S., Li, I. K., Venkatesh, P., Park, J., Yu, J., & Park, H. S. (2020). Humbi: A large multiview dataset of human body expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 2990–3000).
Yuan, Y., Chen, X., & Wang, J. (2020). Object-contextual representations for semantic segmentation. In Proceedings of the European Conference on Computer Vision, (pp. 173–190).
Zeng, D., Huang, Y., Bao, Q., Zhang, J., Su, C., & Liu, W. (2021). Neural architecture search for joint human parsing and pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 11385–11394)
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 3836–3847).
Zhang, S., Cao, X., Qi, G. J., Song, Z., & Zhou, J. (2022). Aiparsing: Anchor-free instance-level human parsing. IEEE Transactions on Image Processing, 31, 5599–612.
Google Scholar
Zhang, X., Chen, Y., Tang, M., Wang, J., Zhu, X., & Lei, Z. (2022). Human parsing with part-aware relation modeling. IEEE Transactions on Multimedia, 25, 2601–2612.
Google Scholar
Zhang, X., Chen, Y., Zhu, B., Wang, J., & Tang, M. (2020). Blended grammar network for human parsing. In Proceedings of the European Conference on Computer Vision, (pp. 189–205).
Zhang, X., Chen, Y., Zhu, B., Wang, J., & Tang, M. (2020). Part-aware context network for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 8971–8980).
Zhang, Z., Su, C., Zheng, L., & Xie, X. (2020). Correlating edge, pose with parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 8900–8909).
Zhang, Z., Su, C., Zheng, L., Xie, X., & Li, Y. (2021). On the correlation among edge, pose and parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 8492–507.
Google Scholar
Zhao, F., Xie, Z., Kampffmeyer, M., Dong, H., Han, S., Zheng, T., Zhang, T., & Liang, X. (2021). M3d-vton: A monocular-to-3d virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 13239–13249).
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2881–2890).
Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., & Feng, J. (2018). Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 792–800).
Zhao, J., Li, J., Liu, H., Yan, S., & Feng, J. (2020). Fine-grained multi-human parsing. International Journal of Computer Vision, 128(8), 2185–2203.
Google Scholar
Zhao, Y., Li, J., Zhang, Y., & Tian, Y. (2019). Multi-class part parsing with joint boundary-semantic awareness. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9177–9186).
Zhao, Y., Li, J., Zhang, Y., & Tian, Y. (2022). From pose to part: Weakly-supervised pose evolution for human part segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3107–20.
Google Scholar
Zhao, Z., Jin, Y., & Heng, P. A. (2021). Modelling neighbor relation in joint space-time graph for video correspondence learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9960–9969).
Zheng, C., Wu, W., Yang, T., Zhu, S., Chen, C., Liu, R., Shen, J., Kehtarnavaz, N., & Shah, M. (2023). Deep learning-based human pose estimation: A survey. ACM Computing Surveys, 56(1), 1–37.
Google Scholar
Zheng, S., Yang, F., Kiapour, M. H., & Piramuthu, R. (2018). Modanet: A large-scale street fashion dataset with polygon annotations. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 1670–1678).
Zheng, Z., Yu, T., Wei, Y., Dai, Q., & Liu, Y. (2019). Deephuman: 3d human reconstruction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 7739–7749).
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 633–641).
Zhou, Q., Liang, X., Gong, K., & Lin, L. (2018). Adaptive temporal encoding network for video instance-level human parsing. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 1527–1535).
Zhou, T., Wang, W., Liu, S., Yang, Y., & Gool, L. V. (2021). Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 1622–1631).
Zhou, T., Yang, Y., & Wang, W. (2023). Differentiable multi-granularity human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7), 8296–8310.
Zhu, B., Chen, Y., Tang, M., & Wang, J. (2018). Progressive cognitive human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 7607–7614).
Zhu, L., Chen, Y., Lu, Y., Lin, C., & Yuille, A. (2008). Max margin and/or graph learning for parsing the human body. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1–8).
Zhu, T., Karlsson, P., & Bregler, C. (2020). Simpose: Effectively learning densepose and surface normals of people from simulated data. In Proceedings of the European Conference on Computer Vision, (pp. 225–242).
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations.

Download references

Acknowledgements

This work was supported by the China National Postdoctoral Program for Innovative Talents (No. BX2021047), China Postdoctoral Science Foundation (No. 2022M710466), and Young Scientists Fund of NSFC (Grant No. 62206025).

Author information

Authors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, 100876, China
Lu Yang, Wenhe Jia, Shan Li & Qing Song

Authors

Lu Yang
View author publications
Search author on:PubMed Google Scholar
Wenhe Jia
View author publications
Search author on:PubMed Google Scholar
Shan Li
View author publications
Search author on:PubMed Google Scholar
Qing Song
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Qing Song.

Additional information

Communicated by Limin Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, L., Jia, W., Li, S. et al. Deep Learning Technique for Human Parsing: A Survey and Outlook. Int J Comput Vis 132, 3270–3301 (2024). https://doi.org/10.1007/s11263-024-02031-9

Download citation

Received: 14 June 2023
Accepted: 09 February 2024
Published: 09 March 2024
Version of record: 09 March 2024
Issue date: August 2024
DOI: https://doi.org/10.1007/s11263-024-02031-9

Keywords

Part of a collection:

Survey Papers

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning Technique for Human Parsing: A Survey and Outlook

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Transferring pose and augmenting background for deep human-image parsing and its applications

Fine-Grained Multi-human Parsing

Instance-Level Human Parsing via Part Grouping Network

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now