Abstract
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
Similar content being viewed by others
References
Bao, H., Dong, L., Piao, S., & Wei, F. (2022). Beit: Bert pre-training of image transformers. In Proceedings of the International Conference on Learning Representations.
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., Manassra, W., Dhariwal, P., Chu, C., & Jiao, Y. (2023). Improving image generation with better captions. OpenAI blog.
Bo, Y., & Fowlkes, C. C. (2011). Shape-based pedestrian parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2265–2272).
Borras, A., Tous, F., Llados, J., & Vanrell, M. (2003). High-level clothes description based on colour-texture and structural features. In Iberian Conference on Pattern Recognition and Image Analysis, (pp. 108–116).
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, pp. 213–229.
Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision, (pp. 139–156).
Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9650–9660).
Chang, Y., Peng, T., He, R., Hu, X., Liu, J., Zhang, Z., & Jiang, M. (2022). Pf-vton: Toward high-quality parser-free virtual try-on network. In International Conference on Multimedia Modeling, (pp. 28–40).
Chen, H., Xu, Z., Liu, Z., & Zhu, S. C. (2006). Composite templates for cloth modeling and sketching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 943–950).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Chen, L. C., Yang, Y., Wang, J., Xu, W., & Yuille, A. L. (2016). Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3640–3649).
Chen, Q., Ge, T., Xu, Y., Zhang, Z., Yang, X., & Gai, K. (2018). Semantic human matting. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 618–626).
Chen, R., Chen, X., Ni, B., & Ge, Y. (2020). Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM International Conference on Multimedia, (pp. 2003–2011).
Chen, S., & Wang, J. (2023). Virtual reality human-computer interactive english education experience system based on mobile terminal. International Journal of Human-Computer Interaction. https://doi.org/10.1080/10447318.2023.2190674
Chen, W., Xu, X., Jia, J., Luo, H., Wang, Y., Wang, F., Jin, R., & Sun, X. (2023). Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 15050–15061).
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1971–1978).
Chen, Y., Zhu, X., & Gong, S. (2019). Instance-guided context rendering for cross-domain person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 232–242).
Cheng, B., Chen, L. C., Wei, Y., Zhu, Y., Huang, Z., Xiong, J., Huang, T. S., Hwu, W. M., & Shi, H. (2019). Spgnet: Semantic prediction guidance for scene parsing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 5218–5228).
Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., & Schwing, A. G. (2021). Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Cheng, B., Schwing, A. G., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34, 17864–17875.
Cheng, W., Song, S., Chen, C. Y., Hidayati, S. C., & Liu, J. (2021). Fashion meets computer vision: A survey. ACM Computing Surveys, 54(4), 1–41.
Ci, Y., Wang, Y., Chen, M., Tang, S., Bai, L., Zhu, F., Zhao, R., Yu, F., Qi, D., & Ouyang, W. (2023). Unihcp: A unified model for human-centric perceptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition., (pp. 17840–17852).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3213–3223).
Dai, Y., Chen, X., Wang, X., Pang, M., Gao, L., & Shen, H. T. (2023). Resparser: Fully convolutional multiple human parsing with representative sets. IEEE Transactions on Multimedia, 26, 1384–1394.
Devlin, J., Chang, M. W., Lee, K., &Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 4171–4186).
Dong, H., Liang, X., Shen, X., Wang, B., Lai, H., Zhu, J., Hu, Z., & Yin, J. (2019). Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9026–9035).
Dong, J., Chen, Q., Shen, X., Yang, J., & Yan, S. (2014). Towards unified human parsing and pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 843–850).
Dong, J., Chen, Q., Xia, W., Huang, Z., & Yan, S. (2013). A deformable mixture parsing model with parselets. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 3408–3415).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Fang, H. S., Lu, G., Fang, X., Xie, J., Tai, Y. W., & Lu, C. (2018). Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 70–78).
Fang, J., Sun, Y., Zhang, Q., Li, Y., Liu, W., & Wang, X. (2020). Densely connected search space for more flexible neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 10628–10637).
Fruhstuck, A., Singh, K. K., Shechtman, E., Mitra Niloy, J., Wonka, P., & Lu, J. (2022). Insetgan for full-body image generation. arXiv preprint arXiv:2203.07293
Fulkerson, B., Vedaldi, A., & Soatto, S. (2009). Class segmentation and object localization with superpixel neighborhoods. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 670–677).
Gao, Y., Lang, C., Liu, F., Cao, Y., Sun, L., & Wei, Y. (2023). Dynamic interaction dilation for interactive human parsing. IEEE Transactions on Multimedia. https://doi.org/10.1109/TMM.2023.3262973
Gao, Y., Liang, L., Lang, C., Feng, S., Li, Y., & Wei, Y. (2022). Clicking matters: Towards interactive human parsing. IEEE Transactions on Multimedia. https://doi.org/10.1109/TMM.2022.3156812
Ge, Y., Zhang, R., Wang, X., Tang, X., & Luo, P. (2019). Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 5337–5345).
de Geus, D., Meletis, P., Lu, C., Wen, X., Dubbelman, G. (2021). Part-aware panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 5485–5494).
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., & Misra, I. (2023). Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 15180–15190).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 580–587).
Gong, K., Gao, Y., Liang, X., Shen, X., Wang, M., & Lin, L. (2019). Graphonomy: Universal human parsing via graph transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 7450–7459).
Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., & Lin, L. (2018). Instance-level human parsing via part grouping network. In Proceedings of the European Conference on Computer Vision, (pp. 770–785).
Gong, K., Liang, X., Zhang, D., Shen, X., & Lin, L. (2017). Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 932–940).
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems.
Guan, P., Freifeld, O., & Black, M. J. (2010). A 2d human body model dressed in eigen clothing. In Proceedings of the European Conference on Computer Vision, (pp. 285–298).
Guler, R. A., & Kokkinos, I. (2019). Holopose: Holistic 3d human reconstruction in-the-wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 10884–10894).
Guler, R. A., Neverova, N., & Kokkinos, I. (2018). Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 7297–7306).
Gupta, A., Wu, J., Deng, J., & Fei-Fei, L. (2023). Siamese masked autoencoders. arXiv preprint arXiv:2305.14344
Han, X., Wu, Z., Wu, Z., Yu, R., & Davis, L. S. (2018). Viton: An image-based virtual try-on network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 7543–7552).
Hariharan, B., Arbelaez, P., Girshick, R., & Malik, J. (2014). Simultaneous detection and segmentation. In Proceedings of the European Conference on Computer Vision, (pp. 297–312).
He, H., Zhang, J., Thuraisingham, B., & Tao, D. (2021). Progressive one-shot human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 1522–1530).
He, H., Zhang, J., Zhang, Q., Tao, D. (2020). Grapy-ml: Graph pyramid mutual learning for cross-dataset human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 10949–10956).
He, H., Zhang, J., Zhuang, B., Cai, J., & Tao, D. (2023). End-to-end one-shot human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence.
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 9729–9738).
He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 770–778).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hu, Y., Wang, R., Zhang, K., & Gao, Y. (2022). Semantic-aware fine-grained correspondence. In European Conference on Computer Vision, (pp. 97–115).
Huang, H., Yang, W., Lin, J., Huang, G., Xu, J., Wang, G., Chen, X., & Huang, K. (2020). Improve person re-identification with part awareness learning. IEEE Transactions on Image Processing, 29, 7468–7481.
Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., & Zhou, J. (2023). Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778
Huang, Z., Wang, X., Wei, Y., Huang, L., Shi, H., Liu, W., & Huang, T. S. (2023). Ccnet: Criss-cross attention for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(06), 6896–6908.
Huo, J., Jin, S., Li, W., Wu, J., Lai, Y. K., Shi, Y., & Gao, Y. (2021). Manifold alignment for semantically aligned style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 14861–14869).
Issenhuth, T., Mary, J., Calauzenes, C. (2020). Do not mask what you do not need to mask: a parser-free virtual try-on. In Proceedings of the European Conference on Computer Vision, (pp. 619–635).
Jabri, A. A., Owens, A., & Efros, A. A. (2020). Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems, 33, 19545–19560.
Jeon, S., Min, D., Kim, S., & Sohn, K. (2021). Mining better samples for contrastive learning of temporal correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 1034–1044).
Ji, R., Du, D., Zhang, L., Wen, L., Wu, Y., Zhao, C., Huang, F., & Lyu, S. (2020). Learning semantic neural tree for human parsing. In Proceedings of the European Conference on Computer Vision, (pp. 205–221).
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, (pp. 675–678).
Jin, Z., Gong, T., Yu, D., Chu, Q., Wang, J., Wang, C., & Shao, J. (2021). Mining contextual information beyond image for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 7231–7241).
Jin, Z., Liu, B., Chu, Q., & Yu, N. (2021). Isnet: Integrate image-level and semantic-level context for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 7189–7198).
Kae, A., Sohn, K., Lee, H., & Learned-Miller, E. (2013). Augmenting crfs with boltzmann machine shape priors for image labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2019–2026).
Kalayeh, M. M., Basaran, E., Gokmen, M., Kamasak, M. E., & Shah, M. (2018). Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1062–1071).
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4401–4410).
Khan, K., Khan, R. U., Ahmad, K., Ali, F., & Kwak, K. S. (2020). Face segmentation: A journey from classical to deep learning paradigm, approaches, trends, and directions. IEEE Access, 8, 58683–58699.
Kiefel, M., & Gehler, P. (2014). Human pose estimation with fields of parts. In Proceedings of the European Conference on Computer Vision, (pp. 331—346).
Kim, B. K., Kim, G., & Lee, S. Y. (2019). Style-controlled synthesis of clothing segments for fashion image manipulation. IEEE Transactions on Multimedia, 22(2), 298–310.
Kirillov, A., Girshick, R., He, K., & Dollar, P. (2019). Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 6399–6408).
Kirillov, A., He, K., Girshick, R., Rother, C., & Dollar, P. (2019). Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 9404–9413).
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W. Y., et al. (2023). Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems.
L2ID: Learning from limited or imperfect data (l2id) workshop. https://l2id.github.io/challenge_localization.html (2021)
Ladicky, L., Torr, P. H., & Zisserman, A. (2013). Human pose estimation using a joint pixel-wise and part-wise formulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3578–3585).
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Li, J., Zhao, J., Wei, Y., Lang, C., Li, Y., Sim, T., Yan, S., & Feng, J. (2017). Multiple-human parsing in the wild. arXiv preprint arXiv:1705.07206
Li, L., Zhou, T., Wang, W., Li, J., & Yang, Y. (2022). Deep hierarchical semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 1246–1257).
Li, L., Zhou, T., Wang, W., Yang, L., Li, J., & Yang, Y. (2022). Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Li, P., Xu, Y., Wei, Y., & Yang, Y. (2020). Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 3260–3271.
Li, Q., Arnab, A., & Torr, P. H. (2017). Holistic, instance-level human parsing. In British Machine Vision Conference.
Li, R., & Liu, D. (2023). Spatial-then-temporal self-supervised learning for video correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 2279–2288).
Li, T., Liang, Z., Zhao, S., Gong, J., & Shen, J. (2020). Self-learning with rectification strategy for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 9263–9272).
Li, X., Liu, S., Mello, S. D., Wang, X., Kautz, J., & Yang, M. H. (2019). Joint-task self-supervised learning for temporal correspondence. Advances in Neural Information Processing Systems, 32, 318–328.
Li, Z., Cao, L., Wang, H., & Xu, L. (2023). End-to-end instance-level human parsing by segmenting persons. IEEE Transactions on Multimedia, 26, 41–50.
Li, Z., Lv, J., Chen, Y., & Yuan, J. (2021). Person re-identification with part prediction alignment. Computer Vision and Image Understanding, 205, 103172.
Liang, H., Yuan, J., & Thalmann, D. (2014). Parsing the hand in depth images. IEEE Transactions on Multimedia, 16(5), 1241–1253.
Liang, X., Gong, K., Shen, X., & Lin, L. (2018). Look into person: Joint body parsing pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4), 871–885.
Liang, X., Lin, L., Shen, X., Feng, J., Yan, S., & Xing, E. P. (2017). Interpretable structure-evolving lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1010–1019).
Liang, X., Lin, L., Yang, W., Luo, P., Huang, J., & Yan, S. (2016). Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval. IEEE Transactions on Multimedia, 18(6), 1175–1186.
Liang, X., Liu, S., Shen, X., Yang, J., Liu, L., Dong, J., Lin, L., & Yan, S. (2015). Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(12), 2402–2414.
Liang, X., Shen, X., Feng, J., Lin, L., Yan, S. (2016). Semantic object parsing with graph lstm. In Proceedings of the European Conference on Computer Vision, (pp. 125–143).
Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., & Yan, S. (2016). Semantic object parsing with local-global long short-term memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3185–3193).
Liang, X., Xu, C., Shen, X., Yang, J., Liu, S., Tang, J., Lin, L., & Yan, S. (2015). Human parsing with contextualized convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 1386–1394).
Lin, C., Li, Z., Zhou, S., Hu, S., Zhang, J., Luo, L., Zhang, J., Huang, L., & He, Y. (2022). Rmgn: A regional mask guided network for parser-free virtual try-on. arXiv preprint arXiv:2204.11258
Lin, J., Yang, H., Chen, D., Zeng, M., Wen, F., & Yuan, L. (2019). Face parsing with roi tanh-warping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 5654–5663).
Lin, L., Zhang, D., & Zuo, W. (2020). Human centric visual analysis with deep learning. Singapore: Springer.
Lin, T. Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2117–2125).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, (pp. 740–755).
Liu, G., Song, D., Tong, R., Tang, M. (2021). Toward realistic virtual try-on through landmark-guided shape matching. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 2118–2126).
Liu, J., Yao, Y., Hou, W., Cui, M., Xie, X., Zhang, C., & Hua, X. S. (2020). Boosting semantic human matting with coarse annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 8563–8572).
Liu, K., Choi, O., Wang, J., & Hwang, W. (2021). Cdgnet: Class distribution guided network for human parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4473–4482).
Liu, S., Feng, J., Domokos, C., Xu, H., Huang, J., Hu, Z., & Yan, S. (2013). Fashion parsing with weak color-category labels. IEEE Transactions on Multimedia, 16(1), 253–265.
Liu, S., Liang, X., Liu, L., Lu, K., Lin, L., Cao, X., & Yan, S. (2015). Fashion parsing with video context. IEEE Transactions on Multimedia, 17(8), 1347–1358.
Liu, S., Liang, X., Liu, L., Shen, X., Yang, J., Xu, C., & Lin, L. (2015). Matching-cnn meets knn: Quasi-parametric human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1419–1427).
Liu, S., Sun, Y., Zhu, D., Ren, G., Chen, Y., Feng, J., Han, J. (2018). Cross-domain human parsing via adversarial feature and label adaptation. In Proceedings of the AAAI Conference On Artificial Intelligence, (pp. 7146–7153).
Liu, S., Zhong, G., Mello, S. D., Gu, J., Jampani, V., Yang, M. H., & Kautz, J. (2018). Switchable temporal propagation network. In Proceedings of the European Conference on Computer Vision, (pp. 87–102).
Liu, X., Zhang, M., Liu, W., Song, J., & Mei, T. (2019). Braidnet: Braiding semantics and details for accurate human parsing. In Proceedings of the 27th ACM International Conference on Multimedia, (pp. 338–346).
Liu, Y., Chen, W., Liu, L., & Lew, M. S. (2019). Swapgan: A multistage generative approach for person-to-person fashion style transfer. IEEE Transactions on Multimedia, 21(9), 2209–2222.
Liu, Y., Zhang, S., Yang, J., & Yuen, P. (2021). Hierarchical information passing based noise-tolerant hybrid learning for semi-supervised human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 2207–2215).
Liu, Y., Zhao, L., Zhang, S., & Yang, J. (2020). Hybrid resolution network using edge guided region mutual information loss for human parsing. In Proceedings of the 28th ACM International Conference on Multimedia, (pp. 1670–1678).
Liu, Z., Zhu, X., Yang, L., Yan, X., Tang, M., Lei, Z., Zhu, G., Feng, X., Wang, Y., & Wang, J. (2021). Multi-initialization optimization network for accurate 3d human pose and shape estimation. In Proceedings of the 29th ACM International Conference on Multimedia, (pp. 1976–1984).
Loshchilov, I., & Hutter, F. (2018). Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations.
Luo, P., Wang, X., & Tang, X. (2013). Pedestrian parsing via deep decompositional network. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 2648–2655).
Luo, X., Su, Z., & Guo, J. (2018). Trusted guidance pyramid network for human parsing. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 654–662).
Luo, Y., Zheng, Z., Zheng, L., Guan, T., Yu, J., & Yang, Y. (2018). Macro-micro adversarial network for human parsing. In Proceedings of the European Conference on Computer Vision, (pp. 418–434).
Ma, Z., Lin, T., Li, X., Li, F., He, D., Ding, E., Wang, N., & Gao, X. (2022). Dual-affinity style embedding network for semantic-aligned image style transfer. IEEE Transactions on Neural Networks and Learning Systems, 34(10), 7404–7417.
Mameli, M., Paolanti, M., Pietrini, R., Pazzaglia, G., Frontoni, E., Zingaretti, P. (2021). Deep learning approaches for fashion knowledge extraction from social media: a review. IEEE Access.
Mckee, D., Zhan, Z., Shuai, B., Modolo, D., Tighe, J., & Lazebnik, S. (2022). Transfer of representations to video label propagation: implementation factors matter. arXiv preprint arXiv:2203.05553.
Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., & Terzopoulos, D. (2021). Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3523–3542.
Neuhold, G., Ollmann, T., Bulo, S. R., & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 4990–4999).
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741
Nie, X., Feng, J., & Yan, S. (2018). Mutual learning to adapt for joint human parsing and pose estimation. In Proceedings of the European Conference on Computer Vision, (pp. 502–517).
Niemeyer, M., & Geiger, A. (2021). Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 11453–11464).
Ntavelis, E., Romero, A., Kastanis, I., Gool, L. V., & Timofte, R. (2020). Sesame: Semantic editing of scenes by adding, manipulating or erasing objects. In Proceedings of the European Conference on Computer Vision, (pp. 394–411).
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., & El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193
Qian, R., Ding, S., Liu, X., & Lin, D. (2023). Semantics meets temporal correspondence: Self-supervised object-centric learning in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 16675–16687).
Qian, X., Wang, W., Zhang, L., Zhu, F., Fu, Y., Tao, X., Jiang, Y. G., & Xue, X. (2020). Long-term cloth-changing person re-identification. In Proceedings of the Asian Conference on Computer Vision, (pp. 71–88).
Qin, H., Hong, W., Hung, W. C., Tsai, Y. H., & Yang, M. H. (2019). A top-down unified framework for instance-level human parsing. In British Machine Vision Conference
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, (pp. 8748–8763).
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2021). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 10684–10695).
Ruan, T., Liu, T., Huang, Z., Wei, Y., Wei, S., Zhao, Y. (2019). Devil in the details: Towards accurate single and multiple human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 4814–4821).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Li, F. F. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Schuemie, M. J., Straaten, P. V. D., Krijn, M., & Mast, C. A. V. D. (2001). Research on presence in virtual reality: A survey. Cyberpsychology behavior, 4(2), 183–201.
Shelhamer, E., Long, J., & Darrell, T. (2016). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651.
Son, J. (2022). Contrastive learning for space-time correspondence via self-cycle consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 14679–14688).
Sun, Y., Zheng, L., Li, Y., Yang, Y., Tian, Q., & Wang, S. (2019). Learning part-based convolutional features for person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(3), 902–917.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1–9).
Tang, B., Jin, C., Zhang, D., & Zheng, Q. (2021). Motion human parsing: A new benchmark for 3d human parsing. In IEEE International Conference on Big Data, (pp. 3203–3208).
Tang, S., Chen, C., Xie, Q., Chen, M., Wang, Y., Ci, Y., Bai, L., Zhu, F., Yang, H., Yi, L., Zhao, R., & Ouyang, W. (2023). Humanbench: Towards general human-centric perception with projector assisted pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 21970–21982).
Tian, M., Yi, S., Li, H., Li, S., Zhang, X., Shi, J., Yan, J., & Wang, X. (2018). Eliminating background-bias for robust person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 5794–5803).
Tian, Z., Shen, C., Chen, H., & He, T. (2020). Fcos: A simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 1922–1933.
Tighe, J., & Lazebnik, S. (2010). Superparsing: scalable nonparametric image parsing with superpixels. In Proceedings of the European Conference on Computer Vision, (pp. 352–365).
Tseng, H. Y., Fisher, M., Lu, J., Li, Y., Kim, V., & Yang, M. H. (2020). Modeling artistic workflows for image generation and editing. In Proceedings of the European Conference on Computer Vision, (pp. 158–174).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, pp. 6000–6010.
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking emerges by colorizing videos. In Proceedings of the European Conference on Computer Vision, (pp. 391–408).
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., & Yang, M. (2018). Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European Conference on Computer Vision, (pp. 589–604).
Wang, D., & Zhang, S. (2023). Contextual instance decoupling for instance-level human analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8), 9520–9533.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., & Xiao, B. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364.
Wang, N., Zhou, W., & Li, H. (2021). Contrastive transformation for self-supervised correspondence learning. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 10174–10182).
Wang, W., Zhang, Z., Qi, S., Shen, J., Pang, Y., & Shao, L. (2019). Learning compositional neural information fusion for human parsing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 5703–5713).
Wang, W., Zhou, T., Porikli, F., Crandall, D., & Gool, L. V. (2021) A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153
Wang, W., Zhou, T., Qi, S., Shen, J., & Zhu, S. C. (2021). Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Wang, W., Zhu, H., Dai, J., Pang, Y., Shen, J., & Shao, L. (2020). Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 8929–8939).
Wang, X., Jabri, A., & Efros, A. A. (2019). Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 2566–2576).
Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, (pp. 4724–4732).
Wood, E., Baltrusaitis, T., Hewitt, C., Dziadzio, S., Johnson, M., Estellers, V., Cashman, T. J., & Shotton, J. (2021). Fake it till you make it: Face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 3681–3691).
Wu, B., Xie, Z., Liang, X., Xiao, Y., Dong, H., & Lin, L. (2021). Image comes dancing with collaborative parsing-flow video synthesis. IEEE Transactions on Image Processing, 30, 9259–9269.
Wu, D., Yang, Z., Zhang, P., Wang, R., & Yang, B. (2023). Virtual-reality interpromotion technology for metaverse: A survey. IEEE Internet of Things Journal, 10(18), 15788–15809.
Wu, Z., Lin, G., Tao, Q., & Cai, J. (2019). M2e-try on net: Fashion from model to everyone. In Proceedings of the 27th ACM International Conference on Multimedia, (pp. 293–301).
Xia, F., Wang, P., Chen, L. C., & Yuille, A. L. (2016). Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In Proceedings of the European Conference on Computer Vision, (pp. 648–663).
Xia, F., Wang, P., Chen, X., & Yuille, A. L. (2017). Joint multi-person pose estimation and semantic part segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 6769–6778).
Xia, F., Zhu, J., Wang, P., & Yuille, A. L. (2016). Pose-guided human parsing by an and/or graph using pose-context features. Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 3632–3640).
Xiao, B., Hu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision, (pp. 466–481).
Xie, Z., Zhang, X., Zhao, F., Dong, H., Kampffmeyer, M., Yan, H., & Liang, X. (2021). Was-vton: Warping architecture search for virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia, (pp. 3350–3359).
Xu, J., & Wang, X. (2021). Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 10075–10085).
Yamaguchi, K., Hadi Kiapour, M., & Berg, T. L. (2013). Paper doll parsing: Retrieving similar styles to parse clothing items. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 3519–3526).
Yamaguchi, K., Kiapour, M. H., Ortiz, L. E., & Berg, T. L. (2012). Parsing clothing in fashion photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3570–3577).
Yang, J., Wang, C., Li, Z., Wang, J., & Zhang, R. (2023). Semantic human parsing via scalable semantic transfer over multiple label domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 19424–19433).
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., & Gao, J. (2023). Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441
Yang, L., Fan, Y., Xu, N. (2019). Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 5188–5197).
Yang, L., Jiang, H., Song, Q., & Guo, J. (2022). A survey on long-tailed visual recognition. International Journal of Computer Vision, 130(7), 1837–1872.
Yang, L., Liu, Z., Zhou, T., & Song, Q. (2022). Part decomposition and refinement network for human parsing. IEEE/CAA Journal of Automatica Sinica, 9(6), 1111–1114.
Yang, L., Song, Q., Wang, Z., Hu, M., & Liu, C. (2020). Hier r-cnn: Instance-level human parts detection and a new benchmark. IEEE Transactions on Image Processing, 30, 39–54.
Yang, L., Song, Q., Wang, Z., Hu, M., Liu, C., Xin, X., Jia, W., & Xu, S. (2020). Renovating parsing r-cnn for accurate multiple human parsing. In Proceedings of the European Conference on Computer Vision, (pp. 421–437).
Yang, L., Song, Q., Wang, Z., & Jiang, M. (2019). Parsing r-cnn for instance-level human analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 364–373).
Yang, L., Song, Q., Wang, Z., Liu, Z., Xu, S., & Li, Z. (2022). Quality-aware network for human parsing. IEEE Transactions on Multimedia, 25, 7128–7138.
Yang, L., Song, Q., & Wu, Y. (2021). Attacks on state-of-the-art face recognition using attentional adversarial attack generative network. Multimedia Tools and Applications, 80(1), 855–875.
Yang, L., Song, Q., Wu, Y., & Hu, M. (2018). Attention inspiring receptive-fields network for learning invariant representations. IEEE Transactions on Neural Networks and Learning Systems, 30(6), 1744–1755.
Yang, W., Huang, H., Zhang, Z., Chen, X., Huang, K., & Zhang, S. (2019). Towards rich feature discovery with class activation maps augmentation for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 1389–1398).
Yang, Y., Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1385–1392).
Yu, C., Zhu, X., Zhang, X., Wang, Z., Zhang, Z., & Lei, Z. (2022). Hp-capsule: Unsupervised face part discovery by hierarchical parsing capsule network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4032–4041).
Yu, R., Wang, X., & Xie, X. (2019). Vtnfp: An image-based virtual try-on network with body and clothing feature preservation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 10511–10520).
Yu, S., Li, S., Chen, D., Zhao, R., Yan, J., & Qiao, Y. (2020). Cocas: A large-scale clothes changing person dataset for re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 3400–3409).
Yu, Z., Yoon, J. S., Li, I. K., Venkatesh, P., Park, J., Yu, J., & Park, H. S. (2020). Humbi: A large multiview dataset of human body expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 2990–3000).
Yuan, Y., Chen, X., & Wang, J. (2020). Object-contextual representations for semantic segmentation. In Proceedings of the European Conference on Computer Vision, (pp. 173–190).
Zeng, D., Huang, Y., Bao, Q., Zhang, J., Su, C., & Liu, W. (2021). Neural architecture search for joint human parsing and pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 11385–11394)
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 3836–3847).
Zhang, S., Cao, X., Qi, G. J., Song, Z., & Zhou, J. (2022). Aiparsing: Anchor-free instance-level human parsing. IEEE Transactions on Image Processing, 31, 5599–612.
Zhang, X., Chen, Y., Tang, M., Wang, J., Zhu, X., & Lei, Z. (2022). Human parsing with part-aware relation modeling. IEEE Transactions on Multimedia, 25, 2601–2612.
Zhang, X., Chen, Y., Zhu, B., Wang, J., & Tang, M. (2020). Blended grammar network for human parsing. In Proceedings of the European Conference on Computer Vision, (pp. 189–205).
Zhang, X., Chen, Y., Zhu, B., Wang, J., & Tang, M. (2020). Part-aware context network for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 8971–8980).
Zhang, Z., Su, C., Zheng, L., & Xie, X. (2020). Correlating edge, pose with parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 8900–8909).
Zhang, Z., Su, C., Zheng, L., Xie, X., & Li, Y. (2021). On the correlation among edge, pose and parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 8492–507.
Zhao, F., Xie, Z., Kampffmeyer, M., Dong, H., Han, S., Zheng, T., Zhang, T., & Liang, X. (2021). M3d-vton: A monocular-to-3d virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 13239–13249).
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2881–2890).
Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., & Feng, J. (2018). Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 792–800).
Zhao, J., Li, J., Liu, H., Yan, S., & Feng, J. (2020). Fine-grained multi-human parsing. International Journal of Computer Vision, 128(8), 2185–2203.
Zhao, Y., Li, J., Zhang, Y., & Tian, Y. (2019). Multi-class part parsing with joint boundary-semantic awareness. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9177–9186).
Zhao, Y., Li, J., Zhang, Y., & Tian, Y. (2022). From pose to part: Weakly-supervised pose evolution for human part segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3107–20.
Zhao, Z., Jin, Y., & Heng, P. A. (2021). Modelling neighbor relation in joint space-time graph for video correspondence learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9960–9969).
Zheng, C., Wu, W., Yang, T., Zhu, S., Chen, C., Liu, R., Shen, J., Kehtarnavaz, N., & Shah, M. (2023). Deep learning-based human pose estimation: A survey. ACM Computing Surveys, 56(1), 1–37.
Zheng, S., Yang, F., Kiapour, M. H., & Piramuthu, R. (2018). Modanet: A large-scale street fashion dataset with polygon annotations. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 1670–1678).
Zheng, Z., Yu, T., Wei, Y., Dai, Q., & Liu, Y. (2019). Deephuman: 3d human reconstruction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 7739–7749).
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 633–641).
Zhou, Q., Liang, X., Gong, K., & Lin, L. (2018). Adaptive temporal encoding network for video instance-level human parsing. In Proceedings of the 26th ACM International Conference on Multimedia, (pp. 1527–1535).
Zhou, T., Wang, W., Liu, S., Yang, Y., & Gool, L. V. (2021). Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 1622–1631).
Zhou, T., Yang, Y., & Wang, W. (2023). Differentiable multi-granularity human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7), 8296–8310.
Zhu, B., Chen, Y., Tang, M., & Wang, J. (2018). Progressive cognitive human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 7607–7614).
Zhu, L., Chen, Y., Lu, Y., Lin, C., & Yuille, A. (2008). Max margin and/or graph learning for parsing the human body. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1–8).
Zhu, T., Karlsson, P., & Bregler, C. (2020). Simpose: Effectively learning densepose and surface normals of people from simulated data. In Proceedings of the European Conference on Computer Vision, (pp. 225–242).
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations.
Acknowledgements
This work was supported by the China National Postdoctoral Program for Innovative Talents (No. BX2021047), China Postdoctoral Science Foundation (No. 2022M710466), and Young Scientists Fund of NSFC (Grant No. 62206025).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Limin Wang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, L., Jia, W., Li, S. et al. Deep Learning Technique for Human Parsing: A Survey and Outlook. Int J Comput Vis 132, 3270–3301 (2024). https://doi.org/10.1007/s11263-024-02031-9
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02031-9