HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events

Lin, Weiyao; Liu, Huabin; Liu, Shizhan; Li, Yuxi; Xiong, Hongkai; Qi, Guojun; Sebe, Nicu

doi:10.1007/s11263-023-01842-6

HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events

Published: 10 July 2023

Volume 131, pages 2994–3018, (2023)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Weiyao Lin ORCID: orcid.org/0000-0001-8307-7107¹,
Huabin Liu¹,
Shizhan Liu¹,
Yuxi Li¹,
Hongkai Xiong¹,
Guojun Qi² &
…
Nicu Sebe³

1233 Accesses
31 Citations
13 Altmetric
1 Mention
Explore all metrics

Abstract

Along with the development of modern smart cities, human-centric video analysis has been encountering the challenge of analyzing diverse and complex events in real scenes. A complex event relates to dense crowds, anomalous individuals, or collective behaviors. However, limited by the scale and coverage of existing video datasets, few human analysis approaches have reported their performances on such complex events. To this end, we present a new large-scale dataset with comprehensive annotations, named human-in-events or human-centric video analysis in complex events (HiEve), for the understanding of human motions, poses, and actions in a variety of realistic events, especially in crowd and complex events. It contains a record number of poses (> 1 M), the largest number of action instances (> 56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time (with an average trajectory length of > 480 frames). Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation, respectively. They leverage cross-label information during training to enhance the feature learning in corresponding visual tasks. Experiments show that they could boost the performance of existing action recognition and pose estimation pipelines. More importantly, they prove the widely ranged annotations in HiEve can improve various video tasks. Furthermore, we conduct extensive experiments to benchmark recent video analysis approaches together with our baseline methods, demonstrating HiEve is a challenging dataset for human-centric video analysis. We expect that the dataset will advance the development of cutting-edge techniques in human-centric analysis and the understanding of complex events. The dataset is available at http://humaninevents.org.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Intensified Approach for Human Activity Recognition Using Machine Learning Deep Neural Networks Concept and Computer Vision Techniques

Toward Generating Human-Centered Video Annotations

Article 24 May 2019

Learning correlations for human action recognition in videos

Article 10 February 2017

Data Availability

The datasets analyzed during the current study are all available publicly. Please refer to http://humaninevents.org for further details.

Notes

References

Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., & Schiele, B. (2018). Posetrack: A benchmark for human pose estimation and tracking. In CVPR (pp. 5167–5176).
Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In CVPR.
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, p. 4).
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464–3468). IEEE.
Bochinski, E., Eiselein, V., & Sikora, T. (2017). High-speed tracking-by-detection without using image information. In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE.
Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang, H., Zhang, X., Zhou, X., Zhou, E., & Sun, J. (2020). Learning delicate local representations for multi-person pose estimation. In European conference on computer vision (pp. 455–472). Springer.
Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 6299–6308).
Chen, L., Ai, H., Zhuang, Z., & Shang, C. (2018). Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME.
Chen, Y., Zhao, P., Qi, M., Zhao, Y., Jia, W., & Wang, R. (2022). Audio matters in video super-resolution by implicit semantic guidance. IEEE Transactions on Multimedia.
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., & Zhang, L. (2020). Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, M. (2021). The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11), 4125–4141. https://doi.org/10.1109/TPAMI.2020.2991965
Article Google Scholar
Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003.
Du, Y., Fu, Y., & Wang, L. (2016). Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Transactions on Image Processing, 25(7), 3010–3022.
Article MathSciNet MATH Google Scholar
Eichner, M., &Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In ECCV.
Fang, H.S., Xie, S., Tai, Y.W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In IEEE international conference on computer vision.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In IEEE international conference on computer vision.
Ferryman, J., & Shahrokni, A. (2009). Pets2009: Dataset and challenge. In 2009 Twelfth IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–6. IEEE).
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The Kitti vision benchmark suite. In 2012 CVPR. IEEE
Geng, Z., Sun, K., Xiao, B., Zhang, Z., & Wang, J. (2021). Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14676–14686).
Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2018). A better baseline for AVA. arXiv:1807.10066.
Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In CVPR (pp. 244–253).
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., & Mueller-Freitag, M., et al. (2017). The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision (pp. 5842–5850).
Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., & Sukthankar, R., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.
Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., & Sukthankar, R. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.
He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In CVPR (pp. 770–778).
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR.
Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In 2017 12th IEEE international conference on automatic face and gesture recognition (FG 2017) (pp. 438–445). IEEE.
Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In: bmvc.
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE international conference on computer vision (pp. 4405–4413).
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In 2011 international conference on computer vision. IEEE.
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., & Lu, C. (2019). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR (pp. 10863–10872).
Li, Y., Zhang, B., Li, J., Wang, Y., Lin, W., Wang, C., Li, J., & Huang, F. (2021). Lstc: Boosting atomic action detection with long-short-term context. In Proceedings of the 29th ACM international conference on multimedia (pp. 2158–2166).
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In ECCV.
Liu, J., Wang, G., Duan, L. Y., Abdiyeva, K., & Kot, A. C. (2017). Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing, 27(4), 1586–1599.
Article MathSciNet MATH Google Scholar
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3202–3211).
Lu, C., Shi, J., & Jia, J. (2013). Abnormal event detection at 150 fps in matlab. In IEEE international conference on computer vision (pp. 2720–2727)
Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129, 548–578.
Article Google Scholar
Luvizon, D.C., Picard, D., &Tabia, H. (2018). 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5137–5146).
Mei, T., Tang, L. X., Tang, J., & Hua, X. S. (2013). Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications, 9(3), 1–23.
Article Google Scholar
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv:1603.00831.
Ning, G., Huang, & H. (2019). Lighttrack: A generic framework for online top-down human pose tracking. arXiv:1905.02822.
Peng, J., Wang, T., Lin, W., Wang, J., See, J., Wen, S., & Ding, E. (2020). Tpm: Multiple object tracking with tracklet-plane matching. Pattern Recognition, 107, 107480.
Article Google Scholar
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., & Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR (pp. 4929–4937).
Ren, L., Lu, J., Wang, Z., Tian, Q., & Zhou, J. (2018). Collaborative deep reinforcement learning for multi-object tracking. In ECCV (pp. 586–602).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (pp. 91–99).
Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCV (pp. 17–35).
Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In: CVPR (pp. 3674–3681).
Shahroudy, A., Liu, J., Ng, T.T., & Wang, G.: Ntu rgb+ d . (2016). A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1010–1019).
Shu, X., Tang, J., Qi, G., Liu, W., & Yang, J. (2019). Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Singh, G., Saha, S., Sapienza, M., Torr, P.H., & Cuzzolin, F. (2017). Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE international conference on computer vision (pp. 3637–3646).
Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
Sultani, W., Chen, C., & Shah, M. (2018). Real-world anomaly detection in surveillance videos. In CVPR (pp. 6479–6488).
Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In CVPR.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems (pp. 5998–6008).
Veeriah, V., Zhuang, N., & Qi, G.J. (2015). Differential recurrent neural networks for action recognition. In IEEE international conference on computer vision (pp. 4041–4049).
Wang, H., & Wang, L. (2018). Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Transactions on Image Processing (pp. 4382–4394).
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR.
Wang, Z., Zheng, L., Liu, Y., & Wang, S. (2019). Towards real-time multi-object tracking. arXiv:1909.12605.
Wojke, N., Bewley, A., &Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (pp. 3645–3649). IEEE.
Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019). Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 284–293).
Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In ECCV.
Xiaohan Nie, B., Xiong, C., & Zhu, S.C. (2015). Joint action recognition and pose estimation from video. In CVPR (pp. 1293–1301).
Xiu, Y., Li, J., Wang, H., Fang, Y., & Lu, C. (2018). Pose flow: Efficient online pose tracking. arXiv:1802.00977.
Xu, M., Liu, Y., Hu, R., & He, F. (2018). Find who to look at: Turning from action to saliency. IEEE Transactions on Image Processing.
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence.
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (vol. 32).
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems.
Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2020). Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv:2004.01888.
Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In European conference on computer vision (pp. 474–490).

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China
Weiyao Lin, Huabin Liu, Shizhan Liu, Yuxi Li & Hongkai Xiong
Machine Perception and Learning Lab, Orlando, USA
Guojun Qi
University of Trento, Trento, Italy
Nicu Sebe

Authors

Weiyao Lin
View author publications
Search author on:PubMed Google Scholar
Huabin Liu
View author publications
Search author on:PubMed Google Scholar
Shizhan Liu
View author publications
Search author on:PubMed Google Scholar
Yuxi Li
View author publications
Search author on:PubMed Google Scholar
Hongkai Xiong
View author publications
Search author on:PubMed Google Scholar
Guojun Qi
View author publications
Search author on:PubMed Google Scholar
Nicu Sebe
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Weiyao Lin.

Ethics declarations

Conflicts of interest

The authors declare no conflicts of interest. All videos in this paper are either collected where the human participants were informed in advance and their consents for data publication were obtained, or obtained from online repositories where the publishing approvals from the video authors were obtained and the human identity information was guaranteed to be properly hidden or blurred.

Intended use of HiEve

The authors do not condone with AI systems developed for malicious/unethical surveillance and tracking systems. Any use of the proposed video dataset must adhere to all relevant laws and regulations, including those related to data protection, privacy, and ethical considerations. The proposed video dataset is not to be used for any purpose that violates individual privacy or other legal or ethical standards. The authors are committed to ensuring that the proposed video dataset is used in ways that benefit society and do not cause harm.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lin, W., Liu, H., Liu, S. et al. HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events. Int J Comput Vis 131, 2994–3018 (2023). https://doi.org/10.1007/s11263-023-01842-6

Download citation

Received: 19 August 2022
Accepted: 15 June 2023
Published: 10 July 2023
Version of record: 10 July 2023
Issue date: November 2023
DOI: https://doi.org/10.1007/s11263-023-01842-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Intensified Approach for Human Activity Recognition Using Machine Learning Deep Neural Networks Concept and Computer Vision Techniques

Toward Generating Human-Centered Video Annotations

Learning correlations for human action recognition in videos

Explore related subjects

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Intended use of HiEve

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now