Abstract
Along with the development of modern smart cities, human-centric video analysis has been encountering the challenge of analyzing diverse and complex events in real scenes. A complex event relates to dense crowds, anomalous individuals, or collective behaviors. However, limited by the scale and coverage of existing video datasets, few human analysis approaches have reported their performances on such complex events. To this end, we present a new large-scale dataset with comprehensive annotations, named human-in-events or human-centric video analysis in complex events (HiEve), for the understanding of human motions, poses, and actions in a variety of realistic events, especially in crowd and complex events. It contains a record number of poses (> 1 M), the largest number of action instances (> 56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time (with an average trajectory length of > 480 frames). Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation, respectively. They leverage cross-label information during training to enhance the feature learning in corresponding visual tasks. Experiments show that they could boost the performance of existing action recognition and pose estimation pipelines. More importantly, they prove the widely ranged annotations in HiEve can improve various video tasks. Furthermore, we conduct extensive experiments to benchmark recent video analysis approaches together with our baseline methods, demonstrating HiEve is a challenging dataset for human-centric video analysis. We expect that the dataset will advance the development of cutting-edge techniques in human-centric analysis and the understanding of complex events. The dataset is available at http://humaninevents.org.
Similar content being viewed by others
Data Availability
The datasets analyzed during the current study are all available publicly. Please refer to http://humaninevents.org for further details.
References
Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., & Schiele, B. (2018). Posetrack: A benchmark for human pose estimation and tracking. In CVPR (pp. 5167–5176).
Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In CVPR.
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, p. 4).
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464–3468). IEEE.
Bochinski, E., Eiselein, V., & Sikora, T. (2017). High-speed tracking-by-detection without using image information. In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE.
Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang, H., Zhang, X., Zhou, X., Zhou, E., & Sun, J. (2020). Learning delicate local representations for multi-person pose estimation. In European conference on computer vision (pp. 455–472). Springer.
Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 6299–6308).
Chen, L., Ai, H., Zhuang, Z., & Shang, C. (2018). Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME.
Chen, Y., Zhao, P., Qi, M., Zhao, Y., Jia, W., & Wang, R. (2022). Audio matters in video super-resolution by implicit semantic guidance. IEEE Transactions on Multimedia.
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., & Zhang, L. (2020). Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, M. (2021). The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11), 4125–4141. https://doi.org/10.1109/TPAMI.2020.2991965
Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003.
Du, Y., Fu, Y., & Wang, L. (2016). Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Transactions on Image Processing, 25(7), 3010–3022.
Eichner, M., &Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In ECCV.
Fang, H.S., Xie, S., Tai, Y.W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In IEEE international conference on computer vision.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In IEEE international conference on computer vision.
Ferryman, J., & Shahrokni, A. (2009). Pets2009: Dataset and challenge. In 2009 Twelfth IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–6. IEEE).
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The Kitti vision benchmark suite. In 2012 CVPR. IEEE
Geng, Z., Sun, K., Xiao, B., Zhang, Z., & Wang, J. (2021). Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14676–14686).
Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2018). A better baseline for AVA. arXiv:1807.10066.
Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In CVPR (pp. 244–253).
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., & Mueller-Freitag, M., et al. (2017). The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision (pp. 5842–5850).
Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., & Sukthankar, R., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.
Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., & Sukthankar, R. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.
He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In CVPR (pp. 770–778).
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR.
Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In 2017 12th IEEE international conference on automatic face and gesture recognition (FG 2017) (pp. 438–445). IEEE.
Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In: bmvc.
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE international conference on computer vision (pp. 4405–4413).
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In 2011 international conference on computer vision. IEEE.
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., & Lu, C. (2019). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR (pp. 10863–10872).
Li, Y., Zhang, B., Li, J., Wang, Y., Lin, W., Wang, C., Li, J., & Huang, F. (2021). Lstc: Boosting atomic action detection with long-short-term context. In Proceedings of the 29th ACM international conference on multimedia (pp. 2158–2166).
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In ECCV.
Liu, J., Wang, G., Duan, L. Y., Abdiyeva, K., & Kot, A. C. (2017). Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing, 27(4), 1586–1599.
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3202–3211).
Lu, C., Shi, J., & Jia, J. (2013). Abnormal event detection at 150 fps in matlab. In IEEE international conference on computer vision (pp. 2720–2727)
Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129, 548–578.
Luvizon, D.C., Picard, D., &Tabia, H. (2018). 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5137–5146).
Mei, T., Tang, L. X., Tang, J., & Hua, X. S. (2013). Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications, 9(3), 1–23.
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv:1603.00831.
Ning, G., Huang, & H. (2019). Lighttrack: A generic framework for online top-down human pose tracking. arXiv:1905.02822.
Peng, J., Wang, T., Lin, W., Wang, J., See, J., Wen, S., & Ding, E. (2020). Tpm: Multiple object tracking with tracklet-plane matching. Pattern Recognition, 107, 107480.
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., & Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR (pp. 4929–4937).
Ren, L., Lu, J., Wang, Z., Tian, Q., & Zhou, J. (2018). Collaborative deep reinforcement learning for multi-object tracking. In ECCV (pp. 586–602).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (pp. 91–99).
Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCV (pp. 17–35).
Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In: CVPR (pp. 3674–3681).
Shahroudy, A., Liu, J., Ng, T.T., & Wang, G.: Ntu rgb+ d . (2016). A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1010–1019).
Shu, X., Tang, J., Qi, G., Liu, W., & Yang, J. (2019). Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Singh, G., Saha, S., Sapienza, M., Torr, P.H., & Cuzzolin, F. (2017). Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE international conference on computer vision (pp. 3637–3646).
Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
Sultani, W., Chen, C., & Shah, M. (2018). Real-world anomaly detection in surveillance videos. In CVPR (pp. 6479–6488).
Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In CVPR.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems (pp. 5998–6008).
Veeriah, V., Zhuang, N., & Qi, G.J. (2015). Differential recurrent neural networks for action recognition. In IEEE international conference on computer vision (pp. 4041–4049).
Wang, H., & Wang, L. (2018). Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Transactions on Image Processing (pp. 4382–4394).
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR.
Wang, Z., Zheng, L., Liu, Y., & Wang, S. (2019). Towards real-time multi-object tracking. arXiv:1909.12605.
Wojke, N., Bewley, A., &Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (pp. 3645–3649). IEEE.
Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019). Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 284–293).
Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In ECCV.
Xiaohan Nie, B., Xiong, C., & Zhu, S.C. (2015). Joint action recognition and pose estimation from video. In CVPR (pp. 1293–1301).
Xiu, Y., Li, J., Wang, H., Fang, Y., & Lu, C. (2018). Pose flow: Efficient online pose tracking. arXiv:1802.00977.
Xu, M., Liu, Y., Hu, R., & He, F. (2018). Find who to look at: Turning from action to saliency. IEEE Transactions on Image Processing.
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence.
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (vol. 32).
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems.
Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2020). Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv:2004.01888.
Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In European conference on computer vision (pp. 474–490).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare no conflicts of interest. All videos in this paper are either collected where the human participants were informed in advance and their consents for data publication were obtained, or obtained from online repositories where the publishing approvals from the video authors were obtained and the human identity information was guaranteed to be properly hidden or blurred.
Intended use of HiEve
The authors do not condone with AI systems developed for malicious/unethical surveillance and tracking systems. Any use of the proposed video dataset must adhere to all relevant laws and regulations, including those related to data protection, privacy, and ethical considerations. The proposed video dataset is not to be used for any purpose that violates individual privacy or other legal or ethical standards. The authors are committed to ensuring that the proposed video dataset is used in ways that benefit society and do not cause harm.
Additional information
Communicated by Dima Damen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lin, W., Liu, H., Liu, S. et al. HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events. Int J Comput Vis 131, 2994–3018 (2023). https://doi.org/10.1007/s11263-023-01842-6
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-023-01842-6