Local Compressed Video Stream Learning for Generic Event Boundary Detection

Zhang, Libo; Gu, Xin; Li, Congcong; Luo, Tiejian; Fan, Heng

doi:10.1007/s11263-023-01921-8

Local Compressed Video Stream Learning for Generic Event Boundary Detection

Published: 01 November 2023

Volume 132, pages 1187–1204, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Libo Zhang ORCID: orcid.org/0000-0001-8450-0958^1,2,3^na1,
Xin Gu²^na1,
Congcong Li²,
Tiejian Luo² &
…
Heng Fan⁴

412 Accesses
3 Citations
Explore all metrics

Abstract

Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at https://github.com/GX77/LCVSL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BaSSL: Boundary-aware Self-Supervised Learning for Video Scene Segmentation

A novel method for video shot boundary detection using CNN-LSTM approach

Article 03 October 2022

Boundary Attention: Learning Curves, Corners, Junctions and Grouping

Data Availability Statement

The data that support the findings of this study are openly available in “GEBD” at https://github.com/StanLei52/GEBD, which are included in this published article (Shou et al., 2021).

Notes

https://www.meltycone.com/blog/video-marketing-statistics-for-2023.
https://sites.google.com/view/loveucvpr21.

References

Alwassel, H., Heilbron, F. C., & Ghanem, B. (2018). Action search: Spotting actions in videos and its application to temporal action localization. In: ECCV.
Arnab, A., Dehghani, M., Heigold, G., et al. (2021). Vivit: A video vision transformer. In: ICCV.
Caba Heilbron, F., Barrios, W., Escorcia, V., et al. (2017). Scc: Semantic context cascade for efficient action detection. In: CVPR.
Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR.
Chao, Y. W., Vijayanarasimhan, S., Seybold, B., et al. (2018). Rethinking the faster R-CNN architecture for temporal action localization. In: CVPR.
Chen, Y., Kalantidis, Y., Li, J., et al. (2018). Multi-fiber networks for video recognition. In: ECCV.
Deng, J., Dong, W., Socher, R., et al. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR.
Ding, L., & Xu, C. (2018). Weakly-supervised action segmentation with iterative soft boundary assignment. In: CVPR.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021a). An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021b). An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR.
Fan, H., Xiong, B., Mangalam, K., et al. (2021). Multiscale vision transformers.
Fan, L., Huang, W., Gan, C., et al. (2018). End-to-end learning of motion representation for video understanding. In: CVPR.
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In: CVPR.
Feichtenhofer, C., Fan, H, Malik J, et al (2019) Slowfast networks for video recognition. In: ICCV.
Gall, D. L. (1991). MPEG: A video compression standard for multimedia applications. Communications of the ACM, 34(4), 46–58.
Article Google Scholar
Geirhos, R., Jacobsen, J., Michaelis, C., et al. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673.
Article Google Scholar
He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. In: CVPR.
Hong, D., Li, C., Wen, L., et al. (2021). Generic event boundary detection challenge at CVPR 2021 technical report: Cascaded temporal attention network (CASTANET). arXiv.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In: CVPR.
Huang, D., Fei-Fei, L., Niebles, J. C. (2016). Connectionist temporal modeling for weakly supervised action labeling. In: ECCV.
Huang, L., Liu, Y., Wang, B., et al. (2021). Self-supervised video representation learning by context and motion decoupling. In: CVPR.
Ji, S., Xu, W., Yang, M., et al. (2013). 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221–231.
Article Google Scholar
Kang, H., Kim, J., Kim, K., et al. (2021). Winning the CVPR’2021 kinetics-GEBD challenge: Contrastive learning approach. arXiv.
Kuehne, H., Jhuang, H., Garrote, E., et al. (2011). HMDB: A large video database for human motion recognition. In: ICCV.
Lea, C., Reiter, A., Vidal, R., et al. (2016). Segmental spatiotemporal CNNS for fine-grained action segmentation. In: ECCV.
Lea, C., Flynn, M. D., Vidal, R., et al. (2017). Temporal convolutional networks for action segmentation and detection. In: CVPR.
Li, C., Wang, X., Wen, L., et al. (2022). End-to-end compressed video representation learning for generic event boundary detection. In: CVPR.
Li, J., Wei, P., Zhang, Y., et al. (2020). A slow-i-fast-p architecture for compressed video action recognition. In: ACM MM.
Lin, T., Zhao, X., & Shou, Z. (2017). Single shot temporal action detection. In: ACM MM.
Lin, T., Liu, X., Li, X., et al. (2019a). BMN: Boundary-matching network for temporal action proposal generation. In: ICCV.
Lin, T., Liu, X., Li, X., et al. (2019b). BMN: Boundary-matching network for temporal action proposal generation. In: ICCV.
Liu, Z., Lin, Y., Cao, Y., et al, (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022.
Liu, Z., Ning, J., Cao, Y., et al. (2022). Video Swin transformer.
Long, F., Yao, T., Qiu, Z., et al. (2019). Gaussian temporal awareness networks for action localization. In: CVPR.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: CVPR.
Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. In: CVPR.
Ng, J. Y., Choi, J., Neumann, J., et al. (2018). Actionflownet: Learning motion representation for action recognition. In: WACV.
Ni, B., Yang, X., & Gao, S. (2016). Progressively parsing interactional objects for fine grained action detection. In: CVPR.
Paszke, A., Gross, S., Massa, F., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS.
Rai, A. K., Krishna, T., Dietlmeier, J, et al. (2021). Discerning generic event boundaries in long-form wild videos. arXiv.
Richard, A., & Gall, J. (2016). Temporal action detection using a statistical language model. In: CVPR.
Shao, D., Zhao, Y., Dai, B., et al. (2020). Intra- and inter-action understanding via temporal action parsing. In: CVPR.
Shou, M. Z., Lei, S. W., Wang, W, et al. (2021). Generic event boundary detection: A benchmark for event segmentation. In: ICCV.
Shou, Z., Lin, X., Kalantidis, Y., et al. (2019). Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In: CVPR.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In: NIPS.
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv.
Sun, D., Yang, X., Liu, M., et al. (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: CVPR.
Tang, J., Liu, Z., Qian, C., et al. (2022). Progressive attention on multi-level dense difference maps for generic event boundary detection. In: CVPR.
Taylor, G. W., Fergus, R., LeCun, Y., et al. (2010). Convolutional learning of spatio-temporal features. In: ECCV.
Tran, D., Bourdev, L. D., Fergus, R, et al. (2015). Learning spatiotemporal features with 3d convolutional networks. In: ICCV.
Tran, D., Ray, J., Shou, Z, et al. (2017). Convnet architecture search for spatiotemporal feature learning. arXiv.
Tran, D., Wang, H., Feiszli, M., et al. (2019). Video classification with channel-separated convolutional networks. In: ICCV.
Varol, G., Laptev, I., & Schmid, C. (2018). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1510–1517.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017a). Attention is all you need. In: NIPS.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017b). Attention is all you need. In: NIPS.
Wang, L., Li, W., Li, W., et al. (2018a). Appearance-and-relation networks for video classification. In: CVPR.
Wang, S., Lu, H., & Deng, Z. (2019). Fast object detection in compressed video. In: ICCV.
Wang, X., Girshick, R. B., Gupta, A., et al. (2018b). Non-local neural networks. In: CVPR.
Woo, S., Park, J., Lee, J., et al. (2018). CBAM: Convolutional block attention module. In: ECCV.
Wu, C., Zaheer, M., Hu, H., et al. (2018). Compressed video action recognition. In: CVPR.
Xie, S., Sun, C., Huang, J., et al. (2017). Rethinking spatiotemporal feature learning for video understanding. arXiv.
Yu, Y., Lee, S., Kim, G., et al. (2021). Self-supervised learning of compressed video representations. In: ICLR.
Yuan, Z., Stroud, J. C., Lu, T., et al. (2017). Temporal action localization by structured maximal sums. In: CVPR.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In: ECCV.
Zhang, B., Wang, L., Wang, Z., et al. (2016). Real-time action recognition with enhanced motion vector CNNs. In: CVPR.
Zhang, B., Wang, L., Wang, Z., et al. (2018). Real-time action recognition with deeply transferred motion vector CNNs. IEEE Transactions on Image Processing, 27(5), 2326–2339.
Article MathSciNet Google Scholar
Zhang, H., Hao, Y., & Ngo, C. (2021). Token shift transformer for video classification. In: ACM MM.
Zhao, P., Xie, L., Ju, C., et al. (2020). Bottom-up temporal action localization with mutual regularization. In: ECCV.
Zhao, Y., Xiong, Y., Wang, L., et al. (2017). Temporal action detection with structured segment networks. In: ICCV.

Download references

Acknowledgements

Libo Zhang was supported by the Key Research Program of Frontier Sciences, CAS, Grant No. ZDBS-LY-JSC038, High-end Research Institutions Innovation Special Funds introduced by Zhongshan Science and Technology Bureau (No.2020AG011) and Youth Innovation Promotion Association, CAS (2020111). Heng Fan and his employer received no financial support for the research, authorship, and/or publication of this article.

Author information

Libo Zhang, Xin Gu have equal contribution to this work.

Authors and Affiliations

Institute of Software Chinese Academy of Sciences, Beijing, China
Libo Zhang
University of Chinese Academy of Sciences, Beijing, China
Libo Zhang, Xin Gu, Congcong Li & Tiejian Luo
Nanjing Institute of Software Technology, Nanjing, China
Libo Zhang
Department of Computer Science and Engineering, University of North Texas, Denton, TX, USA
Heng Fan

Authors

Libo Zhang
View author publications
Search author on:PubMed Google Scholar
Xin Gu
View author publications
Search author on:PubMed Google Scholar
Congcong Li
View author publications
Search author on:PubMed Google Scholar
Tiejian Luo
View author publications
Search author on:PubMed Google Scholar
Heng Fan
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Heng Fan.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, L., Gu, X., Li, C. et al. Local Compressed Video Stream Learning for Generic Event Boundary Detection. Int J Comput Vis 132, 1187–1204 (2024). https://doi.org/10.1007/s11263-023-01921-8

Download citation

Received: 05 January 2023
Accepted: 25 September 2023
Published: 01 November 2023
Version of record: 01 November 2023
Issue date: April 2024
DOI: https://doi.org/10.1007/s11263-023-01921-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Local Compressed Video Stream Learning for Generic Event Boundary Detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BaSSL: Boundary-aware Self-Supervised Learning for Video Scene Segmentation

A novel method for video shot boundary detection using CNN-LSTM approach

Boundary Attention: Learning Curves, Corners, Junctions and Grouping

Explore related subjects

Data Availability Statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now