Procedure-Aware Action Quality Assessment: Datasets and Performance Evaluation

Xu, Jinglin; Rao, Yongming; Zhou, Jie; Lu, Jiwen

doi:10.1007/s11263-024-02146-z

Procedure-Aware Action Quality Assessment: Datasets and Performance Evaluation

Published: 14 July 2024

Volume 132, pages 6069–6090, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jinglin Xu^1,2,
Yongming Rao²,
Jie Zhou² &
…
Jiwen Lu ORCID: orcid.org/0000-0002-6121-5529²

1160 Accesses
5 Citations
Explore all metrics

Abstract

In this paper, we investigate the problem of procedure-aware action quality assessment, which analyzes the action quality by delving into the semantic and spatial-temporal relationships among various composed steps of the action. Most existing action quality assessment methods regress on deep features of entire videos to learn diverse scores, which ignore the relationships among different fine-grained steps in actions and result in limitations in visual interpretability and generalization ability. To address these issues, we construct a fine-grained competitive sports video dataset called FineDiving with detailed semantic and temporal annotations, which helps understand the internal structures of each action. We also propose a new approach (i.e., spatial-temporal segmentation attention, STSA) that introduces procedure segmentation to parse an action into consecutive steps, learns powerful representations from these steps by constructing spatial motion attention and procedure-aware cross-attention, and designs a fine-grained contrastive regression to achieve an interpretable scoring mechanism. In addition, we build a benchmark on the FineDiving dataset to evaluate the performance of representative action quality assessment methods. Then, we expand FineDiving to FineDiving+ and construct three new benchmarks to investigate the transferable abilities between different diving competitions, between synchronized and individual dives, and between springboard and platform dives to demonstrate the generalization abilities of our STSA in unknown scenarios, scoring rules, action types, and difficulty degrees. Extensive experiments demonstrate that our approach, designed for procedure-aware action quality assessment, achieves substantial improvements. Our dataset and code are available at https://github.com/xujinglin/FineDiving.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action quality assessment via moment aware network

Article 29 April 2025

Improving action quality assessment with across-staged temporal reasoning on imbalanced data

Article 18 November 2023

Assessing action quality with semantic-sequence performance regression and densely distributed sample weighting

Article 29 February 2024

Notes

https://resources.fina.org/

References

Bai, Y., Zhou, D., Zhang, S., Wang, J., Ding, E., Guan, Y., Long, Y., & Wang, J. (2022). Action quality assessment with temporal parsing transformer. In ECCV, (pp. 422–438).
Bertasius, G., Soo Park, H., Yu, S.X., & Shi, J. (2017). Am i a baller? basketball performance assessment from first-person videos. In ICCV, (pp. 2177–2185).
Caba Heilbron, F., Escorcia, V., Ghanem, B., & Niebles, J.C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, (pp. 961–970).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, (pp. 6299–6308).
Chen, X., Pang, A., Yang, W., Ma, Y., Xu, L., & Yu, J. (2021). Sportscap: Monocular 3d human motion capture and fine-grained understanding in challenging sports videos. IJCV, 129, 2846–2864.
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929.
Doughty, H., Damen, D., & Mayol-Cuevas, W. (2018). Who’s better? who’s best? pairwise deep ranking for skill determination. In CVPR, (pp. 6057–6066).
Doughty, H., Mayol-Cuevas, W., & Damen, D. (2019). The pros and cons: Rank-aware temporal attention for skill determination in long videos. In CVPR, (pp. 7862–7871).
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR, (pp. 1933–1941).
Gao, J., Zheng, W.S., Pan, J.H., Gao, C., Wang, Y., Zeng, W., & Lai, J. (2020). An asymmetric modeling for action assessment. In ECCV, (pp. 222–238).
Gao, J., Pan, J. H., Zhang, S. J., & Zheng, W. S. (2023). Automatic modelling for interactive action assessment. IJCV, 131(3), 659–679.
Article Google Scholar
Gattupalli, S., Ebert, D., Papakostas, M., Makedon, F., & Athitsos, V. (2017). Cognilearn: A deep learning-based interface for cognitive behavior assessment. In IUI, (pp. 577–587).
Gorban, A., Idrees, H., Jiang, Y.G., Roshan Zamir, A., Laptev, I., Shah, M., & Sukthankar, R. (2015). THUMOS challenge: Action recognition with a large number of classes. http://www.thumos.info/.
Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, (pp. 6047–6056).
Jain, H., Harit, G., & Sharma, A. (2020). Action quality assessment using siamese network-based deep metric learning. TCSVT, 31(6), 2260–2273.
Google Scholar
Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3d convolutional neural networks for human action recognition. TPAMI, 35(1), 221–231.
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR, (pp. 1725–1732).
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: a large video database for human motion recognition. In ICCV, (pp. 2556–2563).
Li, Y., Chai, X., & Chen, X. (2018). End-to-end learning for action quality assessment. In PRCM, (pp. 125–134).
Li, Y., Chen, L., He, R., Wang, Z., Wu, G., & Wang, L. (2021). Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In ICCV, (pp. 13536–13545).
Li, H., Chen, J., Hu, R., Yu, M., Chen, H., & Xu, Z. (2019). Action recognition using visual attention with reinforcement learning. In ICMM, (pp. 365–376).
Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In ECCV, (pp. 513–528).
Lin, T., Liu, X., Li, X., Ding, E., & Wen, S. (2019). Bmn: Boundary-matching network for temporal action proposal generation. In ICCV, (pp. 3889–3898).
Liu, S., Liu, X., Huang, G., Feng, L., Hu, L., Jiang, D., Zhang, A., Liu, Y., & Qiao, H. (2020). Fsd-10: a dataset for competitive sports content analysis. arXiv:2002.03312.
Liu, S., Zhang, A., Li, Y., Zhou, J., Xu, L., Dong, Z., & Zhang, R. (2021). Temporal segmentation of fine-gained semantic action: A motion-centered figure skating dataset. In AAAI, (pp. 2163–2171).
Meyer, M., Baldwin, D.A., & Sage, K. (2011). Assessing young children’s hierarchical action segmentation. In CogSci, (pp. 3156–3161).
Monfort, M., Andonian, A., Zhou, B., Ramakrishnan, K., Bargal, S.A., Yan, T., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C., et al. (2019). Moments in time dataset: one million videos for event understanding. In TPAMI, (pp. 1–8). https://doi.org/10.1109/TPAMI.2019.2901464.
Montes, A., Salvador, A., Pascual, S., & Giro-i Nieto, X. (2016). Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv:1608.08128.
Niebles, J.C., Chen, C.W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, (pp. 392–405).
Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV, (pp. 1817–1824).
Pan, J.H., Gao, J., & Zheng, W.S. (2019). Action assessment by joint relation graphs. In ICCV, (pp. 6331–6340).
Parisi, G.I., Magg, S., & Wermter, S. (2016). Human motion assessment in real time using recurrent self-organization. In RO-MAN, (pp. 71–76).
Parmar, P., & Morris, B. (2019). Action quality assessment across multiple actions. In WACV, (pp. 1468–1476).
Parmar, P., & Tran Morris, B. (2017). Learning to score olympic events. In CVPRW, (pp. 20–28).
Parmar, P., & Tran Morris, B. (2019). What and how well you performed? a multitask learning approach to action quality assessment. In CVPR, (pp. 304–313).
Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Assessing the quality of actions. In ECCV, (pp. 556–571).
Safdarnejad, S.M., Liu, X., Udpa, L., Andrus, B., Wood, J., & Craven, D. (2015). Sports videos in the wild (svw): A video dataset for sports analysis. In FG, (pp. 1–7).
Schmidt, C.F. (1976). Understanding human action: Recognizing the plans and motives of other persons. In ACL Anthology, (pp. 196–200).
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR, (pp. 32–36).
Shao, D., Zhao, Y., Dai, B., & Lin, D. (2020). Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, (pp. 2616–2625).
Shao, D., Zhao, Y., Dai, B., & Lin, D. (2020). Intra-and inter-action understanding via temporal action parsing. In CVPR, (pp. 730–739).
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199.
Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
Tang, Y., Ding, D., Rao, Y., Zheng, Y., Zhang, D., Zhao, L., Lu, J., & Zhou, J. (2019). Coin: A large-scale dataset for comprehensive instructional video analysis. In CVPR, (pp. 1207–1216).
Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., & Zhou, J. (2020). Uncertainty-aware score distribution learning for action quality assessment. In CVPR, (pp. 9839–9848).
Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv:2203.12602.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In ICCV, (pp. 4489–4497).
Varol, G., Laptev, I., & Schmid, C. (2017). Long-term temporal convolutions for action recognition. TPAMI, 40(6), 1510–1517.
Article Google Scholar
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV, (pp. 3551–3558).
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR, (pp. 7794–7803).
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV, (pp. 20–36).
Wang, S., Yang, D., Zhai, P., Chen, C., & Zhang, L. (2021). Tsa-net: Tube self-attention network for action quality assessment. In ACM-MM, (pp. 4902–4910).
Xiang, X., Tian, Y., Reiter, A., Hager, G.D., & Tran, T.D. (2018). S3d: Stacking segmental p3d for action quality assessment. In ICIP, (pp. 928–932).
Xu, J., Rao, Y., Yu, X., Chen, G., Zhou, J., & Lu, J. (2022). Finediving: A fine-grained dataset for procedure-aware action quality assessment. In CVPR, (pp. 2949–2958).
Xu, A., Zeng, L.A., & Zheng, W.S. (2022). Likert scoring with grade decoupling for long-term action assessment. In CVPR, (pp. 3232–3241).
Xu, C., Fu, Y., Zhang, B., Chen, Z., Jiang, Y. G., & Xue, X. (2019). Learning to score figure skating sport videos. TCSVT, 30(12), 4578–4590.
Google Scholar
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 32.
Yang, C., Xu, Y., Shi, J., Dai, B., & Zhou, B. (2020). Temporal pyramid network for action recognition. In CVPR, (pp. 591–600).
Yang, L., Han, J., Zhao, T., Lin, T., Zhang, D., & Chen, J. (2021). Background-click supervision for temporal action localization. TPAMI. https://doi.org/10.1109/TPAMI.2021.3132058
Article Google Scholar
Yeung, S., Russakovsky, O., Mori, G., & Fei-Fei, L. (2016). End-to-end learning of action detection from frame glimpses in videos. In CVPR, (pp. 2678–2687).
Yu, X., Rao, Y., Zhao, W., Lu, J., & Zhou, J. (2021). Group-aware contrastive regression for action quality assessment. In ICCV, (pp. 7919–7928).
Zeng, L.A., Hong, F.T., Zheng, W.S., Yu, Q.Z., Zeng, W., Wang, Y.W., & Lai, J.H. (2020). Hybrid dynamic-static context-aware attention network for action assessment in long videos. In ACM MM, (pp. 2526–2534).
Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., & Gan, C. (2021). Graph convolutional module for temporal action localization in videos. TPAMI. https://doi.org/10.1109/TPAMI.2021.3090167
Article Google Scholar
Zhang, B., Chen, J., Xu, Y., Zhang, H., Yang, X., & Geng, X. (2021). Auto-encoding score distribution regression for action quality assessment. arXiv:2111.11029.
Zhang, S., Dai, W., Wang, S., Shen, X., Lu, J., Zhou, J., & Tang, Y. (2023). Logo: A long-form video dataset for group action quality assessment. In CVPR, (pp. 2405–2414).
Zhang, C., Gupta, A., & Zisserman, A. (2021). Temporal query networks for fine-grained video understanding. In CVPR, (pp. 4486–4496).
Zhang, Q., & Li, B. (2014). Relative hidden markov models for video-based evaluation of motion skills in surgical training. TPAMI, 37(6), 1206–1218.
Article Google Scholar
Zhao, H., Torralba, A., Torresani, L., & Yan, Z. (2019). Hacs: Human action clips and segments dataset for recognition and temporal localization. In ICCV, (pp. 8668–8678).
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2017). Temporal action detection with structured segment networks. In ICCV, (pp. 2914–2923).

Download references

Acknowledgements

This work was supported by grants from the National Natural Science Foundation of China (62125603 and 62373043) and the Young Elite Scientists Sponsorship Program by CAST (2023QNRC001). The authors thank Xumin Yu and Guangyi Chen for their valuable suggestions.

Author information

Authors and Affiliations

School of Intelligence Science and Technology, University of Science and Technology, Beijing, China
Jinglin Xu
Department of Automation, Tsinghua University, Beijing, China
Jinglin Xu, Yongming Rao, Jie Zhou & Jiwen Lu

Authors

Jinglin Xu
View author publications
Search author on:PubMed Google Scholar
Yongming Rao
View author publications
Search author on:PubMed Google Scholar
Jie Zhou
View author publications
Search author on:PubMed Google Scholar
Jiwen Lu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jiwen Lu.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 About Annotator’s Disagreement

During the annotation process, we assess and measure inter-annotator agreement using metrics like Fleiss’ kappa $\kappa $, which can help identify and address inconsistencies between annotators. $\kappa $ is calculated by:

$$\begin{aligned} \kappa =\frac{P_A-P_e}{1-P_e} \end{aligned}$$

(19)

where $1-P_e$ gives the degree of agreement that is attainable above chance, and $P_A-P_e$ gives the degree of agreement actually achieved above chance. If the annotators are in complete agreement, then $\kappa \!=\!1$. If there is no agreement among the annotators (other than what would be expected by chance), then $\kappa \!\le \!0$. When $\kappa $ is below 0.6, we regard it as inter-annotator disagreement, and then additional clarification or higher-level judgments are necessary to resolve ambiguities. For instance, the guideline states that the somersault pike or somersault tuck in the flight is considered over when the torso opens at an angle of more than 90 degrees. If guidelines are well-defined and leave little room for interpretation, annotators are more likely to agree on the annotations. In this work, we provide clear and detailed annotation guidelines according to official FINA diving rules^{Footnote 1}, including various examples and extreme cases, minimizing ambiguities and enhancing annotated data’s reliability. Taking defining where the sub-action “Entry” starts and ends as an example, when 1) the torso opens at an angle of more than 90 degrees, 2) the head, torso, and legs tend to be a straight line, and 3) extending arms straight above the head, with the hands close together, the sub-action “Entry” starts; when the hands enter first, followed by the head, shoulders, and the rest of the body, the sub-action “Entry” ends.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, J., Rao, Y., Zhou, J. et al. Procedure-Aware Action Quality Assessment: Datasets and Performance Evaluation. Int J Comput Vis 132, 6069–6090 (2024). https://doi.org/10.1007/s11263-024-02146-z

Download citation

Received: 14 June 2023
Accepted: 02 June 2024
Published: 14 July 2024
Version of record: 14 July 2024
Issue date: December 2024
DOI: https://doi.org/10.1007/s11263-024-02146-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Procedure-Aware Action Quality Assessment: Datasets and Performance Evaluation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Action quality assessment via moment aware network

Improving action quality assessment with across-staged temporal reasoning on imbalanced data

Assessing action quality with semantic-sequence performance regression and densely distributed sample weighting

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 About Annotator’s Disagreement

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now