这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Procedure-Aware Action Quality Assessment: Datasets and Performance Evaluation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we investigate the problem of procedure-aware action quality assessment, which analyzes the action quality by delving into the semantic and spatial-temporal relationships among various composed steps of the action. Most existing action quality assessment methods regress on deep features of entire videos to learn diverse scores, which ignore the relationships among different fine-grained steps in actions and result in limitations in visual interpretability and generalization ability. To address these issues, we construct a fine-grained competitive sports video dataset called FineDiving with detailed semantic and temporal annotations, which helps understand the internal structures of each action. We also propose a new approach (i.e., spatial-temporal segmentation attention, STSA) that introduces procedure segmentation to parse an action into consecutive steps, learns powerful representations from these steps by constructing spatial motion attention and procedure-aware cross-attention, and designs a fine-grained contrastive regression to achieve an interpretable scoring mechanism. In addition, we build a benchmark on the FineDiving dataset to evaluate the performance of representative action quality assessment methods. Then, we expand FineDiving to FineDiving+ and construct three new benchmarks to investigate the transferable abilities between different diving competitions, between synchronized and individual dives, and between springboard and platform dives to demonstrate the generalization abilities of our STSA in unknown scenarios, scoring rules, action types, and difficulty degrees. Extensive experiments demonstrate that our approach, designed for procedure-aware action quality assessment, achieves substantial improvements. Our dataset and code are available at https://github.com/xujinglin/FineDiving.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://resources.fina.org/

References

  • Bai, Y., Zhou, D., Zhang, S., Wang, J., Ding, E., Guan, Y., Long, Y., & Wang, J. (2022). Action quality assessment with temporal parsing transformer. In ECCV, (pp. 422–438).

  • Bertasius, G., Soo Park, H., Yu, S.X., & Shi, J. (2017). Am i a baller? basketball performance assessment from first-person videos. In ICCV, (pp. 2177–2185).

  • Caba Heilbron, F., Escorcia, V., Ghanem, B., & Niebles, J.C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, (pp. 961–970).

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, (pp. 6299–6308).

  • Chen, X., Pang, A., Yang, W., Ma, Y., Xu, L., & Yu, J. (2021). Sportscap: Monocular 3d human motion capture and fine-grained understanding in challenging sports videos. IJCV, 129, 2846–2864.

    Article  Google Scholar 

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929.

  • Doughty, H., Damen, D., & Mayol-Cuevas, W. (2018). Who’s better? who’s best? pairwise deep ranking for skill determination. In CVPR, (pp. 6057–6066).

  • Doughty, H., Mayol-Cuevas, W., & Damen, D. (2019). The pros and cons: Rank-aware temporal attention for skill determination in long videos. In CVPR, (pp. 7862–7871).

  • Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR, (pp. 1933–1941).

  • Gao, J., Zheng, W.S., Pan, J.H., Gao, C., Wang, Y., Zeng, W., & Lai, J. (2020). An asymmetric modeling for action assessment. In ECCV, (pp. 222–238).

  • Gao, J., Pan, J. H., Zhang, S. J., & Zheng, W. S. (2023). Automatic modelling for interactive action assessment. IJCV, 131(3), 659–679.

    Article  Google Scholar 

  • Gattupalli, S., Ebert, D., Papakostas, M., Makedon, F., & Athitsos, V. (2017). Cognilearn: A deep learning-based interface for cognitive behavior assessment. In IUI, (pp. 577–587).

  • Gorban, A., Idrees, H., Jiang, Y.G., Roshan Zamir, A., Laptev, I., Shah, M., & Sukthankar, R. (2015). THUMOS challenge: Action recognition with a large number of classes. http://www.thumos.info/.

  • Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, (pp. 6047–6056).

  • Jain, H., Harit, G., & Sharma, A. (2020). Action quality assessment using siamese network-based deep metric learning. TCSVT, 31(6), 2260–2273.

    Google Scholar 

  • Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3d convolutional neural networks for human action recognition. TPAMI, 35(1), 221–231.

    Article  Google Scholar 

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR, (pp. 1725–1732).

  • Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: a large video database for human motion recognition. In ICCV, (pp. 2556–2563).

  • Li, Y., Chai, X., & Chen, X. (2018). End-to-end learning for action quality assessment. In PRCM, (pp. 125–134).

  • Li, Y., Chen, L., He, R., Wang, Z., Wu, G., & Wang, L. (2021). Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In ICCV, (pp. 13536–13545).

  • Li, H., Chen, J., Hu, R., Yu, M., Chen, H., & Xu, Z. (2019). Action recognition using visual attention with reinforcement learning. In ICMM, (pp. 365–376).

  • Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In ECCV, (pp. 513–528).

  • Lin, T., Liu, X., Li, X., Ding, E., & Wen, S. (2019). Bmn: Boundary-matching network for temporal action proposal generation. In ICCV, (pp. 3889–3898).

  • Liu, S., Liu, X., Huang, G., Feng, L., Hu, L., Jiang, D., Zhang, A., Liu, Y., & Qiao, H. (2020). Fsd-10: a dataset for competitive sports content analysis. arXiv:2002.03312.

  • Liu, S., Zhang, A., Li, Y., Zhou, J., Xu, L., Dong, Z., & Zhang, R. (2021). Temporal segmentation of fine-gained semantic action: A motion-centered figure skating dataset. In AAAI, (pp. 2163–2171).

  • Meyer, M., Baldwin, D.A., & Sage, K. (2011). Assessing young children’s hierarchical action segmentation. In CogSci, (pp. 3156–3161).

  • Monfort, M., Andonian, A., Zhou, B., Ramakrishnan, K., Bargal, S.A., Yan, T., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C., et al. (2019). Moments in time dataset: one million videos for event understanding. In TPAMI, (pp. 1–8). https://doi.org/10.1109/TPAMI.2019.2901464.

  • Montes, A., Salvador, A., Pascual, S., & Giro-i Nieto, X. (2016). Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv:1608.08128.

  • Niebles, J.C., Chen, C.W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, (pp. 392–405).

  • Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV, (pp. 1817–1824).

  • Pan, J.H., Gao, J., & Zheng, W.S. (2019). Action assessment by joint relation graphs. In ICCV, (pp. 6331–6340).

  • Parisi, G.I., Magg, S., & Wermter, S. (2016). Human motion assessment in real time using recurrent self-organization. In RO-MAN, (pp. 71–76).

  • Parmar, P., & Morris, B. (2019). Action quality assessment across multiple actions. In WACV, (pp. 1468–1476).

  • Parmar, P., & Tran Morris, B. (2017). Learning to score olympic events. In CVPRW, (pp. 20–28).

  • Parmar, P., & Tran Morris, B. (2019). What and how well you performed? a multitask learning approach to action quality assessment. In CVPR, (pp. 304–313).

  • Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Assessing the quality of actions. In ECCV, (pp. 556–571).

  • Safdarnejad, S.M., Liu, X., Udpa, L., Andrus, B., Wood, J., & Craven, D. (2015). Sports videos in the wild (svw): A video dataset for sports analysis. In FG, (pp. 1–7).

  • Schmidt, C.F. (1976). Understanding human action: Recognizing the plans and motives of other persons. In ACL Anthology, (pp. 196–200).

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR, (pp. 32–36).

  • Shao, D., Zhao, Y., Dai, B., & Lin, D. (2020). Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, (pp. 2616–2625).

  • Shao, D., Zhao, Y., Dai, B., & Lin, D. (2020). Intra-and inter-action understanding via temporal action parsing. In CVPR, (pp. 730–739).

  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199.

  • Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.

  • Tang, Y., Ding, D., Rao, Y., Zheng, Y., Zhang, D., Zhao, L., Lu, J., & Zhou, J. (2019). Coin: A large-scale dataset for comprehensive instructional video analysis. In CVPR, (pp. 1207–1216).

  • Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., & Zhou, J. (2020). Uncertainty-aware score distribution learning for action quality assessment. In CVPR, (pp. 9839–9848).

  • Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv:2203.12602.

  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In ICCV, (pp. 4489–4497).

  • Varol, G., Laptev, I., & Schmid, C. (2017). Long-term temporal convolutions for action recognition. TPAMI, 40(6), 1510–1517.

    Article  Google Scholar 

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV, (pp. 3551–3558).

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR, (pp. 7794–7803).

  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV, (pp. 20–36).

  • Wang, S., Yang, D., Zhai, P., Chen, C., & Zhang, L. (2021). Tsa-net: Tube self-attention network for action quality assessment. In ACM-MM, (pp. 4902–4910).

  • Xiang, X., Tian, Y., Reiter, A., Hager, G.D., & Tran, T.D. (2018). S3d: Stacking segmental p3d for action quality assessment. In ICIP, (pp. 928–932).

  • Xu, J., Rao, Y., Yu, X., Chen, G., Zhou, J., & Lu, J. (2022). Finediving: A fine-grained dataset for procedure-aware action quality assessment. In CVPR, (pp. 2949–2958).

  • Xu, A., Zeng, L.A., & Zheng, W.S. (2022). Likert scoring with grade decoupling for long-term action assessment. In CVPR, (pp. 3232–3241).

  • Xu, C., Fu, Y., Zhang, B., Chen, Z., Jiang, Y. G., & Xue, X. (2019). Learning to score figure skating sport videos. TCSVT, 30(12), 4578–4590.

    Google Scholar 

  • Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 32.

  • Yang, C., Xu, Y., Shi, J., Dai, B., & Zhou, B. (2020). Temporal pyramid network for action recognition. In CVPR, (pp. 591–600).

  • Yang, L., Han, J., Zhao, T., Lin, T., Zhang, D., & Chen, J. (2021). Background-click supervision for temporal action localization. TPAMI. https://doi.org/10.1109/TPAMI.2021.3132058

    Article  Google Scholar 

  • Yeung, S., Russakovsky, O., Mori, G., & Fei-Fei, L. (2016). End-to-end learning of action detection from frame glimpses in videos. In CVPR, (pp. 2678–2687).

  • Yu, X., Rao, Y., Zhao, W., Lu, J., & Zhou, J. (2021). Group-aware contrastive regression for action quality assessment. In ICCV, (pp. 7919–7928).

  • Zeng, L.A., Hong, F.T., Zheng, W.S., Yu, Q.Z., Zeng, W., Wang, Y.W., & Lai, J.H. (2020). Hybrid dynamic-static context-aware attention network for action assessment in long videos. In ACM MM, (pp. 2526–2534).

  • Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., & Gan, C. (2021). Graph convolutional module for temporal action localization in videos. TPAMI. https://doi.org/10.1109/TPAMI.2021.3090167

    Article  Google Scholar 

  • Zhang, B., Chen, J., Xu, Y., Zhang, H., Yang, X., & Geng, X. (2021). Auto-encoding score distribution regression for action quality assessment. arXiv:2111.11029.

  • Zhang, S., Dai, W., Wang, S., Shen, X., Lu, J., Zhou, J., & Tang, Y. (2023). Logo: A long-form video dataset for group action quality assessment. In CVPR, (pp. 2405–2414).

  • Zhang, C., Gupta, A., & Zisserman, A. (2021). Temporal query networks for fine-grained video understanding. In CVPR, (pp. 4486–4496).

  • Zhang, Q., & Li, B. (2014). Relative hidden markov models for video-based evaluation of motion skills in surgical training. TPAMI, 37(6), 1206–1218.

    Article  Google Scholar 

  • Zhao, H., Torralba, A., Torresani, L., & Yan, Z. (2019). Hacs: Human action clips and segments dataset for recognition and temporal localization. In ICCV, (pp. 8668–8678).

  • Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2017). Temporal action detection with structured segment networks. In ICCV, (pp. 2914–2923).

Download references

Acknowledgements

This work was supported by grants from the National Natural Science Foundation of China (62125603 and 62373043) and the Young Elite Scientists Sponsorship Program by CAST (2023QNRC001). The authors thank Xumin Yu and Guangyi Chen for their valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiwen Lu.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 About Annotator’s Disagreement

During the annotation process, we assess and measure inter-annotator agreement using metrics like Fleiss’ kappa \(\kappa \), which can help identify and address inconsistencies between annotators. \(\kappa \) is calculated by:

$$\begin{aligned} \kappa =\frac{P_A-P_e}{1-P_e} \end{aligned}$$
(19)

where \(1-P_e\) gives the degree of agreement that is attainable above chance, and \(P_A-P_e\) gives the degree of agreement actually achieved above chance. If the annotators are in complete agreement, then \(\kappa \!=\!1\). If there is no agreement among the annotators (other than what would be expected by chance), then \(\kappa \!\le \!0\). When \(\kappa \) is below 0.6, we regard it as inter-annotator disagreement, and then additional clarification or higher-level judgments are necessary to resolve ambiguities. For instance, the guideline states that the somersault pike or somersault tuck in the flight is considered over when the torso opens at an angle of more than 90 degrees. If guidelines are well-defined and leave little room for interpretation, annotators are more likely to agree on the annotations. In this work, we provide clear and detailed annotation guidelines according to official FINA diving rulesFootnote 1, including various examples and extreme cases, minimizing ambiguities and enhancing annotated data’s reliability. Taking defining where the sub-action “Entry” starts and ends as an example, when 1) the torso opens at an angle of more than 90 degrees, 2) the head, torso, and legs tend to be a straight line, and 3) extending arms straight above the head, with the hands close together, the sub-action “Entry” starts; when the hands enter first, followed by the head, shoulders, and the rest of the body, the sub-action “Entry” ends.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, J., Rao, Y., Zhou, J. et al. Procedure-Aware Action Quality Assessment: Datasets and Performance Evaluation. Int J Comput Vis 132, 6069–6090 (2024). https://doi.org/10.1007/s11263-024-02146-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02146-z

Keywords