State-aware video procedural captioning

Nishimura, Taichi; Hashimoto, Atsushi; Ushiku, Yoshitaka; Kameko, Hirotaka; Mori, Shinsuke

doi:10.1007/s11042-023-14774-7

State-aware video procedural captioning

Published: 20 March 2023

Volume 82, pages 37273–37301, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

142 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Video procedural captioning (VPC), which generates procedural text from instructional videos, is an essential task for scene understanding and real-world applications. The main challenge of VPC is to describe how to manipulate materials accurately. This paper focuses on this challenge by designing a new VPC task, generating a procedural text from the clip sequence of an instructional video and material set. In this task, the state of materials is sequentially changed by manipulations, yielding their state-aware visual representations (e.g., eggs are transformed into cracked, stirred, then fried forms). The essential difficulty is to convert such visual representations into textual representations; that is, a model should track the material states after manipulations to better associate the cross-modal relations. To achieve this, we propose a novel VPC method, which modifies an existing textual simulator for tracking material states as a visual simulator and incorporates it into a video captioning model. Our experimental results show the effectiveness of the proposed method, which outperforms state-of-the-art video captioning models. We further analyze the learned embedding of materials to demonstrate that the simulators capture their state transition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Move Forward and Tell: A Progressive Generator of Video Descriptions

Introduction to Emergent Practices and Material Conditions in Learning and Teaching with Technologies

From Climate Crisis to Polycrisis: Integrating Verbal, Visual, and Cinematic Resources in Environmental Protection Videos

Data Availablity Statement

The datasets generated during and/or analysed during the current study are available in our repository^{Footnote 7}: https://github.com/misogil0116/svpc_pp

Notes

We perform an experiments on the full prediction setting in Section 4.7.
We employ pre-trained 300D word embedding, which can be downloaded from http://nlp.stanford.edu/data/glove.6B.zip
To consider multiple words of materials, we divide the probability by the number of words.
We will release annotated ingredients and the dataset split.
The attention weight in the material selector was higher than 0.5.
The raw and updated ingredients correspond to an embedding $\boldsymbol {\mathcal {E}}^{0}$ by the material encoder and an embedding $\boldsymbol {\mathcal {E}}^{n}$ updated from $\boldsymbol {\mathcal {E}}^{0}$ by the visual simulator, respectively.
Currently, the repository is a private mode. After our manuscript has been accepted, I will release the code and dataset.
$\boldsymbol {w}_{p}^{j}$ represents j-th value of w_p; thus, Eq.(6) indicates normalization of w_p
https://github.com/flairNLP/flair
In the tag definitions of the E-rFG corpus, we display food entities as the estimated ingredients. These entities cannot be directly used for our dataset because the definition of food slightly differs from the ingredient definition in this paper (for example, “it” or “salad” are recognized as food in the E-rFG corpus). Therefore, we asked annotators to delete or rewrite ingredients if they were not appropriate.

References

Akbik A, Blythe D, Vollgraf R (2018) Contextual string embeddings for sequence labeling. In: Proc COLING, pp 1638–1649
Alayrac J-B, Bojanowski P, Agrawal N, Sivic J, Laptev I, Lacoste-Julien S (2016) Unsupervised learning from narrated instruction videos. In: Proc CVPR, pp 4575–4583
Alayrac J-B, Sivic J, Laptev I, Lacoste-Julien S (2017) Joint discovery of object states and manipulation actions. In: Proc ICCV, pp 2127–2136
Amac MS, Yagcioglu S, Erdem A, Erdem E (2019) Procedural reasoning networks for understanding multimodal procedures. In: Proc coNLL, pp 441–451
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proc ACL workshop IEEMMTS, pp 65–72
Bosselut A, Levy O, Holtzman A, Ennis C, Fox D, Choi Y (2018) Simulating action dynamics with neural process networks. In: Proc ICLR
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proc ECCV, pp 213–229
Chen J, Ngo C-W (2016) Deep-based ingredient recognition for cooking recipe retrieval. In: Proc ACMMM, pp 32–41
Dai Z, Yang Z, Yang Y, Carbonell J, Le Q, Salakhutdinov R (2019) Transformer-xl: attentive language models beyond a fixed-length context. In: Proc ACL, pp 2978–2988
Dalvi B, Huang L, Tandon N, Yih W-t, Clark P (2018) Tracking state changes in procedural text: a challenge dataset and models for process paragraph comprehension. In: Proc NAACL, pp 1595–1604
Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, Wray M (2018) Scaling egocentric vision: The EPIC-KITCHENS dataset. In: Proc ECCV, pp 720–736
Devlin J, Chang M -W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proc NAACL, pp 4171–4186
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proc CVPR, pp 2625–2634
Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) DAPS: deep action proposals for action understanding. In: Proc ECCV, pp 768–784
Gupta A, Durrett G (2019) Tracking discrete and continuous entity state for process understanding. In: Proc NAACL workshop SPNLP, pp 7–12
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc CVPR, pp 770–778
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proc ICML, pp 448–456
Jang E, Gu S, Poole B (2017) Categorical reparametrization with gumble-softmax. In: Proc ICLR
Jermsurawong J, Habash N (2015) Predicting the structure of cooking recipes. In: Proc EMNLP, pp 781–786
Kiddon C, Ponnuraj GT, Zettlemoyer L, Choi Y (2015) Mise en Place: unsupervised interpretation of instructional recipes. In: Proc EMNLP, pp 982–992
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proc ICLR, USA
Lei J, Wang L, Shen Y, Yu D, Berg T, Bansal M (2020) Mart: memory-augmented recurrent transformer for coherent video paragraph captioning. In: Proc ACL, pp 2603–2614
Lin C-Y, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proc ACL, pp 605–612
Maeta H, Sasada T, Mori S (2015) A framework for procedural text understanding. In: Proc IWPT, pp 50–60
Miech A, Alayrac J-B, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: Proc CVPR, pp 9879–9889
Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J (2019) HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: Proc ICCV, pp 2630–2640
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NeurIPS, pp 3111–3119
Mintz M, Bills S, Snow R, Jurafsky D (2009) Distant supervision for relation extraction without labeled data. In: Proc ACL-IJCNLP, pp 1003–1011
Nishimura T, Hashimoto A, Ushiku Y, Kameko H, Mori S (2021) State-aware video procedural captioning. In: Proc ACMMM
Nishimura T, Hashimoto A, Ushiku Y, Kameko H, Yamakata Y, Mori S (2020) Structure-aware procedural text generation from an image sequence. IEEE Access 9:2125–2141
Article Google Scholar
Nishimura T, Sakoda K, Hashimoto A, Ushiku Y, Tanaka N, Ono F, Kameko H, Mori S (2021) Egocentric biochemical video-and-language dataset. In: Proc CLVL, pp 3129–3133
Pan L, Chen J, Wu J, Liu S, Ngo C-W, Kan M-Y, Jiang Y-G, Chua T-S (2020) Multi-modal cooking workflow construction for food recipes. In: Proc ACMMM, pp 1132–1141
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proc ACL, pp 311–318
Park JS, Rohrbach M, Darrell T, Rohrbach A (2019) Adversarial inference for multi-sentence video description. In: Proc CVPR, pp 6598–6608
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proc EMNLP, pp 1532–1543
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks arXiv
Ren S, He K, Girshick R, Sun J (2015) Faster r-CNN: towards real-time object detection with region proposal networks. In: Proc NeurIPS, pp 91–99
Salvador A, Hynes N, Aytar Y, Marin J, Ofli F, Weber I, Torralba A (2017) Learning cross-modal embeddings for cooking recipes and food images. In: Proc CVPR, pp 3020–3028
Santoro A, Faulkner R, Raposo D, Rae J, Chrzanowski M, Weber T, Wierstra D, Vinyals O, Pascanu R, Lillicrap T (2019) Relational recurrent neural networks. In: Proc NeurIPS, pp 7299–7310
See A, Liu PJ, Manning CD (2017) Get to the point: summarization with pointer-generator networks. In: Proc ACL, pp 1073–1083
Shi B, Ji L, Liang Y, Duan N, Chen P, Niu Z, Zhou M (2019) Dense procedure captioning in narrated instructional videos. In: Proc ACL, pp 6382–6391
Shi B, Ji L, Niu Z, Duan N, Zhou M, Chen X (2020) Learning semantic concepts and temporal alignment for narrated video procedural captioning. In: Proc ACMMM, pp 4355–4363
Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: a joint model for video and language representation learning. In: Proc ICCV, pp 7464–7473
Tan G, Liu D, Wang M, Zha Z-J (2020) Learning to discretely compose reasoning module networks for video captioning. In: Proc IJCAI, pp 745–752
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605
MATH Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proc NeurIPS, pp 5998–6008
Vedantam R, Zitnick CL, Parikh D (2015) CIDER: consensus-based image description evaluation. In: Proc CVPR, pp 4566–4575
Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1:270–280
Article Google Scholar
Xiong Y, Dai B, Lin D (2018) Move forward and tell: a progressive generator of video descriptions. In: Proc ECCV, pp 489–505
Yamakata Y, Mori S, Carroll J (2020) English recipe flow graph corpus. In: Proc LREC, pp 5187–5194
Zamir N, Noy A, Friedman I, Protter M, Zelnik-Manor L (2020) Asymmetric loss for multi-label classification
Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. In: Proc CVPR, pp 6578–6587
Zhou L, Xu C, Corso JJ (2018) Towards automatic learning of procedures from web instructional videos. In: Proc AAAI, pp 7590–7598
Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018) End-to-end dense video captioning with masked transformer. In: Proc CVPR, pp 8739–8748

Download references

Funding

This work was supported by JSPS KAKENHI Grant Number JP21J20250 and JP20H04210, and partially supported by JP21H04910, JP17H06100, JST-Mirai Program Grant Number JPMJMI21G2, and JST ACT-I Grant Number JPMJPR17U5.

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Yoshidahonmachi, Kyoto-shi, 606–8501, Kyoto, Japan
Taichi Nishimura
OMRON SINIC X Corporation, 5–24–5, Bunkyo-ku, 113–8656, Tokyo, Japan
Atsushi Hashimoto & Yoshitaka Ushiku
Academic Center for Computing and Media Studies, Kyoto University, Yoshidahonmachi, Kyoto-shi, 606–8501, Kyoto, Japan
Hirotaka Kameko & Shinsuke Mori

Authors

Taichi Nishimura
View author publications
Search author on:PubMed Google Scholar
Atsushi Hashimoto
View author publications
Search author on:PubMed Google Scholar
Yoshitaka Ushiku
View author publications
Search author on:PubMed Google Scholar
Hirotaka Kameko
View author publications
Search author on:PubMed Google Scholar
Shinsuke Mori
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Taichi Nishimura.

Ethics declarations

Competing interests

All of them are research grants from the Japanese government.

Conflict of Interests

All authors state that no financial/non-financial support has been received from any organization that may have an interest in this work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Details of simulator

In this section, we describe the details of the visual simulator for the reproducibility of the proposed method. Given the encoded vectors of the clip sequence and material list $(\boldsymbol {{\mathscr{H}}}, \boldsymbol {\mathcal {E}}^{0})$, the visual simulator, shown in Fig. 3 in the main paper, recurrently reasons the state transition of materials at each step. Specifically, at the n-th step, given the n-th clip h_n and (n − 1)-th material list $\boldsymbol {\mathcal {E}}^{n-1}$, the visual simulator predicts executed actions and involved materials in (1) the action and (2) material selector and it then updates the state of materials in (3) the updater. After n-th reasoning, it outputs a state-aware step vector $\boldsymbol {u}_{n} \in \mathbb {R}^{3 \times d}$, which concatenates the n-th clip h_n, selected action $\bar {\boldsymbol {f}}_{n}$ and material vectors $\bar {\boldsymbol {e}}_{n}$ (d represents the dimension of these vectors). The visual simulator recurrently repeats the above process until processing the end element of the clip sequence. For clarity, we explain the simulation process of the visual simulator at the n-th step.

Action selector

Given a clip vector h_n, the action selector outputs the selected action vector $\bar {\boldsymbol {f}}_{n}$ by choosing actions executed in the clip from predefined action embedding $\boldsymbol {\mathcal {F}}$. For example, in Fig. 3 in the main paper, the actions “crack” and “stir” are executed in the clip, thus both f_crack and f_stir should be selected. To consider multiple actions, the action selector computes a soft selection w_p as action probability for each action in $\boldsymbol {\mathcal {F}}$. Then it outputs the selected action vector $\boldsymbol {\bar {f}}_{n}$ as a weighted sum of the action embedding $\boldsymbol {\mathcal {F}}$ and action probability w_p:

$$ \begin{array}{@{}rcl@{}} \boldsymbol{w}_{p} &=& \text{MLP}(\boldsymbol{h}_{n}) \end{array} $$

(A1)

$$ \begin{array}{@{}rcl@{}} \bar{\boldsymbol{w}}_{p} &=& \frac{\boldsymbol{w}_{p}}{{\sum}_{j}\boldsymbol{w}_{p}^{j}} \end{array} $$

(A2)

$$ \begin{array}{@{}rcl@{}} \bar{\boldsymbol{f}}_{n} &=& \bar{\boldsymbol{w}}_{p}^{T} \boldsymbol{\mathcal{F}}, \end{array} $$

(A3)

where MLP(⋅) represents two-layer MLPs with the sigmoid function and $\boldsymbol {w}_{p} \in \mathbb {R}^{\|\boldsymbol {\mathcal {F}}\|}$ is the attention distribution over $\|\boldsymbol {\mathcal {F}}\|$ possible actions^{Footnote 8}.

Material selector

Based on the action probability w_p and clip vector h_n, the material selector outputs the selected material vector $\boldsymbol {\bar {e}}_{n}$ by choosing materials involved in the clip from the material list $\boldsymbol {\mathcal {E}}^{n-1}$. For example, in Fig. 3 in the main paper, the raw “cheese” and manipulated “eggs” and “butter” should be selected. To consider such a combination of raw and manipulated material selection, the material selector has two attention modules: (1) clip attention and (2) recurrent attention.

(1)
The clip attention chooses relevant materials from the clip vector h_n and action probability w_p:
$$ \begin{array}{@{}rcl@{}} \hat{\boldsymbol{\textit{h}}}_{n} &=& \text{ReLU}(\boldsymbol{W}_{1} \boldsymbol{\textit{h}}_{n} + \boldsymbol{b}_{1}) \end{array} $$
(A4)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{d}_{m} &=& \sigma((\boldsymbol{\textit{e}}_{m}^{n-1})^{\textsf{T}} \boldsymbol{W}_{2} [\hat{\boldsymbol{\textit{h}}}_{n};\boldsymbol{w}_{p}]) \end{array} $$
(A5)
where W₁ and W₂ are linear and bilinear mapping, b₁ and b₂ are biases, and $\boldsymbol {\textit {e}}_{m}^{n-1}$ and d_m represent the m-th material vector and its attention weight.
(2)
Recurrent attention selects materials based on information from both the current and previous clips. Using the result of clip attention, it computes a soft selection aⁿ as a material probability for each material in the material list:
$$ \begin{array}{@{}rcl@{}} \boldsymbol{c} &=& \text{softmax}(\boldsymbol{W}_{3} \hat{\boldsymbol{\textit{h}}}_{n} + \boldsymbol{b}_{3}) \end{array} $$
(A6)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{a}_{m}^{n} &=& \boldsymbol{c}_{1}\boldsymbol{d}_{m} + \boldsymbol{c}_{2}\boldsymbol{a}_{m}^{n-1} + \boldsymbol{c}_{3} \boldsymbol{0} \end{array} $$
(A7)
where W₃ is a linear mapping, $\boldsymbol {c} \in \mathbb {R}^{3}$ is the choice distribution, $\boldsymbol {a}_{m}^{n-1}$ is the attention weight of the previous clip for each material, $\boldsymbol {a}_{m}^{n}$ is the final distribution for each material, and 0 is a vector of zeros (providing the option not to select any materials). Finally, using the calculated attention weights, the selected material vector $\boldsymbol {\bar {e}}_{n}$ is computed as the normalized weighted sum of the selected materials.
$$ \begin{array}{@{}rcl@{}} \boldsymbol{\alpha}_{m}^{n} &=& \frac{\boldsymbol{a}_{m}^{n}}{{\sum}_{j}\boldsymbol{a}_{j}^{n}} \end{array} $$
(A8)
$$ \begin{array}{@{}rcl@{}} \bar{\boldsymbol{\textit{e}}}_{n} &=& \sum\limits_{m}\boldsymbol{\alpha}_{m}^{n}\boldsymbol{e}_{m}^{n-1}. \end{array} $$
(A9)

Updater

Based on the selected actions and materials, the updater represents the state transition of materials by computing a new material vector $\hat {\boldsymbol {e}}_{m}$. To this end, it first calculates an action-aware proposal vector l_n of materials with a bilinear transformation of the selected action and material vectors $(\boldsymbol {\bar {f}}_{n},\boldsymbol {\bar {e}}_{n})$:

$$ \boldsymbol{l}_{n} = \text{ReLU}(\bar{\boldsymbol{f}}_{n}\boldsymbol{W}_{4}\bar{\boldsymbol{\textit{e}}}_{n} + \boldsymbol{b}_{4}), $$

(A10)

where W₄ is a bilinear mapping.

Then, based on the material probability aⁿ, it computes the new material vector $\hat {\boldsymbol {e}}_{m}$ by interpolating the action-aware proposal vector l_n and current material vector $\hat {\boldsymbol {e}}_{m}^{n-1}$:

$$ \hat{\boldsymbol{e}}_{m} = \boldsymbol{a}_{n}^{m}\boldsymbol{l}_{n} + (1 - \boldsymbol{a}_{n}^{m}) \boldsymbol{e}_{m}^{n-1}. $$

(A11)

The new m-th material vector $\hat {\boldsymbol {e}}_{m}$ is assigned to $\boldsymbol {\mathcal {E}}_{m}^{n}$, which is forwarded to the next (n + 1)-th step.

Appendix B: Detailed annotation process

We additionally annotated ingredients with the rest of the recipes (126 recipes) to the YouCook2-ingredient dataset, and built the YouCook2-ingredient+ dataset.

We increased the dataset size by obtaining the missing videos through YouCook2 author and annotating ingredients with these additional videos by hiring one annotator to use the web tool shown in Fig. 10. This annotation tool presents a recipe, corresponding video, and text boxes for writing ingredients. In this paper, ingredients are defined as raw materials that are necessary to complete the dish. For example, “tomato” and “cucumber” should be written as ingredients, although “salad” should not be written because it represents a mixture of ingredients.

To annotate ingredients easily, “jump” buttons, which allow annotators to see a clip corresponding to a step, are implemented based on the start/end timestamp from the original YouCook2 dataset. Moreover, to encourage annotators to write ingredients easily, this tool displays estimated ingredients using the named entity recognition (NER) model, flair^{Footnote 9} [1] pre-trained on the English recipe flow graph corpus (E-rFG corpus) [50]^{Footnote 10}. If the estimated words are not appropriate for the ingredients, the ingredients can be deleted or rewritten.

Appendix C: Baseline implementation details

As our comparative models, we employed two state-of-the-art transformer-based video captioning models: Transformer-XL [9] and MART [22]. These models originally have no ingredient set in their inputs and copy mechanism in their decoder; thus for a fair comparison, we prepare for additional baseline +ingredient (-I) models, which incorporate the material encoder (Section 3.2) and the copy mechanism into the baselines.

These models are based on the transformer that encodes sequential inputs and decodes a sentence by attending all of the elements in the input sequence. Thus, to fit this characteristic, we concatenate the encoded ingredient and video vectors, and input them to the model, as shown in Fig. 11. When decoding, based on the output of the decoder o_k and ingredient vectors $\boldsymbol {\mathcal {E}}_{0}$, the copy mechanism calculates the copying gate to make a soft choice between selecting an ingredient from the ingredient set or generating a word from the vocabulary.

Appendix D: Implementation and training details on full prediction settings

Here, we discuss the implementation and training details of the full prediction settings, where the material set is not given, but is predicted from the video clips in advance. To address this, as described in Section 4.7, we added an ingredient decoder of the multi-label classifier and trained the entire model as multi-task learning.

Figure 12 shows how to integrate the ingredient decoder into the model. The ingredient decoder consists of a two-layered MLP with a sigmoid function, and converts $\hat {\boldsymbol {h}}$ of a max-pooled vector of clip vectors into a probability vector of materials, where q indicates the number of unique ingredients appearing more than three times in the training set (we obtained q = 668 in the experiment). During training, we compute the ingredient decoder loss ${\mathscr{L}}_{ingr}$, which is an asymmetric loss [51] on the multi-label classification settings, and add it to the total loss defined in (4). Note that we adopt teacher-forcing [48] to stabilize the training; while the models learn using the ground-truth ingredients for the downstream process in the training phase, they generate a recipe based on predicted ingredients at the inference phase (we sample the top k = 15 ingredients from the probability). Another modification of the model is to remove the copy mechanism because we find that it degrades the captioning performance with this setting.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nishimura, T., Hashimoto, A., Ushiku, Y. et al. State-aware video procedural captioning. Multimed Tools Appl 82, 37273–37301 (2023). https://doi.org/10.1007/s11042-023-14774-7

Download citation

Received: 05 November 2021
Revised: 03 June 2022
Accepted: 05 February 2023
Published: 20 March 2023
Version of record: 20 March 2023
Issue date: October 2023
DOI: https://doi.org/10.1007/s11042-023-14774-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

State-aware video procedural captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Move Forward and Tell: A Progressive Generator of Video Descriptions

Introduction to Emergent Practices and Material Conditions in Learning and Teaching with Technologies

From Climate Crisis to Polycrisis: Integrating Verbal, Visual, and Cinematic Resources in Environmental Protection Videos

Explore related subjects

Data Availablity Statement

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Conflict of Interests

Additional information

Publisher’s note

Appendices

Appendix A: Details of simulator

Action selector

Material selector

Updater

Appendix B: Detailed annotation process

Appendix C: Baseline implementation details

Appendix D: Implementation and training details on full prediction settings

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now