Grounded Affordance from Exocentric View

Luo, Hongchen; Zhai, Wei; Zhang, Jing; Cao, Yang; Tao, Dacheng

doi:10.1007/s11263-023-01962-z

Grounded Affordance from Exocentric View

Published: 26 December 2023

Volume 132, pages 1945–1969, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Hongchen Luo¹,
Wei Zhai¹,
Jing Zhang²,
Yang Cao ORCID: orcid.org/0000-0002-2891-4379^1,3 &
…
Dacheng Tao²

1062 Accesses
21 Citations
1 Altmetric
Explore all metrics

Abstract

Affordance grounding aims to locate objects’ “action possibilities” regions, an essential step toward embodied intelligence. Due to the diversity of interactive affordance, i.e., the uniqueness of different individual habits leads to diverse interactions, which makes it difficult to establish an explicit link between object parts and affordance labels. Human has the ability that transforms various exocentric interactions into invariant egocentric affordance to counter the impact of interactive diversity. To empower an agent with such ability, this paper proposes a task of affordance grounding from the exocentric view, i.e., given exocentric human-object interaction and egocentric object images, learning the affordance knowledge of the object and transferring it to the egocentric image using only the affordance label as supervision. However, there is some “interaction bias” between personas, mainly regarding different regions and views. To this end, we devise a cross-view affordance knowledge transfer framework that extracts affordance-specific features from exocentric interactions and transfers them to the egocentric view to solve the above problems. Furthermore, the perception of affordance regions is enhanced by preserving affordance co-relations. In addition, an affordance grounding dataset named AGD20K is constructed by collecting and labeling over 20K images from 36 affordance categories. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. The code is available via: github.com/lhc1224/Cross-View-AG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

INTRA: Interaction Relationship-Aware Weakly Supervised Affordance Grounding

Visual Affordance Recognition: A Study on Explainability and Interpretability for Human Robot Interaction

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

References

Bandini, A., & Zariffa, J. (2020). Analysis of the hands in egocentric vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99, 1–1. https://doi.org/10.1109/TPAMI.2020.2986648
Article Google Scholar
Bohg, J., Hausman, K., Sankaran, B., Brock, O., Kragic, D., Schaal, S., & Sukhatme, G. S. (2017). Interactive perception: Leveraging action in perception and perception in action. IEEE Transactions on Robotics, 33(6), 1273–1291.
Article Google Scholar
Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Oliva, A., & Torralba, A. (2015). Mit saliency benchmark.
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., & Durand, F. (2018). What do different evaluation metrics tell us about saliency models? IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(3), 740–757.
Article Google Scholar
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9650–9660).
Chan, E. R., Nagano, K., Chan, M. A., Bergman, A. W., Park, J. J., Levy, A., Aittala, M., De Mello, S., Karras, T., & Wetzstein, G. (2023). Generative novel view synthesis with 3d-aware diffusion models. arXiv preprint arXiv:2304.02602.
Chao, Y. W., Liu, Y., Liu, X., Zeng, H., & Deng, J. (2018). Learning to detect human-object interactions. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, (pp. 381–389).
Chen, J., Gao, D., Lin, K. Q., & Shou, M. Z. (2023). Affordance grounding from demonstration video to target image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 6799–6808).
Chen, Y. C., Lin, Y. Y., Yang, M. H., & Huang, J. B. (2020). Show, match and segment: Joint weakly supervised learning of semantic matching and object co-segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3632–3647.
Article Google Scholar
Choi, I., Gallo, O., Troccoli, A., Kim, M. H., & Kautz, J. (2019). Extreme view synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 7781–7790).
Chuang, C. Y., Li, J., Torralba, A., & Fidler, S. (2018). Learning to act properly: Predicting and explaining affordances from images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 975–983).
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., & Price, W., et al. (2018). Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV), (pp. 720–736).
Debevec, P., Yu, Y., & Borshukov, G. (1998). Efficient view-dependent image-based rendering with projective texture-mapping. In: Rendering Techniques’ 98: Proceedings of the Eurographics Workshop in Vienna, Austria, June 29-July 1, 1998 9, Springer, (pp. 105–116).
Deng, S., Xu, X., Wu, C., Chen, K., & Jia, K. (2021). 3d affordancenet: A benchmark for visual object affordance understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 1778–1787).
Do, T. T., Nguyen, A., & Reid, I. (2018). Affordancenet: An end-to-end deep learning approach for object affordance detection. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, (pp. 5882–5889).
Fan, C., Lee, J., Xu, M., Singh, K.K., & Yong, J. L. (2017). Identifying first-person camera wearers in third-person videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Fan, D. P., Li, T., Lin, Z., Ji, G. P., Zhang, D., Cheng, M. M., Fu, H., & Shen, J. (2021). Re-thinking co-salient object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8), 4339–4354.
Google Scholar
Fang, K., Wu, T. L., Yang, D., Savarese, S., & Lim, J. J. (2018). Demo2vec: Reasoning object affordances from online videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 2139–2147).
Fouhey, D. F., Wang, X., & Gupta, A. (2015). In defense of the direct perception of affordances. arXiv preprint arXiv:1505.01085.
Gao, W., Wan, F., Pan, X., Peng, Z., Tian, Q., Han, Z., Zhou, B., & Ye, Q. (2021). Ts-cam: Token semantic coupled attention map for weakly supervised object localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), (pp. 2886–2895).
Geng, Z., Guo, M. H., Chen, H., Li, X., Wei, K., & Lin, Z. (2021). Is attention better than matrix decomposition? arXiv preprint arXiv:2109.04553.
Gibson, J. J. (1977). The Theory of Affordances. Hilldale.
Google Scholar
Grabner, H., Gall, J., & Van Gool, L. (2011). What makes a chair a chair? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (pp. 1529–1536).
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., & Liu, X., et al. (2022). Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 18995–19012).
Hadjivelichkov, D., Zwane, S., Agapito, L., Deisenroth, M. P., & Kanoulas, D. (2023). One-shot transfer of affordance regions? affcorrs! In: Conference on Robot Learning, PMLR, (pp. 550–560).
Hassanin, M., Khan, S., & Tahtali, M. (2018). Visual affordance and function understanding: A survey. arXiv.
Hassanin, M., Khan, S., & Tahtali, M. (2021). Visual affordance and function understanding: A survey. ACM Computing Surveys (CSUR), 54(3), 1–35.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 770–778).
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Ho, H. I., Chiu, W. C., & Wang, Y. C. F. (2018). Summarizing first-person videos from third persons’ points of view. In: Proceedings of the European Conference on Computer Vision (ECCV), (pp. 70–85).
Huang, Y., Cai, M., Li, Z., & Sato, Y. (2018). Predicting gaze in egocentric video by learning task-dependent attention transition. In: Proceedings of the European Conference on Computer Vision (ECCV), (pp. 754–769).
Judd, T., Durand, F., & Torralba, A. (2012). A benchmark of computational models of saliency to predict human fixations.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., & Lo, W. Y., et al. (2023). Segment anything. arXiv preprint arXiv:2304.02643.
Kjellström, H., Romero, J., & Kragić, D. (2011). Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, 115(1), 81–90.
Article Google Scholar
Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500.
Article MathSciNet Google Scholar
Koppula, H. S., & Saxena, A. (2014). Physically grounded spatio-temporal object affordances. In: European Conference on Computer Vision (ECCV), Springer, (pp. 831–847).
Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32(8), 951–970.
Article Google Scholar
Kümmerer, M., Wallis, T. S., & Bethge, M. (2016). Deepgaze ii: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563.
Lakani, S. R., Rodríguez-Sánchez, A. J., & Piater, J. (2017). Can affordances guide object decomposition into semantically meaningful parts? In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, (pp. 82–90).
Lau, M., Dev, K., Shi, W., Dorsey, J., & Rushmeier, H. (2016). Tactile mesh saliency. ACM Transactions on Graphics (TOG), 35(4), 1–11.
Article Google Scholar
Lee, D. D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. In: NIPS.
Li, B., Sun, Z., Li, Q., Wu, Y., & Hu, A. (2019). Group-wise deep object co-segmentation with co-attention recurrent neural network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 8519–8528).
Li, G., Jampani, V., Sun, D., & Sevilla-Lara, L. (2023a). Locate: Localize and transfer object parts for weakly supervised affordance grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 10922–10931).
Li, J., Liu, K., & Wu, J. (2023b). Ego-body pose estimation via ego-head pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 17142–17151).
Li, Y., Nagarajan, T., Xiong, B., & Grauman, K. (2021). Ego-exo: Transferring visual representations from third-person to first-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 6943–6953).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV), Springer, (pp. 740–755).
Liu, S., Tripathi, S., Majumdar, S., & Wang, X. (2022). Joint hand motion and interaction hotspots prediction from egocentric videos. arXiv preprint arXiv:2204.01696.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 10012–10022).
Lu, J., Zhou, Z., Zhu, X., Xu, H., & Zhang, L. (2022a). Learning ego 3d representation as ray tracing. arXiv preprint arXiv:2206.04042.
Lu, L., Zhai, W., Luo, H., Kang, Y., & Cao, Y. (2022b). Phrase-based affordance detection via cyclic bilateral interaction. arXiv preprint arXiv:2202.12076.
Luo, H., Zhai, W., Zhang, J., Cao, Y., & Tao, D. (2021a). Learning visual affordance grounding from demonstration videos. arXiv preprint arXiv:2108.05675.
Luo, H., Zhai, W., Zhang, J., Cao, Y., & Tao, D. (2021b). One-shot affordance detection. arXiv preprint arXiv:2106.14747.
Luo, H., Zhai, W., Zhang, J., Cao, Y., & Tao, D. (2022). Learning affordance grounding from exocentric images. arXiv preprint arXiv:2203.09905.
Lv, Y., Zhang, J., Dai, Y., Li, A., Barnes, N., & Fan, D. P. (2022). Towards deeper understanding of camouflaged object detection. arXiv preprint arXiv:2205.11333.
Mai, J., Yang, M., & Luo, W. (2020). Erasing integrated learning: A simple yet effective approach for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 8766–8775).
Mandikal, P., & Grauman, K. (2021). Learning dexterous grasping with object-centric visual affordances. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE, (pp. 6169–6176).
Mi, J., Tang, S., Deng, Z., Goerner, M., & Zhang, J. (2019). Object affordance based multimodal fusion for natural human-robot interaction. Cognitive Systems Research, 54, 128–137.
Article Google Scholar
Mi, J., Liang, H., Katsakis, N., Tang, S., Li, Q., Zhang, C., & Zhang, J. (2020). Intention-related natural language grounding via object affordance detection and intention semantic extraction. Frontiers in Neurorobotics, 14, 26.
Article Google Scholar
Myers, A., Teo, C. L., Fermüller, C., & Aloimonos, Y. (2015). Affordance detection of tool parts from geometric features. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, (pp. 1374–1381).
Nagarajan, T., & Grauman, K. (2020). Learning affordance landscapes for interaction exploration in 3d environments. Advances in Neural Information Processing Systems, 33, 2005–2015.
Google Scholar
Nagarajan, T., Feichtenhofer, C., & Grauman, K. (2019). Grounded human-object interaction hotspots from video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (pp. 8688–8697).
Nguyen, A., Kanoulas, D., Caldwell, D. G., & Tsagarakis, N. G. (2016). Detecting object affordances with convolutional neural networks. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, (pp. 2765–2770).
Nguyen, A., Kanoulas, D., Caldwell, D. G., & Tsagarakis, N. G. (2017). Object-based affordances detection with convolutional neural networks and dense conditional random fields. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, (pp. 5908–5915).
Pan, X., Gao, Y., Lin, Z., Tang, F., Dong, W., Yuan, H., Huang, F., & Xu, C. (2021). Unveiling the potential of structure preserving for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 11642–11651).
Pei, G., Shen, F., Yao, Y., Xie, G. S., Tang, Z., & Tang, J. (2022). Hierarchical feature alignment network for unsupervised video object segmentation. In: European Conference on Computer Vision, Springer, (pp. 596–613).
Peters, R. J., Iyer, A., Itti, L., & Koch, C. (2005). Components of bottom-up gaze allocation in natural images. Vision Research, 45(18), 2397–2416.
Article Google Scholar
Quan, R., Han, J., Zhang, D., & Nie, F. (2016). Object co-segmentation via graph optimized-flexible manifold ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 687–695).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
Regmi, K., & Shah, M. (2019). Bridging the domain gap for ground-to-aerial image matching. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), (pp. 470–479).
Ren, S., Liu, W., Liu, Y., Chen, H., Han, G., & He, S. (2021). Reciprocal transformations for unsupervised video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 15455–15464).
Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience, 27, 169–192.
Article Google Scholar
Sawatzky, J., & Gall, J. (2017). Adaptive binarization for weakly supervised affordance segmentation. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, (pp. 1383–1391).
Sawatzky, J., Srikantha, A., & Gall, J. (2017). Weakly supervised affordance detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
Sigurdsson, G. A., Gupta, A., Schmid, C., Farhadi, A., & Alahari, K. (2018). Actor and observer: Joint modeling of first and third-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 7396–7404).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Soran, B., Farhadi, A., & Shapiro, L. (2014). Action recognition in the presence of one egocentric and multiple static cameras. In: Asian Conference on Computer Vision, Springer, (pp. 178–193).
Srikantha, A., & Gall, J. (2016). Weakly supervised learning of affordances. arXiv preprint arXiv:1605.02964.
Stark, M., Lies, P., Zillich, M., Wyatt, J., & Schiele, B. (2008). Functional object class detection based on learned affordance cues. In: International Conference on Computer Vision Systems, Springer, (pp. 435–444).
Swain, M. J., & Ballard, D. H. (1991). Color indexing. International Journal of Computer Vision (IJCV), 7(1), 11–32.
Article Google Scholar
Tang, Y., Tian, Y., Lu, J., Feng, J., & Zhou, J. (2017). Action recognition in rgb-d egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, (pp. 3410–3414).
Wang, J., Liu, L., Xu, W., Sarkar, K., & Theobalt, C. (2021). Estimating egocentric 3d human pose in global space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 11500–11509).
Wen, Y., Singh, K. K., Anderson, M., Jan, W. P., & Lee, Y. J. (2021). Seeing the unseen: Predicting the first-person camera wearer’s location and pose in third-person scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 3446–3455).
Wiles, O., Gkioxari, G., Szeliski, R., & Johnson, J. (2020). Synsin: End-to-end view synthesis from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 7467–7477).
Wong, B., Chen, J., Wu, Y., Lei, S. W., Mao, D., Gao, D., & Shou, M. Z. (2022). Assistq: Affordance-centric question-driven task completion for egocentric assistant. In: European Conference on Computer Vision, Springer, (pp. 485–501).
Wu, P., Zhai, W., & Cao, Y. (2021). Background activation suppression for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yang, Y., Ni, Z., Gao, M., Zhang, J., & Tao, D. (2021). Collaborative pushing and grasping of tightly stacked objects via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 9(1), 135–145.
Article Google Scholar
Yang, Y., Zhai, W., Luo, H., Cao, Y., Luo, J., & Zha, Z. J. (2023). Grounding 3d object affordance from 2d interactions in images. arXiv preprint arXiv:2303.10437.
Yuan, Z. H., Lu, T., & Wu, Y., et al. (2017). Deep-dense conditional random fields for object co-segmentation. In: IJCAI, vol 1, p 2.
Zhai, W., Cao, Y., Zhang, J., & Zha, Z. J. (2022a). Exploring figure-ground assignment mechanism in perceptual organization. Advances in Neural Information Processing Systems, 35, 17030–17042.
Google Scholar
Zhai, W., Luo, H., Zhang, J., Cao, Y., & Tao, D. (2022). One-shot object affordance detection in the wild. International Journal of Computer Vision (IJCV), 130(10), 2472–500.
Article Google Scholar
Zhai, W., Cao, Y., Zhang, J., Xie, H., Tao, D., & Zha, Z. J. (2023). On exploring multiplicity of primitives and attributes for texture recognition in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 403–420.
Article Google Scholar
Zhai, W., Wu, P., Zhu, K., Cao, Y., Wu, F., & Zha, Z. J. (2023b). Background activation suppression for weakly supervised object localization and semantic segmentation. International Journal of Computer Vision (pp. 1–26).
Zhang, J., & Tao, D. (2020). Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet of Things Journal, 8(10), 7789–7817.
Article MathSciNet Google Scholar
Zhang, K., Li, T., Shen, S., Liu, B., Chen, J., & Liu, Q. (2020a). Adaptive graph convolutional network with attention graph clustering for co-saliency detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang, L., Zhou, S., Stent, S., & Shi, J. (2022). Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In: European Conference on Computer Vision, Springer, (pp. 127–145).
Zhang, Q., Cong, R., Hou, J., Li, C., & Zhao, Y. (2020b). Coadnet: Collaborative aggregation-and-distribution networks for co-salient object detection. Advances in Neural Information Processing Systems, 33, 6959–6970.
Google Scholar
Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision (IJCV), 12, 1–22.
Google Scholar
Zhang, Z., Jin, W., Xu, J., & Cheng, M.M. (2020c). Gradient-induced co-saliency detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, Springer, (pp. 455–472).
Zhao, X., Cao, Y., & Kang, Y. (2020). Object affordance detection with relationship-aware network. Neural Computing and Applications, 32(18), 14321–14333.
Article Google Scholar
Zhen, M., Li, S., Zhou, L., Shang, J., Feng, H., Fang, T., & Quan, L. (2020). Learning discriminative feature with crf for unsupervised video object segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, Springer, (pp. 445–462).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 2921–2929).

Download references

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Hongchen Luo, Wei Zhai & Yang Cao
The University of Sydney, Sydney, Australia
Jing Zhang & Dacheng Tao
Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei, China
Yang Cao

Authors

Hongchen Luo
View author publications
Search author on:PubMed Google Scholar
Wei Zhai
View author publications
Search author on:PubMed Google Scholar
Jing Zhang
View author publications
Search author on:PubMed Google Scholar
Yang Cao
View author publications
Search author on:PubMed Google Scholar
Dacheng Tao
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yang Cao.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Luo, H., Zhai, W., Zhang, J. et al. Grounded Affordance from Exocentric View. Int J Comput Vis 132, 1945–1969 (2024). https://doi.org/10.1007/s11263-023-01962-z

Download citation

Received: 25 May 2023
Accepted: 13 November 2023
Published: 26 December 2023
Version of record: 26 December 2023
Issue date: June 2024
DOI: https://doi.org/10.1007/s11263-023-01962-z

Keywords

Profiles

Dacheng Tao View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Grounded Affordance from Exocentric View

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

INTRA: Interaction Relationship-Aware Weakly Supervised Affordance Grounding

Visual Affordance Recognition: A Study on Explainability and Interpretability for Human Robot Interaction

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now