这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Grounded Affordance from Exocentric View

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Affordance grounding aims to locate objects’ “action possibilities” regions, an essential step toward embodied intelligence. Due to the diversity of interactive affordance, i.e., the uniqueness of different individual habits leads to diverse interactions, which makes it difficult to establish an explicit link between object parts and affordance labels. Human has the ability that transforms various exocentric interactions into invariant egocentric affordance to counter the impact of interactive diversity. To empower an agent with such ability, this paper proposes a task of affordance grounding from the exocentric view, i.e., given exocentric human-object interaction and egocentric object images, learning the affordance knowledge of the object and transferring it to the egocentric image using only the affordance label as supervision. However, there is some “interaction bias” between personas, mainly regarding different regions and views. To this end, we devise a cross-view affordance knowledge transfer framework that extracts affordance-specific features from exocentric interactions and transfers them to the egocentric view to solve the above problems. Furthermore, the perception of affordance regions is enhanced by preserving affordance co-relations. In addition, an affordance grounding dataset named AGD20K is constructed by collecting and labeling over 20K images from 36 affordance categories. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. The code is available via: github.com/lhc1224/Cross-View-AG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  • Bandini, A., & Zariffa, J. (2020). Analysis of the hands in egocentric vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99, 1–1. https://doi.org/10.1109/TPAMI.2020.2986648

    Article  Google Scholar 

  • Bohg, J., Hausman, K., Sankaran, B., Brock, O., Kragic, D., Schaal, S., & Sukhatme, G. S. (2017). Interactive perception: Leveraging action in perception and perception in action. IEEE Transactions on Robotics, 33(6), 1273–1291.

    Article  Google Scholar 

  • Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Oliva, A., & Torralba, A. (2015). Mit saliency benchmark.

  • Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., & Durand, F. (2018). What do different evaluation metrics tell us about saliency models? IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(3), 740–757.

    Article  Google Scholar 

  • Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 9650–9660).

  • Chan, E. R., Nagano, K., Chan, M. A., Bergman, A. W., Park, J. J., Levy, A., Aittala, M., De Mello, S., Karras, T., & Wetzstein, G. (2023). Generative novel view synthesis with 3d-aware diffusion models. arXiv preprint arXiv:2304.02602.

  • Chao, Y. W., Liu, Y., Liu, X., Zeng, H., & Deng, J. (2018). Learning to detect human-object interactions. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, (pp. 381–389).

  • Chen, J., Gao, D., Lin, K. Q., & Shou, M. Z. (2023). Affordance grounding from demonstration video to target image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 6799–6808).

  • Chen, Y. C., Lin, Y. Y., Yang, M. H., & Huang, J. B. (2020). Show, match and segment: Joint weakly supervised learning of semantic matching and object co-segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3632–3647.

    Article  Google Scholar 

  • Choi, I., Gallo, O., Troccoli, A., Kim, M. H., & Kautz, J. (2019). Extreme view synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 7781–7790).

  • Chuang, C. Y., Li, J., Torralba, A., & Fidler, S. (2018). Learning to act properly: Predicting and explaining affordances from images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 975–983).

  • Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., & Price, W., et al. (2018). Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV), (pp. 720–736).

  • Debevec, P., Yu, Y., & Borshukov, G. (1998). Efficient view-dependent image-based rendering with projective texture-mapping. In: Rendering Techniques’ 98: Proceedings of the Eurographics Workshop in Vienna, Austria, June 29-July 1, 1998 9, Springer, (pp. 105–116).

  • Deng, S., Xu, X., Wu, C., Chen, K., & Jia, K. (2021). 3d affordancenet: A benchmark for visual object affordance understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 1778–1787).

  • Do, T. T., Nguyen, A., & Reid, I. (2018). Affordancenet: An end-to-end deep learning approach for object affordance detection. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, (pp. 5882–5889).

  • Fan, C., Lee, J., Xu, M., Singh, K.K., & Yong, J. L. (2017). Identifying first-person camera wearers in third-person videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Fan, D. P., Li, T., Lin, Z., Ji, G. P., Zhang, D., Cheng, M. M., Fu, H., & Shen, J. (2021). Re-thinking co-salient object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8), 4339–4354.

    Google Scholar 

  • Fang, K., Wu, T. L., Yang, D., Savarese, S., & Lim, J. J. (2018). Demo2vec: Reasoning object affordances from online videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 2139–2147).

  • Fouhey, D. F., Wang, X., & Gupta, A. (2015). In defense of the direct perception of affordances. arXiv preprint arXiv:1505.01085.

  • Gao, W., Wan, F., Pan, X., Peng, Z., Tian, Q., Han, Z., Zhou, B., & Ye, Q. (2021). Ts-cam: Token semantic coupled attention map for weakly supervised object localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), (pp. 2886–2895).

  • Geng, Z., Guo, M. H., Chen, H., Li, X., Wei, K., & Lin, Z. (2021). Is attention better than matrix decomposition? arXiv preprint arXiv:2109.04553.

  • Gibson, J. J. (1977). The Theory of Affordances. Hilldale.

    Google Scholar 

  • Grabner, H., Gall, J., & Van Gool, L. (2011). What makes a chair a chair? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (pp. 1529–1536).

  • Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., & Liu, X., et al. (2022). Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 18995–19012).

  • Hadjivelichkov, D., Zwane, S., Agapito, L., Deisenroth, M. P., & Kanoulas, D. (2023). One-shot transfer of affordance regions? affcorrs! In: Conference on Robot Learning, PMLR, (pp. 550–560).

  • Hassanin, M., Khan, S., & Tahtali, M. (2018). Visual affordance and function understanding: A survey. arXiv.

  • Hassanin, M., Khan, S., & Tahtali, M. (2021). Visual affordance and function understanding: A survey. ACM Computing Surveys (CSUR), 54(3), 1–35.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 770–778).

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

  • Ho, H. I., Chiu, W. C., & Wang, Y. C. F. (2018). Summarizing first-person videos from third persons’ points of view. In: Proceedings of the European Conference on Computer Vision (ECCV), (pp. 70–85).

  • Huang, Y., Cai, M., Li, Z., & Sato, Y. (2018). Predicting gaze in egocentric video by learning task-dependent attention transition. In: Proceedings of the European Conference on Computer Vision (ECCV), (pp. 754–769).

  • Judd, T., Durand, F., & Torralba, A. (2012). A benchmark of computational models of saliency to predict human fixations.

  • Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., & Lo, W. Y., et al. (2023). Segment anything. arXiv preprint arXiv:2304.02643.

  • Kjellström, H., Romero, J., & Kragić, D. (2011). Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, 115(1), 81–90.

    Article  Google Scholar 

  • Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500.

    Article  MathSciNet  Google Scholar 

  • Koppula, H. S., & Saxena, A. (2014). Physically grounded spatio-temporal object affordances. In: European Conference on Computer Vision (ECCV), Springer, (pp. 831–847).

  • Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32(8), 951–970.

    Article  Google Scholar 

  • Kümmerer, M., Wallis, T. S., & Bethge, M. (2016). Deepgaze ii: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563.

  • Lakani, S. R., Rodríguez-Sánchez, A. J., & Piater, J. (2017). Can affordances guide object decomposition into semantically meaningful parts? In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, (pp. 82–90).

  • Lau, M., Dev, K., Shi, W., Dorsey, J., & Rushmeier, H. (2016). Tactile mesh saliency. ACM Transactions on Graphics (TOG), 35(4), 1–11.

    Article  Google Scholar 

  • Lee, D. D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. In: NIPS.

  • Li, B., Sun, Z., Li, Q., Wu, Y., & Hu, A. (2019). Group-wise deep object co-segmentation with co-attention recurrent neural network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 8519–8528).

  • Li, G., Jampani, V., Sun, D., & Sevilla-Lara, L. (2023a). Locate: Localize and transfer object parts for weakly supervised affordance grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 10922–10931).

  • Li, J., Liu, K., & Wu, J. (2023b). Ego-body pose estimation via ego-head pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 17142–17151).

  • Li, Y., Nagarajan, T., Xiong, B., & Grauman, K. (2021). Ego-exo: Transferring visual representations from third-person to first-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 6943–6953).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV), Springer, (pp. 740–755).

  • Liu, S., Tripathi, S., Majumdar, S., & Wang, X. (2022). Joint hand motion and interaction hotspots prediction from egocentric videos. arXiv preprint arXiv:2204.01696.

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 10012–10022).

  • Lu, J., Zhou, Z., Zhu, X., Xu, H., & Zhang, L. (2022a). Learning ego 3d representation as ray tracing. arXiv preprint arXiv:2206.04042.

  • Lu, L., Zhai, W., Luo, H., Kang, Y., & Cao, Y. (2022b). Phrase-based affordance detection via cyclic bilateral interaction. arXiv preprint arXiv:2202.12076.

  • Luo, H., Zhai, W., Zhang, J., Cao, Y., & Tao, D. (2021a). Learning visual affordance grounding from demonstration videos. arXiv preprint arXiv:2108.05675.

  • Luo, H., Zhai, W., Zhang, J., Cao, Y., & Tao, D. (2021b). One-shot affordance detection. arXiv preprint arXiv:2106.14747.

  • Luo, H., Zhai, W., Zhang, J., Cao, Y., & Tao, D. (2022). Learning affordance grounding from exocentric images. arXiv preprint arXiv:2203.09905.

  • Lv, Y., Zhang, J., Dai, Y., Li, A., Barnes, N., & Fan, D. P. (2022). Towards deeper understanding of camouflaged object detection. arXiv preprint arXiv:2205.11333.

  • Mai, J., Yang, M., & Luo, W. (2020). Erasing integrated learning: A simple yet effective approach for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 8766–8775).

  • Mandikal, P., & Grauman, K. (2021). Learning dexterous grasping with object-centric visual affordances. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE, (pp. 6169–6176).

  • Mi, J., Tang, S., Deng, Z., Goerner, M., & Zhang, J. (2019). Object affordance based multimodal fusion for natural human-robot interaction. Cognitive Systems Research, 54, 128–137.

    Article  Google Scholar 

  • Mi, J., Liang, H., Katsakis, N., Tang, S., Li, Q., Zhang, C., & Zhang, J. (2020). Intention-related natural language grounding via object affordance detection and intention semantic extraction. Frontiers in Neurorobotics, 14, 26.

    Article  Google Scholar 

  • Myers, A., Teo, C. L., Fermüller, C., & Aloimonos, Y. (2015). Affordance detection of tool parts from geometric features. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, (pp. 1374–1381).

  • Nagarajan, T., & Grauman, K. (2020). Learning affordance landscapes for interaction exploration in 3d environments. Advances in Neural Information Processing Systems, 33, 2005–2015.

    Google Scholar 

  • Nagarajan, T., Feichtenhofer, C., & Grauman, K. (2019). Grounded human-object interaction hotspots from video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (pp. 8688–8697).

  • Nguyen, A., Kanoulas, D., Caldwell, D. G., & Tsagarakis, N. G. (2016). Detecting object affordances with convolutional neural networks. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, (pp. 2765–2770).

  • Nguyen, A., Kanoulas, D., Caldwell, D. G., & Tsagarakis, N. G. (2017). Object-based affordances detection with convolutional neural networks and dense conditional random fields. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, (pp. 5908–5915).

  • Pan, X., Gao, Y., Lin, Z., Tang, F., Dong, W., Yuan, H., Huang, F., & Xu, C. (2021). Unveiling the potential of structure preserving for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 11642–11651).

  • Pei, G., Shen, F., Yao, Y., Xie, G. S., Tang, Z., & Tang, J. (2022). Hierarchical feature alignment network for unsupervised video object segmentation. In: European Conference on Computer Vision, Springer, (pp. 596–613).

  • Peters, R. J., Iyer, A., Itti, L., & Koch, C. (2005). Components of bottom-up gaze allocation in natural images. Vision Research, 45(18), 2397–2416.

    Article  Google Scholar 

  • Quan, R., Han, J., Zhang, D., & Nie, F. (2016). Object co-segmentation via graph optimized-flexible manifold ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 687–695).

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

  • Regmi, K., & Shah, M. (2019). Bridging the domain gap for ground-to-aerial image matching. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), (pp. 470–479).

  • Ren, S., Liu, W., Liu, Y., Chen, H., Han, G., & He, S. (2021). Reciprocal transformations for unsupervised video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 15455–15464).

  • Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience, 27, 169–192.

    Article  Google Scholar 

  • Sawatzky, J., & Gall, J. (2017). Adaptive binarization for weakly supervised affordance segmentation. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, (pp. 1383–1391).

  • Sawatzky, J., Srikantha, A., & Gall, J. (2017). Weakly supervised affordance detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.

  • Sigurdsson, G. A., Gupta, A., Schmid, C., Farhadi, A., & Alahari, K. (2018). Actor and observer: Joint modeling of first and third-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 7396–7404).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  • Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

  • Soran, B., Farhadi, A., & Shapiro, L. (2014). Action recognition in the presence of one egocentric and multiple static cameras. In: Asian Conference on Computer Vision, Springer, (pp. 178–193).

  • Srikantha, A., & Gall, J. (2016). Weakly supervised learning of affordances. arXiv preprint arXiv:1605.02964.

  • Stark, M., Lies, P., Zillich, M., Wyatt, J., & Schiele, B. (2008). Functional object class detection based on learned affordance cues. In: International Conference on Computer Vision Systems, Springer, (pp. 435–444).

  • Swain, M. J., & Ballard, D. H. (1991). Color indexing. International Journal of Computer Vision (IJCV), 7(1), 11–32.

    Article  Google Scholar 

  • Tang, Y., Tian, Y., Lu, J., Feng, J., & Zhou, J. (2017). Action recognition in rgb-d egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, (pp. 3410–3414).

  • Wang, J., Liu, L., Xu, W., Sarkar, K., & Theobalt, C. (2021). Estimating egocentric 3d human pose in global space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 11500–11509).

  • Wen, Y., Singh, K. K., Anderson, M., Jan, W. P., & Lee, Y. J. (2021). Seeing the unseen: Predicting the first-person camera wearer’s location and pose in third-person scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 3446–3455).

  • Wiles, O., Gkioxari, G., Szeliski, R., & Johnson, J. (2020). Synsin: End-to-end view synthesis from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 7467–7477).

  • Wong, B., Chen, J., Wu, Y., Lei, S. W., Mao, D., Gao, D., & Shou, M. Z. (2022). Assistq: Affordance-centric question-driven task completion for egocentric assistant. In: European Conference on Computer Vision, Springer, (pp. 485–501).

  • Wu, P., Zhai, W., & Cao, Y. (2021). Background activation suppression for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Yang, Y., Ni, Z., Gao, M., Zhang, J., & Tao, D. (2021). Collaborative pushing and grasping of tightly stacked objects via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 9(1), 135–145.

    Article  Google Scholar 

  • Yang, Y., Zhai, W., Luo, H., Cao, Y., Luo, J., & Zha, Z. J. (2023). Grounding 3d object affordance from 2d interactions in images. arXiv preprint arXiv:2303.10437.

  • Yuan, Z. H., Lu, T., & Wu, Y., et al. (2017). Deep-dense conditional random fields for object co-segmentation. In: IJCAI, vol 1, p 2.

  • Zhai, W., Cao, Y., Zhang, J., & Zha, Z. J. (2022a). Exploring figure-ground assignment mechanism in perceptual organization. Advances in Neural Information Processing Systems, 35, 17030–17042.

    Google Scholar 

  • Zhai, W., Luo, H., Zhang, J., Cao, Y., & Tao, D. (2022). One-shot object affordance detection in the wild. International Journal of Computer Vision (IJCV), 130(10), 2472–500.

    Article  Google Scholar 

  • Zhai, W., Cao, Y., Zhang, J., Xie, H., Tao, D., & Zha, Z. J. (2023). On exploring multiplicity of primitives and attributes for texture recognition in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 403–420.

    Article  Google Scholar 

  • Zhai, W., Wu, P., Zhu, K., Cao, Y., Wu, F., & Zha, Z. J. (2023b). Background activation suppression for weakly supervised object localization and semantic segmentation. International Journal of Computer Vision (pp. 1–26).

  • Zhang, J., & Tao, D. (2020). Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet of Things Journal, 8(10), 7789–7817.

    Article  MathSciNet  Google Scholar 

  • Zhang, K., Li, T., Shen, S., Liu, B., Chen, J., & Liu, Q. (2020a). Adaptive graph convolutional network with attention graph clustering for co-saliency detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

  • Zhang, L., Zhou, S., Stent, S., & Shi, J. (2022). Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In: European Conference on Computer Vision, Springer, (pp. 127–145).

  • Zhang, Q., Cong, R., Hou, J., Li, C., & Zhao, Y. (2020b). Coadnet: Collaborative aggregation-and-distribution networks for co-salient object detection. Advances in Neural Information Processing Systems, 33, 6959–6970.

    Google Scholar 

  • Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision (IJCV), 12, 1–22.

    Google Scholar 

  • Zhang, Z., Jin, W., Xu, J., & Cheng, M.M. (2020c). Gradient-induced co-saliency detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, Springer, (pp. 455–472).

  • Zhao, X., Cao, Y., & Kang, Y. (2020). Object affordance detection with relationship-aware network. Neural Computing and Applications, 32(18), 14321–14333.

    Article  Google Scholar 

  • Zhen, M., Li, S., Zhou, L., Shang, J., Feng, H., Fang, T., & Quan, L. (2020). Learning discriminative feature with crf for unsupervised video object segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, Springer, (pp. 445–462).

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 2921–2929).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Cao.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luo, H., Zhai, W., Zhang, J. et al. Grounded Affordance from Exocentric View. Int J Comput Vis 132, 1945–1969 (2024). https://doi.org/10.1007/s11263-023-01962-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-023-01962-z

Keywords

Profiles

  1. Dacheng Tao