From Scan to Action:
Leveraging Realistic Scans for Embodied Scene Understanding

Anna-Maria Halacheva, Jan-Nico Zaech, Sombit Dey, Luc Van Gool, Danda Pani Paudel

INSAIT, Sofia University “St. Kliment Ohridski

Acknowledgments

This research was partially funded by the Ministry of Education and Science of Bulgaria (support for INSAIT, part of the Bulgarian National Roadmap for Research Infrastructure).

References

Chu et al. [2023] Ruihang Chu, Zhengzhe Liu, Xiaoqing Ye, Xiao Tan, Xiaojuan Qi, Chi-Wing Fu, and Jiaya Jia. Command-driven articulated object understanding and manipulation. In International Conference on Computer Vision and Pattern Recognition (CVPR), pages 8813–8823, 2023.
Dai et al. [2017] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Deitke et al. [2022] Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In International Conference on Neural Information Processing Systems (NeurIPS), 2022.
Deitke et al. [2023] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. International Conference on Neural Information Processing Systems (NeurIPS), 36:35799–35813, 2023.
Delitzas et al. [2024] Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Engelbracht et al. [2024] Tim Engelbracht, René Zurbrügg, Marc Pollefeys, Hermann Blum, and Zuria Bauer. Spotlight: Robotic scene understanding through interaction and affordance detection. arXiv preprint arXiv:2409.11870, 2024.
et al. [2024] Kawana Y. et al. Detection based part-level articulated object reconstruction from single rgbd image. International Conference on Neural Information Processing Systems (NeurIPS), 2024.
Gupta et al. [2023] Arjun Gupta, Max E. Shepherd, and Saurabh Gupta. Predicting motion plans for articulating everyday objects. In International Conference on Robotics and Automation (ICRA), 2023.
Halacheva et al. [2024] Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, and Danda Pani Paudel. Holistic understanding of 3d scenes as universal scene description. arXiv preprint arXiv:2412.01398, 2024.
Khanna et al. [2024] Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Mao et al. [2022] Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel X Chang, and Manolis Savva. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects. In International Conference on Neural Information Processing Systems (NeurIPS), 2022.
Mittal et al. [2023] Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments. IEEE Robotics and Automation Letters (RA-L), 8(6):3740–3747, 2023.
Nasiriany et al. [2024] Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems (RSS), 2024.
Ning et al. [2024] Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, and Hao Dong. Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects. International Conference on Neural Information Processing Systems (NeurIPS), 2024.
OpenAI [2024a] OpenAI. GPT-4o mini: advancing cost-efficient intelligence. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024a. Accessed: 2024-14-11.
OpenAI [2024b] OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/, 2024b. Accessed: 2024-14-11.
Qiu et al. [2025] Xiaowen Qiu, Jincheng Yang, Yian Wang, Zhehuan Chen, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Articulate anymesh: Open-vocabulary 3d articulated objects modeling. arXiv:2502.02590, 2025.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Computing Research Repository (CoRR), abs/1707.06347, 2017.
Villasevil et al. [2024] Marcel Torne Villasevil, Anthony Simeonov, Zechu Li, April Chan, Tao Chen, Abhishek Gupta, and Pulkit Agrawal. Reconciling Reality through Simulation: A Real-To-Sim-to-Real Approach for Robust Manipulation. In Robotics: Science and Systems (RSS), 2024.
Weng et al. [2024] Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, and Stan Birchfield. Neural implicit representation for building digital twins of unknown articulated objects. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Yeshwanth et al. [2023] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In International Conference on Computer Vision (ICCV), 2023.

From Scan to Action: Leveraging Realistic Scans for Embodied Scene Understanding

Acknowledgments

References

From Scan to Action:
Leveraging Realistic Scans for Embodied Scene Understanding