Indoor Obstacle Discovery on Reflective Ground via Monocular Camera

Xue, Feng; Chang, Yicong; Wang, Tianxi; Zhou, Yu; Ming, Anlong

doi:10.1007/s11263-023-01925-4

Indoor Obstacle Discovery on Reflective Ground via Monocular Camera

Published: 20 October 2023

Volume 132, pages 987–1007, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Feng Xue¹,
Yicong Chang¹,
Tianxi Wang⁴,
Yu Zhou ORCID: orcid.org/0000-0002-6674-6484^2,3 &
…
Anlong Ming¹

713 Accesses
12 Citations
Explore all metrics

Abstract

Visual obstacle discovery is a key step towards autonomous navigation of indoor mobile robots. Successful solutions have many applications in multiple scenes. One of the exceptions is the reflective ground. In this case, the reflections on the floor resemble the true world, which confuses the obstacle discovery and leaves navigation unsuccessful. We argue that the key to this problem lies in obtaining discriminative features for reflections and obstacles. Note that obstacle and reflection can be separated by the ground plane in 3D space. With this observation, we firstly introduce a pre-calibration based ground detection scheme that uses robot motion to predict the ground plane. Due to the immunity of robot motion to reflection, this scheme avoids failed ground detection caused by reflection. Given the detected ground, we design a ground-pixel parallax to describe the location of a pixel relative to the ground. Based on this, a unified appearance-geometry feature representation is proposed to describe objects inside rectangular boxes. Eventually, based on segmenting by detection framework, an appearance-geometry fusion regressor is designed to utilize the proposed feature to discover the obstacles. It also prevents our model from concentrating too much on parts of obstacles instead of whole obstacles. For evaluation, we introduce a new dataset for Obstacle on Reflective Ground (ORG), which comprises 15 scenes with various ground reflections, a total of more than 200 image sequences and 3400 RGB images. The pixel-wise annotations of ground and obstacle provide a comparison to our method and other methods. By reducing the misdetection of the reflection, the proposed approach outperforms others. The source code and the dataset will be available at https://github.com/xuefeng-cvr/IndoorObstacleDiscovery-RG

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

3D Perception for Autonomous Robot Exploration

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Geometric-driven structure recovery from a single omnidirectional image based on planar depth map learning

Article 03 October 2023

Data Availability

The code of the MATLAB implementation and datasets generated during the current study are available in the GitHub repository, https://github.com/XuefengBUPT/IndoorObstacleDiscovery-RG.

References

Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(5), 898–916.
Article Google Scholar
Bian, J. W., Zhan, H., Wang, N., Li, Z., Zhang, L., Shen, C., Cheng, M. M., & Reid, I. (2021). Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision (IJCV), 129(9), 2548–2564.
Article Google Scholar
Broggi, A., Buzzoni, M., Felisa, M., & Zani, P. (2011). Stereo obstacle detection in challenging environments: The VIAC experience. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV).
Chen, T., Vemuri, B. C., Rangarajan, A., & Eisenschenk, S. J. (2009). Group-wise point-set registration using a novel cdf-based havrda-charvát divergence. International Journal of Computer Vision (IJCV), 86(1), 111.
Article Google Scholar
Conrad, D., & DeSouza, G. N. (2010). Homography-based ground plane detection for mobile robot navigation using a modified em algorithm. In IEEE International Conference on Robotics and Automation (ICRA).
Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Now Publishers Inc.
Google Scholar
Dong, Z., Xu, K., Yang, Y., Bao, H., Xu, W., & Lau, R. W. (2021). Location-aware single image reflection removal. In IEEE/CVF International Conference on Computer Vision (ICCV).
Dongfu, Z., & Zheng, C. The code of CamOdomCalibraTool. https://github.com/MegviiRobot/CamOdomCalibraTool
Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., & Van Gool, L. (2017). Deepproposals: Hunting objects and actions by cascading deep convolutional layers. International Journal of Computer Vision (IJCV), 124(2), 115–131.
Article Google Scholar
Gupta, K., Javed, S. A., Gandhi, V., Krishna, K. M. (2018). Mergenet: A deep net architecture for small obstacle discovery. In IEEE International Conference on Robotics and Automation (ICRA).
Hartley, R., & Zisserman, A. (2003). Multiple View Geometry in Computer Vision. Cambridge University Press.
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Heng, L., Bo, L., & Pollefeys, M. (2013). Camodocal: Automatic intrinsic and extrinsic calibration of a rig with multiple generic cameras and odometry. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Hoiem, D., Efros, A. A., & Hebert, M. (2011). Recovering occlusion boundaries from an image. International Journal of Computer Vision (IJCV), 91(3), 328–346.
Article MathSciNet Google Scholar
Hua, M., Nan, Y., & Lian, S. (2019). Small obstacle avoidance based on rgb-d semantic segmentation. In IEEE International Conference on Computer Vision Workshop (ICCVW).
Jia, J. (2007). Single image motion deblurring using transparency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Kalal, Z., Mikolajczyk, K., & Matas, J. (2010). Forward-backward error: Automatic detection of tracking failures. In IEEE International Conference on Pattern Recognition (ICPR).
Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS)
Kumar, S., Karthik, M. S., & Krishna, K. M. (2014). Markov random field based small obstacle discovery over images. In IEEE International Conference on Robotics and Automation (ICRA).
Li, H., Liu, Y., Ouyang, W., & Wang, X. (2019). Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision (IJCV), 127(3), 225–238.
Article Google Scholar
Lin, C., Jiang, S., Yueh-Ju Pu, & Song, K. (2010). Robust ground plane detection for obstacle avoidance of mobile robots using a monocular camera. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Lindeberg, T. (1998). Edge detection and ridge detection with automatic scale selection. International Journal of Computer Vision (IJCV), 30(2), 117–156.
Article Google Scholar
Lis, K., Nakka, K. K., Fua, P., & Salzmann, M. (2019). Detecting the unexpected via image resynthesis. In IEEE/CVF International Conference on Computer Vision (ICCV).
Lu, R., Xue, F., Zhou, M., Ming, A., & Zhou, Y. (2019). Occlusion-shared and feature-separated network for occlusion relationship reasoning. In IEEE/CVF International Conference on Computer Vision (ICCV).
Lucas, B.D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In International Joint Conference on Artificial Intelligence (IJCAI).
Ma, J., Ming, A., Huang, Z., Wang, X., & Zhou, Y. (2017). Object-level proposals. In IEEE International Conference on Computer Vision (ICCV).
Malis, E., & Vargas, M. (2007). Deeper understanding of the homography decomposition for vision-based control. Research Report RR-6303, INRIA.
Mancini, M., Costante, G., Valigi, P., & Ciarfuglia, T. A. (2018). J-MOD2: Joint monocular obstacle detection and depth estimation. IEEE Robotics and Automation Letters (RA-L), 3(3):1490–1497.
Ming, A., Wu, T., Ma, J., Sun, F., & Zhou, Y. (2016). Monocular depth-ordering reasoning with occlusion edge detection and couple layers inference. In IEEE Intelligent Systems (IS), vol. 31, pp. 54–65.
Ming, A., Xun, B., Ni, J., Gao, M., & Zhou, Y. (2015). Learning discriminative occlusion feature for depth ordering inference on monocular image. In IEEE International Conference on Image Processing (ICIP).
Nam, S., Brubaker, M. A., & Brown, M. S. (2022). Neural image representations for multi-image fusion and layer separation. In S. Avidan, G. Brostow, M. Cissé, G.M. Farinella, T. Hassner (Eds.) Europe Conference on Computer Vision (ECCV).
Panahandeh, G., & Jansson, M. (2014). Vision-aided inertial navigation based on ground plane feature detection. IEEE/ASME Transactions on Mechatronics (TMECH), 19(4), 1206–1215.
Article Google Scholar
Paszke, A., Chaurasia, A., Kim, S., & Culurciello, E. (2017). Enet: A deep neural network architecture for real-time semantic segmentation. In International Conference on Learning Representations (ICLR).
Pinggera, P., Ramos, S., Gehrig, S., Franke, U., Rother, C., & Mester, R. (2016). Lost and found: detecting small road hazards for self-driving vehicles. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Ramos, S., Gehrig, S., Pinggera, P., Franke, U., & Rother, C. (2017). Detecting unexpected obstacles for self-driving cars: Fusing deep learning and geometric modeling. In IEEE Intelligent Vehicles Symposium (IV).
Saxena, A., Chung, S. H., & Ng, A. Y. (2008). 3-d depth reconstruction from a single still image. International Journal of Computer Vision (IJCV), 76(1), 53–69.
Article Google Scholar
Sharp, G., Lee, S., & Wehe, D. (2002). ICP registration using invariant features. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 24(1), 90–102.
Article Google Scholar
Shelhamer, E., Long, J., & Darrell, T. (2017). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(4), 640–651.
Article Google Scholar
Singh, A., Kamireddypalli, A., Gandhi, V., & Krishna, K. M. (2020). Lidar guided small obstacle segmentation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Sun, L., Yang, K., Hu, X., Hu, W., & Wang, K. (2020). Real-time fusion network for rgb-d semantic segmentation incorporating unexpected obstacle detection for road-driving images. In IEEE Robotics and Automation Letters (RA-L)
Xie, S., & Tu, Z. (2017). Holistically-nested edge detection. International Journal of Computer Vision (IJCV), 125(1), 3–18.
Article MathSciNet Google Scholar
Xue, F., Cao, J., Zhou, Y., Sheng, F., Wang, Y., & Ming, A. (2021). Boundary-induced and scene-aggregated network for monocular depth prediction. Pattern Recognition (PR), 115, 107901.
Xue, F., Ming, A., Zhou, M., & Zhou, Y. (2019). A novel multi-layer framework for tiny obstacle discovery. In IEEE International Conference on Robotics and Automation (ICRA).
Xue, F., Ming, A., & Zhou, Y. (2020). Tiny obstacle discovery by occlusion-aware multilayer regression. IEEE Transactions on Image Processing (TIP), 29, 9373–9386.
Article ADS Google Scholar
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., & Sang, N. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision (ECCV).
Zhou, J., & Li, B. (2006a). Homography-based ground detection for a mobile robot platform using a single camera. In IEEE International Conference on Robotics and Automation (ICRA).
Zhou, J., & Li, B. (2006b). Robust ground plane detection with normalized homography in monocular sequences from a robot platform. In IEEE International Conference on Image Processing (ICIP).
Zhou, M., Ma, J., Ming, A., & Zhou, Y. (2018). Objectness-aware tracking via double-layer model. In IEEE International Conference on Image Processing (ICIP).
Zhou, Y., Bai, X., Liu, W., & Latecki, L. (2016). Similarity fusion for visual tracking. International Journal of Computer Vision (IJCV), 118(3), 337–363.
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China 62176098, 61703049, the Natural Science Foundation of Hubei Province of China under Grant 2019CFA022, the national key R &D program intergovernmental international science and technology innovation cooperation project under Grant No. 2021YFE0101600, the Beijing University of Posts and Telecommunications (BUPT) Excellent Ph.D. Students Foundation under Grant CX2020114.

Author information

Authors and Affiliations

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Feng Xue, Yicong Chang & Anlong Ming
School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 430074, China
Yu Zhou
Hubei Key Laboratory of Smart Internet Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Yu Zhou
DiDi Autonomous Driving, Beijing, 100094, China
Tianxi Wang

Authors

Feng Xue
View author publications
Search author on:PubMed Google Scholar
Yicong Chang
View author publications
Search author on:PubMed Google Scholar
Tianxi Wang
View author publications
Search author on:PubMed Google Scholar
Yu Zhou
View author publications
Search author on:PubMed Google Scholar
Anlong Ming
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yu Zhou.

Additional information

Communicated by Slobodan Ilic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Principle of Ground-Pixel Parallax

In this section, we state the correctness of Eq. 4 in the main manuscript, and prove that this equation determines the relationship between an observed point and the ground.

First of all, we give the proof of Eq. 4 in the main manuscript. Supposing $I^t$ and $I^{t-1}$ denote two consecutive images from robot’s view, $\pi $ denotes the ground plane, and $e^t, e^{t-1}$ denote the epipoles on the two images, which respectively are two intersection points of the two images and a line, i.e., the line connecting two camera optical centers, as the yellow points in Fig. 18. $x^t_i=\{u^t_i,v^t_i,1\}$ denotes an occlusion edge points of image $I^t$ in its homogeneous form, and it is also the 2D projection of a 3D point $X'$ on image $I^t$. A ray emitted from $I^t$’s optical center, denoted as $\textbf{l}$, penetrates $X'$ and $x^t_i$, and intersects the ground $\pi $ at a 3D point X. In addition, with the representation of the main manuscript, the geometric and appearance corresponding points are denoted as $g^{t-1}_i$ and $a^{t-1}_i$. According to Epipolar Constraints, the 3D line $\textbf{l}$ can be projected to image $I^{t-1}$ to form a projected 2D line, denoted as $\overrightarrow{g^{t-1}_i e^{t-1}}$. Since the 3D point $X'$ is on the 3D line $\textbf{l}$, its projection on image $I^{t-1}$, namely $a^{t-1}_i$, is on the projected 2D line $\overrightarrow{g^{t-1}_i e^{t-1}}$. Hence, these 2D points $a^{t-1}_i$, $g^{t-1}_i$, and $e^{t-1}$ are collinear. Based on this, 2D point $a^{t-1}_i$ can be stated as:

$$\begin{aligned} a^{t-1}_i = g^{t-1}_i + \rho (g^{t-1}_i-e^{t-1}) \end{aligned}$$

(14)

where $\rho $ is a scalar. This equation can be reformulated as Eq. 4 in the main manuscript.

Then, we discuss the reason that Eq. 4 in the main manuscript can be used to distinguish the points above the ground and below the ground. Noteworthily, since $g^{t-1}_i$ uniquely represents the projection of 3D ground point X, $g^{t-1}_i$ can be considered as the dividing point, and all points on both sides of this 2D line are partitioned into two spaces, above the ground and below the ground, as shown in Fig. 18. Hence, the space of point $x^{t}_i$ can be determined by comparing $a^{t-1}_i$ and $g^{t-1}_i$.

Appendix B: Feasibility of Reflection Removal Methods

In scenes of reflection ground, it is intuitive to incorporate reflection removal approaches into the feature extraction part. Therefore, we employ the state-of-the-art single frame reflection removal algorithm (Dong et al., 2021) (ICCV 2021) and multi-frame reflection removal algorithm (Nam et al., 2022) (ECCV 2022), both of which have publicly available code. Their results on our dataset are visualized in Fig. 19. Note that since the real-world interfaces the reconstruction of multi-frame-based methods, we only conduct the reflection removal in the bottom half of the image, i.e., in the ground area.

Observably, both reflection removal methods are ineffective in removing reflections in our benchmark scenarios. In fact, they even damage the information of obstacles that needed to be detected. The single-frame-based method (Dong et al., 2021) produced confidence maps that failed to highlight the reflection, and the non-reflection images were almost identical to the original RGB images. Despite being free from the interface of the real world, the multi-frame-based method (Nam et al., 2022) still fails to eliminate reflections. The reason is that both algorithms require strong textures of main object for adequate reconstruction, but the ground texture is too weak to be perceived. In contrast, the reflection has a stronger texture than ground, making it appear as the main object. Overall, existing reflection removal algorithms cannot be directly used in the scene with reflective ground, and even damage obstacle information.

Appendix C: Feasibility of Depth Sensors

In recent years, multi-modal sensors have been increasingly popular in autonomous driving. Thus, we evaluate the usability of radar or depth cameras in reflective ground environments. To this end, we collect depth data of several reflective scenes by two classical sensors, i.e., the structured light camera (Kinect v1, released in 2010, priced at $150), the stereo camera (RealSense D455, released in 2019, priced at $249). The exemplar RGBD data is visualized in Fig. 20. Intuitively, the depth data obtained by these cameras in reflective environments is of such low quality that it cannot be applied in reflective scenes. Specifically, the structured light camera generates many void areas on the ground plane, while the stereo camera matches corresponding pixels erroneously between the cameras and results in completely incorrect depth data. Clearly, both types of cameras are unsuitable for reflective ground scenes.

Furthermore, we conduct capability tests of laser sensor using a single-beam 360-degree LiDAR, which is illustrated in Fig. 21. The results illustrate that the single-beam LiDAR is not affected by reflections in all angles, which means that 3D LiDAR can obtain reliable depth information in reflective scenes. Unfortunately, although 3D LiDARs almost avoid the issue brought by reflective ground, it is too expensive to be deployed on a robot compared to other depth sensor.

Table 8 Formulation of each feature representing a bounding box

Full size table

Appendix D: Detailed Formulation of Feature Vector

To clearly represent the feature used in this paper, Table 8 shows the formulation of each feature channel. Note that, according to Sect. 3.4, $b_j^t$ denotes the j-th bounding boxes in the t-th image, and its feature vector is denoted as $v_j^t$, which consists 19 channels grouped into five categories. To simplify the notation, we use $\textsf{b}$ to represent the bounding box $b_j^t$, specified by its top-left pixel coordinates $\left( \textsf{u},\textsf{v}\right) $ and its width and height $\left( \textsf{w},\textsf{h}\right) $. In Table 8, the notation $\check{\textsf{b}}$ refers to the inner ring of the bounding box $\textsf{b}$, while $\hat{\textsf{b}}$ represents the outer ring.

In the 2-nd channel, the notation $\left[ .\right] $ represents an indicator function that outputs 1 if the input is correct and 0 otherwise. In the 5-th channel, $\left( \textsf{W},\textsf{H}\right) $ denotes the width and height of the input image. For the 12-th - 17-th channels, the variables $\mathcal {H}$, $\mathcal {S}$, and $\mathcal {V}$ correspond to the input image’s HSV channels, and $\mathcal {H}\left( p\right) $ denotes the p-th pixel in the channel $\mathcal {H}$. Note that, the color contrast formulation involves the normalized histogram of box $\textsf{b}$’s H channel, represented as $hist_\textsf{b}^\mathcal {H}$, which is discretized into 18 bins denoted by $\{h_k^\mathcal {H},k\in \left[ 1,18\right] \}$, where k ranges from 1 to 18. The value of $h_k^\mathcal {H}$ is obtained as the sum over all pixels p in box $\textsf{b}$ such that $h_k^\mathcal {H}=\sum _{p\in \textsf{b}}\left[ \left\lfloor \frac{\mathcal {H}\left( p\right) }{360/18}\right\rfloor =k\right] $. Additionally, similar histograms $hist_\textsf{b}^\mathcal {S}$, $hist_\textsf{b}^\mathcal {V}$, $hist_{\hat{\textsf{b}}}^\mathcal {H}$, $hist_{\hat{\textsf{b}}}^\mathcal {S}$, $hist_{\hat{\textsf{b}}}^\mathcal {V}$ are computed in the same way.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xue, F., Chang, Y., Wang, T. et al. Indoor Obstacle Discovery on Reflective Ground via Monocular Camera. Int J Comput Vis 132, 987–1007 (2024). https://doi.org/10.1007/s11263-023-01925-4

Download citation

Received: 16 March 2022
Accepted: 30 September 2023
Published: 20 October 2023
Version of record: 20 October 2023
Issue date: March 2024
DOI: https://doi.org/10.1007/s11263-023-01925-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Indoor Obstacle Discovery on Reflective Ground via Monocular Camera

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

3D Perception for Autonomous Robot Exploration

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Geometric-driven structure recovery from a single omnidirectional image based on planar depth map learning

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Principle of Ground-Pixel Parallax

Appendix B: Feasibility of Reflection Removal Methods

Appendix C: Feasibility of Depth Sensors

Appendix D: Detailed Formulation of Feature Vector

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now