Abstract
A popular approach to estimate scene flow is to utilize point cloud data from various Lidar scans. However, there is little attention to learning 3D motion from camera images. Learning scene flow from a monocular camera remains a challenging task due to its ill-posedness as well as lack of annotated data. Self-supervised methods demonstrate learning scene flow estimation from unlabeled data, yet their accuracy lags behind (semi-)supervised methods. In this paper, we introduce a self-supervised monocular scene flow method that substantially improves the accuracy over the previous approaches. Based on RAFT, a state-of-the-art optical flow model, we design a new decoder to iteratively update 3D motion fields and disparity maps simultaneously. Furthermore, we propose an enhanced upsampling layer and a disparity initialization technique, which overall further improves accuracy up to 7.2%. Our method achieves state-of-the-art accuracy among all self-supervised monocular scene flow methods, improving accuracy by 34.2%. Our fine-tuned model outperforms the best previous semi-supervised method with 228 times faster runtime. Code will be publicly available to ensure reproducibility.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analyzed during the current study are available in the The KITTI Vision Benchmark Suite, https://www.cvlibs.net/datasets/kitti/
References
Basha, T., Moses, Y., & Kiryati, N. (2010). Multi-view scene flow estimation: A view centered variational approach. In CVPR.
Brickwedde, F., Abraham, S., & Mester, R. (2019). Mono-SF: Multi-view geometry meets single-view depth for monocular scene flow estimation of dynamic traffic scenes. In ICCV, pp. 2780–2790.
Chen, Y., Schmid, C., & Sminchisescu, C. (2019). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV, pp. 7062–7071.
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In NIPS.
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. IJRR, 32, 1231–1237.
Godard, C., Aodha, O. M., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In CVPR, pp. 6602–6611.
Godard, C., Aodha, O. M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth estimation. In ICCV, pp. 3827–3837.
Gu, X., Wang, Y., Wu, C., Lee, Y. J., & Wang, P. (2019). HPLFlowNet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In CVPR, pp. 3249–3258 .
Hur, J., & Roth, S. (2020). Self-supervised monocular scene flow estimation. In CVPR, pp. 7394–7403.
Hur, J., & Roth, S. (2021). Self-supervised multi-frame monocular scene flow. In CVPR, pp. 2683–2693.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
Liu, L., Zhai, G., Ye, W., & Liu, Y. (2019a). Unsupervised learning of scene flow estimation fusing with local rigidity. In IJCAI.
Liu, X., Qi, C., & Guibas, L. J. (2019b). FlowNet3D: Learning scene flow in 3d point clouds. CVPR, pp. 529–537.
Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., & Yuille, A. (2020). Every pixel counts++: Joint learning of geometry and motion with 3D holistic understanding. IEEE TPAMI, 42, 2624–2641.
Meister, S., Hur, J., & Roth, S. (2018). Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In AAAI.
Peng, R., Wang, R., Lai, Y., Tang, L., & Cai, Y. (2021). Excavating the potential capacity of self-supervised monocular depth estimation. In ICCV.
Qi, C., Yi, L., Su, H., & Guibas, L. J. (2017). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS.
Ramamonjisoa, M., Firman, M., Watson, J., Lepetit, V., & Turmukhambetov, D. (2021). Single image depth prediction with wavelet decomposition. In CVPR, pp. 11084–11093.
Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M. J. (2019). Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR, pp. 12232–12241.
Schuster, R., Unger, C., & Stricker, D. (2020). MonoComb: A sparse-to-dense combination approach for monocular scene flow. In Computer Science in Cars Symposium.
Sun, D., Yang, X., Liu, M.-Y., Kautz, J. (2018). PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, pp. 8934–8943.
Teed, Z., & Deng, J. (2020). RAFT: Recurrent all-pairs field transforms for optical flow. In ECCV.
Teed, Z., & Deng, J. (2021). RAFT-3D: Scene flow using rigid-motion embeddings. In CVPR, pp. 8371–8380.
Valgaerts, L., Bruhn, A., Zimmer, H., Weickert, J., Stoll, C., & Theobalt, C. (2010). Joint estimation of motion, structure and geometry from stereo sequences. In ECCV.
Vedula, S., Baker, S., Rander, P., Collins, R. T., & Kanade, T. (2005). Three-dimensional scene flow. EEE TPAMI, 27, 475–480.
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13, 600–612.
Wang, Z., Li, S., Howard-Jenkins, H., Prisacariu, V. A., & Chen, M. (2020). FlowNet3D++: Geometric losses for deep scene flow estimation. In WACV, pp. 91–98.
Watson, J., Firman, M., Brostow, G. J., & Turmukhambetov, D. (2019). Self-supervised monocular depth hints. In ICCV, pp. 2162–2171.
Wu, W., Wang, Z., Li, Z., Liu, W., & Li, F. (2020). Pointpwc-net: Cost volume on point clouds for (self-)supervised scene flow estimation. In ECCV.
Yang, G., & Ramanan, D. (2020). Upgrading optical flow to 3D scene flow through optical expansion. In CVPR, pp. 1331–1340.
Yang, Z., Wang, P., Wang, Y., Xu, W., & Nevatia, R. (2018). Every pixel counts: Unsupervised geometry learning with holistic 3D motion understanding. In ECCV Workshops.
Yin, Z., & Shi, J. (2018). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, pp. 1983–1992.
Zhou, Z., Fan, X., Shi, P., & Xin, Y. (2021). R-MSFM: Recurrent multi-scale feature modulation for monocular depth estimating. In ICCV, pp. 12777–12786.
Zou, Y., Luo, Z., & Huang, J.-B. (2018). DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In ECCV.
Acknowledgements
This paper is supported by NSFC, China (No. 62176155) and Shanghai Municipal Science and Technology Major Project, China (2021SHZDZX0102).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bayramli, B., Hur, J. & Lu, H. RAFT-MSF: Self-Supervised Monocular Scene Flow Using Recurrent Optimizer. Int J Comput Vis 131, 2757–2769 (2023). https://doi.org/10.1007/s11263-023-01828-4
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-023-01828-4