这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

RAFT-MSF: Self-Supervised Monocular Scene Flow Using Recurrent Optimizer

  • Manuscript
  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

A popular approach to estimate scene flow is to utilize point cloud data from various Lidar scans. However, there is little attention to learning 3D motion from camera images. Learning scene flow from a monocular camera remains a challenging task due to its ill-posedness as well as lack of annotated data. Self-supervised methods demonstrate learning scene flow estimation from unlabeled data, yet their accuracy lags behind (semi-)supervised methods. In this paper, we introduce a self-supervised monocular scene flow method that substantially improves the accuracy over the previous approaches. Based on RAFT, a state-of-the-art optical flow model, we design a new decoder to iteratively update 3D motion fields and disparity maps simultaneously. Furthermore, we propose an enhanced upsampling layer and a disparity initialization technique, which overall further improves accuracy up to 7.2%. Our method achieves state-of-the-art accuracy among all self-supervised monocular scene flow methods, improving accuracy by 34.2%. Our fine-tuned model outperforms the best previous semi-supervised method with 228 times faster runtime. Code will be publicly available to ensure reproducibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

The datasets generated during and/or analyzed during the current study are available in the The KITTI Vision Benchmark Suite, https://www.cvlibs.net/datasets/kitti/

References

  • Basha, T., Moses, Y., & Kiryati, N. (2010). Multi-view scene flow estimation: A view centered variational approach. In CVPR.

  • Brickwedde, F., Abraham, S., & Mester, R. (2019). Mono-SF: Multi-view geometry meets single-view depth for monocular scene flow estimation of dynamic traffic scenes. In ICCV, pp. 2780–2790.

  • Chen, Y., Schmid, C., & Sminchisescu, C. (2019). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV, pp. 7062–7071.

  • Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In NIPS.

  • Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. IJRR, 32, 1231–1237.

    Google Scholar 

  • Godard, C., Aodha, O. M., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In CVPR, pp. 6602–6611.

  • Godard, C., Aodha, O. M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth estimation. In ICCV, pp. 3827–3837.

  • Gu, X., Wang, Y., Wu, C., Lee, Y. J.,   & Wang, P. (2019). HPLFlowNet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In CVPR, pp. 3249–3258 .

  • Hur, J., & Roth, S. (2020). Self-supervised monocular scene flow estimation. In CVPR, pp. 7394–7403.

  • Hur, J., & Roth, S. (2021). Self-supervised multi-frame monocular scene flow. In CVPR, pp. 2683–2693.

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

  • Liu, L., Zhai, G., Ye, W., & Liu, Y. (2019a). Unsupervised learning of scene flow estimation fusing with local rigidity. In IJCAI.

  • Liu, X., Qi, C., & Guibas, L. J. (2019b). FlowNet3D: Learning scene flow in 3d point clouds. CVPR, pp. 529–537.

  • Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., & Yuille, A. (2020). Every pixel counts++: Joint learning of geometry and motion with 3D holistic understanding. IEEE TPAMI, 42, 2624–2641.

    Article  Google Scholar 

  • Meister, S., Hur, J., & Roth, S. (2018). Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In AAAI.

  • Peng, R., Wang, R., Lai, Y., Tang, L., & Cai, Y. (2021). Excavating the potential capacity of self-supervised monocular depth estimation. In ICCV.

  • Qi, C., Yi, L., Su, H., & Guibas, L. J. (2017). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS.

  • Ramamonjisoa, M., Firman, M., Watson, J., Lepetit, V., & Turmukhambetov, D. (2021). Single image depth prediction with wavelet decomposition. In CVPR, pp. 11084–11093.

  • Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M. J. (2019). Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR, pp. 12232–12241.

  • Schuster, R., Unger, C., & Stricker, D. (2020). MonoComb: A sparse-to-dense combination approach for monocular scene flow. In Computer Science in Cars Symposium.

  • Sun, D., Yang, X., Liu, M.-Y., Kautz, J. (2018). PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, pp. 8934–8943.

  • Teed, Z., & Deng, J. (2020). RAFT: Recurrent all-pairs field transforms for optical flow. In ECCV.

  • Teed, Z., & Deng, J. (2021). RAFT-3D: Scene flow using rigid-motion embeddings. In CVPR, pp. 8371–8380.

  • Valgaerts, L., Bruhn, A., Zimmer, H., Weickert, J., Stoll, C., & Theobalt, C. (2010). Joint estimation of motion, structure and geometry from stereo sequences. In ECCV.

  • Vedula, S., Baker, S., Rander, P., Collins, R. T., & Kanade, T. (2005). Three-dimensional scene flow. EEE TPAMI, 27, 475–480.

    Article  Google Scholar 

  • Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13, 600–612.

    Article  Google Scholar 

  • Wang, Z., Li, S., Howard-Jenkins, H., Prisacariu, V. A., & Chen, M. (2020). FlowNet3D++: Geometric losses for deep scene flow estimation. In WACV, pp. 91–98.

  • Watson, J., Firman, M., Brostow, G. J., & Turmukhambetov, D. (2019). Self-supervised monocular depth hints. In ICCV, pp. 2162–2171.

  • Wu, W., Wang, Z., Li, Z., Liu, W., & Li, F. (2020). Pointpwc-net: Cost volume on point clouds for (self-)supervised scene flow estimation. In ECCV.

  • Yang, G., & Ramanan, D. (2020). Upgrading optical flow to 3D scene flow through optical expansion. In CVPR, pp. 1331–1340.

  • Yang, Z., Wang, P., Wang, Y., Xu, W., & Nevatia, R. (2018). Every pixel counts: Unsupervised geometry learning with holistic 3D motion understanding. In ECCV Workshops.

  • Yin, Z., & Shi, J. (2018). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, pp. 1983–1992.

  • Zhou, Z., Fan, X., Shi, P., & Xin, Y. (2021). R-MSFM: Recurrent multi-scale feature modulation for monocular depth estimating. In ICCV, pp. 12777–12786.

  • Zou, Y., Luo, Z., & Huang, J.-B. (2018). DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In ECCV.

Download references

Acknowledgements

This paper is supported by NSFC, China (No. 62176155) and Shanghai Municipal Science and Technology Major Project, China (2021SHZDZX0102).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongtao Lu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bayramli, B., Hur, J. & Lu, H. RAFT-MSF: Self-Supervised Monocular Scene Flow Using Recurrent Optimizer. Int J Comput Vis 131, 2757–2769 (2023). https://doi.org/10.1007/s11263-023-01828-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-023-01828-4

Keywords