RRT-MVS: Recurrent Regularization Transformer for Multi-View Stereo
Authors: Jianfei Jiang, Liyong Wang, Haochen Yu, Tianyu Hu, Jiansheng Chen, Huimin Ma*
Institute: University of Science and Technology Beijing
AAAI 2025
Learning-based multi-view stereo methods aim to predict depth maps for reconstructing dense point clouds. These methods rely on regularization to reduce redundancy in the cost volume. However, existing methods have limitations: CNN-based regularization is restricted to local receptive fields, while Transformer-based regularization struggles with handling depth discontinuities. These limitations often result in inaccurate depth maps with significant noise, particularly noticeable in the boundary and background regions. In this paper, we propose a Recurrent Regularization Transformer for Multi-View Stereo (RRT-MVS), which addresses these limitations by regularizing the cost volume separately for depth and spatial dimensions. Specifically, we introduce Recurrent Self-Attention (R-SA) to aggregate global matching costs within and across the cost maps and filter out noisy feature correlations. Additionally, we present Depth Residual Attention (DRA) to aggregate depth correlations within the cost volume and a Positional Adapter (PA) to enhance 3D positional awareness in each 2D cost map, further augmenting the effectiveness of R-SA. Experimental results demonstrate that RRT-MVS achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets. Notably, RRT-MVS ranks first on both the Tanks-and-Temples intermediate and advanced benchmarks among all published methods.
conda create -n rrtmvsnet python=3.10.8
conda activate rrtmvsnet
pip install -r requirements.txt
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 -f https://download.pytorch.org/whl/torch_stable.html
Training data. We use the same DTU training data as mentioned in MVSNet and CasMVSNet. Download DTU training data and Depth raw. Unzip and organize them as:
dtu_training
├── Cameras
├── Depths
├── Depths_raw
└── Rectified
Testing Data. Download DTU testing data. Unzip it as:
dtu_testing
├── scan1
├── scan4
├── ...
Download BlendedMVS and unzip it as:
blendedmvs
├── 5a0271884e62597cdee0d0eb
├── 5a3ca9cb270f0e3f14d0eddb
├── ...
├── training_list.txt
├── ...
Download Tanks and Temples processed data by ET-MVSNet and MVSFormer++ and unzip it as:
tanksandtemples_1
├── advanced
│ ├── Auditorium
│ ├── ...
└── intermediate
├── Family
├── ...
To train the model on DTU, specify DTU_TRAINING
in ./scripts/train_dtu.sh
first and then run:
bash scripts/train_dtu.sh exp_name
After training, you will get model checkpoints in ./checkpoints/dtu
.
To fine-tune the model on BlendedMVS, you need specify BLD_TRAINING
and BLD_CKPT_FILE
in ./scripts/train_bld.sh
first, then run:
bash scripts/train_bld.sh exp_name
For DTU testing, we use the model (pretrained model) trained on DTU training dataset. Specify DTU_TESTPATH
and DTU_CKPT_FILE
in ./scripts/test_dtu.sh
first, then run the following command to generate point cloud results.
bash scripts/test_dtu.sh exp_name
For quantitative evaluation, download SampleSet and Points from DTU's website. Unzip them and place Points
folder in SampleSet/MVS Data/
. The structure is just like:
SampleSet
├──MVS Data
└──Points
Specify datapath
, plyPath
, resultsPath
in evaluations/dtu/BaseEvalMain_web.m
and datapath
, resultsPath
in evaluations/dtu/ComputeStat_web.m
, then run the following command to obtain the quantitative metics.
cd evaluations/dtu
matlab -nodisplay
BaseEvalMain_web
ComputeStat_web
The matlab evaluation code is slow, we recommend Fast DTU Evaluation Using GPU with Python for fast validation, and the final result reported in paper is obtained by official matlab code.
We recommend using the finetuned model (pretrained model) to test on Tanks and Temples benchmark. Similarly, specify TNT_TESTPATH
and TNT_CKPT_FILE
in scripts/test_tnt_inter.sh
and scripts/test_tnt_adv.sh
. To generate point cloud results, just run:
bash scripts/test_tnt_inter.sh exp_name
bash scripts/test_tnt_adv.sh exp_name
For quantitative evaluation, you can upload your point clouds to Tanks and Temples benchmark.
Our results on DTU and Tanks and Temples (T&T) Dataset are listed in the tables.
DTU | Acc. ↓ | Comp. ↓ | Overall ↓ |
---|---|---|---|
Ours | 0.309 | 0.261 | 0.285 |
T&T (Inter.) | Mean ↑ | Family | Francis | Horse | Lighthouse | M60 | Panther | Playground | Train |
---|---|---|---|---|---|---|---|---|---|
Ours | 68.16 | 82.54 | 72.31 | 61.44 | 69.89 | 65.35 | 68.88 | 64.45 | 60.48 |
T&T (Adv.) | Mean ↑ | Auditorium | Ballroom | Courtroom | Museum | Palace | Temple |
---|---|---|---|---|---|---|---|
Ours | 43.29 | 30.95 | 46.42 | 41.13 | 55.46 | 37.63 | 48.12 |
You can download reconstructed point clouds on both DTU and T&T here.
If you find this work useful in your research, please consider citing the following preprint:
@inproceedings{jiang2025rrt,
title={RRT-MVS: Recurrent Regularization Transformer for Multi-View Stereo},
author={Jiang, Jianfei and Wang, Liyong and Yu, Haochen and Hu, Tianyu and Chen, Jiansheng and Ma, Huimin},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={4},
pages={3994--4002},
year={2025}
}
Our work is partially based on these opening source work:
We appreciate their contributions to the MVS community.