CN116258817A

CN116258817A - A method and system for constructing an autonomous driving digital twin scene based on multi-view 3D reconstruction

Info

Publication number: CN116258817A
Application number: CN202310123079.6A
Authority: CN
Inventors: 李涛; 李睿航; 潘之杰
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-02-16
Filing date: 2023-02-16
Publication date: 2023-06-13
Anticipated expiration: 2043-02-16
Also published as: CN116258817B

Abstract

The invention provides an automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction. The invention firstly provides an automatic driving digital twin scene construction system based on multi-view three-dimensional reconstruction, which comprises six modules including data acquisition and processing, camera pose estimation, multi-view three-dimensional reconstruction, point cloud model scale correction, empty point cloud model fusion and model precision quantitative evaluation. The multi-view three-dimensional reconstruction method of the automatic driving scene is also provided, and the problems of insufficient feature extraction in the texture complex region, large influence by noise data and the like in the prior art are solved, so that the reconstruction precision of the three-dimensional model of the automatic driving scene is improved, and the occupied space of a video memory is reduced. In addition, the invention also provides an automatic outside-cab scene image acquisition scheme based on the unmanned aerial vehicle, which can effectively and comprehensively acquire image data required by the multi-view three-dimensional reconstruction method.

Description

A self-driving digital twin scene construction method based on multi-view 3D reconstruction and system

技术领域technical field

本发明涉及计算机视觉领域和自动驾驶技术领域，具体涉及一种基于多视图三维重建的自动驾驶数字孪生场景构建方法和系统。The invention relates to the field of computer vision and the field of automatic driving technology, in particular to a method and system for constructing a digital twin scene for automatic driving based on multi-view three-dimensional reconstruction.

背景技术Background technique

近些年，受单个车辆的感知范围与能力的限制，自动驾驶逐渐朝着以数字孪生为核心的智能网联化方向发展。自动驾驶数字孪生技术基于现实场景中的传感数据，在三维虚拟环境中构建高精度的交通场景。重建的三维虚拟场景不仅为自动驾驶仿真测试提供完整、充足、可编辑的场景，还为自动驾驶算法优化提供大量驾驶数据，更为高精度地图的自动化制作提供了高效方案。自动驾驶数字孪生技术通过将现实交通场景元素映射到虚拟空间，建立起物理世界和虚拟世界之间的联系。在虚拟世界中，人们可以全过程、全要素的掌控自动驾驶中的各个环节，并依据一定的逻辑和规则自由切换时空，低成本、高效率地研究自动驾驶中的关键技术，进而驱动现实世界中自动驾驶技术的发展。In recent years, limited by the perception range and capabilities of a single vehicle, autonomous driving has gradually developed towards the direction of intelligent networking with digital twins as the core. The autonomous driving digital twin technology builds high-precision traffic scenes in a three-dimensional virtual environment based on sensor data in real scenes. The reconstructed 3D virtual scene not only provides a complete, sufficient, and editable scene for the simulation test of automatic driving, but also provides a large amount of driving data for the optimization of automatic driving algorithms, and provides an efficient solution for the automatic production of more high-precision maps. The autonomous driving digital twin technology establishes the connection between the physical world and the virtual world by mapping the elements of the real traffic scene to the virtual space. In the virtual world, people can control all aspects of autonomous driving in the whole process and all elements, and freely switch time and space according to certain logic and rules, research key technologies in autonomous driving at low cost and high efficiency, and then drive the real world development of autonomous driving technology.

自动驾驶三维虚拟场景重建的本质是交通场景三维几何结构的感知和恢复。相较于激光雷达扫描方案而言，基于视觉的方法从二维投影中分析多视图几何关系与内在联系，具有成本低廉、数据获取方便、重建模型稠密、纹理丰富等优点。传统的Colmap多视图三维重建算法基于经典计算几何理论，利用归一化互相关匹配法进行视图间光学一致性的度量，利用PatchMatch进行深度传递。Colmap算法可解释性强，但是深度图计算时间长，难以应用在自动驾驶这类大规模室外场景中。近些年，随着深度学习的发展，深度神经网络表现出强大提取图像特征的能力。许多研究将多视图和对应的相机位姿输入深度神经网络，实现端到端的多视图立体匹配与深度图估计。MVSNet提出可微的单应变换构造代价体，并使用多尺度的类Unet结构对聚合后的代价体进行三维卷积，得到平滑后的概率体，并据此估计每张图像的深度图。CVP-MVSNet通过设计金字塔结构，由粗到细地优化深度图质量；Fast-MVSNet采用稀疏代价体构建方法提高了深度图估计的速度。这些算法较传统Colmap算法而言，极大地减少了深度图估计时间。The essence of 3D virtual scene reconstruction for autonomous driving is the perception and restoration of the 3D geometric structure of the traffic scene. Compared with the lidar scanning scheme, the vision-based method analyzes the multi-view geometric relationship and internal relationship from the two-dimensional projection, which has the advantages of low cost, convenient data acquisition, dense reconstruction model, and rich texture. The traditional Colmap multi-view 3D reconstruction algorithm is based on classical computational geometry theory, uses the normalized cross-correlation matching method to measure the optical consistency between views, and uses PatchMatch for depth transfer. The Colmap algorithm is highly interpretable, but it takes a long time to calculate the depth map, making it difficult to apply it to large-scale outdoor scenes such as autonomous driving. In recent years, with the development of deep learning, deep neural networks have shown a strong ability to extract image features. Many studies feed multiple views and corresponding camera poses into deep neural networks to achieve end-to-end multi-view stereo matching and depth map estimation. MVSNet proposes a differentiable homography to construct a cost body, and uses a multi-scale Unet-like structure to perform three-dimensional convolution on the aggregated cost body to obtain a smoothed probability body, and estimates the depth map of each image accordingly. CVP-MVSNet optimizes the quality of the depth map from coarse to fine by designing a pyramid structure; Fast-MVSNet uses a sparse cost body construction method to improve the speed of depth map estimation. Compared with the traditional Colmap algorithm, these algorithms greatly reduce the depth map estimation time.

尽管现有技术在室内场景数据集上取得了不错的效果，但是在自动驾驶室外大规模场景中面临着较大挑战，主要包括：Although the existing technology has achieved good results on indoor scene datasets, it faces great challenges in large-scale scenes outside autonomous driving, mainly including:

(1)现有技术在结构复杂的自动驾驶场景中存在特征提取不充分的问题，导致重建精度不足。自动驾驶室外场景具有重建范围广、场景结构复杂、光线变化大等特点，场景中低纹理和纹理丰富的区域并存，现有技术采用固定的卷积核大小提取特征，忽视了纹理复杂区域丰富的场景特征；此外，现有技术赋予特征通道相等的权重，未能很好地过滤掉噪声数据，导致模型重建精度不足。(1) The existing technology has the problem of insufficient feature extraction in the automatic driving scene with complex structure, resulting in insufficient reconstruction accuracy. The outdoor scene of automatic driving has the characteristics of wide reconstruction range, complex scene structure, and large light changes. In the scene, low-texture and texture-rich areas coexist. The existing technology uses a fixed convolution kernel size to extract features, ignoring the richness of complex texture areas. Scene features; In addition, the existing technology gives equal weight to the feature channel, which fails to filter out the noise data well, resulting in insufficient model reconstruction accuracy.

(2)现有技术在自动驾驶大规模室外场景重建中显存空间占用过大。在自动驾驶大规模室外场景中，场景中不同物体与相机之间的距离差异较大，深度图的假设深度范围必须设置得较大；此外，为了获得更佳的重建效果，图像分辨率也会设置得较大，这直接导致所构建的代价体的体积变大，进而网络模型在推理时占用大量的显存空间。例如，现有技术MVSNet方法在图像宽度设置为1200像素、高度800像素、假设深度范围设置为1024时，算法推理阶段会占用约29Gb的显存空间，难以在普通消费级显卡上运行。(2) The existing technology occupies too much video memory space in large-scale outdoor scene reconstruction of autonomous driving. In large-scale outdoor scenes for autonomous driving, the distance between different objects in the scene and the camera varies greatly, so the assumed depth range of the depth map must be set larger; in addition, in order to obtain a better reconstruction effect, the image resolution will also increase. If it is set larger, this will directly lead to the larger size of the constructed cost body, and then the network model will occupy a large amount of video memory space during inference. For example, in the existing MVSNet method, when the image width is set to 1200 pixels, the height is 800 pixels, and the assumed depth range is set to 1024, the algorithm inference stage will occupy about 29Gb of video memory space, which is difficult to run on ordinary consumer-grade graphics cards.

(3)输入的多视角图像的质量直接影响着重建场景的效果，现有技术使用的数据集大多围绕某一物体或某栋建筑进行采集，这种采集方式不适用于自动驾驶大规模室外场景。为此，需要提出一种针对自动驾驶场景室外图像数据采集与数据处理方案。(3) The quality of the input multi-view image directly affects the effect of reconstructing the scene. Most of the data sets used in the existing technology are collected around a certain object or a certain building. This collection method is not suitable for large-scale outdoor scenes of autonomous driving. . For this reason, it is necessary to propose an outdoor image data acquisition and data processing scheme for autonomous driving scenarios.

发明内容Contents of the invention

为解决现有技术中存在的上述问题，本发明提供了一种基于多视图三维重建的自动驾驶数字孪生场景构建方法和系统。本发明提出的自动驾驶场景多视图三维重建方法中的特征提取模块，通过改变卷积核大小方式进一步提取纹理复杂区域的特征，并赋予不同特征通道不同的权重。此外，提取得到的特征图经单应变换形成多个特征体并聚合成一个代价体，该方法将代价体沿深度维度方向进行切片，并使用时空递归神经网络对切片序列进行正则化处理，在降低自动驾驶室外大规模场景中模型推理阶段占用的显存空间大小的同时，保留了切片之间的关联。另外，本发明还提供了一种基于无人机的自动驾驶室外场景图像采集方案，能有效且全面地采集多视图三维重建方法所需要的图像数据。最后，本发明还提出了一种基于多视图三维重建的自动驾驶数字孪生场景构建系统，包括数据采集与处理、相机位姿估计、多视图三维重建、点云模型尺度矫正、空地点云模型融合、模型精度定量评价六个模块。In order to solve the above-mentioned problems existing in the prior art, the present invention provides a method and system for constructing an autonomous driving digital twin scene based on multi-view 3D reconstruction. The feature extraction module in the multi-view 3D reconstruction method of the automatic driving scene proposed by the present invention further extracts the features of the complex texture area by changing the size of the convolution kernel, and assigns different weights to different feature channels. In addition, the extracted feature maps form multiple feature volumes through homography transformation and aggregate them into a cost volume. This method slices the cost volume along the depth dimension, and uses the spatio-temporal recurrent neural network to regularize the slice sequence. While reducing the size of the memory space occupied by the model inference stage in the large-scale scene outside the autopilot, the association between the slices is preserved. In addition, the present invention also provides a UAV-based automatic driving outdoor scene image acquisition scheme, which can effectively and comprehensively acquire the image data required by the multi-view three-dimensional reconstruction method. Finally, the present invention also proposes a self-driving digital twin scene construction system based on multi-view 3D reconstruction, including data collection and processing, camera pose estimation, multi-view 3D reconstruction, point cloud model scale correction, and empty point cloud model fusion 1. Six modules for quantitative evaluation of model accuracy.

本发明的目的是通过以下技术方案来实现的：一种基于多视图三维重建的自动驾驶数字孪生场景构建系统，所述自动驾驶数字孪生场景构建系统包括：The purpose of the present invention is achieved by the following technical solutions: a multi-view 3D reconstruction based automatic driving digital twin scene construction system, the automatic driving digital twin scene construction system includes:

数据采集与处理模块(M100)，用于对自动驾驶场景的多视角图像进行采集和数据预处理，并将处理后的图像数据划分为若干组；The data collection and processing module (M100), used for collecting and data preprocessing the multi-view images of the autonomous driving scene, and dividing the processed image data into several groups;

相机位姿估计模块(M200)，用于将采集的多视角图像作为输入，输出拍摄每张图像的相机对应的位置和姿态，从而获取相机内外参数序列；The camera pose estimation module (M200) is used to take the collected multi-view images as input, and output the corresponding position and attitude of the camera that captures each image, so as to obtain the sequence of internal and external parameters of the camera;

多视图三维重建模块(M300)，用于构建网络模型，并通过网络模型提取多视角图像的特征图序列，并结合相机内外参数序列构建代价体，将代价体沿深度维度方向进行切片，然后将切片后的代价体进行处理得到概率体，并根据概率体估计多视角图像的深度图，最后融合深度图得到场景三维稠密点云；The multi-view 3D reconstruction module (M300) is used to construct the network model, and extract the feature map sequence of the multi-view image through the network model, and combine the internal and external parameter sequence of the camera to construct the cost body, slice the cost body along the depth dimension, and then The sliced cost volume is processed to obtain the probability volume, and the depth map of the multi-view image is estimated according to the probability volume, and finally the depth map is fused to obtain the three-dimensional dense point cloud of the scene;

点云模型尺度矫正模块(M400)，用于将模块(M100)中处理得到的三个特征点及其组成三角形的边长作为输入参数，构建虚拟三维空间中等比例的三角形面片，并在多视图三维重建模块(M300)得到的场景三维稠密点云中，找到三个对应特征点的位置，将虚拟点云模型中的三个特征点与三角形面片中对应的三个点同时配准，对三维稠密点云进行尺度变换；The point cloud model scale correction module (M400) is used to use the three feature points processed in the module (M100) and the side lengths of the triangles as input parameters to construct a medium-proportion triangular patch in the virtual three-dimensional space, and in multiple In the 3D dense point cloud of the scene obtained by the view 3D reconstruction module (M300), find the positions of three corresponding feature points, and simultaneously register the three feature points in the virtual point cloud model with the corresponding three points in the triangle patch, Scale transformation of 3D dense point cloud;

空地点云模型融合模块(M500)，用于将数据采集与处理模块(M100)中无人机采集的图像重建得到的三维稠密点云划分为空中点云模型，将其它组图像重建得到的三维稠密点云划分为地面点云模型；本模块将若干个地面点云模型配准到空中点云模型，形成最终的自动驾驶数字孪生场景模型；The air point cloud model fusion module (M500) is used to divide the three-dimensional dense point cloud obtained by reconstructing the image collected by the drone in the data acquisition and processing module (M100) into an air point cloud model, and reconstruct the other groups of images The three-dimensional dense point cloud is divided into ground point cloud models; this module registers several ground point cloud models to the air point cloud model to form the final autonomous driving digital twin scene model;

模型精度定量评价模块(M600)，用于对自动驾驶场景三维模型的精度进行定量评测，判断自动驾驶场景三维模型的精度满足后续自动驾驶任务需求。The model accuracy quantitative evaluation module (M600) is used to quantitatively evaluate the accuracy of the three-dimensional model of the automatic driving scene, and judge that the accuracy of the three-dimensional model of the automatic driving scene meets the requirements of subsequent automatic driving tasks.

作为本发明的优选方案，所述数据采集与处理模块(M100)对数据进行采集和处理的方法包括以下步骤：As a preferred solution of the present invention, the method for collecting and processing data by the data collection and processing module (M100) includes the following steps:

S201:划定待重建的自动驾驶场景范围；S201: Delineate the scope of the automatic driving scene to be reconstructed;

S202:预设数据采集路线，使用无人机在固定的飞行高度按预设S型路线飞行，并在拍摄点拍摄场景图像；S202: Preset the data collection route, use the unmanned aerial vehicle to fly according to the preset S-shaped route at a fixed flight height, and take scene images at the shooting point;

S203:降低无人机飞行高度，围绕场景中的建筑物、采用八字绕飞的方式进行拍摄；S203: Lower the flying height of the drone, and shoot around the buildings in the scene in a figure-of-eight manner;

S204:对于道路旁边的建筑物、道路完全被树木遮挡的路段，使用手持拍摄设备，以环绕拍摄的方式采集数据；S204: For the buildings next to the road and the section where the road is completely blocked by trees, use a hand-held shooting device to collect data in a way of surrounding shooting;

S205:对收集的所有图像数据进行预处理：通过保留图像最中心区域的方式将图像尺寸调整为宽3000像素、高2250像素，然后将图像降采样到宽1600像素、高1200像素；S205: Preprocessing all the collected image data: adjusting the size of the image to a width of 3000 pixels and a height of 2250 pixels by retaining the most central area of the image, and then downsampling the image to a width of 1600 pixels and a height of 1200 pixels;

S206:将预处理后的图像数据划分为若干组，其中将步骤S202和步骤S203所采集的图像分为一组，作为第一组图像；将步骤S204中针对每一个建筑物或路段拍摄的图像单独分组；S206: divide the preprocessed image data into several groups, wherein the images collected by step S202 and step S203 are divided into one group as the first group of images; Images are grouped individually;

S207:在每一组图像覆盖的现实场景中，选择三个最为明显的特征点，记录特征点的位置以及所构成三角形的毫米级精度边长。S207: In the real scene covered by each group of images, select the three most obvious feature points, and record the positions of the feature points and the millimeter-level precision side lengths of the formed triangles.

作为本发明的优选方案，所述相机位姿估计模块(M200)包括检索匹配单元和增量式重建单元；所述检索匹配单元以多视角图像作为输入，寻找经过几何验证的、具有重叠区域的图像对，并计算空间中的同一点在这些图像对中两张图像上的投影；所述增量式重建单元用于输出拍摄每张图像的相机对应的位置和姿态。As a preferred solution of the present invention, the camera pose estimation module (M200) includes a retrieval matching unit and an incremental reconstruction unit; the retrieval matching unit takes multi-view images as input, and looks for geometrically verified images with overlapping regions. Image pairs, and calculate the projection of the same point in space on the two images in these image pairs; the incremental reconstruction unit is used to output the corresponding position and attitude of the camera that captures each image.

更为优选的是，增量式重建单元输出拍摄每张图像的相机对应的位置和姿态的具体过程为：首先在多视角图像稠密位置选择一对初始图像对并配准；然后选择与当前已配准的图像之间配准点数量最多的图像；将新增加的视图与已确定位姿的图像集合进行配准，利用PnP问题求解算法估计拍摄该图像的相机位姿；之后对于新增加的已配准图像覆盖到的未重建出的空间点，三角化该图像并增加新的空间点到已重建出的空间点集合中；最后，对当前所有估计出的三维空间点和相机位姿进行一次光束平差法优化调整。More preferably, the specific process of the incremental reconstruction unit outputting the corresponding position and attitude of the camera that captures each image is as follows: first, select and register a pair of initial image pairs at the dense position of the multi-view image; The image with the largest number of registration points between the registered images; register the newly added view with the image set whose pose has been determined, and use the PnP problem solving algorithm to estimate the pose of the camera that took the image; then for the newly added view that has been determined Register the unreconstructed spatial points covered by the image, triangulate the image and add new spatial points to the reconstructed spatial point set; finally, perform a process on all currently estimated 3D spatial points and camera poses Bundle adjustment optimization adjustment.

本发明还提供了一种上述自动驾驶数字孪生场景构建系统的场景构建方法，包括以下步骤：The present invention also provides a scene construction method of the above-mentioned automatic driving digital twin scene construction system, comprising the following steps:

S501：使用数据采集与处理模块(M100)对自动驾驶场景进行全方位感知，对采集后的多视角图像数据进行处理；S501: Use the data collection and processing module (M100) to perceive the autonomous driving scene in all directions, and process the collected multi-view image data;

S502：将采集得到的多视角图像数据输入到相机位姿估计模块(M200)，通过检索匹配与增量式重建方法估计出拍摄每张图像的相机对应的位置和姿态，获得相机内外参数序列

S502: Input the collected multi-view image data into the camera pose estimation module (M200), estimate the corresponding position and pose of the camera that captures each image through retrieval matching and incremental reconstruction methods, and obtain the sequence of internal and external parameters of the camera

S503：将相机内外参数序列

和数据采集与处理模块(M100)所采集的图像数据输入到多视图三维重建模块(M300)构建的网络模型中；利用网络模型对图像数据提取图像序列/>

的特征图序列/>

并根据相机内外参数序列/>

和特征图序列

构建特征体序列并聚合成一个代价体；将代价体沿深度维度方向进行切片，并通过网络模型同时处理每个切片及其前后相邻的两个切片，得到描述每一像素在不同深度上概率分布的概率体；S503: Sequence the internal and external parameters of the camera

and the image data collected by the data acquisition and processing module (M100) are input into the network model constructed by the multi-view 3D reconstruction module (M300); the image sequence is extracted from the image data by using the network model/>

The sequence of feature maps />

And according to the sequence of internal and external parameters of the camera />

and a sequence of feature maps

Construct a feature body sequence and aggregate it into a cost body; slice the cost body along the depth dimension, and process each slice and its two adjacent slices through the network model at the same time, and obtain the probability of describing each pixel at different depths the probability body of the distribution;

S504:将真实深度图中的有效深度值通过独热编码的方式调整为真实值体，真实值体作为有监督学习的标签；将概率体和真实值体输入原始网络模型，通过多轮训练的方式，使得概率体与真实值体之间的交叉熵损失函数值最小，得到训练好的网络模型；S504: Adjust the effective depth value in the real depth map to a real value body by means of one-hot encoding, and the real value body is used as a label for supervised learning; input the probability body and the real value body into the original network model, and pass multiple rounds of training way, so that the cross-entropy loss function value between the probability body and the real value body is the smallest, and the trained network model is obtained;

S505:对于输入的多视角图像序列，其中的每一张图像经训练好的网络模型处理得到概率体，并将概率体调整为深度图；然后，对深度图序列进行过滤与融合，得到重建后的场景三维稠密点云；S505: For the input multi-view image sequence, each image is processed by the trained network model to obtain a probability body, and the probability body is adjusted to a depth map; then, the depth map sequence is filtered and fused to obtain the reconstructed 3D dense point cloud of the scene;

S506：通过点云模型尺度矫正模块(M400)构建虚拟三维空间中等比例的三角形面片，将重建后的三维稠密点云中的三个特征点与三角形面片中对应点配准，对场景三维稠密点云进行尺度变换；S506: Use the point cloud model scale correction module (M400) to construct a medium-proportion triangular patch in the virtual 3D space, register the three feature points in the reconstructed 3D dense point cloud with the corresponding points in the triangular patch, and perform a three-dimensional scene Scale transformation of dense point clouds;

S507：通过空地点云模型融合模块(M500)将三维稠密点云划分为空中点云模型和地面点云模型；将若干个地面点云模型配准到空中点云模型，形成最终的自动驾驶数字孪生场景模型；并对重建的自动驾驶场景三维模型的精度进行定量评测，以保证其精度满足后续自动驾驶任务需求。S507: Divide the three-dimensional dense point cloud into an air point cloud model and a ground point cloud model through the air point cloud model fusion module (M500); register several ground point cloud models to the air point cloud model to form the final automatic driving figure twin scene model; and quantitatively evaluate the accuracy of the reconstructed 3D model of the autonomous driving scene to ensure that its accuracy meets the requirements of subsequent autonomous driving tasks.

作为本发明的优选方案，步骤S503中特征图序列

的获取具体为：通过网络模型学习到卷积核方向向量的偏移量，使得卷积核能适应不同纹理结构的区域，提取更为细致的特征；其次，将不同尺寸的特征图上采样到原输入图像大小，并将它们连接到一起组成一张具有32通道的特征图；然后，将每个特征通道的二维信息u_c(i,j)压缩成一维实数z_c，并进行两级全连接；最后，使用sigmoid函数将每个实数z_c的范围限制在[0,1]范围内，使得特征图的每个通道具有不同的权重，削弱匹配过程中的噪声数据与无关特征；对每张输入图像均重复上述操作步骤，得到特征图序列/>

As a preferred solution of the present invention, the sequence of feature maps in step S503

The acquisition is as follows: the offset of the convolution kernel direction vector is learned through the network model, so that the convolution kernel can adapt to regions with different texture structures and extract more detailed features; secondly, the feature maps of different sizes are sampled to the original Input the size of the image, and connect them together to form a feature map with 32 channels; then, compress the two-dimensional information u _c (i,j) of each feature channel into a one-dimensional real number z _c , and perform two-stage full Connection; finally, use the sigmoid function to limit the range of each real number z _c to the range [0,1], so that each channel of the feature map has different weights, weakening the noise data and irrelevant features in the matching process; for each Repeat the above operation steps for each input image to obtain a sequence of feature maps />

作为本发明的优选方案，步骤S503中代价体的聚合具体为：选择特征图序列中的一张图像作为参考特征图F₁，其余特征图作为源特征图

然后根据相机内外参数序列

将所有的特征图通过单应变换投影到参考图像下的若干个平行平面上，构成N-1个特征体/>

最后，通过聚合函数将特征体聚合成一个代价体。As a preferred solution of the present invention, the aggregation of the cost body in step S503 is specifically: select an image in the feature map sequence as the reference feature map F ₁ , and the remaining feature maps as the source feature map

Then according to the sequence of internal and external parameters of the camera

Project all feature maps to several parallel planes under the reference image through homography transformation to form N-1 feature bodies/>

Finally, the feature volume is aggregated into a cost volume through the aggregation function.

作为本发明的优选方案，步骤S503中概率体的形成具体为：将代价体切成D个片，其中D是深度先验，表示深度值可以取0～D中的任意一个值；然后将代价体切片序列视为时间序列，送入网络模型中的时空递归神经网络进行正则化，时空递归神经网络使用ST-LSTM在时序(水平方向)和空域(垂直方向)上传递存储状态，保留切片序列之间的联系，减少概率体出现多峰值的情况；在水平方向上，某时刻的第一层单元接受上一时刻最后一层单元的隐藏状态和记忆状态，并在垂直方向上逐层传递；最后使用softmax归一化操作输出每一像素在深度d∈[0,D]处的概率值，形成概率体。As a preferred solution of the present invention, the formation of the probability body in step S503 is specifically: cutting the cost body into D slices, where D is the depth prior, indicating that the depth value can take any value from 0 to D; The volume slice sequence is regarded as a time series, which is sent to the spatiotemporal recurrent neural network in the network model for regularization. The spatiotemporal recurrent neural network uses ST-LSTM to transfer the storage state in time series (horizontal direction) and space domain (vertical direction), and preserve the slice sequence The connection between them reduces the occurrence of multiple peaks in the probability body; in the horizontal direction, the first layer of units at a certain moment accepts the hidden state and memory state of the last layer of units at the previous moment, and transmits them layer by layer in the vertical direction; Finally, the softmax normalization operation is used to output the probability value of each pixel at the depth d∈[0,D] to form a probability volume.

作为本发明的优选方案，步骤S505具体为：对每一张图像经训练好的网络模型推理得到的概率体进行argmax操作，得到深度图序列，并基于光度一致性准则和几何一致性准则过滤掉置信度低的深度图，最后通过公式P＝dM^-1K^-1p将深度图融合成三维稠密点云；其中p是像素坐标，d是网络模型推理得到的深度值，P是世界坐标系下的三维坐标。As a preferred solution of the present invention, step S505 is specifically: perform an argmax operation on the probability body obtained by inference of the trained network model for each image to obtain a sequence of depth maps, and filter out the For depth maps with low confidence, the depth map is finally fused into a three-dimensional dense point cloud through the formula P=dM ^-1 K ^-1 p; where p is the pixel coordinate, d is the depth value obtained by network model reasoning, and P is the world coordinate system The three-dimensional coordinates below.

与现有技术相比，本发明具有以下有益技术效果：Compared with the prior art, the present invention has the following beneficial technical effects:

(1)本发明提出了一种面向自动驾驶室外大规模场景的多视图三维重建方法。该方法通过子网络模型学习到卷积核方向向量的偏移量，进而在处理不同的图像部分时具有不同的卷积核大小，使得面对不同纹理复杂度的区域时具有不同的感受野，更能适应纹理复杂的区域。该方法通过对提取的特征图中的不同通道赋予不同的权重，增强了匹配过程中的重要特征，削弱了如噪声数据与无关特征，提高了特征提取的准确性与鲁棒性。本发明针对自动驾驶室外场景中场景结构复杂、低纹理和纹理丰富的区域并存的特点，提出了新的特征提取模块，克服了现有技术中在纹理复杂区域特征提取不足、受噪声数据影响大等问题，从而提升了自动驾驶场景模型重建精度。(1) The present invention proposes a multi-view 3D reconstruction method for large-scale scenes outside the automatic driving. This method learns the offset of the convolution kernel direction vector through the sub-network model, and then has different convolution kernel sizes when processing different image parts, so that it has different receptive fields when facing regions with different texture complexity. It is more suitable for areas with complex textures. By assigning different weights to different channels in the extracted feature map, this method enhances important features in the matching process, weakens noise data and irrelevant features, and improves the accuracy and robustness of feature extraction. Aiming at the characteristics of complex scene structure, low-texture and texture-rich areas in the outdoor scene of automatic driving, the present invention proposes a new feature extraction module, which overcomes the lack of feature extraction in complex texture areas in the prior art and is greatly affected by noise data and other issues, thereby improving the reconstruction accuracy of the autonomous driving scene model.

(2)本发明提出的自动驾驶场景多视图三维重建方法将代价体沿深度维度方向进行切片，然后将切片序列加载到时空递归神经网络中，使得推理阶段显存占用空间的大小与假设深度范围d无关，进而网络模型可以用于对假设深度范围广的自动驾驶室外大规模场景的重建。本发明在使用ST-LSTM在水平和垂直方向上传递存储状态，降低了显存占用空间同时保留了切片序列之间的联系，并减少了现有技术中概率体出现多峰值的情况，提高了深度预测的准确率。(2) The multi-view 3D reconstruction method of the autonomous driving scene proposed by the present invention slices the cost body along the depth dimension, and then loads the slice sequence into the spatio-temporal recurrent neural network, so that the size of the memory occupied by the inference stage is the same as the assumed depth range d irrelevant, and then the network model can be used for reconstruction of large-scale scenes outside the autonomous driving assuming a wide range of depths. The present invention uses ST-LSTM to transfer the storage state in the horizontal and vertical directions, which reduces the space occupied by the video memory while retaining the connection between the slice sequences, and reduces the occurrence of multiple peaks in the probability body in the prior art, and improves the depth prediction accuracy.

(3)本发明提出了一种面向自动驾驶场景的图像数据收集与数据处理方案，该方案通过计算图像重叠率的方式预设无人机飞行路线与拍摄点，并结合S型和八字绕飞的方式进行飞行拍摄，保证了收集的多视角图像的质量。此外，针对空中视角下部分车道被树木遮挡的问题，该方案还使用手持拍摄设备对地面附近的场景进行了图像数据采集，并对拍摄图像分组处理，为自动驾驶场景室外图像数据有效且全面地采集与数据处理提供了新的方案。(3) The present invention proposes an image data collection and data processing scheme for automatic driving scenes. The scheme presets the UAV flight route and shooting points by calculating the image overlap rate, and combines S-shaped and figure-of-eight flying around The way of flight shooting ensures the quality of the collected multi-view images. In addition, in order to solve the problem that some lanes are blocked by trees in the aerial view, the solution also uses a handheld camera device to collect image data of the scene near the ground, and groups the captured images to provide an effective and comprehensive solution for the outdoor image data of the autonomous driving scene. Acquisition and data processing provide new solutions.

(4)本发明提出了一种基于多视图三维重建的自动驾驶数字孪生场景构建系统，该系统提供了自动驾驶数字孪生场景重建及精度评价的整个过程，包括数据采集与处理、相机位姿估计、多视图三维重建、点云模型尺度矫正、空地点云模型融合、模型精度定量评价模块。该系统能有效、全面地采集自动驾驶室外场景的图像数据，并估计拍摄这些图像的相机位置和姿态。此外，该系统能以较小的显存空间占用、更强的纹理复杂区域特征提取能力重建出自动驾驶场景的三维模型，并通过空地点云模型融合的方式提高了场景模型的完整性，还提供一种定量评估的方法对所重建场景模型精度进行评估。(4) The present invention proposes a self-driving digital twin scene construction system based on multi-view 3D reconstruction, which provides the whole process of autonomous driving digital twin scene reconstruction and accuracy evaluation, including data collection and processing, camera pose estimation , Multi-view 3D reconstruction, point cloud model scale correction, empty point cloud model fusion, model accuracy quantitative evaluation module. The system can efficiently and comprehensively collect image data of the scene outside the autonomous driving, and estimate the position and pose of the camera that captured these images. In addition, the system can reconstruct the 3D model of the autonomous driving scene with a smaller video memory footprint and stronger feature extraction capabilities in complex texture areas, and improves the integrity of the scene model through the integration of cloud models of empty locations. It also provides A quantitative evaluation method evaluates the accuracy of the reconstructed scene model.

附图说明Description of drawings

图1是本发明提出的面向自动驾驶室外大规模场景的多视图三维重建方法的深度神经网络模型结构。Fig. 1 is the deep neural network model structure of the multi-view 3D reconstruction method for the large-scale scene outside the automatic driving proposed by the present invention.

图2是图1中的特征提取模块。Figure 2 is the feature extraction module in Figure 1.

图3是本发明提出的自动驾驶场景数据采集方法流程图。Fig. 3 is a flow chart of the method for collecting scene data of automatic driving proposed by the present invention.

图4是本发明提出的自动驾驶场景数据采集方法示意图；图中S202,S203,S204分别为实施例中对应的操作步骤。Fig. 4 is a schematic diagram of the method for collecting data of an automatic driving scene proposed by the present invention; S202, S203, and S204 in the figure are the corresponding operation steps in the embodiment.

图5是实施例中步骤S205和S206图像数据的处理和分组的示意图。Fig. 5 is a schematic diagram of image data processing and grouping in steps S205 and S206 in the embodiment.

图6是本发明提出的自动驾驶数字孪生场景构建系统模块结构图。Fig. 6 is a module structure diagram of the autonomous driving digital twin scene construction system proposed by the present invention.

图7是系统中相机位姿估计模块M200中三角化增加新的空间点的示意图。FIG. 7 is a schematic diagram of triangulation adding new spatial points in the camera pose estimation module M200 in the system.

图8是系统中点云模型尺度矫正模块M400中构建的空间三角形面片示意图。Fig. 8 is a schematic diagram of the spatial triangular patch constructed in the scale correction module M400 of the point cloud model in the system.

图9是系统中空地点云模型融合模块M500模型融合效果图。Fig. 9 is a model fusion effect diagram of the M500 model fusion module of the hollow location cloud model in the system.

图10是系统中空地点云模型融合模块M600精度定量评估结果图。Figure 10 is a diagram of the accuracy quantitative evaluation results of the system's hollow location cloud model fusion module M600.

图11是根据本发明提出的自动驾驶数字孪生场景构建系统重建出的场景效果图。Fig. 11 is a scene effect diagram reconstructed by the autonomous driving digital twin scene construction system proposed by the present invention.

图12是根据本发明提出的自动驾驶数字孪生场景构建系统重建出的场景效果图。Fig. 12 is an effect diagram of a scene reconstructed by the autonomous driving digital twin scene construction system proposed by the present invention.

具体实施方式Detailed ways

下面结合具体实施方式对本发明做进一步阐述和说明。所述实施例仅是本公开内容的示范且不圈定限制范围。本发明中各个实施方式的技术特征在没有相互冲突的前提下，均可进行相应组合。The present invention will be further elaborated and described below in combination with specific embodiments. The embodiments are merely exemplary of the disclosure and do not delineate the scope of limitation. The technical features of the various implementations in the present invention can be combined accordingly on the premise that there is no conflict with each other.

本发明提出的基于多视图三维重建的自动驾驶数字孪生场景构建系统结构如图6所示，由6个模块组成，本实施例根据这6个模块构建自动驾驶数字孪生场景：The structure of the self-driving digital twin scene construction system based on multi-view 3D reconstruction proposed by the present invention is shown in Figure 6, which consists of 6 modules. In this embodiment, the self-driving digital twin scene is constructed according to these 6 modules:

相机位姿估计模块M200：本实施例将M100模块得到的图像输入该模块，经过检索匹配单元M201和增量式重建单元M202的处理得到拍摄这些图像的相机位姿。具体地，本实施例将图像输入检索匹配单元，首先对每个视角的图像进行特征点检测并提取特征点的SIFT特征描述子，然后随机选择一组可能匹配的图像对C＝{(I_a,I_b)|I_a,I_b∈I,a＜b}，计算两张图像对应特征的关系矩阵M_ab∈F_a×F_b。最后，本实施例使用八点法估计相机的基础矩阵，使用RANSAC方法去除在估计基础矩阵时被判定为外点的对应关系，构建出场景图。本实施例再将场景图输入增量式重建单元，本实施例选择与当前已配准的图像之间配准点数量最多的图像，并将新增加的视图与已确定位姿的图像集合进行配准。可选地，本实施例利用DLT算法求解PnP问题，根据n个点的世界坐标和图像中的像素坐标得到2n个约束方程，然后利用SVD求解超定方程组，实现相机位姿矩阵的估计。如图7所示，本实施例根据空间中的同一个点在多个不同视角图像中存在投影，寻找新增加的已配准图像覆盖到的未重建出的空间点，将其增加到已重建出的空间点集合中。最后，本实施例对当前所有估计出的三维空间点和相机位姿进行一次光束平差法优化调整。具体地，本实施例通过调整相机位姿和估计的场景点位置，使公式1中的重投影误差最小。Camera pose estimation module M200: In this embodiment, the images obtained by the M100 module are input into this module, and the camera poses of these images are obtained through the processing of the retrieval matching unit M201 and the incremental reconstruction unit M202. Specifically, in this embodiment, the image is input into the retrieval and matching unit, first, the feature point detection is performed on the image of each viewing angle and the SIFT feature descriptor of the feature point is extracted, and then a group of possible matching image pairs C={(I _a ,I _b )|I _a ,I _b ∈I,a<b}, calculate the relationship matrix M _ab ∈F _a ×F _b of the corresponding features of the two images. Finally, in this embodiment, the eight-point method is used to estimate the fundamental matrix of the camera, and the RANSAC method is used to remove the corresponding relationship determined to be outliers when estimating the fundamental matrix, so as to construct a scene graph. In this embodiment, the scene graph is input to the incremental reconstruction unit. In this embodiment, the image with the largest number of registration points between the currently registered images is selected, and the newly added view is registered with the image set whose pose has been determined. allow. Optionally, this embodiment uses the DLT algorithm to solve the PnP problem, obtains 2n constraint equations according to the world coordinates of n points and the pixel coordinates in the image, and then uses SVD to solve the overdetermined equations to realize the estimation of the camera pose matrix. As shown in Figure 7, in this embodiment, according to the projection of the same point in space in multiple images with different viewing angles, the unreconstructed spatial point covered by the newly added registered image is found, and it is added to the reconstructed out of the set of spatial points. Finally, in this embodiment, an optimal adjustment of bundle adjustment is performed on all currently estimated three-dimensional space points and camera poses. Specifically, this embodiment minimizes the reprojection error in Formula 1 by adjusting the camera pose and estimated scene point positions.

其中P_i为空间中的第i个三维点，m表示空间中三维点的个数，n表示相机个数，R和t为n个相机中的旋转矩阵和平移矩阵，e_ij＝π(P_i,R_j,t_j)-p_ij表示空间中的第i个三维点在第j个相机的投影误差。Wherein P _i is the i-th three-dimensional point in space, m represents the number of three-dimensional points in space, n represents the number of cameras, R and t are the rotation matrix and translation matrix in n cameras, e _ij =π(P _i ,R _j ,t _j )-p _ij represents the projection error of the i-th 3D point in the space at the j-th camera.

多视图三维重建模块M300：用于构建网络模型，并通过网络模型提取多视角图像的特征图序列，并结合相机内外参数序列构建代价体，将代价体沿深度维度方向进行切片，然后将切片后的代价体进行处理得到概率体，并根据概率体估计多视角图像的深度图，最后融合深度图得到场景三维稠密点云；Multi-view 3D reconstruction module M300: used to build a network model, and extract the feature map sequence of multi-view images through the network model, and combine the internal and external parameter sequences of the camera to construct the cost body, slice the cost body along the depth dimension, and then slice the The cost body is processed to obtain the probability volume, and the depth map of the multi-view image is estimated according to the probability volume, and finally the depth map is fused to obtain the three-dimensional dense point cloud of the scene;

点云模型尺度矫正模块M400：如图8所示，用于将模块(M100)中处理得到的三个特征点及其组成三角形的边长作为输入参数，构建虚拟三维空间中等比例的三角形面片，并在多视图三维重建模块(M300)得到的场景三维稠密点云中，找到三个对应特征点的位置，将虚拟点云模型中的三个特征点与三角形面片中对应的三个点同时配准，对三维稠密点云进行尺度变换；本实施例将模块M100中得到的三个特征点及其组成三角形的边长作为输入参数，在虚拟三维空间中构造等比例的三角形面片R₀R₁T₂。然后，本实施例在重建得到的场景三维点云模型中，找到三个对应特征点的位置，将虚拟点云模型中的三个特征点与三角形面片中对应的三个点同时配准，对重建三维点云模型进行尺度变换。具体地，本实施例将重建三维点云模型的尺度扩大了1.1倍。Point cloud model scale correction module M400: as shown in Figure 8, it is used to use the three feature points processed in the module (M100) and the side lengths of the triangles as input parameters to construct a triangular patch of medium proportion in the virtual three-dimensional space , and in the 3D dense point cloud of the scene obtained by the multi-view 3D reconstruction module (M300), find the positions of three corresponding feature points, and compare the three feature points in the virtual point cloud model with the corresponding three points in the triangle patch Simultaneously register and perform scale transformation on the three-dimensional dense point cloud; in this embodiment, the three feature points obtained in the module M100 and the side lengths of the triangles are used as input parameters to construct an equal-proportion triangle patch R in the virtual three-dimensional space ₀ R ₁ T ₂ . Then, in this embodiment, in the reconstructed 3D point cloud model of the scene, the positions of three corresponding feature points are found, and the three feature points in the virtual point cloud model are simultaneously registered with the corresponding three points in the triangle patch, Perform scale transformation on the reconstructed 3D point cloud model. Specifically, in this embodiment, the scale of the reconstructed 3D point cloud model is enlarged by 1.1 times.

空地点云模型融合模块M500：用于将数据采集与处理模块(M100)中无人机采集的图像重建得到的三维稠密点云划分为空中点云模型，将其它组图像重建得到的三维稠密点云划分为地面点云模型；本模块将若干个地面点云模型配准到空中点云模型，形成最终的自动驾驶数字孪生场景模型；本实施例分别在两个地面点云模型中找出三个在空中点云模型中也存在的特征点，这些特征点分布在建筑物的边缘、三立面的交点。根据这些匹配的特征点，本实施例分别对两个地面点云模型进行平移、旋转矩阵变换，将其融合到空中点云中。融合结果如图9所示，其中颜色较深的部分(图中的建筑物以及图片下方被树木遮挡的道路)为融合前的地面点云模型。Air point cloud model fusion module M500: used to divide the 3D dense point cloud obtained by reconstructing the image collected by the drone in the data acquisition and processing module (M100) into an air point cloud model, and reconstruct the 3D dense point cloud obtained by other groups of images The point cloud is divided into ground point cloud models; this module registers several ground point cloud models to the air point cloud model to form the final autonomous driving digital twin scene model; this embodiment finds Three feature points that also exist in the aerial point cloud model, these feature points are distributed on the edge of the building and the intersection of the three facades. According to these matching feature points, this embodiment respectively performs translation and rotation matrix transformation on the two ground point cloud models, and fuses them into the air point cloud. The fusion result is shown in Figure 9, where the darker parts (the buildings in the picture and the road covered by trees below the picture) are the ground point cloud model before fusion.

模型精度定量评价模块M600：用于对自动驾驶场景三维模型的精度进行定量评测，判断自动驾驶场景三维模型的精度满足后续自动驾驶任务需求；如图10所示，本实施例根据M100模块记录的与车道相关的40个特征点对的位置和距离，在重建得到的场景三维点云模型中寻找对应的特征点对，并测量对应特征点对的距离，并比较虚拟三维点云模型中的每个点对的距离和实际场景中的距离的绝对误差值与误差百分比。本实施例评价结果如图10所示，在场景的40个特征点对下，该实施例所构建的自动驾驶数字孪生场景三维点云模型中的最大绝对误差为6.08cm，平均绝对误差为2.5cm，最大百分比为2.361％，平均百分比为0.549％，其中百分比衡量指标是指误差与实际场景中的点对距离之比。本实施例所构建的自动驾驶数字孪生场景如图11和图12所示。Model Accuracy Quantitative Evaluation Module M600: Used to quantitatively evaluate the accuracy of the three-dimensional model of the automatic driving scene, and judge that the accuracy of the three-dimensional model of the automatic driving scene meets the requirements of subsequent automatic driving tasks; The positions and distances of the 40 feature point pairs related to the lane, find the corresponding feature point pairs in the reconstructed scene 3D point cloud model, measure the distance of the corresponding feature point pairs, and compare each of the virtual 3D point cloud models. The absolute error value and error percentage between the distance of each point pair and the distance in the actual scene. The evaluation results of this embodiment are shown in Figure 10. Under the 40 feature point pairs of the scene, the maximum absolute error in the 3D point cloud model of the autonomous driving digital twin scene constructed in this embodiment is 6.08 cm, and the average absolute error is 2.5 cm. cm, the maximum percentage is 2.361%, and the average percentage is 0.549%, where the percentage measure refers to the ratio of the error to the point-to-point distance in the actual scene. The autonomous driving digital twin scene constructed in this embodiment is shown in Fig. 11 and Fig. 12 .

根据本发明的第一方面提供的面向自动驾驶室外大规模场景的多视图三维重建方法，本实施例包括如下步骤：According to the multi-view three-dimensional reconstruction method for large-scale scenes outside the automatic driving provided by the first aspect of the present invention, this embodiment includes the following steps:

S1：使用数据采集与处理模块(M100)对自动驾驶场景进行全方位感知，对采集后的多视角图像数据进行处理；S1: Use the data acquisition and processing module (M100) to perceive the autonomous driving scene in all directions, and process the collected multi-view image data;

S2：将采集得到的多视角图像数据输入到相机位姿估计模块(M200)，通过检索匹配与增量式重建方法估计出拍摄每张图像的相机对应的位置和姿态，获得相机内外参数序列

S2: Input the collected multi-view image data into the camera pose estimation module (M200), estimate the corresponding position and pose of the camera that took each image through retrieval matching and incremental reconstruction methods, and obtain the sequence of internal and external parameters of the camera

S3：将相机内外参数序列

的特征图序列/>

并根据相机内外参数序列/>

和特征图序列

构建特征体序列并聚合成一个代价体；将代价体沿深度维度方向进行切片，并通过网络模型同时处理每个切片及其前后相邻的两个切片，得到描述每一像素在不同深度上概率分布的概率体；S3: Sequence the internal and external parameters of the camera

The sequence of feature maps />

and a sequence of feature maps

S4：将真实深度图中的有效深度值通过独热编码的方式调整为真实值体，真实值体作为有监督学习的标签；将概率体和真实值体输入原始网络模型，通过多轮训练的方式，使得概率体与真实值体之间的交叉熵损失函数值最小，得到训练好的网络模型；S4: The effective depth value in the real depth map is adjusted to the real value body by one-hot encoding, and the real value body is used as a label for supervised learning; the probability body and the real value body are input into the original network model, and through multiple rounds of training way, so that the cross-entropy loss function value between the probability body and the real value body is the smallest, and the trained network model is obtained;

S5：对于输入的多视角图像序列，其中的每一张图像经训练好的网络模型处理得到概率体，并将概率体调整为深度图；然后，对深度图序列进行过滤与融合，得到重建后的场景三维稠密点云；S5: For the input multi-view image sequence, each image is processed by the trained network model to obtain the probability body, and the probability body is adjusted into a depth map; then, the depth map sequence is filtered and fused to obtain the reconstructed 3D dense point cloud of the scene;

S6：通过点云模型尺度矫正模块(M400)构建虚拟三维空间中等比例的三角形面片，将重建后的三维稠密点云中的三个特征点与三角形面片中对应点配准，对场景三维稠密点云进行尺度变换；S6: Use the point cloud model scale correction module (M400) to construct a medium-proportion triangular patch in the virtual 3D space, register the three feature points in the reconstructed 3D dense point cloud with the corresponding points in the triangular patch, and perform three-dimensional analysis of the scene Scale transformation of dense point clouds;

S7：通过空地点云模型融合模块(M500)将三维稠密点云划分为空中点云模型和地面点云模型；将若干个地面点云模型配准到空中点云模型，形成最终的自动驾驶数字孪生场景模型；并对重建的自动驾驶场景三维模型的精度进行定量评测，以保证其精度满足后续自动驾驶任务需求。S7: Divide the 3D dense point cloud into an air point cloud model and a ground point cloud model through the space point cloud model fusion module (M500); register several ground point cloud models to the air point cloud model to form the final autopilot digital twin scene model; and quantitatively evaluate the accuracy of the reconstructed 3D model of the autonomous driving scene to ensure that its accuracy meets the requirements of subsequent autonomous driving tasks.

步骤三在本实施例中的具体为：输入图像的尺寸为宽W＝1600像素、高H＝1200像素。如图2所示，本实施例对每一张输入图像进行三次传统卷积操作，得到三个尺寸分别为

的特征图。由于传统卷积操作卷积核大小固定，感受野相同，在面对纹理复杂区域时，往往忽视了很多重要特征。因此，本实施例通过子网络模型学习到卷积核方向向量的偏移量，进一步提取纹理复杂区域的特征。本实施例公式2将提取到的特征图X处理成由不同通道权重组成的特征图X′。具体地，本实施例通过插值的方式将不同尺寸的特征图上采样到W×H的分辨率，并连接成具有32个通道的特征图；然后根据公式3将每个通道的二维特征u_c(i,j)压缩成一维实数z_c；然后根据公式4对32个实数z_c进行两级的全连接W₁,W₂，最后用sigmoid函数将每个实数的范围限制在[0,1]范围内，进而增强匹配过程中的重要特征，削弱无关特征，提高特征提取的准确性。Step 3 in this embodiment is specifically as follows: the size of the input image is W=1600 pixels in width and H=1200 pixels in height. As shown in Figure 2, this embodiment performs three traditional convolution operations on each input image to obtain three sizes of

feature map of . Due to the fixed size of the convolution kernel and the same receptive field in the traditional convolution operation, many important features are often ignored when facing areas with complex textures. Therefore, in this embodiment, the subnetwork model is used to learn the offset of the direction vector of the convolution kernel to further extract the features of the region with complex textures. Formula 2 of this embodiment processes the extracted feature map X into a feature map X′ composed of different channel weights. Specifically, this embodiment upsamples feature maps of different sizes to a resolution of W×H through interpolation, and connects them into a feature map with 32 channels; then according to formula 3, the two-dimensional feature u of each channel _c (i, j) is compressed into a one-dimensional real number z _c ; then according to formula 4, two-stage fully connected W ₁ , W ₂ is performed on 32 real numbers z _c , and finally the range of each real number is limited to [0, 1], thereby enhancing the important features in the matching process, weakening irrelevant features, and improving the accuracy of feature extraction.

X′＝F_scale{[F_sq[f(upsaaple(X_C))],s_c} (2)X′＝F _scale {[F _sq [f(upsaaple(X _C ))],s _c } (2)

s_c＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z)) (4)s _c =F _ex (z,W)=σ(g(z,W))=σ(W ₂ δ(W ₁ z)) (4)

再通过输入的相机内外参数序列

和特征图序列/>

构建代价体。本实施例选择输入的其中一张特征图作为参考图像特征图，将其它特征图视为源图像特征图，并假设深度范围值为d。因此，本实施例通过可微的单应变换将N-1张源图像特征图映射到参考图像特征图下的d个平行平面上，以得到N-1个特征体/>

这些特征体刻画了在深度值为d的平面上第i个特征图F_i与参考特征图F₁之间的单应变换关系。可微的单应变换具体内容可参考Yao Y,et al.MVSNET:depth inference for unstructured multi-viewstereo[C],Springer,European Conference on Computer Vision(ECCV),Munich,2018:785-801.对于这些特征体，本实施例按照公式5进行处理，聚合成代价体C，其中/>

为代价体序列的平均值。Then through the input camera internal and external parameter sequence

and sequence of feature maps />

Build cost body. In this embodiment, one of the input feature maps is selected as the reference image feature map, the other feature maps are regarded as source image feature maps, and the depth range value is assumed to be d. Therefore, in this embodiment, N-1 source image feature maps are mapped to d parallel planes under the reference image feature map through differentiable homography transformation to obtain N-1 feature bodies/>

These feature volumes describe the homography transformation relationship between the i-th feature map F _i and the reference feature map F ₁ on a plane with a depth value of d. The specific content of differentiable homography transformation can refer to Yao Y, et al. MVSNET: depth inference for unstructured multi-viewstereo [C], Springer, European Conference on Computer Vision (ECCV), Munich, 2018: 785-801. For these The feature body, in this embodiment, is processed according to formula 5, and aggregated into a cost body C, where />

is the average value of the cost body sequence.

将得到的体积大小为V＝W·H·D·F的代价体切成D个大小为W·H·F的片，其中W和H是特征图的尺寸，F是特征图的通道数，D是深度先验。切片后的代价体并不是孤立的个体，本实施例将代价体切片序列视为时间序列，送入时空递归神经网络进行正则化。本实施使用的时空递归神经网络由若干个ST-LSTM记忆单元组成，水平方向上由三个时刻组成，分别代表上一时刻、当前时刻、下一时刻输入的代价体切片，每一时刻在垂直方向上由四层组成，每一层之间都由一个ST-LSTM记忆单元连接，并且当前时刻的第一层单元接受上一时刻最后一层单元的隐藏状态和记忆状态。ST-LSTM单元的具体结构可参考Y.Wang et al.,"PredRNN:A Recurrent Neural Network for Spatiotemporal Predictive Learning,"inIEEE Transactions on Pattern Analysis and Machine Intelligence,doi:10.1109/TPAMI.2022.3165153.因此，本实施例通过上述方式建立起了代价体切片序列的联系，减少了得到的概率体中出现多峰的情况，提高了深度预测的准确率。此外，网络模型一次性仅处理三个切片，降低了模型推理过程中显存空间的占用。具体地，本实施例使用Pytorch深度学习框架，在一块型号为NVIDIA GeForce GTX 3060(显存6GB)的显卡上进行模型推理，实施上述过程。Cut the obtained cost body with a volume size of V=W·H·D·F into D pieces of size W·H·F, where W and H are the size of the feature map, F is the number of channels of the feature map, D is the depth prior. The sliced cost body is not an isolated individual. In this embodiment, the slice sequence of the cost body is regarded as a time series, which is sent to the spatio-temporal recurrent neural network for regularization. The spatio-temporal recurrent neural network used in this implementation is composed of several ST-LSTM memory units. It is composed of three moments in the horizontal direction, representing the cost body slices input at the previous moment, the current moment, and the next moment respectively. It consists of four layers in the direction, and each layer is connected by an ST-LSTM memory unit, and the first layer unit at the current moment accepts the hidden state and memory state of the last layer unit at the previous moment. The specific structure of the ST-LSTM unit can refer to Y.Wang et al.,"PredRNN:A Recurrent Neural Network for Spatiotemporal Predictive Learning,"inIEEE Transactions on Pattern Analysis and Machine Intelligence, doi:10.1109/TPAMI.2022.3165153. Therefore, this implementation In this example, the connection of the slice sequence of the cost volume is established through the above method, which reduces the occurrence of multi-peaks in the obtained probability volume and improves the accuracy of depth prediction. In addition, the network model only processes three slices at a time, which reduces the memory space occupied during model inference. Specifically, this embodiment uses the Pytorch deep learning framework to perform model inference on a graphics card modeled as NVIDIA GeForce GTX 3060 (6GB of video memory), and implement the above process.

将真实深度图中的有效深度值通过独热编码的方式调整为真实值体，本实施例在模型训练过程中采用的优化目标函数是公式6，其中G^(d)(p)和P^(d)(p)分别为在深度假设d和像素p处的深度真实值和估计值；由于真实深度图中的深度值是不完整的，因此在本实施例中，只使用深度图中存在有效深度值的像素进行监督。The effective depth value in the real depth map is adjusted to the real value body by means of one-hot encoding. The optimization objective function adopted in the model training process of this embodiment is formula 6, wherein G ^(d) (p) and P ^{(d )} (p) are the depth real value and estimated value at the depth assumption d and pixel p respectively; since the depth value in the real depth map is incomplete, in this embodiment, only the effective depth in the depth map is used Valued pixels are supervised.

对深度图进行过滤与融合。本实施例根据光度一致性准则和几何一致性准则对深度图过滤。具体地，本实施例将估计深度值对应的概率值低于0.8的点视为外点，并要求有效像素满足公式7所描述的不等式Filter and fuse the depth map. In this embodiment, the depth map is filtered according to the photometric consistency criterion and the geometric consistency criterion. Specifically, in this embodiment, points whose probability value corresponding to the estimated depth value is lower than 0.8 are regarded as outliers, and valid pixels are required to satisfy the inequality described in formula 7

，其中p_i为参考图像像素点p₁处的深度值d₁在邻域视图p_i点处的投影，对应深度值为d_i，p_reproj为p_i点处的深度值d_i在参考图像上的重投影，对应深度值为d_reproj。, where p _i is the projection of the depth value d ₁ at the pixel point p ₁ of the reference image at the point p _i of the neighborhood view, and the corresponding depth value is d _i , and p _reproj is the depth value d i at the point _p _i in the reference image Reprojection on , the corresponding depth value is d _reproj .

如图3所示，根据本发明的第二方面提供的自动驾驶场景数据采集与数据处理方法，本实施例包括如下步骤：As shown in FIG. 3 , according to the automatic driving scene data collection and data processing method provided by the second aspect of the present invention, this embodiment includes the following steps:

S201:划定待重建的自动驾驶场景范围。S201: Define the scope of the autonomous driving scene to be reconstructed.

S202:预设数据采集路线，使用无人机在固定的飞行高度按预设路线飞行，并在拍摄点拍摄场景图像。S202: Presetting the data collection route, using the unmanned aerial vehicle to fly according to the preset route at a fixed flight height, and shooting scene images at the shooting point.

S203:降低无人机飞行高度，围绕场景中的建筑物、采用八字绕飞的方式进行拍摄。S203: Lower the flying height of the drone, and shoot around the buildings in the scene in a figure-of-eight manner.

S204:对于道路旁边的建筑物、道路完全被树木遮挡的路段，使用手持拍摄设备，以环绕拍摄的方式采集数据。S204: For the buildings next to the road and the road sections where the road is completely covered by trees, use a hand-held shooting device to collect data in a surround shooting manner.

S205:对收集的所有图像数据进行预处理。S205: Preprocessing all the collected image data.

S206:将按照上述步骤采集并处理后的图像数据划分为若干组。S206: Divide the image data collected and processed according to the above steps into several groups.

S207:在每一组图像覆盖的现实场景中，选择特征点。S207: Select feature points in the real scene covered by each group of images.

本实施例根据步骤S201，选择允许消费级无人机飞行、没有大量的强反光材料的约四万平方米的区域，并选择光照条件充足、人流量、车流量较小的正午时间段使用无人机拍摄照片。In this embodiment, according to step S201, an area of about 40,000 square meters that allows consumer-grade drones to fly and does not have a large amount of strong reflective materials is selected, and the noon time period with sufficient lighting conditions, less traffic, and less traffic is selected to use the drone. The man-machine takes pictures.

如图4所示，根据步骤S202，本实施例将无人机飞行高度固定为90米，使用的机载相机35mm等效焦距为24mm，画幅尺寸为24×36mm，并将航向重叠率和旁向重叠率均设置为85％，根据公式8和9计算拍摄点位置。As shown in Figure 4, according to step S202, this embodiment fixes the flying height of the UAV to 90 meters, the 35mm equivalent focal length of the airborne camera used is 24mm, the frame size is 24×36mm, and the heading overlap rate and side The overlap rate in all directions is set to 85%, and the shooting point position is calculated according to formulas 8 and 9.

其中fw_overlap是航向重叠率，side_overlap是旁向重叠率，frame_w和frame_l分别是画幅尺寸的宽和高，focal为等效焦距。在本实施例中，计算得到沿无人机飞行方向，每隔13.5米设置一个拍摄点，S型飞行路线的旁向间距为20米。在每个拍摄点，本实施例在正斜前、正斜后、正斜左、正斜右、正下方五个方向各拍摄一张图像，共收集有效图像的数量为380张。Among them, fw _overlap is the heading overlap rate, side _overlap is the side overlap rate, frame _w and frame _l are the width and height of the frame size, and focal is the equivalent focal length. In this embodiment, it is calculated that along the flight direction of the drone, a shooting point is set every 13.5 meters, and the lateral spacing of the S-shaped flight route is 20 meters. At each shooting point, this embodiment shoots one image in each of the five directions of forward oblique front, forward oblique back, forward oblique left, forward oblique right, and directly below, and a total of 380 valid images are collected.

根据步骤S203，本实施例将无人机的飞行高度调整为35米，围绕场景中的建筑物，采用如图4中的八字绕飞的方式对场景中的明显建筑物进行拍摄，图4中八字绕飞路线中的每个点都是一个拍摄点，共收集有效图像的数量为307张。According to step S203, in this embodiment, the flying height of the drone is adjusted to 35 meters, around the buildings in the scene, and the obvious buildings in the scene are photographed in the way of flying around the scene as shown in Figure 4, as shown in Figure 4 Each point in the figure-of-eight flight route is a shooting point, and the number of valid images collected is 307.

根据步骤S204，本实施例使用具有单摄像头的拍摄设备，按照如图4中的方式对道路旁边的建筑物、道路完全被树木遮挡的路段进行数据采集，收集有效图像的数量共181张。According to step S204, this embodiment uses a shooting device with a single camera to collect data on buildings next to the road and road sections where the road is completely blocked by trees as shown in Figure 4, and a total of 181 effective images are collected.

根据步骤S205，本实施例对图像数据进行批量处理，如图5左侧所示，保留原有图像最中心的区域，将图像尺寸调整为宽3000像素、高2250像素，然后将图像降采样到宽1600像素、高1200像素。根据步骤S206，本实施例将处理后的图像数据分为3组，如图5右侧所示，无人机所采集的图像为一组，被被树木完全遮挡的道路为一组，建筑物为一组。根据步骤S207，在完成上述步骤后，为满足自动驾驶任务需求，本实施例在场景中选取了3个长方体石墩顶点、车道线角点作为特征点用于系统中点云尺度矫正模块，并在场景中均匀选取了40对特征点对用于系统中点云模型精度定量评估。According to step S205, this embodiment performs batch processing on the image data, as shown on the left side of Figure 5, retains the most central area of the original image, adjusts the image size to a width of 3000 pixels, and a height of 2250 pixels, and then downsamples the image to 1600 pixels wide and 1200 pixels high. According to step S206, this embodiment divides the processed image data into three groups, as shown on the right side of Figure 5, the images collected by the drone are one group, the roads completely blocked by trees are one group, and the buildings are one group. as a group. According to step S207, after the above steps are completed, in order to meet the requirements of the automatic driving task, this embodiment selects three cuboid stone pier vertices and lane line corner points in the scene as feature points for the point cloud scale correction module in the system, and In the scene, 40 pairs of feature points are uniformly selected for quantitative evaluation of the point cloud model accuracy in the system.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。The above-mentioned embodiments only express several implementation modes of the present invention, and the description thereof is relatively specific and detailed, but should not be construed as limiting the patent scope of the present invention. For those skilled in the art, without departing from the concept of the present invention, several modifications and improvements can be made, and these all belong to the protection scope of the present invention.

Claims

1. An autopilot digital twin scene construction system based on multi-view three-dimensional reconstruction, characterized in that the autopilot digital twin scene construction system comprises:

the data acquisition and processing module (M100) is used for acquiring and preprocessing the multi-view images of the automatic driving scene and dividing the processed image data into a plurality of groups;

a camera pose estimation module (M200) for taking the collected multi-view images as input, and outputting the corresponding position and pose of a camera for shooting each image, thereby obtaining the internal and external parameter sequences of the camera;

the multi-view three-dimensional reconstruction module (M300) is used for constructing a network model, extracting a characteristic image sequence of a multi-view image through the network model, constructing a cost body by combining an internal and external parameter sequence of a camera, slicing the cost body along the depth dimension direction, processing the sliced cost body to obtain a probability body, estimating a depth image of the multi-view image according to the probability body, and finally fusing the depth image to obtain a three-dimensional dense point cloud of the scene;

The point cloud model scale correction module (M400) is used for constructing a triangle patch with a medium proportion in a virtual three-dimensional space by taking three characteristic points obtained by processing in the module (M100) and side lengths of triangles formed by the three characteristic points as input parameters, finding out positions of three corresponding characteristic points in a scene three-dimensional dense point cloud obtained by the multi-view three-dimensional reconstruction module (M300), registering the three characteristic points in the virtual point cloud model with the corresponding three points in the triangle patch at the same time, and performing scale transformation on the three-dimensional dense point cloud;

the space-point cloud model fusion module (M500) is used for dividing the three-dimensional dense point cloud obtained by reconstructing the image acquired by the unmanned aerial vehicle in the data acquisition and processing module (M100) into a space-point cloud model and dividing the three-dimensional dense point cloud obtained by reconstructing other groups of images into a ground point cloud model; registering a plurality of ground point cloud models to an aerial point cloud model by the module to form a final automatic driving digital twin scene model;

and the model precision quantitative evaluation module (M600) is used for quantitatively evaluating the precision of the three-dimensional model of the automatic driving scene and judging that the precision of the three-dimensional model of the automatic driving scene meets the requirement of a subsequent automatic driving task.

2. The automated driving digital twinning scenario construction system according to claim 1, wherein the method of data acquisition and processing by the data acquisition and processing module (M100) comprises the steps of:

s201, demarcating an automatic driving scene range to be rebuilt;

s202, presetting a data acquisition route, flying according to a preset S-shaped route at a fixed flying height by using an unmanned aerial vehicle, and shooting scene images at shooting points;

s203, lowering the flight height of the unmanned aerial vehicle, and shooting around a building in a scene in a splayed flight-around mode;

s204, for buildings beside the road and road sections where the road is completely covered by trees, acquiring data in a surrounding shooting mode by using a handheld shooting device;

s205, preprocessing all collected image data: adjusting the image size to 3000 pixels wide and 2250 pixels high by reserving the most central area of the image, and then downsampling the image to 1600 pixels wide and 1200 pixels high;

s206, dividing the preprocessed image data into a plurality of groups, wherein the images acquired in the step S202 and the step S203 are divided into a group to be used as a first group of images; grouping the images photographed for each building or road section in step S204 individually;

S207, selecting three most obvious characteristic points in the real scene covered by each group of images, and recording the positions of the characteristic points and the millimeter-level precision side length of the formed triangle.

3. The automated driving digital twin scene building system of claim 1, wherein the camera pose estimation module (M200) comprises a search matching unit and an incremental reconstruction unit; the searching and matching unit takes multi-view images as input, searches for image pairs which are geometrically verified and have overlapping areas, and calculates the projection of the same point in space on two images in the image pairs; the incremental reconstruction unit is used for outputting the corresponding position and posture of the camera for shooting each image.

4. The automated driving digital twin scene building system according to claim 3, wherein the specific process of outputting the position and posture corresponding to the camera capturing each image by the incremental reconstruction unit is: first selecting and registering a pair of initial image pairs at a multi-view image dense location; then selecting the image with the largest registration point number with the currently registered image; registering the newly added view with the image set with the determined pose, and estimating the pose of a camera shooting the image by using a PnP problem solving algorithm; then, for the unreconstructed spatial points covered by the newly added registered image, triangulating the image and adding the new spatial points to the reconstructed spatial point set; and finally, performing one-time beam adjustment optimization adjustment on all the current estimated three-dimensional space points and camera pose.

5. A scene construction method of an autopilot digital twin scene construction system according to claim 1, comprising the steps of:

s501: the data acquisition and processing module (M100) is used for carrying out omnibearing sensing on an automatic driving scene and processing the acquired multi-view image data;

s502: inputting the acquired multi-view image data into a camera pose estimation module (M200), estimating the position and the pose corresponding to a camera shooting each image by searching, matching and incremental reconstruction methods, and obtaining an internal and external parameter sequence of the camera

S503: camera inner and outer parameter sequence

Inputting the image data acquired by the data acquisition and processing module (M100) into a network model constructed by the multi-view three-dimensional reconstruction module (M300); extracting an image sequence from image data using a network model>

Is->

And according to the camera in-out parameter sequence->

And the sequence of feature maps->

Constructing a feature sequence and polymerizing the feature sequence into a cost body; slicing the cost body along the depth dimension direction, and simultaneously processing each slice and two adjacent slices in front and behind through a network model to obtain probability bodies describing probability distribution of each pixel at different depths;

S504, adjusting the effective depth value in the real depth map into a real value body by a single-heat coding mode, wherein the real value body is used as a label for supervised learning; inputting the probability body and the true value body into an original network model, and obtaining a trained network model by a multi-round training mode to minimize the cross entropy loss function value between the probability body and the true value body;

s505, processing each image of the input multi-view image sequence through a trained network model to obtain a probability body, and adjusting the probability body into a depth map; then, filtering and fusing the depth map sequence to obtain a reconstructed three-dimensional dense point cloud of the scene;

s506: constructing a triangle patch with a medium proportion in a virtual three-dimensional space through a point cloud model scale correction module (M400), registering three characteristic points in the reconstructed three-dimensional dense point cloud with corresponding points in the triangle patch, and performing scale transformation on the three-dimensional dense point cloud of the scene;

s507: dividing the three-dimensional dense point cloud into an air point cloud model and a ground point cloud model by an air-to-ground point cloud model fusion module (M500); registering a plurality of ground point cloud models to an aerial point cloud model to form a final automatic driving digital twin scene model; and quantitatively evaluating the precision of the reconstructed three-dimensional model of the automatic driving scene to ensure that the precision meets the requirements of subsequent automatic driving tasks.

6. The scene construction method according to claim 5, wherein the feature map sequence in step S503

The acquisition of (1) is specifically as follows: the offset of the convolution kernel direction vector is learned through the network model, so that the convolution kernel can adapt to areas with different texture structures, and finer features are extracted; secondly, up-sampling the feature images with different sizes to the original input image size, and connecting the feature images to form a feature image with 32 channels; then, the two-dimensional information u of each characteristic channel is calculated _c (i, j) compressing into one-dimensional real number z _c And performing two-stage full connection; finally, each real number z is determined using a sigmoid function _c Is limited to [0,1 ]]Within the range, each channel of the feature map has different weights, weakening the matchingNoise data and irrelevant features in the process; repeating the above steps for each input image to obtain a feature map sequence +.>

7. The scene construction method according to claim 5, wherein the aggregation of the cost volumes in step S503 is specifically: selecting an image of the sequence of feature maps as reference feature map F ₁ The rest feature images are used as source feature images

Then according to the camera inner and outer parameter sequence->

Projecting all feature images onto a plurality of parallel planes under a reference image through homography transformation to form N-1 feature bodies +. >

Finally, the feature bodies are aggregated into a cost body through an aggregation function.

8. The scene construction method according to claim 5, wherein the forming of the probability volume in step S503 is specifically: cutting the cost body into D pieces, wherein D is depth priori, and the depth value can be any one value from 0 to D; then, the cost body slice sequence is regarded as a time sequence, the time-space recurrent neural network in the network model is sent to regularize, the time-space recurrent neural network uses ST-LSTM to transfer the storage state in the time sequence, namely the horizontal direction and the airspace, namely the vertical direction, the relation between slice sequences is reserved, and the situation that the probability body has multiple peaks is reduced; in the horizontal direction, the first layer unit at a certain moment receives the hidden state and the memory state of the last layer unit at the previous moment and transmits the hidden state and the memory state layer by layer in the vertical direction; finally, a softmax normalization operation is used for outputting probability values of each pixel at the depth d E [0, D ] to form a probability body.

9. The scene construction method according to claim 5, wherein step S505 is specifically: performing argmax operation on a probability body obtained by reasoning a trained network model of each image to obtain a depth map sequence, filtering out a depth map with low confidence coefficient based on a luminosity consistency criterion and a geometric consistency criterion, and finally obtaining a depth map sequence through the formula p=dm ^-1 K ^-1 p fusing the depth map into a three-dimensional dense point cloud; where P is the pixel coordinate, d is the depth value inferred by the network model, and P is the three-dimensional coordinate in the world coordinate system.