CN115731530A

CN115731530A - Model training method and device

Info

Publication number: CN115731530A
Application number: CN202110976217.6A
Authority: CN
Inventors: 陈铠; 洪蓝青; 徐航; 李震国
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2023-03-03

Abstract

The application discloses a model training method, which comprises the following steps: sampling a sample image to obtain a first image block and a second image block, respectively extracting the features of the first image block and the second image block through a feature extraction network based on the fact that the cross-over ratio of the areas of the first image block and the second image block on the sample image is larger than a threshold value, determining loss according to the difference between an obtained first feature map and an obtained second feature map, and updating the feature extraction network based on the loss. According to the method, the cross-over ratio is used as a mode for measuring the relative distance between the two image blocks, the semantic consistency in the local area is guaranteed by requiring the cross-over ratio of the image blocks to be larger than a given threshold value so as to restore the basic assumption of contrast learning, the image blocks are limited in one local area, therefore, the global inconsistency is converted into the local consistency, and the data processing accuracy of the updated feature extraction network is further improved.

Description

A model training method and device thereof

技术领域technical field

本申请涉及人工智能领域，尤其涉及一种模型训练方法及其装置。The present application relates to the field of artificial intelligence, in particular to a model training method and device thereof.

背景技术Background technique

计算机视觉是各个应用领域，如制造业、检验、文档分析、医疗诊断，和军事等领域中各种智能/自主系统中不可分割的一部分，它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的，被拍摄对象的数据与信息的学问。形象地说，就是给计算机安装上眼睛(照相机或摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等，从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息，所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说，计算机视觉就是用各种成象系统代替视觉器官获取输入信息，再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界，具有自主适应环境的能力。Computer vision is an integral part of various intelligent/autonomous systems in various application fields such as manufacturing, inspection, document analysis, medical diagnosis, and military. What we need is the knowledge of the data and information of the subject being photographed. Figuratively speaking, it is to install eyes (camera or video camera) and brain (algorithm) on the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be thought of as extracting information from sensory signals, computer vision can also be thought of as the science of how to make artificial systems "perceive" from images or multidimensional data. In general, computer vision is to use various imaging systems to replace the visual organs to obtain input information, and then use the computer to replace the brain to complete the processing and interpretation of these input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.

感知网络可以为对图像进行处理分析并得到处理结果的神经网络模型，目前感知网络能完成的功能越来越多，例如，可以是图像分类、2D检测、语义分割(semanticsegmentation)、关键点检测、线性物体检测(比如自动驾驶技术中的车道线或停止线检测)、可行驶区域检测等。另外，视觉感知系统具有成本低、非接触性、体积小、信息量大的特点。随着视觉感知算法的精度的不断提高，其成为当今众多人工智能系统的关键技术，得到越来越广泛的应用，如：高级驾驶辅助系统(advanced driving assistant system，ADAS)和自动驾驶系统(autonomous driving system，ADS)中对路面上的动态障碍物(人或车)、静态物体(交通灯、交通标志或交通锥状物)的识别，在终端视觉的拍照美颜功能中通过识别人体掩膜Mask和关键点实现瘦身效果等。The perceptual network can be a neural network model that processes and analyzes images and obtains processing results. Currently, the perceptual network can perform more and more functions, such as image classification, 2D detection, semantic segmentation, key point detection, Linear object detection (such as lane line or stop line detection in autonomous driving technology), drivable area detection, etc. In addition, the visual perception system has the characteristics of low cost, non-contact, small size and large amount of information. With the continuous improvement of the accuracy of visual perception algorithms, it has become the key technology of many artificial intelligence systems today, and has been more and more widely used, such as: advanced driving assistant system (advanced driving assistant system, ADAS) and automatic driving system (autonomous In the driving system (ADS), the recognition of dynamic obstacles (people or cars) and static objects (traffic lights, traffic signs or traffic cones) on the road surface, through the identification of human body masks in the camera beauty function of terminal vision Mask and key points to achieve slimming effect, etc.

近年来基于对比学习的自监督表征学习取得重大技术突破，其基本假设是认为来自相同图像的图像块描述相同的语义信息，他们的特征应该尽可能接近，而来自不同图像的图像块描述不同的语义信息，因此他们的特征应该尽可能相异。这种方法虽然不需要人工标注，但仍需要额外的单实例假设，即要求每个图像中仅有一个实例，且这个实例占据该图像的中心主要部分。这样才能认为相同图像切得图像块为正样本，而不同图像切得图像块为负样本。但在自动驾驶场景中，街景图像往往包含行人和车辆等多个实例，具有很强的全局不一致性，并不符合上述单实例假设，因此现有方法往往局限在拥有单实例假设的数据集上，很难充分利用更加贴近现实的多实例自动驾驶数据集。In recent years, a major technological breakthrough has been made in self-supervised representation learning based on contrastive learning. The basic assumption is that image blocks from the same image describe the same semantic information, and their features should be as close as possible, while image blocks from different images describe different semantic information, so their features should be as dissimilar as possible. Although this method does not require manual annotation, it still requires an additional single-instance assumption, which requires that there is only one instance in each image, and this instance occupies the central main part of the image. In this way, the image blocks cut from the same image can be regarded as positive samples, while the image blocks cut from different images are negative samples. However, in autonomous driving scenarios, street view images often contain multiple instances such as pedestrians and vehicles, which have strong global inconsistency and do not meet the above single-instance assumptions. Therefore, existing methods are often limited to datasets with single-instance assumptions. , it is difficult to make full use of more realistic multi-instance autonomous driving datasets.

发明内容Contents of the invention

第一方面，本申请提供了一种模型训练方法，所述方法包括：In a first aspect, the present application provides a model training method, the method comprising:

获取样本图像；样本图像可以为上述自动驾驶场景下的多实例图像(例如街景图像)，在一种可能的实现中，所述样本图像可以包括多个对象，所述对象包括人物、车辆、交通标志、车道线、植物中的至少一个。示例性的，对象可以包括但不限于如下示例的至少一种：动态障碍物(行人(Pedestrian)、骑行者(Cyclist)、三轮车(Tricycle)、轿车(Car)、卡车(Truck)、公交车(Bus))，静态障碍物(交通锥标(TrafficCone)、交通棍标(TrafficStick)、消防栓(FireHydrant)、摩托车(Motocycle)、自行车(Bicycle))，交通标志((TrafficSign)、导向标志(GuideSign)、广告牌(Billboard)、红色交通灯(TrafficLight_Red)/黄色交通灯(TrafficLight_Yellow)/绿色交通灯(TrafficLight_Green)/黑色交通灯(TrafficLight_Black)、路标(RoadSign))。Obtain a sample image; the sample image can be a multi-instance image (such as a street view image) in the above-mentioned automatic driving scene. In a possible implementation, the sample image can include multiple objects, and the objects include people, vehicles, traffic At least one of a sign, a lane marking, a plant. Exemplarily, the objects may include but not limited to at least one of the following examples: dynamic obstacles (pedestrians, cyclists (Cyclist), tricycles (Tricycle), cars (Car), trucks (Truck), buses ( Bus)), static obstacles (TrafficCone, TrafficStick, Fire Hydrant, Motocycle, Bicycle), traffic signs (TrafficSign), guide signs ( GuideSign), Billboard, Red Traffic Light (TrafficLight_Red) / Yellow Traffic Light (TrafficLight_Yellow) / Green Traffic Light (TrafficLight_Green) / Black Traffic Light (TrafficLight_Black), Road Sign (RoadSign)).

采样所述样本图像，以得到第一图像块和第二图像块，所述第一图像块和所述第二图像块为所述样本图像上不同的图像块；Sampling the sample image to obtain a first image block and a second image block, where the first image block and the second image block are different image blocks on the sample image;

其中，在获取到样本图像之后，可以采样样本图像，以得到多个局部图像(例如第一图像块和第二图像块)，第一图像块和第二图像块可以为采样样本图像得到的小图像(这里的小是指相对于样本图像的尺寸大小而言，第一图像块和第二图像块的图像面积更小)。Wherein, after the sample image is acquired, the sample image can be sampled to obtain a plurality of partial images (such as a first image block and a second image block), and the first image block and the second image block can be small samples obtained by sampling the sample image. image (here small refers to the image area of the first image block and the second image block being smaller relative to the size of the sample image).

其中，上述采样可以理解为裁剪，在一次采样过程中，可以在样本图像上随机确定一个矩形区域，并将该确定的矩形区域内的图像作为第一图像块，在另一次采样过程中，可以在样本图像上随机确定一个矩形区域，并将该确定的矩形区域内的图像作为第二图像块。Wherein, the above-mentioned sampling can be understood as cropping. In one sampling process, a rectangular area can be randomly determined on the sample image, and the image in the determined rectangular area can be used as the first image block. In another sampling process, it can be A rectangular area is randomly determined on the sample image, and an image within the determined rectangular area is used as a second image block.

其中，上述采样可以为随机采样，所谓随机采样，可以理解为采样的位置和/或采样的大小是随机确定的。Wherein, the above-mentioned sampling may be random sampling, and the so-called random sampling may be understood as that a sampling position and/or a sampling size are randomly determined.

其中，第一图像块和第二图像块可以为矩形的图像。Wherein, the first image block and the second image block may be rectangular images.

基于所述第一图像块和所述第二图像块在所述样本图像上所在区域之间的并交比大于阈值，通过特征提取网络分别对所述第一图像块和所述第二图像块进行特征提取，以得到第一特征图和第二特征图；Based on the intersection ratio of the first image block and the second image block on the sample image where the area is greater than a threshold, the first image block and the second image block are respectively analyzed through a feature extraction network Perform feature extraction to obtain a first feature map and a second feature map;

其中，第一图像块在样本图像上所在区域和第二图像块在样本图像上所在区域之间可以存在公共区域(重叠区域)，该公共区域的面积大小与第一图像块和所述第二图像块在样本图像上占据的全部区域之间的比值为所述第一图像块和所述第二图像块在所述样本图像上所在区域之间的并交比(intersection over union，IoU)；Wherein, there may be a common area (overlapping area) between the area where the first image block is located on the sample image and the area where the second image block is located on the sample image, the area of the common area is the same as that of the first image block and the second image block The ratio between all areas occupied by the image blocks on the sample image is an intersection over union (IoU) between the areas where the first image block and the second image block are located on the sample image;

在一种可能的实现中，所述阈值可以为大于或等于0.4的数值，例如阈值可以为0.4、0.45、0.5、0.55等等。应理解，交并比阈值显然不能定得太低，但同时也并非越高越好，因为不希望图像块之间毫不相关，但也不希望两者完全相同，所以交并比阈值的选择实际上也在控制多实例自监督学习中数据噪声和数据复杂性之间的平衡。In a possible implementation, the threshold may be a value greater than or equal to 0.4, for example, the threshold may be 0.4, 0.45, 0.5, 0.55 and so on. It should be understood that the threshold value of the intersection ratio cannot be set too low, but at the same time, it is not as high as possible, because it is not expected that the image blocks are irrelevant, but they are not expected to be exactly the same, so the selection of the threshold value of the intersection ratio It is actually also controlling the balance between data noise and data complexity in multi-instance self-supervised learning.

根据所述第一特征图和所述第二特征图之间的差异，确定损失，并基于所述损失更新所述特征提取网络，以得到更新后的特征提取网络。A loss is determined according to the difference between the first feature map and the second feature map, and the feature extraction network is updated based on the loss to obtain an updated feature extraction network.

本申请使用交并比作为衡量两图像块之间相对距离的方式，通过要求图像块交并比大于给定阈值保证局部区域内语义一致性以恢复对比学习基本假设，将图像块限制在一个局部区域内部，从而将全局不一致性转化为局部一致性，进而提高更新后的特征提取网络的数据处理精准度。同时阈值的选择可以有效控制不同场景下数据噪音和数据复杂性间的平衡。This application uses the intersection and union ratio as a way to measure the relative distance between two image blocks. By requiring the intersection and union ratio of the image block to be greater than a given threshold to ensure the semantic consistency in the local area, the basic assumption of contrastive learning is restored, and the image block is limited to a local area. Inside the region, the global inconsistency is transformed into local consistency, thereby improving the data processing accuracy of the updated feature extraction network. At the same time, the choice of threshold can effectively control the balance between data noise and data complexity in different scenarios.

在一种可能的实现中，所述阈值为大于或等于0.4的数值。In a possible implementation, the threshold is a value greater than or equal to 0.4.

在一种可能的实现中，所述根据所述第一特征图和所述第二特征图之间的差异，确定损失之前，所述方法还包括：对所述第一图像块和所述第二图像块进行对齐，以得到对齐后的第一图像块和对齐后的第二图像块；所述根据所述第一特征图和所述第二特征图之间的差异，确定损失，包括：根据所述对齐后的第一特征图和所述对齐后的第二特征图之间的差异，确定损失。In a possible implementation, before determining the loss according to the difference between the first feature map and the second feature map, the method further includes: analyzing the first image block and the second feature map Two image blocks are aligned to obtain an aligned first image block and an aligned second image block; said determining the loss according to the difference between the first feature map and the second feature map includes: A loss is determined based on the difference between the aligned first feature map and the aligned second feature map.

在一种可能的实现中，所述样本图像包括目标区域，所述目标区域为所述第一图像块和所述第二图像块在所述样本图像上所在的重叠区域，所述对所述第一图像块和所述第二图像块进行对齐，包括：根据所述目标区域，确定所述第一特征图中与所述目标区域对应的第一子特征图以及所述第二特征图中与所述目标区域对应的第二子特征图；对所述第一子特征图进行上采样，以得到所述对齐后的第一图像块；对所述第二子特征图进行上采样，以得到所述对齐后的第二图像块，其中所述对齐后的第一图像块与所述对齐后的第二图像块的尺寸大小一致。In a possible implementation, the sample image includes a target area, where the target area is an overlapping area where the first image block and the second image block are located on the sample image, and the pair of Aligning the first image block with the second image block includes: determining, according to the target area, a first sub-feature map corresponding to the target area in the first feature map and a sub-feature map in the second feature map A second sub-feature map corresponding to the target area; performing upsampling on the first sub-feature map to obtain the aligned first image block; performing up-sampling on the second sub-feature map to obtain The aligned second image block is obtained, wherein the aligned first image block has the same size as the aligned second image block.

为了区分多实例特征，需要舍弃主干网络之后的全局池化层来保持特征图的二维结构和位置信息，但同时也带来了额外的特征不对齐的问题，即二维特征图上不再拥有相同相对位置的一一对应关系。本申请实施例中提供两种不同的方式进行特征对齐：感兴趣区域(region of interest)对齐将图像块重叠部分分别作为两图像块的感兴趣区域，例如可以使用RoI Align仅提取重合部分特征进行后续计算，这种方式直观但是没有充分利用不重合部分的信息。In order to distinguish multi-instance features, it is necessary to abandon the global pooling layer after the backbone network to maintain the two-dimensional structure and position information of the feature map, but it also brings additional features that are not aligned. One-to-one correspondence with the same relative position. In the embodiment of the present application, two different ways are provided for feature alignment: region of interest (region of interest) alignment uses the overlapping parts of the image blocks as the regions of interest of the two image blocks, for example, RoI Align can be used to extract only the features of the overlapping part For subsequent calculations, this method is intuitive but does not make full use of the information of the non-overlapping parts.

在一种可能的实现中，所述第一特征图和所述第二特征图的尺寸大小一致，所述第一特征图包括M个第一特征点，所述第二特征图包括M个第二特征点，所述M个第一特征点在所述样本图像中对应于M个第一像素点，所述M个第二特征点在所述样本图像中对应于M个第二像素点，所述M个第一像素点与所述M个第二像素点一一对应，所述方法还包括：根据所述M个第一像素点和所述M个第二像素点，以得到第三特征图，所述第三特征图和所述第二特征图的尺寸大小一致，所述第三特征图包括M个第三特征点，每个第三特征点为基于具有对应关系的第一像素点和第二像素点之间的像素位置差异得到；将所述第三特征图与所述第一特征图进行融合，以得到所述对齐后的第一图像块，所述第二特征图用于作为所述对齐后的第二图像块。In a possible implementation, the first feature map and the second feature map have the same size, the first feature map includes M first feature points, and the second feature map includes M first feature points Two feature points, the M first feature points correspond to M first pixel points in the sample image, and the M second feature points correspond to M second pixel points in the sample image, The M first pixels are in one-to-one correspondence with the M second pixels, and the method further includes: according to the M first pixels and the M second pixels, to obtain a third A feature map, the size of the third feature map is the same as that of the second feature map, the third feature map includes M third feature points, and each third feature point is based on the first pixel with a corresponding relationship The pixel position difference between the point and the second pixel point is obtained; the third feature map is fused with the first feature map to obtain the aligned first image block, and the second feature map is used as the aligned second image block.

在一种可能的实现中，所述将所述第三特征图与所述第一特征图进行融合，包括：In a possible implementation, the fusing the third feature map with the first feature map includes:

将所述第三特征图与所述第一特征图在深度方向上进行拼接。Splicing the third feature map and the first feature map in a depth direction.

其中，位移对齐则取每对位于相同相对位置的像素点，计算他们在原图中的坐标位移并与特征图串联(在深度方向上进行拼接)，作为额外的侧信息提供给预测网络进行隐式的特征对齐来帮助后续的特征预测，充分利用了不重合区域的特征信息，同时使得后面的相似度衡量更加灵活。Among them, the displacement alignment takes each pair of pixels at the same relative position, calculates their coordinate displacement in the original image and concatenates them with the feature map (stitching in the depth direction), and provides them as additional side information to the prediction network for implicit The feature alignment to help the subsequent feature prediction makes full use of the feature information of the non-overlapping area, and at the same time makes the subsequent similarity measurement more flexible.

在一种可能的实现中，所述根据所述第一特征图和所述第二特征图之间的差异，确定损失，包括：通过目标预测网络，对所述第一特征图的M个第一特征点进行处理，以得到每个第一特征点的预测值；基于目标聚类算法，对所述第一特征图的M个第二特征点进行聚类，以更新所述M个第二特征点的特征值，其中每个更新后的第二特征点的特征值为所在的聚类类别的聚类中心特征值；根据所述每个第一特征点的预测值以及所述每个更新后的第二特征点的特征值之间的差异，确定损失。In a possible implementation, the determining the loss according to the difference between the first feature map and the second feature map includes: using a target prediction network to perform a A feature point is processed to obtain the predicted value of each first feature point; based on the target clustering algorithm, the M second feature points of the first feature map are clustered to update the M second feature points The feature value of the feature point, wherein the feature value of each updated second feature point is the cluster center feature value of the cluster category where it is located; according to the predicted value of each first feature point and each update After the difference between the eigenvalues of the second eigenpoints, the loss is determined.

其中，本申请实施例利用图像内聚类，一方面网络有能力对不同实例特征进行区分；另一方面通过考虑二维特征图整体信息，目标分支提供的回归目标更加鲁棒，而在线分支通过部署自注意力模块引入全局视角以获得更加准确的预测结果。Among them, the embodiment of the present application uses image clustering. On the one hand, the network has the ability to distinguish the characteristics of different instances; Deploying a self-attention module introduces a global perspective to obtain more accurate prediction results.

在一种可能的实现中，所述方法还包括：In a possible implementation, the method also includes:

根据所述损失，更新所述目标预测网络。Based on the loss, the target prediction network is updated.

在一种可能的实现中，所述样本图像包括多个对象，所述对象包括人物、车辆、交通标志、车道线、植物中的至少一个。In a possible implementation, the sample image includes multiple objects, and the objects include at least one of people, vehicles, traffic signs, lane lines, and plants.

获取目标网络以及待处理图像，所述目标网络包括所述更新后的特征提取网络以及下游任务网络；通过所述目标网络，处理所述待处理图像，以得到处理结果。Obtaining a target network and an image to be processed, the target network including the updated feature extraction network and a downstream task network; processing the image to be processed through the target network to obtain a processing result.

第二方面，本申请提供了一种模型训练装置，所述装置包括：In a second aspect, the present application provides a model training device, the device comprising:

获取模块，用于获取样本图像；An acquisition module, used to acquire sample images;

采样模块，用于采样所述样本图像，以得到第一图像块和第二图像块，所述第一图像块和所述第二图像块为所述样本图像上不同的图像块；a sampling module, configured to sample the sample image to obtain a first image block and a second image block, the first image block and the second image block being different image blocks on the sample image;

特征提取模块，用于基于所述第一图像块和所述第二图像块在所述样本图像上所在区域之间的并交比大于阈值，通过特征提取网络分别对所述第一图像块和所述第二图像块进行特征提取，以得到第一特征图和第二特征图；A feature extraction module, configured to extract the first image block and the second image block respectively through a feature extraction network based on the intersection ratio between the areas where the first image block and the second image block are located on the sample image is greater than a threshold performing feature extraction on the second image block to obtain a first feature map and a second feature map;

模型更新模块，用于根据所述第一特征图和所述第二特征图之间的差异，确定损失，并基于所述损失更新所述特征提取网络，以得到更新后的特征提取网络。A model updating module, configured to determine a loss according to the difference between the first feature map and the second feature map, and update the feature extraction network based on the loss to obtain an updated feature extraction network.

在一种可能的实现中，所述装置还包括：In a possible implementation, the device also includes:

对齐模块，用于在所述根据所述第一特征图和所述第二特征图之间的差异，确定损失之前，对所述第一图像块和所述第二图像块进行对齐，以得到对齐后的第一图像块和对齐后的第二图像块；an alignment module, configured to align the first image block and the second image block before determining the loss according to the difference between the first feature map and the second feature map, so as to obtain the aligned first image block and the aligned second image block;

所述模型更新模块，具体用于：The model update module is specifically used for:

根据所述对齐后的第一特征图和所述对齐后的第二特征图之间的差异，确定损失。A loss is determined based on the difference between the aligned first feature map and the aligned second feature map.

在一种可能的实现中，所述样本图像包括目标区域，所述目标区域为所述第一图像块和所述第二图像块在所述样本图像上所在的重叠区域，所述对齐模块，具体用于：In a possible implementation, the sample image includes a target area, where the target area is an overlapping area where the first image block and the second image block are located on the sample image, and the alignment module, Specifically for:

根据所述目标区域，确定所述第一特征图中与所述目标区域对应的第一子特征图以及所述第二特征图中与所述目标区域对应的第二子特征图；According to the target area, determining a first sub-feature map corresponding to the target area in the first feature map and a second sub-feature map corresponding to the target area in the second feature map;

对所述第一子特征图进行上采样，以得到所述对齐后的第一图像块；Upsampling the first sub-feature map to obtain the aligned first image block;

对所述第二子特征图进行上采样，以得到所述对齐后的第二图像块，其中所述对齐后的第一图像块与所述对齐后的第二图像块的尺寸大小一致。Up-sampling the second sub-feature map to obtain the aligned second image block, wherein the aligned first image block has the same size as the aligned second image block.

在一种可能的实现中，所述第一特征图和所述第二特征图的尺寸大小一致，所述第一特征图包括M个第一特征点，所述第二特征图包括M个第二特征点，所述M个第一特征点在所述样本图像中对应于M个第一像素点，所述M个第二特征点在所述样本图像中对应于M个第二像素点，所述M个第一像素点与所述M个第二像素点一一对应，所述对齐模块，具体用于：In a possible implementation, the first feature map and the second feature map have the same size, the first feature map includes M first feature points, and the second feature map includes M first feature points Two feature points, the M first feature points correspond to M first pixel points in the sample image, and the M second feature points correspond to M second pixel points in the sample image, The M first pixels are in one-to-one correspondence with the M second pixels, and the alignment module is specifically used for:

根据所述M个第一像素点和所述M个第二像素点，以得到第三特征图，所述第三特征图和所述第二特征图的尺寸大小一致，所述第三特征图包括M个第三特征点，每个第三特征点为基于具有对应关系的第一像素点和第二像素点之间的像素位置差异得到；According to the M first pixel points and the M second pixel points, a third feature map is obtained, the size of the third feature map is the same as that of the second feature map, and the third feature map Including M third feature points, each third feature point is obtained based on the pixel position difference between the first pixel point and the second pixel point with corresponding relationship;

将所述第三特征图与所述第一特征图进行融合，以得到所述对齐后的第一图像块，所述第二特征图用于作为所述对齐后的第二图像块。The third feature map is fused with the first feature map to obtain the aligned first image block, and the second feature map is used as the aligned second image block.

在一种可能的实现中，所述对齐模块，具体用于：In a possible implementation, the alignment module is specifically used for:

在一种可能的实现中，所述模型更新模块，具体用于：In a possible implementation, the model updating module is specifically used for:

通过目标预测网络，对所述第一特征图的M个第一特征点进行处理，以得到每个第一特征点的预测值；Processing the M first feature points of the first feature map through the target prediction network to obtain a predicted value of each first feature point;

基于目标聚类算法，对所述第一特征图的M个第二特征点进行聚类，以更新所述M个第二特征点的特征值，其中每个更新后的第二特征点的特征值为所在的聚类类别的聚类中心特征值；Based on the target clustering algorithm, cluster the M second feature points of the first feature map to update the feature values of the M second feature points, wherein the feature of each updated second feature point The value is the cluster center feature value of the cluster category;

根据所述每个第一特征点的预测值以及所述每个更新后的第二特征点的特征值之间的差异，确定损失。The loss is determined according to the difference between the predicted value of each first feature point and the feature value of each updated second feature point.

在一种可能的实现中，所述模型更新模块，还用于：In a possible implementation, the model update module is also used to:

数据处理模块，用于获取目标网络以及待处理图像，所述目标网络包括所述更新后的特征提取网络以及下游任务网络；A data processing module, configured to acquire a target network and an image to be processed, the target network including the updated feature extraction network and a downstream task network;

通过所述目标网络，处理所述待处理图像，以得到处理结果。The image to be processed is processed through the target network to obtain a processing result.

第三方面，本申请实施例提供了一种模型训练装置，可以包括存储器、处理器以及总线系统，其中，存储器用于存储程序，处理器用于执行存储器中的程序，以执行如上述第一方面及第一方面任一可选的方法。In the third aspect, the embodiment of the present application provides a model training device, which may include a memory, a processor, and a bus system, wherein the memory is used to store programs, and the processor is used to execute the programs in the memory to perform the above-mentioned first aspect. And any optional method of the first aspect.

第四方面，本申请实施例提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机程序，当其在计算机上运行时，使得计算机执行如上述第一方面及第一方面任一可选的方法。In the fourth aspect, the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when it is run on a computer, the computer executes the above-mentioned first aspect and the first Aspect any optional method.

第五方面，本申请实施例提供了一种计算机程序，当其在计算机上运行时，使得计算机执行上述第一方面及其任一可选的方法。In the fifth aspect, the embodiment of the present application provides a computer program, which, when running on a computer, causes the computer to execute the above-mentioned first aspect and any optional method thereof.

第六方面，本申请提供了一种芯片系统，该芯片系统包括处理器，用于支持执行设备或训练设备实现上述方面中所涉及的功能，例如，发送或处理上述方法中所涉及的数据；或，信息。在一种可能的设计中，所述芯片系统还包括存储器，所述存储器，用于保存执行设备或训练设备必要的程序指令和数据。该芯片系统，可以由芯片构成，也可以包括芯片和其他分立器件。In a sixth aspect, the present application provides a chip system, which includes a processor, configured to support an execution device or a training device to implement the functions involved in the above aspect, for example, send or process the data involved in the above method; or, information. In a possible design, the chip system further includes a memory, and the memory is used for storing necessary program instructions and data of the execution device or the training device. The system-on-a-chip may consist of chips, or may include chips and other discrete devices.

本申请实施例提供了一种模型训练方法，所述方法包括：获取样本图像；采样所述样本图像，以得到第一图像块和第二图像块，所述第一图像块和所述第二图像块为所述样本图像上不同的图像块；基于所述第一图像块和所述第二图像块在所述样本图像上所在区域之间的并交比大于阈值，通过特征提取网络分别对所述第一图像块和所述第二图像块进行特征提取，以得到第一特征图和第二特征图；根据所述第一特征图和所述第二特征图之间的差异，确定损失，并基于所述损失更新所述特征提取网络，以得到更新后的特征提取网络。使用交并比作为衡量两图像块之间相对距离的方式，通过要求图像块交并比大于给定阈值保证局部区域内语义一致性以恢复对比学习基本假设，将图像块限制在一个局部区域内部，从而将全局不一致性转化为局部一致性，进而提高更新后的特征提取网络的数据处理精准度。同时阈值的选择可以有效控制不同场景下数据噪音和数据复杂性间的平衡。An embodiment of the present application provides a model training method, the method comprising: acquiring a sample image; sampling the sample image to obtain a first image block and a second image block, the first image block and the second image block The image block is a different image block on the sample image; based on the union ratio between the first image block and the second image block on the sample image where the area is greater than a threshold, the feature extraction network is respectively used for performing feature extraction on the first image block and the second image block to obtain a first feature map and a second feature map; determining a loss according to the difference between the first feature map and the second feature map , and update the feature extraction network based on the loss to obtain an updated feature extraction network. Using the intersection ratio as a measure of the relative distance between two image blocks, by requiring the image block intersection ratio to be greater than a given threshold to ensure semantic consistency in the local area to restore the basic assumption of contrastive learning, the image block is limited to a local area , so as to transform the global inconsistency into local consistency, and then improve the data processing accuracy of the updated feature extraction network. At the same time, the choice of threshold can effectively control the balance between data noise and data complexity in different scenarios.

附图说明Description of drawings

图1为人工智能主体框架的一种结构示意图；Fig. 1 is a kind of structural schematic diagram of main frame of artificial intelligence;

图2为本申请实施例的应用场景；Fig. 2 is the application scenario of the embodiment of the present application;

图3为本申请实施例采用的卷积神经网络的结构示意；FIG. 3 is a schematic diagram of the structure of the convolutional neural network used in the embodiment of the present application;

图4为本申请实施例采用的卷积神经网络的结构示意；FIG. 4 is a schematic structural representation of the convolutional neural network used in the embodiment of the present application;

图5为本申请实施例提供的一种系统架构的示意图；FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the present application;

图6为本申请实施例提供的一种模型训练方法的示意；FIG. 6 is a schematic diagram of a model training method provided in an embodiment of the present application;

图7为本申请实施例提供的一种并交比计算示意；FIG. 7 is a schematic diagram of a parallel ratio calculation provided by the embodiment of the present application;

图8为本申请实施例提供的一种图像块采样示意；FIG. 8 is a schematic diagram of an image block sampling provided by an embodiment of the present application;

图9为本申请实施例提供的一种图像块采样示意；FIG. 9 is a schematic diagram of an image block sampling provided by an embodiment of the present application;

图10为本申请实施例提供的一种图像块采样示意；FIG. 10 is a schematic diagram of an image block sampling provided by an embodiment of the present application;

图11为本申请实施例提供的一种图像块采样示意；FIG. 11 is a schematic diagram of an image block sampling provided by an embodiment of the present application;

图12为一种主干网络的结构示意；FIG. 12 is a schematic structural representation of a backbone network;

图13为本申请实施例的一种图像块对齐示意；FIG. 13 is a schematic diagram of image block alignment in an embodiment of the present application;

图14为本申请实施例的一种图像块对齐示意；FIG. 14 is a schematic diagram of image block alignment in an embodiment of the present application;

图15为本申请实施例的一种聚类示意；FIG. 15 is a schematic diagram of clustering in the embodiment of the present application;

图16为本申请实施例的一种相似度计算示意；FIG. 16 is a schematic diagram of a similarity calculation in an embodiment of the present application;

图17为本申请实施例提供的一种模型训练方法的示意；Fig. 17 is a schematic diagram of a model training method provided by the embodiment of the present application;

图18为本申请实施例提供的一种模型训练方法的示意；Fig. 18 is a schematic diagram of a model training method provided by the embodiment of the present application;

图19为一种下游任务网络的结构示意；Fig. 19 is a structural representation of a downstream task network;

图20为一种head的结构示意；Fig. 20 is a structural representation of a head;

图21为一种聚类结果的示意；Figure 21 is a schematic diagram of a clustering result;

图22为本申请实施例提供的一种模型训练装置的示意；Fig. 22 is a schematic diagram of a model training device provided by the embodiment of the present application;

图23为本申请实施例提供的执行设备的一种结构示意图；Fig. 23 is a schematic structural diagram of the execution device provided by the embodiment of the present application;

图24是本申请实施例提供的训练设备一种结构示意图；Fig. 24 is a schematic structural diagram of a training device provided by an embodiment of the present application;

图25为本申请实施例提供的芯片的一种结构示意图。FIG. 25 is a schematic structural diagram of a chip provided by an embodiment of the present application.

具体实施方式Detailed ways

下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释，而非旨在限定本发明。Embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention. The terms used in the embodiments of the present invention are only used to explain specific examples of the present invention, and are not intended to limit the present invention.

下面结合附图，对本申请的实施例进行描述。本领域普通技术人员可知，随着技术的发展和新场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。Embodiments of the present application are described below in conjunction with the accompanying drawings. Those of ordinary skill in the art know that, with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换，这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second" and the like in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is merely a description of the manner in which objects with the same attribute are described in the embodiments of the present application. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, product, or apparatus comprising a series of elements is not necessarily limited to those elements, but may include elements not expressly included. Other elements listed explicitly or inherent to the process, method, product, or apparatus.

首先对人工智能系统总体工作流程进行描述，请参见图1，图1示出的为人工智能主体框架的一种结构示意图，下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中，“智能信息链”反映从数据的获取到处理的一列过程。举例来说，可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中，数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程，反映人工智能为信息技术产业带来的价值。First, describe the overall workflow of the artificial intelligence system. Please refer to Figure 1. Figure 1 shows a schematic structural diagram of the main framework of artificial intelligence. The following is from the "intelligent information chain" (horizontal axis) and "IT value chain" ( Vertical axis) to illustrate the above artificial intelligence theme framework in two dimensions. Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom". "IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.

(1)基础设施(1) Infrastructure

基础设施为人工智能系统提供计算能力支持，实现与外部世界的沟通，并通过基础平台实现支撑。通过传感器与外部沟通；计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供；基础平台包括分布式计算框架及网络等相关的平台保障和支持，可以包括云存储和计算、互联互通网络等。举例来说，传感器和外部沟通获取数据，这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing power is provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing framework and network and other related platform guarantees and supports, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with the outside to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.

(2)数据(2) data

基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本，还涉及到传统设备的物联网数据，包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3)数据处理(3) Data processing

数据处理通常包括数据训练，机器学习，深度学习，搜索，推理，决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.

其中，机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.

推理是指在计算机或智能系统中，模拟人类的智能推理方式，依据推理控制策略，利用形式化的信息进行机器思维和求解问题的过程，典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies. The typical functions are search and matching.

决策是指智能信息经过推理后进行决策的过程，通常提供分类、排序、预测等功能。Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.

(4)通用能力(4) General ability

对数据经过上面提到的数据处理后，进一步基于数据处理的结果可以形成一些通用的能力，比如可以是算法或者一个通用系统，例如，翻译，文本的分析，计算机视觉的处理，语音识别，图像的识别等等。After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image processing identification, etc.

(5)智能产品及行业应用(5) Smart products and industry applications

智能产品及行业应用指人工智能系统在各领域的产品和应用，是对人工智能整体解决方案的封装，将智能信息决策产品化、实现落地应用，其应用领域主要包括：智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.

下面分别对ADAS/ADS视觉感知系统、手机美颜、图像分类、商品分类等应用场景做简单的介绍。The following is a brief introduction to application scenarios such as ADAS/ADS visual perception system, mobile phone beautification, image classification, and product classification.

应用场景1：ADAS/ADS视觉感知系统Application Scenario 1: ADAS/ADS Visual Perception System

如图2所示，在ADAS和ADS中，需要实时进行多类型的2D目标检测，包括：动态障碍物(行人(Pedestrian)、骑行者(Cyclist)、三轮车(Tricycle)、轿车(Car)、卡车(Truck)、公交车(Bus))，静态障碍物(交通锥标(TrafficCone)、交通棍标(TrafficStick)、消防栓(FireHydrant)、摩托车(Motocycle)、自行车(Bicycle))，交通标志((TrafficSign)、导向标志(GuideSign)、广告牌(Billboard)、红色交通灯(TrafficLight_Red)/黄色交通灯(TrafficLight_Yellow)/绿色交通灯(TrafficLight_Green)/黑色交通灯(TrafficLight_Black)、路标(RoadSign))。另外，为了准确获取动态障碍物的在3维空间所占的区域，还需要对动态障碍物进行3D估计，输出3D框。为了与激光雷达的数据进行融合，需要获取动态障碍物的Mask，从而把打到动态障碍物上的激光点云筛选出来；为了进行精确的泊车位，需要同时检测出泊车位的4个关键点；为了进行构图定位，需要检测出静态目标的关键点。使用本申请实施例提供的技术方案训练出的模型(例如目标网络)，可以在目标网络中完成上述的全部或一部分功能。As shown in Figure 2, in ADAS and ADS, real-time multi-type 2D target detection is required, including: dynamic obstacles (pedestrians, cyclists, tricycles, cars, trucks) (Truck, Bus (Bus)), static obstacles (Traffic Cone, Traffic Stick, Fire Hydrant, Motorcycle, Bicycle), traffic signs ( (TrafficSign), Guide Signs (GuideSign), Billboards (Billboard), Red Traffic Lights (TrafficLight_Red)/Yellow Traffic Lights (TrafficLight_Yellow)/Green Traffic Lights (TrafficLight_Green)/Black Traffic Lights (TrafficLight_Black), Road Signs (RoadSign)). In addition, in order to accurately obtain the area occupied by the dynamic obstacle in the 3D space, it is also necessary to perform 3D estimation on the dynamic obstacle and output a 3D frame. In order to integrate with the laser radar data, it is necessary to obtain the Mask of the dynamic obstacle, so as to filter out the laser point cloud hit on the dynamic obstacle; in order to carry out accurate parking space, it is necessary to detect 4 key points of the parking space at the same time ; In order to perform composition positioning, it is necessary to detect key points of static objects. Using the model trained by the technical solution provided by the embodiment of the present application (for example, the target network), all or part of the above-mentioned functions can be completed in the target network.

应用场景2：手机美颜功能Application Scenario 2: Mobile phone beauty function

在手机中，通过本申请实施例提供的技术方案训练出的模型(例如目标网络)检测出人体的Mask和关键点，可以对人体相应的部位进行放大缩小，比如进行收腰和美臀操作，从而输出美颜的图像。In the mobile phone, the model (such as the target network) trained by the technical solution provided by the embodiment of the present application detects the Mask and key points of the human body, and can zoom in and out on the corresponding parts of the human body, such as performing waist and buttock enhancement operations, thereby Output beautiful images.

应用场景3：图像分类场景：Application scenario 3: Image classification scenario:

在获取待分类图像后，采用通过本申请实施例提供的技术方案训练出的模型(例如目标网络)可以确定出待分类图像中的物体的类别，然后可根据待分类图像中物体的类别对待分类图像进行分类。对于摄影师来说，每天会拍很多照片，有动物的，有人物，有植物的。采用本申请的方法可以快速地将照片按照照片中的内容进行分类，可分成包含动物的照片、包含人物的照片和包含植物的照片。After obtaining the image to be classified, the model (such as the target network) trained by the technical solution provided by the embodiment of the application can determine the category of the object in the image to be classified, and then it can be classified according to the category of the object in the image to be classified Images are classified. For photographers, many photos are taken every day, including animals, people, and plants. Using the method of the present application, photos can be quickly classified according to the content in the photos, which can be divided into photos containing animals, photos containing people and photos containing plants.

应用场景4：商品分类：Application Scenario 4: Commodity Classification:

在获取包含商品的图像后，采用通过本申请实施例提供的技术方案训练出的模型(例如目标网络)可以确定商品的图像中商品的类别，然后根据商品的类别对商品进行分类。对于大型商场或超市中种类繁多的商品。After the image containing the commodity is acquired, the category of the commodity in the image of the commodity can be determined by using the model (such as the target network) trained by the technical solution provided by the embodiment of the present application, and then the commodity is classified according to the category of the commodity. For a wide variety of goods in large shopping malls or supermarkets.

由于本申请实施例涉及大量神经网络的应用，为了便于理解，下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。Since the embodiment of the present application involves the application of a large number of neural networks, for ease of understanding, the following first introduces related terms and neural network related concepts involved in the embodiment of the present application.

(1)神经网络(1) neural network

神经网络可以是由神经单元组成的，神经单元可以是指以xs(即输入数据)和截距1为输入的运算单元，该运算单元的输出可以为：The neural network can be composed of neural units, and the neural unit can refer to an operation unit that takes xs (ie input data) and intercept 1 as input, and the output of the operation unit can be:

其中，s＝1、2、……n，n为大于1的自然数，Ws为xs的权重，b为神经单元的偏置。f为神经单元的激活函数(activation functions)，用于将非线性特性引入神经网络中，来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入，激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络，即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连，来提取局部接受域的特征，局部接受域可以是由若干个神经单元组成的区域。Wherein, s=1, 2, ... n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.

(2)物体检测，利用图像处理和机器学习、计算机图形学等相关方法，物体检测可以确定图像物体的类别，并确定用于定位物体的检测框。(2) Object detection, using image processing and related methods such as machine learning and computer graphics, object detection can determine the category of image objects and determine the detection frame used to locate the object.

(3)卷积神经网络(convolutional neuron network，CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器，该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中，一个神经元可以只与部分邻层神经元连接。一个卷积层中，通常包含若干个特征平面，每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重，这里共享的权重就是卷积核。共享权重可以理解为提取特征的方式与位置无关。卷积核可以以随机大小的矩阵的形式化，在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外，共享权重带来的直接好处是减少卷积神经网络各层之间的连接，同时又降低了过拟合的风险。(3) Convolutional neural network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer, which can be regarded as a filter. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can only be connected to some adjacent neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract features independent of position. The convolution kernel can be formalized as a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

CNN是一种非常常见的神经网络，下面结合图3重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述，卷积神经网络是一种带有卷积结构的深度神经网络，是一种深度学习(deep learning)架构，深度学习架构是指通过机器学习的算法，在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构，CNN是一种前馈(feed-forward)人工神经网络，该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。CNN is a very common neural network. The structure of CNN will be introduced in detail below in conjunction with Figure 3. As mentioned in the introduction to the basic concepts above, the convolutional neural network is a deep neural network with a convolutional structure, and it is a deep learning architecture. The deep learning architecture refers to the algorithm through machine learning. Multiple levels of learning are performed on the abstraction level. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it.

如图3所示，卷积神经网络(CNN)200可以包括输入层210，卷积层/池化层220(其中池化层为可选的)，以及全连接层(fully connected layer)230。As shown in FIG. 3 , a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230 .

卷积层/池化层220：Convolutional layer/pooling layer 220:

卷积层：Convolution layer:

如图3所示卷积层/池化层220可以包括如示例221-226层，举例来说：在一种实现中，221层为卷积层，222层为池化层，223层为卷积层，224层为池化层，225为卷积层，226为池化层；在另一种实现方式中，221、222为卷积层，223为池化层，224、225为卷积层，226为池化层。即卷积层的输出可以作为随后的池化层的输入，也可以作为另一个卷积层的输入以继续进行卷积操作。As shown in Figure 3, the convolutional layer/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, the 221st layer is a convolutional layer, the 222nd layer is a pooling layer, and the 223rd layer is a volume Layers, 224 are pooling layers, 225 are convolutional layers, and 226 are pooling layers; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers Layer, 226 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.

下面将以卷积层221为例，介绍一层卷积层的内部工作原理。The following will take the convolutional layer 221 as an example to introduce the inner working principle of one convolutional layer.

卷积层221可以包括很多个卷积算子，卷积算子也称为核，其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器，卷积算子本质上可以是一个权重矩阵，这个权重矩阵通常被预先定义，在对图像进行卷积操作的过程中，权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理，从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关，需要注意的是，权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的，在进行卷积运算的过程中，权重矩阵会延伸到输入图像的整个深度。因此，和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出，但是大多数情况下不使用单一权重矩阵，而是应用多个尺寸(行×列)相同的权重矩阵，即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度，这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征，例如一个权重矩阵用来提取图像边缘信息，另一个权重矩阵用来提取图像的特定颜色，又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同，经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同，再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。The convolution layer 221 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter for extracting specific information from the input image matrix. The convolution operators are essentially It can be a weight matrix. This weight matrix is usually pre-defined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image. ...It depends on the value of the stride) to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row×column) are applied, That is, multiple matrices of the same shape. The output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image. Do blurring etc. The multiple weight matrices have the same size (row×column), and the feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple feature maps of the same size are combined to form the convolution operation. output.

这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到，通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息，从而使得卷积神经网络200进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .

当卷积神经网络200有多个卷积层的时候，初始的卷积层(例如221)往往提取较多的一般特征，该一般特征也可以称之为低级别的特征；随着卷积神经网络200深度的加深，越往后的卷积层(例如226)提取到的特征越来越复杂，比如高级别的语义之类的特征，语义越高的特征越适用于待解决的问题。When the convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (such as 221) often extracts more general features, which can also be referred to as low-level features; As the depth of the network 200 deepens, the features extracted by the later convolutional layers (such as 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.

池化层：Pooling layer:

由于常常需要减少训练参数的数量，因此卷积层之后常常需要周期性的引入池化层，在如图3中220所示例的221-226各层，可以是一层卷积层后面跟一层池化层，也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中，池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子，以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外，就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样，池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸，池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the layers 221-226 as shown in 220 in Figure 3, one layer of convolutional layers can be followed by one layer The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In image processing, the sole purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling. The maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

全连接层230：Fully connected layer 230:

在经过卷积层/池化层220的处理后，卷积神经网络200还不足以输出所需要的输出信息。因为如前所述，卷积层/池化层220只会提取特征，并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息)，卷积神经网络200需要利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此，在全连接层230中可以包括多层隐含层(如图3所示的231、232至23n)，该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到，例如该任务类型可以包括图像识别，图像分类，图像超分辨率重建等等……After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. Pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...

在全连接层230中的多层隐含层之后，也就是整个卷积神经网络200的最后层为输出层240，该输出层240具有类似分类交叉熵的损失函数，具体用于计算预测误差，一旦整个卷积神经网络200的前向传播(如图3由210至240方向的传播为前向传播)完成，反向传播(如图3由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差，以减少卷积神经网络200的损失，及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。After the multi-layer hidden layer in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240. The output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network 200 (as shown in Fig. 3, the propagation from 210 to 240 direction is forward propagation) is completed, the backpropagation (as shown in Fig. 3, the propagation from 240 to 210 direction is back propagation) will Start to update the weights and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the output of the convolutional neural network 200 through the output layer and the ideal result.

需要说明的是，如图3所示的卷积神经网络200仅作为一种卷积神经网络的示例，在具体的应用中，卷积神经网络还可以以其他网络模型的形式存在，例如，仅包括图3中所示的网络结构的一部分，比如，本申请实施例中所采用的卷积神经网络可以仅包括输入层210、卷积层/池化层220和输出层240。It should be noted that the convolutional neural network 200 shown in FIG. 3 is only an example of a convolutional neural network. In a specific application, the convolutional neural network can also exist in the form of other network models. For example, only Including a part of the network structure shown in FIG. 3 , for example, the convolutional neural network used in the embodiment of the present application may only include an input layer 210 , a convolutional layer/pooling layer 220 and an output layer 240 .

需要说明的是，如图3所示的卷积神经网络100仅作为一种卷积神经网络的示例，在具体的应用中，卷积神经网络还可以以其他网络模型的形式存在，例如，如图4所示的多个卷积层/池化层并行，将分别提取的特征均输入给全连接层230进行处理。It should be noted that the convolutional neural network 100 shown in FIG. 3 is only an example of a convolutional neural network. In specific applications, the convolutional neural network can also exist in the form of other network models, for example, as Multiple convolutional layers/pooling layers shown in FIG. 4 are parallelized, and the extracted features are input to the fully connected layer 230 for processing.

(4)深度神经网络(4) Deep Neural Network

深度神经网络(Deep Neural Network，DNN)，也称多层神经网络，可以理解为具有很多层隐含层的神经网络，这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分，DNN内部的神经网络可以分为三类：输入层，隐含层，输出层。一般来说第一层是输入层，最后一层是输出层，中间的层数都是隐含层。层与层之间是全连接的，也就是说，第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂，但是就每一层的工作来说，其实并不复杂，简单来说就是如下线性关系表达式：

其中，

是输入向量，

是输出向量，

是偏移向量，W是权重矩阵(也称系数)，α()是激活函数。每一层仅仅是对输入向量

经过如此简单的操作得到输出向量

由于DNN层数多，则系数W和偏移向量

的数量也就很多了。这些参数在DNN中的定义如下所述：以系数W为例：假设在一个三层的DNN中，第二层的第4个神经元到第三层的第2个神经元的线性系数定义为

上标3代表系数W所在的层数，而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是：第L-1层的第k个神经元到第L层的第j个神经元的系数定义为

需要注意的是，输入层是没有W参数的。在深度神经网络中，更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言，参数越多的模型复杂度越高，“容量”也就越大，也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程，其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。Deep Neural Network (DNN), also known as multi-layer neural network, can be understood as a neural network with many hidden layers, and there is no special metric for "many" here. According to the position of different layers of DNN, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in the middle are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks complicated, it is actually not complicated in terms of the work of each layer. In simple terms, it is the following linear relationship expression:

in,

is the input vector,

is the output vector,

Is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just the input vector

After such a simple operation to get the output vector

Due to the large number of DNN layers, the coefficient W and the offset vector

The number is also a lot. The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as

The superscript 3 represents the layer number of the coefficient W, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficient of the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as

It should be noted that the input layer has no W parameter. In deep neural networks, more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).

(5)损失函数(5) Loss function

在训练深度神经网络的过程中，因为希望深度神经网络的输出尽可能的接近真正想要预测的值，所以可以通过比较当前网络的预测值和真正想要的目标值，再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然，在第一次更新之前通常会有初始化的过程，即为深度神经网络中的各层预先配置参数)，比如，如果网络的预测值高了，就调整权重向量让它预测低一些，不断的调整，直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此，就需要预先定义“如何比较预测值和目标值之间的差异”，这便是损失函数(loss function)或目标函数(objective function)，它们是用于衡量预测值和目标值的差异的重要方程。其中，以损失函数举例，损失函数的输出值(loss)越高表示差异越大，那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training the deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then according to the difference between the two to update the weight vector of each layer of the neural network (of course, there is usually an initialization process before the first update, that is, to pre-configure parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make it predict lower, and keep adjusting until the deep neural network can predict the real desired target value or a value very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are used to measure the difference between the predicted value and the target value important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing the loss as much as possible.

(6)反向传播算法(6) Back propagation algorithm

卷积神经网络可以采用误差反向传播(back propagation，BP)算法在训练过程中修正初始的超分辨率模型中参数的大小，使得超分辨率模型的重建误差损失越来越小。具体地，前向传递输入信号直至输出会产生误差损失，通过反向传播误差损失信息来更新初始的超分辨率模型中参数，从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动，旨在得到最优的超分辨率模型的参数，例如权重矩阵。The convolutional neural network can use the back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and updating the parameters in the initial super-resolution model by backpropagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.

(7)自监督学习(self-supervised learning)：是指利用数据本身的某些属性(如旋转角度以及图像块分布顺序)作为数据的“标签”从而以有监督的方式做无监督预训练，最后我们取主干网络参数作为各种下游任务网络参数的初始化。这里上游指的是预训练过程，而下游指的是各种实际视觉问题。其中，对比学习(contrastive learning)是最近的研究热点，它通过挖掘数据本身的一致性构造对比任务，认为来自同一张图像的不同图像块应有相似的特征，以这种方式学到的特征表示甚至获得了超过有监督预训练的迁移效果。(7) Self-supervised learning (self-supervised learning): refers to the use of certain attributes of the data itself (such as rotation angle and image block distribution order) as the "label" of the data to do unsupervised pre-training in a supervised manner, Finally, we take the backbone network parameters as the initialization of network parameters for various downstream tasks. Here upstream refers to the pre-training process, while downstream refers to various practical visual problems. Among them, contrastive learning (contrastive learning) is a recent research hotspot. It constructs contrastive tasks by mining the consistency of the data itself, and believes that different image blocks from the same image should have similar features. The feature representation learned in this way It even achieves a transfer effect that surpasses supervised pre-training.

(7)域内预训练(domain-specific pre-training)：指针对某个特定领域的下游任务，使用与其相同数据域的上游数据集进行预训练，从而缩小上下游之间的数据差异以降低下游微调时的优化难度。在本发明中，上下游使用的数据集均为多实例自动驾驶数据集。(7) Domain-specific pre-training: For downstream tasks in a specific domain, use the upstream data set of the same data domain for pre-training, so as to reduce the data difference between the upstream and downstream to reduce the downstream task. Difficulty of optimization when fine-tuning. In the present invention, the data sets used upstream and downstream are all multi-instance automatic driving data sets.

(9)主干网络(backbone)：是对数据进行特征提取的基本网络结构，也是预训练主要学习的对象。在主干网络的基础上，我们可以根据不同的任务部署不同专属网络结构；主干网络是可共用可迁移的，但是专属网络往往是不可迁移的。对于图像来说，常用的主干网络为卷积神经网络，其最终的特征表示是一个二维特征图(2D feature map)。现有对比学习模型往往会额外部署一个全局池化层(global pooling layer)，比如在空间维度上做平均，将二维特征图处理成一维特征向量(1D feature vector)。本发明中通过舍弃全局池化层，直接对二维特征图建模。(9) Backbone: It is the basic network structure for feature extraction of data, and it is also the main learning object of pre-training. On the basis of the backbone network, we can deploy different dedicated network structures according to different tasks; the backbone network can be shared and migrated, but the dedicated network is often not migrated. For images, the commonly used backbone network is a convolutional neural network, and its final feature representation is a two-dimensional feature map (2D feature map). Existing comparative learning models often deploy an additional global pooling layer (global pooling layer), such as averaging in the spatial dimension, and processing the two-dimensional feature map into a one-dimensional feature vector (1D feature vector). In the present invention, the two-dimensional feature map is directly modeled by abandoning the global pooling layer.

接下来介绍本申请实施例中执行模型训练方法的执行主体的更细节的架构。Next, a more detailed architecture of the execution subject that executes the model training method in the embodiment of the present application is introduced.

下面结合图5对本申请实施例提供的系统架构进行详细的介绍。图5为本申请实施例提供的系统架构示意图。如图5所示，系统架构500包括执行设备510、训练设备520、数据库530、客户设备540、数据存储系统550以及数据采集系统560。The system architecture provided by the embodiment of the present application will be described in detail below with reference to FIG. 5 . FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the present application. As shown in FIG. 5 , the system architecture 500 includes an execution device 510 , a training device 520 , a database 530 , a client device 540 , a data storage system 550 and a data acquisition system 560 .

执行设备510包括计算模块511、I/O接口512、预处理模块513和预处理模块514。计算模块511中可以包括目标模型/规则501，预处理模块513和预处理模块514是可选的。The execution device 510 includes a calculation module 511 , an I/O interface 512 , a preprocessing module 513 and a preprocessing module 514 . The calculation module 511 may include the target model/rule 501, and the preprocessing module 513 and the preprocessing module 514 are optional.

数据采集设备560用于采集训练样本。训练样本可以为图像数据等等，在本申请实施例中，训练样本可以为样本图像。在采集到训练样本之后，数据采集设备560将这些训练样本存入数据库530。The data collection device 560 is used to collect training samples. The training samples may be image data, etc. In the embodiment of the present application, the training samples may be sample images. After collecting the training samples, the data collection device 560 stores these training samples in the database 530 .

训练设备520可以基于数据库530中维护的训练样本，对特征提取网络进行训练，以得到目标模型/规则501。本申请实施例中，目标模型/规则501可以为更新后的特征提取网络。应理解，上述对特征提取网络进行训练的过程可以为预训练过程，在得到更新后的特征提取网络之后，可以结合下游任务数据集对更新后的特征提取网络进行针对于目标任务的微调。The training device 520 can train the feature extraction network based on the training samples maintained in the database 530 to obtain the target model/rule 501 . In the embodiment of the present application, the target model/rule 501 may be an updated feature extraction network. It should be understood that the above-mentioned process of training the feature extraction network may be a pre-training process. After the updated feature extraction network is obtained, the updated feature extraction network may be fine-tuned for the target task in combination with the downstream task data set.

需要说明的是，在实际应用中，数据库530中维护的训练样本不一定都来自于数据采集设备560的采集，也有可能是从其他设备接收得到的。另外需要说明的是，训练设备520也不一定完全基于数据库530维护的训练样本进行目标模型/规则501的训练，也有可能从云端或其他地方获取训练样本进行模型训练，上述描述不应该作为对本申请实施例的限定。It should be noted that, in practical applications, the training samples maintained in the database 530 are not necessarily collected by the data collection device 560, and may also be received from other devices. In addition, it should be noted that the training device 520 does not necessarily perform the training of the target model/rule 501 based entirely on the training samples maintained by the database 530, and it is also possible to obtain training samples from the cloud or other places for model training. Limitations of the Examples.

根据训练设备520训练得到的目标模型/规则501可以应用于不同的系统或设备中，如应用于图5所示的执行设备510，该执行设备510可以是终端，如手机终端，平板电脑，笔记本电脑，增强现实(augmented reality，AR)/虚拟现实(virtual reality，VR)设备，车载终端等，还可以是服务器或者云端等。The target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. 5, which can be a terminal, such as a mobile terminal, a tablet computer, a notebook A computer, an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a vehicle terminal, etc., may also be a server or a cloud, etc.

具体的，训练设备520可以将传递至执行设备510。Specifically, the training device 520 may transfer the data to the executing device 510 .

在图5中，执行设备510配置输入/输出(input/output，I/O)接口512，用于与外部设备进行数据交互，用户可以通过客户设备540向I/O接口512输入数据(例如本申请实施例中的待处理数据)。In FIG. 5 , an execution device 510 is configured with an input/output (input/output, I/O) interface 512 for data interaction with an external device, and a user may input data to the I/O interface 512 through a client device 540 (such as this The data to be processed in the application examples).

预处理模块513和预处理模块514用于根据I/O接口512接收到的输入数据进行预处理。应理解，可以没有预处理模块513和预处理模块514或者只有的一个预处理模块。当不存在预处理模块513和预处理模块514时，可以直接采用计算模块511对输入数据进行处理。The preprocessing module 513 and the preprocessing module 514 are configured to perform preprocessing according to the input data received by the I/O interface 512 . It should be understood that there may be no preprocessing module 513 and preprocessing module 514 or only one preprocessing module. When the preprocessing module 513 and the preprocessing module 514 do not exist, the calculation module 511 may be used directly to process the input data.

在执行设备510对输入数据进行预处理，或者在执行设备510的计算模块511执行计算等相关的处理过程中，执行设备510可以调用数据存储系统550中的数据、代码等以用于相应的处理，也可以将相应处理得到的数据、指令等存入数据存储系统550中。When the execution device 510 preprocesses the input data, or in the calculation module 511 of the execution device 510 performs calculation and other related processing, the execution device 510 can call the data, codes, etc. in the data storage system 550 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 550 .

最后，I/O接口512将处理结果(例如本申请实施例中的处理结果)呈现给客户设备540，从而提供给用户。Finally, the I/O interface 512 presents the processing result (for example, the processing result in the embodiment of the present application) to the client device 540, thereby providing it to the user.

在图5所示情况下，用户可以手动给定输入数据，该“手动给定输入数据”可以通过I/O接口512提供的界面进行操作。另一种情况下，客户设备540可以自动地向I/O接口512发送输入数据，如果要求客户设备540自动发送输入数据需要获得用户的授权，则用户可以在客户设备540中设置相应权限。用户可以在客户设备540查看执行设备510输出的结果，具体的呈现形式可以是显示、声音、动作等具体方式。客户设备540也可以作为数据采集端，采集如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果作为新的样本数据，并存入数据库530。当然，也可以不经过客户设备540进行采集，而是由I/O接口512直接将如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果，作为新的样本数据存入数据库530。In the situation shown in FIG. 5 , the user can manually specify input data, and the “manually specify input data” can be operated through the interface provided by the I/O interface 512 . In another case, the client device 540 can automatically send the input data to the I/O interface 512 . If the client device 540 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 540 . The user can view the results output by the execution device 510 on the client device 540, and the specific presentation form may be specific ways such as display, sound, and action. The client device 540 can also be used as a data collection terminal, collecting input data from the input I/O interface 512 and output results from the output I/O interface 512 as new sample data, and storing them in the database 530 . Of course, it is also possible to collect without the client device 540, but the I/O interface 512 directly uses the input data input to the I/O interface 512 as shown in the figure and the output result of the output I/O interface 512 as a new sample The data is stored in database 530 .

值得注意的是，图5仅是本申请实施例提供的一种系统架构的示意图，图中所示设备、器件、模块等之间的位置关系不构成任何限制，例如，在图5中，数据存储系统550相对执行设备510是外部存储器，在其它情况下，也可以将数据存储系统550置于执行设备510中。应理解，上述执行设备510可以部署于客户设备540中。It should be noted that FIG. 5 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 5, the data The storage system 550 is an external memory relative to the execution device 510 , and in other cases, the data storage system 550 may also be placed in the execution device 510 . It should be understood that the above execution device 510 may be deployed in the client device 540 .

从模型的推理侧来说：From the inference side of the model:

本申请实施例中，上述执行设备520的计算模块511可以获取到数据存储系统550中存储的代码来实现本申请实施例中的模型前馈过程。In the embodiment of the present application, the calculation module 511 of the execution device 520 can obtain the code stored in the data storage system 550 to implement the model feed-forward process in the embodiment of the present application.

本申请实施例中，执行设备520的计算模块511可以包括硬件电路(如专用集成电路(application specific integrated circuit，ASIC)、现场可编程门阵列(field-programmable gate array，FPGA)、通用处理器、数字信号处理器(digital signalprocessing，DSP)、微处理器或微控制器等等)、或这些硬件电路的组合，例如，训练设备520可以为具有执行指令功能的硬件系统，如CPU、DSP等，或者为不具有执行指令功能的硬件系统，如ASIC、FPGA等，或者为上述不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合。In the embodiment of the present application, the calculation module 511 of the execution device 520 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, digital signal processing (digital signal processing, DSP), microprocessor or microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 can be a hardware system with the function of executing instructions, such as CPU, DSP, etc., Or a hardware system that does not have the function of executing instructions, such as ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems that do not have the function of executing instructions and hardware systems that have the function of executing instructions.

具体的，执行设备520的计算模块511可以为具有执行指令功能的硬件系统，本申请实施例提供的数据处理方法可以为存储在存储器中的软件代码，执行设备520的计算模块511可以从存储器中获取到软件代码，并执行获取到的软件代码来实现本申请实施例中的模型前馈过程。Specifically, the computing module 511 of the execution device 520 may be a hardware system capable of executing instructions, the data processing method provided in the embodiment of the present application may be software codes stored in the memory, and the computing module 511 of the execution device 520 may read from the memory The software code is obtained, and the obtained software code is executed to realize the model feed-forward process in the embodiment of the present application.

应理解，执行设备520的计算模块511可以为不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合，本申请实施例提供的数据处理方法的部分步骤还可以通过执行设备520的计算模块511中不具有执行指令功能的硬件系统来实现，这里并不限定。It should be understood that the calculation module 511 of the execution device 520 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. The computing module 511 is implemented by a hardware system that does not have the function of executing instructions, and is not limited here.

从模型的训练侧来说：From the training side of the model:

本申请实施例中，上述训练设备520可以获取到存储器(图5中未示出，可以集成于训练设备520或者与训练设备520分离部署)中存储的代码来实现本申请实施例中模型训练方法。In the embodiment of the present application, the above-mentioned training device 520 can obtain the code stored in the memory (not shown in FIG. 5, which can be integrated into the training device 520 or deployed separately from the training device 520) to implement the model training method in the embodiment of the present application. .

本申请实施例中，训练设备520可以包括硬件电路(如专用集成电路(applicationspecific integrated circuit，ASIC)、现场可编程门阵列(field-programmable gatearray，FPGA)、通用处理器、数字信号处理器(digital signal processing，DSP)、微处理器或微控制器等等)、或这些硬件电路的组合，例如，训练设备520可以为具有执行指令功能的硬件系统，如CPU、DSP等，或者为不具有执行指令功能的硬件系统，如ASIC、FPGA等，或者为上述不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合。In the embodiment of the present application, the training device 520 may include a hardware circuit (such as an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a digital signal processor (digital signal processing, DSP), microprocessor or microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 can be a hardware system with the function of executing instructions, such as CPU, DSP, etc., or a hardware system without execution A hardware system with an instruction function, such as ASIC, FPGA, etc., or a combination of the above-mentioned hardware system without the function of executing instructions and a hardware system with the function of executing instructions.

具体的，训练设备520可以为具有执行指令功能的硬件系统，本申请实施例提供的数据处理方法可以为存储在存储器中的软件代码，训练设备520可以从存储器中获取到软件代码，并执行获取到的软件代码来实现本申请实施例提供的中模型训练方法。Specifically, the training device 520 can be a hardware system capable of executing instructions, and the data processing method provided in the embodiment of the present application can be a software code stored in a memory, and the training device 520 can obtain the software code from the memory, and execute the acquisition The obtained software code is used to implement the model training method provided in the embodiment of the present application.

应理解，训练设备520可以为不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合，本申请实施例提供的中模型训练方法的部分步骤还可以通过训练设备520中不具有执行指令功能的硬件系统来实现，这里并不限定。It should be understood that the training device 520 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. The instruction function is implemented by a hardware system, which is not limited here.

参照图6，图6为本申请实施例提供的一种模型训练方法的实施例示意，本申请实施例提供的一种模型训练方法可以应用在训练设备上，训练设备可以为手机、平板、笔记本电脑、智能穿戴设备等终端设备，训练设备还可以是服务器、芯片等具备数据处理能力的设备，如图6示出的那样，本申请实施例提供的一种模型训练方法可以包括：Referring to Fig. 6, Fig. 6 is a schematic diagram of an embodiment of a model training method provided by the embodiment of the present application. The model training method provided by the embodiment of the present application can be applied to training equipment, and the training equipment can be a mobile phone, a tablet, or a notebook Terminal devices such as computers and smart wearable devices, and training devices can also be devices with data processing capabilities such as servers and chips. As shown in Figure 6, a model training method provided by the embodiment of the present application may include:

601、获取样本图像。601. Acquire a sample image.

在自动驾驶场景中，街景图像往往包含行人和车辆等多个实例(或者称之为对象)，具有很强的全局不一致性，并不符合单实例假设(来自相同图像的图像块描述相同的语义信息，他们的特征应该尽可能接近，而来自不同图像的图像块描述不同的语义信息，因此他们的特征应该尽可能相异)，因此现有方法往往局限在拥有单实例假设的数据集上，很难充分利用更加贴近现实的多实例自动驾驶数据集。In autonomous driving scenarios, street view images often contain multiple instances (or objects) such as pedestrians and vehicles, which have strong global inconsistency and do not meet the single instance assumption (image blocks from the same image describe the same semantics information, their features should be as close as possible, and image blocks from different images describe different semantic information, so their features should be as different as possible), so existing methods are often limited to datasets with single-instance assumptions, It is difficult to take full advantage of more realistic multi-instance autonomous driving datasets.

本申请实施例中，样本图像可以为上述自动驾驶场景下的多实例图像(例如街景图像)。In the embodiment of the present application, the sample image may be a multi-instance image (such as a street view image) in the above-mentioned automatic driving scene.

在一种可能的实现中，所述样本图像可以包括多个对象，所述对象包括人物、车辆、交通标志、车道线、植物中的至少一个。示例性的，对象可以包括但不限于如下示例的至少一种：动态障碍物(行人(Pedestrian)、骑行者(Cyclist)、三轮车(Tricycle)、轿车(Car)、卡车(Truck)、公交车(Bus))，静态障碍物(交通锥标(TrafficCone)、交通棍标(TrafficStick)、消防栓(FireHydrant)、摩托车(Motocycle)、自行车(Bicycle))，交通标志((TrafficSign)、导向标志(GuideSign)、广告牌(Billboard)、红色交通灯(TrafficLight_Red)/黄色交通灯(TrafficLight_Yellow)/绿色交通灯(TrafficLight_Green)/黑色交通灯(TrafficLight_Black)、路标(RoadSign))。In a possible implementation, the sample image may include multiple objects, and the objects include at least one of people, vehicles, traffic signs, lane lines, and plants. Exemplarily, the objects may include but not limited to at least one of the following examples: dynamic obstacles (pedestrians, cyclists (Cyclist), tricycles (Tricycle), cars (Car), trucks (Truck), buses ( Bus)), static obstacles (TrafficCone, TrafficStick, Fire Hydrant, Motocycle, Bicycle), traffic signs (TrafficSign), guide signs ( GuideSign), Billboard, Red Traffic Light (TrafficLight_Red) / Yellow Traffic Light (TrafficLight_Yellow) / Green Traffic Light (TrafficLight_Green) / Black Traffic Light (TrafficLight_Black), Road Sign (RoadSign)).

602、采样所述样本图像，以得到第一图像块和第二图像块，所述第一图像块和所述第二图像块为所述样本图像上不同的图像块。602. Sampling the sample image to obtain a first image block and a second image block, where the first image block and the second image block are different image blocks on the sample image.

603、基于所述第一图像块和所述第二图像块在所述样本图像上所在区域之间的并交比大于阈值，通过特征提取网络分别对所述第一图像块和所述第二图像块进行特征提取，以得到第一特征图和第二特征图。603. Based on the fact that the parallelism ratio between the regions where the first image block and the second image block are located on the sample image is greater than a threshold value, respectively extract the first image block and the second image block through a feature extraction network. Feature extraction is performed on the image block to obtain a first feature map and a second feature map.

其中，第一图像块在样本图像上所在区域和第二图像块在样本图像上所在区域之间可以存在公共区域(重叠区域)，该公共区域的面积大小与第一图像块和所述第二图像块在样本图像上占据的全部区域之间的比值为所述第一图像块和所述第二图像块在所述样本图像上所在区域之间的并交比(intersection over union，IoU)(例如可以参照图7所示)。Wherein, there may be a common area (overlapping area) between the area where the first image block is located on the sample image and the area where the second image block is located on the sample image, the area of the common area is the same as that of the first image block and the second image block The ratio between all areas occupied by image blocks on the sample image is the intersection over union (IoU) (intersection over union, IoU) between the areas where the first image block and the second image block are located on the sample image ( For example, it can be referred to as shown in FIG. 7).

在一种可能的实现中，所述阈值可以为大于或等于0.4的数值。In a possible implementation, the threshold may be a value greater than or equal to 0.4.

其中，多实例场景下全局不一致性的根本原因在于随机切得的图像块可能表示完全不同的场景和语义信息，但如果可以控制两个图像块彼此足够“接近”，那么在局部区域内根据图像连续性，可以认为两个图像块描述相似的语义信息，从而恢复对比学习的基本假设。在本申请实施例中，可以使用两图像块之间的交并比作为衡量距离的方式，这里交并比指两图像块重合区域面积和总覆盖面积的比值(见图7左)。要求只有当随机生成的两个图像块交并比大于给定的阈值时才会进行之后的特征提取和相似度计算(见图7右)。Among them, the root cause of global inconsistency in multi-instance scenes is that randomly cut image blocks may represent completely different scenes and semantic information, but if two image blocks can be controlled to be "close" to each other enough, then in the local area according to the image Continuity, two image patches can be considered to describe similar semantic information, thus restoring the basic assumption of contrastive learning. In the embodiment of the present application, the intersection ratio between two image blocks can be used as a way to measure the distance, where the intersection ratio refers to the ratio of the overlapping area of the two image blocks to the total coverage area (see the left side of Figure 7). It is required that the subsequent feature extraction and similarity calculation will be performed only when the intersection ratio of two randomly generated image blocks is greater than a given threshold (see the right of Figure 7).

应理解，交并比阈值显然不能定得太低，但同时也并非越高越好，因为不希望图像块之间毫不相关，但也不希望两者完全相同，所以交并比阈值的选择实际上也在控制多实例自监督学习中数据噪声和数据复杂性之间的平衡。It should be understood that the threshold value of the intersection ratio cannot be set too low, but at the same time, it is not as high as possible, because it is not expected that the image blocks are irrelevant, but they are not expected to be exactly the same, so the selection of the threshold value of the intersection ratio It is actually also controlling the balance between data noise and data complexity in multi-instance self-supervised learning.

示例性的，参照图8，图8示出了一个样本图像的示例，参照图9，图9为第一图像块和第二图像块的一个示意，在图9示出的第一图像块和第二图像块之间的并交比IOU为0.3，参照图10，图10为第一图像块和第二图像块的一个示意，在图10示出的第一图像块和第二图像块之间的并交比IOU为0.5，参照图11，图11为第一图像块和第二图像块的一个示意，在图11示出的第一图像块和第二图像块之间的并交比IOU为0.7。Exemplarily, referring to FIG. 8, FIG. 8 shows an example of a sample image. Referring to FIG. 9, FIG. 9 is a schematic diagram of a first image block and a second image block. The intersection ratio IOU between the second image blocks is 0.3. Referring to FIG. 10, FIG. 10 is a schematic diagram of the first image block and the second image block. Between the first image block and the second image block shown in FIG. 10 The parallel and intersection ratio IOU between them is 0.5. Referring to Figure 11, Figure 11 is a schematic diagram of the first image block and the second image block, and the parallel and intersection ratio between the first image block and the second image block shown in Figure 11 The IOU is 0.7.

本申请实施例中，可以基于所述第一图像块和所述第二图像块在所述样本图像上所在区域之间的并交比大于阈值，通过特征提取网络分别对所述第一图像块和所述第二图像块进行特征提取，以得到第一特征图和第二特征图。In this embodiment of the present application, based on the fact that the parallelism ratio between the regions where the first image block and the second image block are located on the sample image is greater than a threshold, the feature extraction network can be used to separately analyze the first image block and performing feature extraction with the second image block to obtain a first feature map and a second feature map.

接下来介绍本申请实施例中的特征提取网络：Next, introduce the feature extraction network in the embodiment of this application:

在一种可能的实现中，该特征提取网络可以为主干网络，主干网络用于接收输入的图像(例如本申请实施例中的第一图像块以及第二图像块)，并对所述输入的图像进行卷积处理，以生成多个特征图。In a possible implementation, the feature extraction network can be a backbone network, and the backbone network is used to receive input images (such as the first image block and the second image block in the embodiment of the present application), and The image is convolved to generate multiple feature maps.

需要说明的是，这里的“对所述输入的图像进行卷积处理”，不应理解为，仅仅对对所述输入的图像进行卷积处理，在一些实现中，可以对所述输入的图像进行卷积处理以及其他处理。It should be noted that "convoluting the input image" here should not be understood as only performing convolution processing on the input image. In some implementations, the input image can be Perform convolution processing and other processing.

需要说明的是，这里的“对所述第一图像进行卷积处理，以生成多个特征图”，不应仅理解为，对所述图像进行多次卷积处理，每次卷积处理可以生成一个特征图，即不应该理解为每张特征图都是基于对图像进行卷积处理得到的，而是，从整体上来看，图像是多个特征图的来源；在一种实现中，可以对所述图像进行卷积处理得到一个特征图，之后可以对生成的特征图进行卷积处理，得到另一个特征图，以此类推，就可以得到多个特征图。It should be noted that the "convolution process on the first image to generate multiple feature maps" here should not only be understood as performing multiple convolution processes on the image, and each convolution process can be Generate a feature map, that is, it should not be understood that each feature map is obtained based on the convolution of the image, but, on the whole, the image is the source of multiple feature maps; in one implementation, you can Convolute the image to obtain a feature map, and then perform convolution processing on the generated feature map to obtain another feature map, and so on, to obtain multiple feature maps.

需要说明的是，可以是对所述输入的图像进行一系列的卷积处理，具体的，在每次卷积处理时，可以是对前一次卷积处理得到的特征图进行卷积处理，进而得到一个特征图，通过上述方式，可以得到多个特征图。It should be noted that a series of convolution processing may be performed on the input image, specifically, each convolution processing may be performed on the feature map obtained by the previous convolution processing, and then A feature map is obtained. Through the above method, multiple feature maps can be obtained.

需要说明的是，多个特征图可以是具有多尺度分辨率的特征图，即多个特征图并不是分辨率相同的特征图，在一种可选实现中，多个特征图可以构成一个特征金字塔。It should be noted that multiple feature maps can be feature maps with multi-scale resolution, that is, multiple feature maps are not feature maps with the same resolution. In an optional implementation, multiple feature maps can constitute a feature pyramid.

参见图12，图12为本申请实施例提供的一种特征提取网络的结构示意图。如图12所示，主干网络用于接收输入的图像，并对输入的图像进行卷积处理，输出对应所述图像的具有不同分辨率的特征图(特征图C1、特征图C2、特征图C3、特征图C4)；也就是说输出对应所述图像的不同大小的特征图，主干网络完成基础特征的提取，为后续的检测提供相应的特征。Referring to FIG. 12 , FIG. 12 is a schematic structural diagram of a feature extraction network provided by an embodiment of the present application. As shown in Figure 12, the backbone network is used to receive the input image, and perform convolution processing on the input image, and output feature maps with different resolutions corresponding to the image (feature map C1, feature map C2, feature map C3 , feature map C4); that is to say, feature maps of different sizes corresponding to the image are output, and the backbone network completes the extraction of basic features to provide corresponding features for subsequent detection.

具体的，主干网络可以对输入的图像进行一系列的卷积处理，得到在不同的尺度(具有不同分辨率)下的特征图(feature map)。这些特征图将为后续的检测模块提供基础特征。主干网络可以采用多种形式，比如视觉几何组(visual geometry group，VGG)、残差神经网络(residual neural network，resnet)、GoogLeNet的核心结构(Inception-net)等。Specifically, the backbone network can perform a series of convolution processing on the input image to obtain feature maps (feature maps) at different scales (with different resolutions). These feature maps will provide the basic features for subsequent detection modules. The backbone network can take many forms, such as visual geometry group (VGG), residual neural network (resnet), GoogLeNet's core structure (Inception-net), etc.

主干网络可以对输入的图像进行卷积处理，生成若干不同尺度的卷积特征图，每张特征图是一个H*W*C的矩阵，其中H是特征图的高度，W是特征图的宽度、C是特征图的通道数。The backbone network can perform convolution processing on the input image to generate several convolutional feature maps of different scales. Each feature map is a matrix of H*W*C, where H is the height of the feature map and W is the width of the feature map. , C is the number of channels of the feature map.

backbone可以采用目前多种现有的卷积网络框架，比如VGG16、Resnet50、Inception-Net等，下面以Resnet18为Backbone为例进行说明。Backbone can use a variety of existing convolutional network frameworks, such as VGG16, Resnet50, Inception-Net, etc. The following uses Resnet18 as Backbone as an example to illustrate.

假设输入的图像的分辨率为H*W*3(高度H，宽度W，通道数为3，也就是RBG三个通道)。输入图像可以经过Resnet18的一个卷积层Res18-Conv1进行卷积操作，生成Featuremap(特征图)C1，这个特征图相对于输入图像进行了2次下采样，并且通道数扩充为64，因此C1的分辨率是H/4*W/4*64。C1可以经过Resnet18的Res18-Conv2进行卷积操作，得到Featuremap C2，这个特征图的分辨率与C1一致；C2继续经过Res18-Conv3进行卷积操作，生成Featuremap C3，这个特征图相对C2进一步下采样，通道数增倍，其分辨率为H/8*W/8*128；最后C3经过Res18-Conv4进行卷积操作，生成Featuremap C4，其分辨率为H/16*W/16*256。Assume that the resolution of the input image is H*W*3 (height H, width W, and the number of channels is 3, that is, three channels of RBG). The input image can be convoluted through a convolutional layer Res18-Conv1 of Resnet18 to generate Featuremap (feature map) C1. This feature map has been down-sampled twice relative to the input image, and the number of channels has been expanded to 64. Therefore, the feature map of C1 The resolution is H/4*W/4*64. C1 can be convolved by Res18-Conv2 of Resnet18 to obtain Featuremap C2. The resolution of this feature map is consistent with C1; C2 continues to be convolved by Res18-Conv3 to generate Featuremap C3. This feature map is further downsampled relative to C2. , the number of channels is doubled, and its resolution is H/8*W/8*128; finally, C3 undergoes convolution operation through Res18-Conv4 to generate Featuremap C4, whose resolution is H/16*W/16*256.

需要说明的是，本申请实施例中的主干网络也可以称为骨干网络，这里并不限定。It should be noted that the backbone network in the embodiment of the present application may also be referred to as a backbone network, which is not limited here.

需要说明的是，图12中示出的主干网络仅为一种实现方式，并不构成对本申请的限定。It should be noted that the backbone network shown in FIG. 12 is only an implementation manner and does not limit the present application.

其中，本申请实施例中，通过特征提取网络分别对所述第一图像块和所述第二图像块进行特征提取，以得到第一特征图和第二特征图，第一特征图可以为通过特征提取网络对所述第一图像块进行处理得到的多张特征图(特征金字塔)，也可以是某次卷积操作后得到的一个特征图，类似的，第二特征图可以为通过特征提取网络对所述第二图像块进行处理得到的多张特征图(特征金字塔)，也可以是某次卷积操作后得到的一个特征图。Wherein, in the embodiment of the present application, the feature extraction network is used to extract the features of the first image block and the second image block respectively, so as to obtain the first feature map and the second feature map, and the first feature map can be obtained by The multiple feature maps (feature pyramids) obtained by the feature extraction network processing the first image block can also be a feature map obtained after a certain convolution operation. Similarly, the second feature map can be obtained through feature extraction A plurality of feature maps (feature pyramids) obtained by processing the second image block by the network may also be a feature map obtained after a certain convolution operation.

其中，所述第一特征图和所述第二特征图的尺寸大小一致。Wherein, the size of the first feature map and the second feature map are the same.

604、根据所述第一特征图和所述第二特征图之间的差异，确定损失，并基于所述损失更新所述特征提取网络，以得到更新后的特征提取网络。604. Determine a loss according to the difference between the first feature map and the second feature map, and update the feature extraction network based on the loss to obtain an updated feature extraction network.

本申请实施例中，在基于所述第一图像块和所述第二图像块在所述样本图像上所在区域之间的并交比大于阈值，通过特征提取网络分别对所述第一图像块和所述第二图像块进行特征提取，以得到第一特征图和第二特征图之后，可以根据所述第一特征图和所述第二特征图之间的差异，确定损失，并基于所述损失更新所述特征提取网络，以得到更新后的特征提取网络。In the embodiment of the present application, based on the fact that the parallelism ratio between the regions where the first image block and the second image block are located on the sample image is greater than a threshold value, the feature extraction network is used to separately analyze the first image block After performing feature extraction with the second image block to obtain the first feature map and the second feature map, the loss can be determined according to the difference between the first feature map and the second feature map, and based on the The loss is used to update the feature extraction network to obtain the updated feature extraction network.

在一种可能的实现中，可以先对所述第一图像块和所述第二图像块进行对齐，以得到对齐后的第一图像块和对齐后的第二图像块；In a possible implementation, the first image block and the second image block may be aligned first, so as to obtain an aligned first image block and an aligned second image block;

在一种实现中，所述样本图像可以包括目标区域，所述目标区域为所述第一图像块和所述第二图像块在所述样本图像上所在的重叠区域，可以根据所述目标区域，确定所述第一特征图中与所述目标区域对应的第一子特征图以及所述第二特征图中与所述目标区域对应的第二子特征图；对所述第一子特征图进行上采样，以得到所述对齐后的第一图像块；对所述第二子特征图进行上采样，以得到所述对齐后的第二图像块，其中所述对齐后的第一图像块与所述对齐后的第二图像块的尺寸大小一致。In one implementation, the sample image may include a target area, and the target area is an overlapping area where the first image block and the second image block are located on the sample image, and the target area may be , determine the first sub-feature map corresponding to the target area in the first feature map and the second sub-feature map corresponding to the target area in the second feature map; for the first sub-feature map performing upsampling to obtain the aligned first image block; performing upsampling to the second sub-feature map to obtain the aligned second image block, wherein the aligned first image block It is consistent with the size of the aligned second image block.

为了区分多实例特征，需要舍弃主干网络之后的全局池化层来保持特征图的二维结构和位置信息，但同时也带来了额外的特征不对齐的问题，即二维特征图上不再拥有相同相对位置的一一对应关系。本申请实施例中提供两种不同的方式进行特征对齐：感兴趣区域(region of interest)对齐将图像块重叠部分分别作为两图像块的感兴趣区域，例如可以使用RoI Align仅提取重合部分特征进行后续计算，这种方式直观但是没有充分利用不重合部分的信息(例如可以参照图13所示)。In order to distinguish multi-instance features, it is necessary to abandon the global pooling layer after the backbone network to maintain the two-dimensional structure and position information of the feature map, but it also brings additional features that are not aligned. One-to-one correspondence with the same relative position. In the embodiment of the present application, two different ways are provided for feature alignment: region of interest (region of interest) alignment uses the overlapping parts of the image blocks as the regions of interest of the two image blocks, for example, RoI Align can be used to extract only the features of the overlapping part For subsequent calculations, this method is intuitive but does not make full use of the information of non-overlapping parts (for example, refer to FIG. 13 ).

在一种实现中，所述第一特征图和所述第二特征图的尺寸大小一致，所述第一特征图包括M个第一特征点，所述第二特征图包括M个第二特征点，所述M个第一特征点在所述样本图像中对应于M个第一像素点，所述M个第二特征点在所述样本图像中对应于M个第二像素点，所述M个第一像素点与所述M个第二像素点一一对应，可以根据所述M个第一像素点和所述M个第二像素点，以得到第三特征图，所述第三特征图和所述第二特征图的尺寸大小一致，所述第三特征图包括M个第三特征点，每个第三特征点为基于具有对应关系的第一像素点和第二像素点之间的像素位置差异得到；将所述第三特征图与所述第一特征图进行融合，以得到所述对齐后的第一图像块，所述第二特征图用于作为所述对齐后的第二图像块。可选的，可以将所述第三特征图与所述第一特征图在深度方向上进行拼接。In one implementation, the first feature map and the second feature map have the same size, the first feature map includes M first feature points, and the second feature map includes M second feature points points, the M first feature points correspond to M first pixel points in the sample image, the M second feature points correspond to M second pixel points in the sample image, the The M first pixels are in one-to-one correspondence with the M second pixels, and a third feature map can be obtained according to the M first pixels and the M second pixels, and the third The size of the feature map is the same as that of the second feature map, and the third feature map includes M third feature points, and each third feature point is based on a corresponding relationship between the first pixel point and the second pixel point The pixel position difference between them is obtained; the third feature map is fused with the first feature map to obtain the aligned first image block, and the second feature map is used as the aligned the second image block. Optionally, the third feature map and the first feature map may be spliced in a depth direction.

其中，位移对齐则取每对位于相同相对位置的像素点，计算他们在原图中的坐标位移并与特征图串联(在深度方向上进行拼接)，作为额外的侧信息提供给预测网络进行隐式的特征对齐来帮助后续的特征预测(例如可以参照图14所示)，充分利用了不重合区域的特征信息，同时使得后面的相似度衡量更加灵活。Among them, the displacement alignment takes each pair of pixels at the same relative position, calculates their coordinate displacement in the original image and concatenates them with the feature map (stitching in the depth direction), and provides them as additional side information to the prediction network for implicit feature alignment to help subsequent feature prediction (for example, as shown in Figure 14), making full use of the feature information of non-overlapping regions, and making the subsequent similarity measurement more flexible.

在得到对齐后的第一特征图和所述对齐后的第二特征图之后，可以根据所述对齐后的第一特征图和所述对齐后的第二特征图之间的差异，确定损失。After the aligned first feature map and the aligned second feature map are obtained, a loss may be determined according to a difference between the aligned first feature map and the aligned second feature map.

接下来描述如何得到所述对齐后的第一特征图和所述对齐后的第二特征图之间的差异：The following describes how to obtain the difference between the aligned first feature map and the aligned second feature map:

在一种可能的实现中，可以通过目标预测网络，对所述第一特征图的M个第一特征点进行处理，以得到每个第一特征点的预测值；其中，目标预测网络可以包含卷积操作(例如1*1卷积核)。In a possible implementation, the M first feature points of the first feature map can be processed through the target prediction network to obtain the predicted value of each first feature point; wherein, the target prediction network can include Convolution operation (eg 1*1 convolution kernel).

其中，上述通过目标预测网络，对所述第一特征图的M个第一特征点进行处理的步骤可以称之为在线分支，在线分支则通过预测网络对目标分支的聚类中心结果进行预测，可选的，在线分支还可以额外部署一个空间维度的自注意力模块，充分考虑上下文信息以做出更准确的预测。具体来看，记图像在线分支输入为R，那么最终在线分支的预测结果Q为：Wherein, the above-mentioned step of processing the M first feature points of the first feature map through the target prediction network can be called an online branch, and the online branch predicts the clustering center result of the target branch through the prediction network, Optionally, the online branch can additionally deploy a spatial dimension self-attention module, fully considering the context information to make more accurate predictions. Specifically, if the image online branch input is R, then the prediction result Q of the final online branch is:

其中H和W是R和R′的长度和宽度，q_θ(·)是预测网络，而sim(·,·)则定义为：where H and W are the length and width of R and R′, q _θ ( ) is the prediction network, and sim( , ) is defined as:

sim(R_i,j,R_i′,j′)＝(max(cos(R_i,j,R_i′,j′),0))²；sim(R _i,j ,R _i′,j′ )=(max(cos(R _i,j ,R _i′,j′ ),0)) ² ;

在一种可能的实现中，可以基于目标聚类算法，对所述第一特征图的M个第二特征点进行聚类，以更新所述M个第二特征点的特征值，其中每个更新后的第二特征点的特征值为所在的聚类类别的聚类中心特征值。In a possible implementation, the M second feature points of the first feature map may be clustered based on the target clustering algorithm, so as to update the feature values of the M second feature points, where each The updated eigenvalues of the second feature points are the eigenvalues of the cluster center of the cluster category in which they are located.

其中，基于目标聚类算法，对所述第一特征图的M个第二特征点进行聚类的步骤可以称之为目标分支。目标聚类算法可以但不限于为K均值(k-means)、均值漂移聚类、基于密度的聚类方法、凝聚层次聚类、图团体检测(graph community detection)、高斯混合模型k-means(gaussian mixture model kmeans，GMM k-means)等聚类方法。Wherein, based on the target clustering algorithm, the step of clustering the M second feature points of the first feature map may be referred to as a target branch. The target clustering algorithm can be but not limited to K-means (k-means), mean shift clustering, density-based clustering method, agglomerative hierarchical clustering, graph community detection (graph community detection), Gaussian mixture model k-means ( Gaussian mixture model kmeans, GMM k-means) and other clustering methods.

其中，在多实例场景下存在聚类的层级结构，从上至下包括类别聚类、实例聚类和像素聚类；同时聚类的定义是一个相对概念，即相同像素点在不同上下文的图像块中可能属于不同的聚类(见图15右下示出的两个图像块，像素点P在左侧的图像块中属于人的类别聚类，而在右侧的图像块中则属于男人的类别聚类)。基于这样的观察，可以通过图像内聚类来区别不同实例的特征，让同一聚类像素点特征尽可能接近，而不同聚类像素点特征尽可能相异，促进两图像块聚类分析结果的一致性。具体做法可以如图16所示，可选的，可以在目标分支上使用K-means聚类获得每个点的聚类中心(均值，平滑)特征。Among them, there is a hierarchical structure of clustering in multi-instance scenarios, including category clustering, instance clustering, and pixel clustering from top to bottom; at the same time, the definition of clustering is a relative concept, that is, images of the same pixel in different contexts The blocks may belong to different clusters (see the two image blocks shown in the lower right of Fig. category clustering). Based on this observation, the features of different instances can be distinguished by clustering within the image, so that the features of the same clustered pixels are as close as possible, while the features of different clustered pixels are as different as possible, which promotes the consistency of the clustering analysis results of the two image blocks. consistency. The specific method can be shown in Figure 16. Optionally, K-means clustering can be used on the target branch to obtain the cluster center (mean, smooth) feature of each point.

在得到每个第一特征点的预测值以及所述每个更新后的第二特征点的特征值之后，可以根据所述每个第一特征点的预测值以及所述每个更新后的第二特征点的特征值之间的差异，确定损失，这里的损失可以表示每个第一特征点的预测值以及所述每个更新后的第二特征点的特征值之间的差异(或者描述为相似度)。两个分支通过不同方式最大程度利用二维特征图的整体信息，从而可以更好地衡量两个二维特征图的相似度。After obtaining the predicted value of each first feature point and the feature value of each updated second feature point, based on the predicted value of each first feature point and each updated second feature point The difference between the feature values of the two feature points determines the loss, where the loss can represent the difference between the predicted value of each first feature point and the feature value of each updated second feature point (or describe for the similarity). The two branches maximize the overall information of the two-dimensional feature map in different ways, so that the similarity between two two-dimensional feature maps can be better measured.

在一种实现中，相似度可以定义为：In one implementation, similarity can be defined as:

其中，在线分支和目标分支输入分别为R和R'，Q_i,j为在线分支的输出，Kmeans(R′_i,j)为目标分支的输出。Among them, the input of the online branch and the target branch are R and R' respectively, Q _i,j is the output of the online branch, and Kmeans(R′ _i,j ) is the output of the target branch.

在得到损失之后，可以根据所述损失，更新在线分支(例如包括目标预测网络以及特征提取网络)。例如可以使用梯度下降更新在线分支参数，同时更新目标分支参数为在线分支参数的滑动平均，迭代上述训练过程直到网络收敛。After the loss is obtained, the online branch (for example, including the target prediction network and the feature extraction network) can be updated according to the loss. For example, gradient descent can be used to update the online branch parameters, and at the same time, the updated target branch parameters are the moving average of the online branch parameters, and the above training process is iterated until the network converges.

参照图17和图18，图17和图18为本申请实施例提供的一张多实例自监督学习框架(MultiSiam)数据增强、特征提取和网络训练流程：Referring to Figure 17 and Figure 18, Figure 17 and Figure 18 are a multi-instance self-supervised learning framework (MultiSiam) data enhancement, feature extraction and network training process provided by the embodiment of the present application:

对于每一张输入图像，随机切割获得两个图像块，检查两者交并比是否大于给定阈值，若否则重新切割直到超过交并比阈值，若是则可以继续进行尺度调整和色彩纹理增强；使用主干网络分别提取两个图像块的二维特征图；使用投射网络对两图像块二维特征图进行降维以降低后续计算量；通过特征对齐模块，使用感兴趣区域对齐或是位移对齐来恢复二维特征图相同相对位置上的一一对应关系，注意特征对齐模块并不会改变输入特征图的空间尺度；在目标分支(左)特征图上运行K-means聚类算法，获得每个像素点的聚类结果以及对应的聚类中心点特征，同时为了防止网络退化，在目标分支的最后增加了梯度截断操作；在线分支(右)通过自注意力模块以及预测网络对目标分支聚类结果进行预测，根据定义计算二维聚类相似度，与步一维特征相似度做加权和(例如两者权重可以默认均为0.5)，得到最终特征相似度衡量；使用梯度下降更新在线分支参数，同时更新目标分支参数为在线分支参数的滑动平均，迭代操作直到网络收敛。For each input image, randomly cut to obtain two image blocks, check whether the intersection ratio of the two is greater than a given threshold, if not, re-cut until the intersection ratio threshold is exceeded, and if so, continue to scale adjustment and color texture enhancement; Use the backbone network to extract the two-dimensional feature maps of the two image blocks; use the projection network to reduce the dimensionality of the two-dimensional feature maps of the two image blocks to reduce the amount of subsequent calculations; through the feature alignment module, use the region of interest alignment or displacement alignment to Restore the one-to-one correspondence on the same relative position of the two-dimensional feature map. Note that the feature alignment module does not change the spatial scale of the input feature map; run the K-means clustering algorithm on the target branch (left) feature map to obtain each The clustering results of the pixels and the corresponding clustering center point features, and in order to prevent network degradation, a gradient truncation operation is added at the end of the target branch; the online branch (right) clusters the target branch through the self-attention module and the prediction network Predict the results, calculate the two-dimensional cluster similarity according to the definition, and do a weighted sum with the one-dimensional feature similarity (for example, the weight of both can be 0.5 by default) to obtain the final feature similarity measure; use gradient descent to update online branch parameters , while updating the target branch parameters to be the moving average of the online branch parameters, iteratively operate until the network converges.

应理解，上述对特征提取网络进行训练的过程可以为预训练过程，在得到更新后的特征提取网络之后，可以结合下游任务数据集对更新后的特征提取网络进行针对于目标任务的微调。It should be understood that the above-mentioned process of training the feature extraction network may be a pre-training process. After the updated feature extraction network is obtained, the updated feature extraction network may be fine-tuned for the target task in combination with the downstream task data set.

例如，可以在上述更新后的特征提取网络连接下游任务网络，并基于下游任务数据集对上述更新后的特征提取网络以及下游任务网络进行微调，以得到目标网络，目标网络可以部署在执行设备上进行前馈过程。For example, the above-mentioned updated feature extraction network can be connected to the downstream task network, and the above-mentioned updated feature extraction network and the downstream task network can be fine-tuned based on the downstream task data set to obtain the target network, which can be deployed on the execution device Perform a feed-forward process.

示例性的，以下游任务为目标检测为例，下游任务网络可以包括一个或多个head，如图19示出的那样，一个或多个head用于根据特征提取网络输出的特征图，对一个任务中的任务物体进行检测，输出所述任务物体所在区域的2D框以及每个2D框对应的置信度；可选的，多个head可以并行设置，每个head可以完成不同的任务物体的检测；其中，所述任务物体为该任务中需要检测的物体；所述置信度越高，表示所述对应该置信度的2D框内存在所述任务所对应的物体的概率越大。Exemplarily, taking the downstream task as an example of target detection, the downstream task network may include one or more heads, as shown in Figure 19, one or more heads are used to extract the feature map output by the feature extraction network, for a The task object in the task is detected, and the 2D frame of the area where the task object is located and the confidence corresponding to each 2D frame are output; optionally, multiple heads can be set in parallel, and each head can complete the detection of different task objects ; Wherein, the task object is an object that needs to be detected in the task; the higher the confidence level, the greater the probability that the object corresponding to the task exists in the 2D frame corresponding to the confidence level.

本申请实施例中，不同的head可以完成不同的2D检测任务，比如多个head中的一个head可以完成车的检测，输出Car/Truck/Bus的2D框和置信度；多个head中的head1可以完成人的检测，输出Pedestrian/Cyclist/Tricyle的2D框和置信度；多个head中的head可以完成交通灯的检测，输出Red_Trafficligh/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight的2D框和置信度。In the embodiment of this application, different heads can complete different 2D detection tasks. For example, one of the multiple heads can complete the detection of the car, and output the 2D frame and confidence of Car/Truck/Bus; head1 of the multiple heads It can complete the detection of people, and output the 2D frame and confidence of Pedestrian/Cyclist/Tricyle; the head of multiple heads can complete the detection of traffic lights, and output the 2D frame and confidence of Red_Trafficlight/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight.

本申请实施例中，下游任务网络可以包括多个串行head；所述串行head与一个并行head连接；这里需要强调的是，实际上，串行head并不是必须的，对于只需要检测2D框的场景，就不需要包括串行head。In the embodiment of the present application, the downstream task network may include multiple serial heads; the serial head is connected to a parallel head; what needs to be emphasized here is that, in fact, the serial head is not necessary, and it is only necessary to detect 2D box scenario, there is no need to include the serial head.

其中，所述串行head可以用于：利用其连接的并行head提供的所属任务的任务物体的2D框，在特征提取网络上的一个或多个特征图上提取所述2D框所在区域的特征，根据所述2D框所在区域的特征对所述所属任务的任务物体的3D信息、Mask信息或Keypiont信息进行预测。串行head可选地串接在并行head的后面，在检测出该任务的2D框的基础上，完成2D框内部物体的3D/Mask/Keypoint检测。比如，串行3D_head0完成车辆的朝向、质心和长宽高的估计，从而输出车辆的3D框；串行Mask_head0预测车辆的精细掩膜，从而把车辆分割开来；串行Keypont_head0完成车辆的关键点的估计。串行head并不是必须的，某些任务不需要进行3D/Mask/Keypoint检测，则不需要串接串行head，比如交通灯的检测，只需要检测2D框，就不用串接串行head。另外，某些任务可以根据任务的具体需求，选择串接一个或者多个串行head，比如停车场(Parkingslot)的检测，除了需要得到2D框外，还需要车位的关键点，因此在这个任务中只需要串接一个串行Keypoint_head即可，不需要3D和Mask的head。Wherein, the serial head can be used to: use the 2D frame of the task object of the task provided by the connected parallel head to extract the features of the region where the 2D frame is located on one or more feature maps on the feature extraction network , predicting the 3D information, Mask information or Keypiont information of the task object of the task according to the characteristics of the area where the 2D frame is located. The serial head is optionally connected in series behind the parallel head, and on the basis of detecting the 2D frame of the task, the 3D/Mask/Keypoint detection of the object inside the 2D frame is completed. For example, the serial 3D_head0 completes the estimation of the orientation, center of mass and length, width and height of the vehicle, thereby outputting the 3D frame of the vehicle; the serial Mask_head0 predicts the fine mask of the vehicle, thereby separating the vehicle; the serial Keypont_head0 completes the key points of the vehicle estimate. The serial head is not necessary. Some tasks do not require 3D/Mask/Keypoint detection, so there is no need to connect the serial head. For example, the detection of traffic lights only needs to detect the 2D frame, and there is no need to connect the serial head. In addition, some tasks can choose to connect one or more serial heads in series according to the specific needs of the task, such as parking lot (Parkingslot) detection, in addition to the need to obtain the 2D frame, but also the key points of the parking space, so in this task Only one serial Keypoint_head needs to be connected in series, and the 3D and Mask heads are not required.

本申请实施例中，header与特征提取网络连接，header可以根据特征提取网络提供的特征图，完成一个任务的2D框的检测，输出这个任务的物体的2D框以及对应的置信度等等，接下来描述一种header的结构示意，参照图20，图20为一种header的示意，如图20中示出的那样，head包括候选区域生成网络(Region Proposal Network，RPN)、ROI-ALIGN和RCNN三个模块。In the embodiment of this application, the header is connected to the feature extraction network, and the header can complete the detection of the 2D frame of a task according to the feature map provided by the feature extraction network, and output the 2D frame of the object of this task and the corresponding confidence, etc., then Next, describe a schematic diagram of a header structure. Referring to FIG. 20, FIG. 20 is a schematic diagram of a header. As shown in FIG. 20, the header includes a candidate region generation network (Region Proposal Network, RPN), ROI-ALIGN and RCNN Three modules.

其中，RPN模块可以用于在特征提取网络提供的一个或者多个第三特征图上预测所述任务物体所在的区域，并输出匹配所述区域的候选2D框；或者可以这样理解，RPN在特征提取网络输出的一个或者多个横图上预测出可能存在该任务物体的区域，并且给出这些区域的框，这些区域称为候选区域(Proposal)。比如，当head负责检测车时，其RPN层就预测出可能存在车的候选框；当head负责检测人时，其RPN层就预测出可能存在人的候选框。当然，这些Proposal是不准确的，一方面其不一定含有该任务的物体，另一方面这些框也是不紧致的。Among them, the RPN module can be used to predict the area where the task object is located on one or more third feature maps provided by the feature extraction network, and output a candidate 2D frame matching the area; or it can be understood that the RPN is in the feature The region where the task object may exist is predicted on one or more horizontal images output by the extraction network, and the frames of these regions are given, and these regions are called candidate regions (Proposal). For example, when the head is responsible for detecting a car, its RPN layer predicts a candidate frame that may contain a car; when the head is responsible for detecting a person, its RPN layer predicts a candidate frame that may contain a person. Of course, these Proposals are inaccurate. On the one hand, they do not necessarily contain the objects of the task, and on the other hand, these frames are not compact.

2D候选区域预测流程可以由head的RPN模块实施，其根据特征提取网络提供的特征图，预测出可能存在该任务物体的区域，并且给出这些区域的候选框(也可以叫候选区域，Proposal)。在本实施例中，若head负责检测车，其RPN层就预测出可能存在车的候选框。The 2D candidate area prediction process can be implemented by the RPN module of the head, which predicts the areas where the task object may exist according to the feature map provided by the feature extraction network, and gives the candidate frames of these areas (also called candidate areas, Proposal) . In this embodiment, if the head is responsible for detecting cars, its RPN layer predicts candidate boxes that may contain cars.

RPN层可以在特征提取网络提供的第三特征图上通过例如一个3*3的卷积生成特征图RPN Hidden。后面head的RPN层将会从RPN Hidden中预测Proposal。具体来说，head的RPN层分别通过一个1*1的卷积预测出RPN Hidden每个位置处的Proposal的坐标以及置信度。这个置信度越高，表示这个Proposal存在该任务的物体的概率越大。比如，在head中某个Proposal的score越大，就表示其存在车的概率越大。每个RPN层预测出来的Proposal需要经过Proposal合并模块，根据Proposal之间的重合程度去掉多余的Proposal(这个过程可以采用但不限制于NMS算法)，在剩余的K个Proposal中挑选出score最大的N(N<K)个Proposal作为候选的可能存在物体的区域。这些Proposal是不准确的，一方面其不一定含有该任务的物体，另一方面这些框也是不紧致的。因此，RPN模块只是一个粗检测的过程，需要后续的RCNN模块进行细分。在RPN模块回归Proposal的坐标时，并不是直接回归坐标的绝对值，而是回归出相对于Anchor的坐标。当这些Anchor与实际的物体匹配越高，RPN能检测出物体的概率越大。The RPN layer can generate a feature map RPN Hidden by, for example, a 3*3 convolution on the third feature map provided by the feature extraction network. The RPN layer of the head behind will predict the Proposal from the RPN Hidden. Specifically, the RPN layer of the head predicts the coordinates and confidence of the Proposal at each position of the RPN Hidden through a 1*1 convolution. The higher the confidence, the greater the probability that the Proposal exists in the object of the task. For example, the greater the score of a Proposal in the head, the greater the probability of its existence. Proposals predicted by each RPN layer need to go through the Proposal Merging Module, remove redundant Proposals according to the degree of overlap between Proposals (this process can be used but not limited to the NMS algorithm), and select the largest score among the remaining K Proposals N(N<K) Proposals are used as candidate areas where objects may exist. These Proposals are inaccurate. On the one hand, they do not necessarily contain the objects of the task, and on the other hand, these boxes are not compact. Therefore, the RPN module is only a rough detection process, which requires the subsequent RCNN module to be subdivided. When the RPN module returns the coordinates of the Proposal, it does not directly return the absolute value of the coordinates, but returns the coordinates relative to the Anchor. When these anchors match the actual object, the higher the probability that the RPN can detect the object.

ROI-ALIGN模块用于根据所述RPN模块预测得到的区域，从所述特征提取网络提供的一个特征图中扣取出所述候选2D框所在区域的特征；也就是说，ROI-ALIGN模块主要根据RPN模块提供的Proposal，在某个特征图上把每个Proposal所在的区域的特征扣取出来，并且resize到固定的大小，得到每个Proposal的特征。可以理解的是，ROI-ALIGN模块可以使用但不局限于ROI-POOLING(感兴趣区域池化)/ROI-ALIGN(感兴趣区域提取)/PS-ROIPOOLING(位置敏感的感兴趣区域池化)/PS-ROIALIGN(位置敏感的感兴趣区域提取)等特征抽取方法。The ROI-ALIGN module is used to deduct the features of the region where the candidate 2D frame is located from a feature map provided by the feature extraction network according to the region predicted by the RPN module; that is, the ROI-ALIGN module is mainly based on The Proposal provided by the RPN module extracts the features of the area where each Proposal is located on a certain feature map, and resizes to a fixed size to obtain the features of each Proposal. It can be understood that the ROI-ALIGN module can use but is not limited to ROI-POOLING (region of interest pooling)/ROI-ALIGN (region of interest extraction)/PS-ROIPOOLING (position-sensitive region of interest pooling)/ PS-ROIALIGN (Position Sensitive Region of Interest Extraction) and other feature extraction methods.

RCNN模块用于通过神经网络对所述候选2D框所在区域的特征进行卷积处理，得到所述候选2D框属于各个物体类别的置信度；通过神经网络对所述候选区域2D框的坐标进行调整，使得调整后的2D候选框比所述候选2D框与实际物体的形状更加匹配，并选择置信度大于预设阈值的调整后的2D候选框作为所述区域的2D框。也就是说，RCNN模块主要是对ROI-ALIGN模块提出的每个Proposal的特征进行细化处理，得到每个Proposal的属于各个类别置信度(比如对于车这个任务，会给出Backgroud/Car/Truck/Bus 4个分数)，同时对Proposal的2D框的坐标进行调整，输出更加紧致的2D框。这些2D框经过非极大值抑制(nonmaximum suppression，NMS)合并后，作为最后的2D框输出。The RCNN module is used to perform convolution processing on the features of the region where the candidate 2D frame is located through the neural network to obtain the confidence that the candidate 2D frame belongs to each object category; adjust the coordinates of the 2D frame in the candidate region through the neural network , so that the adjusted 2D candidate frame matches the shape of the actual object more than the candidate 2D frame, and selects the adjusted 2D candidate frame with a confidence greater than a preset threshold as the 2D frame of the region. In other words, the RCNN module mainly refines the features of each Proposal proposed by the ROI-ALIGN module, and obtains the confidence of each Proposal belonging to each category (for example, for the task of car, it will give Backgroud/Car/Truck /Bus 4 points), and adjust the coordinates of the Proposal's 2D box to output a more compact 2D box. These 2D boxes are combined by nonmaximum suppression (NMS) and output as the final 2D box.

2D候选区域细分类主要由图20中的head的RCNN模块实施，其根据ROI-ALIGN模块提取出来的每个Proposal的特征，进一步回归出更加紧致的2D框坐标，同时对这个Proposal进行分类，输出其属于各个类别的置信度。RCNN的可实现形式很多，ROI-ALIGN模块输出的特征大小可以为N*14*14*256(Feature of proposals)，其在RCNN模块中首先经过Resnet18的卷积模块4(Res18-Conv5)处理，输出的特征大小为N*7*7*512，然后通过一个Global Avg Pool(平均池化层)进行处理，把输入特征中每个通道内的7*7的特征进行平均，得到N*512的特征，其中每个1*512维的特征向量代表每个Proposal的特征。接下来通过2个全连接层FC分别回归框的精确坐标(输出N*4的向量，这4个数值分表表示框的中心点x/y坐标，框的宽高)，框的类别的置信度(在head0中，需要给出这个框是Backgroud/Car/Truck/Bus的分数)。最后通过框合并操作，选择分数最大的若干个框，并且通过NMS操作去除重复的框，从而得到紧致的框输出。The subdivision of 2D candidate regions is mainly implemented by the RCNN module of the head in Figure 20, which further regresses the more compact 2D frame coordinates according to the features of each Proposal extracted by the ROI-ALIGN module, and classifies the Proposal at the same time. Output the confidence that it belongs to each class. There are many realizable forms of RCNN. The feature size output by the ROI-ALIGN module can be N*14*14*256 (Feature of proposals), which is first processed by the convolution module 4 (Res18-Conv5) of Resnet18 in the RCNN module. The output feature size is N*7*7*512, and then it is processed through a Global Avg Pool (average pooling layer), and the 7*7 features in each channel of the input feature are averaged to obtain N*512 Features, where each 1*512-dimensional feature vector represents the feature of each Proposal. Next, the exact coordinates of the box are returned through two fully connected layers FC (output N*4 vector, these 4 numerical sub-tables represent the x/y coordinates of the center point of the box, the width and height of the box), and the confidence of the box category degree (in head0, need to give the score that this box is Backgroud/Car/Truck/Bus). Finally, through the frame merging operation, several frames with the largest scores are selected, and the repeated frames are removed through the NMS operation, so as to obtain a compact frame output.

在一些实际应用场景中，该感知网络还可以包括其他head，可以在检测出2D框的基础上，进一步进行3D/Mask/Keypoint检测。示例性的，以3D为例，ROI-ALIGN模块根据head提供的准确的2D框，在特征提取网络输出的特征图上提取出每个2D框所在区域的特征，假设2D框的个数为M，那么ROI-ALIGN模块输出的特征大小为M*14*14*256，其首先经过Resnet18的Res18-Conv5处理，输出的特征大小为N*7*7*512，然后通过一个Global AvgPool(平均池化层)进行处理，把输入特征中每个通道的7*7的特征进行平均，得到M*512的特征，其中每个1*512维的特征向量代表每个2D框的特征。接下来通过3个全连接层FC分别回归框中物体的朝向角(orientation，M*1向量)、质心点坐标(centroid，M*2向量，这2个数值表示质心的x/y坐标)和长宽高(dimention)。In some practical application scenarios, the perception network can also include other heads, which can further perform 3D/Mask/Keypoint detection on the basis of the detected 2D frame. Exemplarily, taking 3D as an example, the ROI-ALIGN module extracts the features of the area where each 2D frame is located on the feature map output by the feature extraction network according to the accurate 2D frame provided by the head, assuming that the number of 2D frames is M , then the feature size output by the ROI-ALIGN module is M*14*14*256, which is first processed by Res18-Conv5 of Resnet18, and the output feature size is N*7*7*512, and then passed through a Global AvgPool (average pool layer) for processing, and average the 7*7 features of each channel in the input features to obtain M*512 features, where each 1*512-dimensional feature vector represents the feature of each 2D box. Next, the orientation angle (orientation, M*1 vector) and centroid point coordinates (centroid, M*2 vector, these two values represent the x/y coordinates of the centroid) and Length, width and height (dimention).

需要说明的是，图19和图20中示出的header仅为一种实现方式，并不构成对本申请的限定。It should be noted that the header shown in FIG. 19 and FIG. 20 is only an implementation manner, and does not constitute a limitation to the present application.

以两个大规模多实例自动驾驶数据集作为上游预训练数据集为例，在实验数据的选择上，上游自监督预训练可以使用多实例自动驾驶数据集包括Waymo公开数据集和SODA-5M数据集。Waymo公开数据集含有近79万张无标签图像，图像尺度从(1920，968)至(1920，1280)不等，是目前已经开源的规模最大的自动驾驶数据集；SODA-5M数据集含有500万张高质量的自动驾驶图像，图像尺度均为(1920，1280)。下游迁移数据集主要考虑Cityscapes和BDD100K两个广泛使用的自动驾驶语义分割基准任务，其中Cityscpaes包含8个语义类别的标注，含有收集自27个城市的2975张训练集图像和500张验证集图像；BDD100K包含9个语义类别的标注，含有收集自4个城市的70000张训练集图像和10000张验证集图像。Taking two large-scale multi-instance autonomous driving datasets as upstream pre-training datasets as an example, in the selection of experimental data, upstream self-supervised pre-training can use multi-instance autonomous driving datasets including Waymo public datasets and SODA-5M data set. The Waymo public dataset contains nearly 790,000 unlabeled images, ranging in size from (1920, 968) to (1920, 1280), and is currently the largest open-source autonomous driving dataset; the SODA-5M dataset contains 500 10,000 high-quality autonomous driving images with image scales of (1920, 1280). The downstream migration data set mainly considers Cityscapes and BDD100K, two widely used autonomous driving semantic segmentation benchmark tasks, among which Cityscpaes contains annotations of 8 semantic categories, including 2975 training set images and 500 validation set images collected from 27 cities; BDD100K contains annotations for 9 semantic categories, and contains 70,000 training set images and 10,000 validation set images collected from 4 cities.

由于上游数据集缺少人工标注无法直接对算法性能进行评估，因此使用自监督预训练模型初始化下游任务的主干网络之后进行微调，以最后在下游任务验证集上预测结果的平均并交比mIOU为评价指标判断预训练模型的质量，其中mIOU值越高越好。首先在上游自动驾驶数据集上根据所提出的框架进行自监督预训练，待网络收敛之后舍弃投射网络以及预测网络等上游任务专属模块，仅保留主干网络参数迁移到下游任务当中；使用下游任务训练集对网络参数进行微调，取微调后的模型在验证集上进行预测，报告最终预测结果的mIOU作为评估标准。Due to the lack of manual annotation in the upstream data set, the performance of the algorithm cannot be directly evaluated. Therefore, the self-supervised pre-training model is used to initialize the backbone network of the downstream task and then fine-tune it. The average union ratio mIOU of the final prediction results on the downstream task verification set is used as the evaluation. The indicator judges the quality of the pre-trained model, and the higher the mIOU value, the better. First, self-supervised pre-training is performed on the upstream autonomous driving dataset according to the proposed framework. After the network converges, the upstream task-specific modules such as the projection network and the prediction network are discarded, and only the backbone network parameters are retained and transferred to the downstream tasks; use the downstream task training Fine-tune the network parameters, take the fine-tuned model to predict on the verification set, and report the mIOU of the final prediction result as the evaluation standard.

使用ResNet-50作为默认主干网络。为了更好适应大批量训练，使用LARS优化器以及余弦学习率衰减来稳定网络训练过程。为了更好地比较模型在不同数据集上的预训练结果，我们保持在不同数据集上训练的GPU时间一致。具体来说，分别在Waymo以及SODA-5M上进行325epoch和55epoch的预训练来和ImageNet 200epoch预训练进行公平的比较。Use ResNet-50 as the default backbone network. In order to better adapt to large batch training, the LARS optimizer and cosine learning rate decay are used to stabilize the network training process. In order to better compare the pre-training results of the models on different datasets, we keep the GPU time of training on different datasets consistent. Specifically, 325 epoch and 55 epoch pre-training are performed on Waymo and SODA-5M respectively to make a fair comparison with ImageNet 200 epoch pre-training.

通过在公开自动驾驶数据集Waymo上进行预训练，本申请实施例提出的方法成功地在基准模型BYOL的基础上获得了Cityscapes上4.7％和BDD100K上3.9％的显著性能提升，成功地在Waymo预训练中达到了当前最优的结果，可见本申请实施例可以更好地使用多实例数据集进行特征学习。但是考虑到Waymo相较于ImageNet拥有更少的图像(0.79Mvs1.28M)，同时具有很强的前后景不平衡，实际上Waymo在图像的数量和质量上均不如ImageNet，预训练的结果也就很难获得超过ImageNet预训练。因此额外使用了自动驾驶数据集SODA-5M进行预训练，成功地超过了ImageNet预训练的迁移结果，展现了域内预训练的有效性。By pre-training on the public self-driving dataset Waymo, the method proposed in the embodiment of this application successfully obtained a significant performance improvement of 4.7% on Cityscapes and 3.9% on BDD100K on the basis of the benchmark model BYOL, and successfully pre-trained on Waymo The current optimal result has been achieved during the training. It can be seen that the embodiment of the present application can better use multi-instance data sets for feature learning. However, considering that Waymo has fewer images (0.79Mvs1.28M) than ImageNet, and has a strong front-background imbalance, in fact Waymo is not as good as ImageNet in terms of the number and quality of images, and the pre-training results are also It is difficult to get beyond ImageNet pre-training. Therefore, the self-driving data set SODA-5M was additionally used for pre-training, which successfully surpassed the migration results of ImageNet pre-training, demonstrating the effectiveness of in-domain pre-training.

值得注意的是，由于单样本假设的存在，要收集到更多类似于ImageNet数据集的图像必然需要人为地进行数据筛选和清洗，而多实例场景摆脱了单实例假设，可以以很低的成本收集到超大规模的预训练数据集，因此多实例自监督学习对于工业级别数据集要更加实用。It is worth noting that due to the existence of the single-instance assumption, to collect more images similar to the ImageNet data set, it is necessary to artificially filter and clean the data, and the multi-instance scene gets rid of the single-instance assumption, which can be obtained at a very low cost. A large-scale pre-training data set is collected, so multi-instance self-supervised learning is more practical for industrial-level data sets.

参照表1，表1为本申请实施例在多实例自动驾驶数据集上的预训练结果，本申请实施例相较于基准模型获得明显的性能提升；当使用超大规模自动驾驶数据集进行预训练时成功超过了ImageNet预训练，展现了域内预训练的有效性。Referring to Table 1, Table 1 is the pre-training result of the embodiment of the present application on the multi-instance automatic driving data set. Compared with the benchmark model, the embodiment of the present application obtains obvious performance improvement; Successfully surpassed ImageNet pre-training, demonstrating the effectiveness of in-domain pre-training.

表1Table 1

示例性的，对最终学习到的特征进行可视化分析，取预训练的主干网络最后一层二维特征图做K-means聚类分析，聚类结果如图21所示，相较于现有技术，本申请实施例不仅可以很好地判断前后景的信息，也可以准确地对不同实例的特征进行区分，同时本申请实施例的特征聚类结果更加平滑也意味着在学得的特征表示当中平移不变性和特征的鲁棒性得到增强。Exemplarily, visual analysis is performed on the finally learned features, and the last two-dimensional feature map of the pre-trained backbone network is used for K-means clustering analysis. The clustering results are shown in Figure 21. Compared with the existing technology , the embodiment of the present application can not only judge the information of the foreground and the background, but also accurately distinguish the features of different instances. At the same time, the feature clustering result of the embodiment of the present application is smoother, which means that the translation in the learned feature representation Invariance and feature robustness are enhanced.

此外，将本申请实施例提供的方案部署在单实例上进行自监督学习来验证本框架可扩展性。除实验数据以外的其余设置均和技术方案一保持一致。在实验数据的选择上，上游数据集可以使用拥有单实例假设的ImageNet数据集，包含1000个类别的自然图像，总共含有128万张训练集图片；这里可以并不使用人工标注，而是直接使用图像数据进行自监督表征学习。下游数据集额外考虑了VOC和COCO两个通用检测数据集，其中VOC数据集包含20个语义类别标注信息，共有1万张训练图像与4900张测试图像；COCO数据集包含80个语义类别标注信息，共有约11.8万张训练图像与5000张验证集图像。In addition, the solution provided by the embodiment of this application is deployed on a single instance for self-supervised learning to verify the scalability of this framework. Except for the experimental data, the rest of the settings are consistent with the technical scheme 1. In the selection of experimental data, the upstream data set can use the ImageNet data set with a single-instance hypothesis, which contains 1,000 categories of natural images, and contains a total of 1.28 million training set pictures; here, manual labeling is not used, but directly used Image data for self-supervised representation learning. The downstream data set additionally considers two general-purpose detection data sets, VOC and COCO. The VOC data set contains 20 semantic category annotation information, a total of 10,000 training images and 4,900 test images; the COCO data set contains 80 semantic category annotation information. , a total of about 118,000 training images and 5,000 validation set images.

经验证，虽然本申请实施例专为多实例场景设计，但是仍然可以有效地利用单实例场景图像获得高质量特征表示，在多个下游任务上取得当前最优的迁移结果(见表2)，由此可见本申请实施例框架的通用性以及可扩展能力。It has been verified that although the embodiment of the present application is designed for multi-instance scenes, it can still effectively use single-instance scene images to obtain high-quality feature representations, and obtain the current optimal migration results on multiple downstream tasks (see Table 2). This shows the versatility and scalability of the framework of the embodiment of the present application.

表2Table 2

参照图22，图22为本申请实施例提供的一种模型训练装置的结构示意，如图22所示，本申请实施例提供的装置2200包括：Referring to Fig. 22, Fig. 22 is a schematic structural diagram of a model training device provided in the embodiment of the present application. As shown in Fig. 22, the device 2200 provided in the embodiment of the present application includes:

获取模块2201，用于获取样本图像。An acquisition module 2201, configured to acquire sample images.

其中，关于获取模块2201的具体描述可以参照上述实施例中步骤601的描述，这里不再赘述。Wherein, for the specific description of the obtaining module 2201, reference may be made to the description of step 601 in the above-mentioned embodiment, and details are not repeated here.

采样模块2202，用于采样所述样本图像，以得到第一图像块和第二图像块，所述第一图像块和所述第二图像块为所述样本图像上不同的图像块；A sampling module 2202, configured to sample the sample image to obtain a first image block and a second image block, where the first image block and the second image block are different image blocks on the sample image;

其中，关于采样模块2202的具体描述可以参照上述实施例中步骤602的描述，这里不再赘述。Wherein, for the specific description of the sampling module 2202, reference may be made to the description of step 602 in the above-mentioned embodiment, which will not be repeated here.

特征提取模块2203，用于基于所述第一图像块和所述第二图像块在所述样本图像上所在区域之间的并交比大于阈值，通过特征提取网络分别对所述第一图像块和所述第二图像块进行特征提取，以得到第一特征图和第二特征图；The feature extraction module 2203 is configured to, based on the intersection ratio between the regions where the first image block and the second image block are located on the sample image, being greater than a threshold, respectively extract the first image block through a feature extraction network performing feature extraction with the second image block to obtain a first feature map and a second feature map;

其中，关于特征提取模块2203的具体描述可以参照上述实施例中步骤603的描述，这里不再赘述。Wherein, for the specific description of the feature extraction module 2203, reference may be made to the description of step 603 in the above embodiment, and details are not repeated here.

模型更新模块2204，用于根据所述第一特征图和所述第二特征图之间的差异，确定损失，并基于所述损失更新所述特征提取网络，以得到更新后的特征提取网络。The model updating module 2204 is configured to determine a loss according to the difference between the first feature map and the second feature map, and update the feature extraction network based on the loss to obtain an updated feature extraction network.

其中，关于模型更新模块2204的具体描述可以参照上述实施例中步骤604的描述，这里不再赘述。Wherein, for the specific description of the model update module 2204, reference may be made to the description of step 604 in the above-mentioned embodiment, and details are not repeated here.

接下来介绍本申请实施例提供的一种执行设备，请参阅图23，图23为本申请实施例提供的执行设备的一种结构示意图，执行设备2300具体可以表现为虚拟现实VR设备、手机、平板、笔记本电脑、智能穿戴设备、监控数据处理设备或服务器等，此处不做限定。具体的，执行设备2300包括：接收器2301、发射器2302、处理器2303和存储器2304(其中执行设备2300中的处理器2303的数量可以一个或多个，图23中以一个处理器为例)，其中，处理器2303可以包括应用处理器23031和通信处理器23032。在本申请的一些实施例中，接收器2301、发射器2302、处理器2303和存储器2304可通过总线或其它方式连接。Next, an execution device provided by the embodiment of the present application is introduced. Please refer to FIG. 23. FIG. 23 is a schematic structural diagram of the execution device provided by the embodiment of the present application. Tablets, laptops, smart wearable devices, monitoring data processing equipment or servers, etc., are not limited here. Specifically, the execution device 2300 includes: a receiver 2301, a transmitter 2302, a processor 2303, and a memory 2304 (the number of processors 2303 in the execution device 2300 may be one or more, and one processor is taken as an example in FIG. 23 ) , where the processor 2303 may include an application processor 23031 and a communication processor 23032 . In some embodiments of the present application, the receiver 2301 , the transmitter 2302 , the processor 2303 and the memory 2304 may be connected through a bus or in other ways.

存储器2304可以包括只读存储器和随机存取存储器，并向处理器2303提供指令和数据。存储器2304的一部分还可以包括非易失性随机存取存储器(non-volatile randomaccess memory，NVRAM)。存储器2304存储有处理器和操作指令、可执行模块或者数据结构，或者它们的子集，或者它们的扩展集，其中，操作指令可包括各种操作指令，用于实现各种操作。The memory 2304 may include read-only memory and random-access memory, and provides instructions and data to the processor 2303 . A part of the memory 2304 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 2304 stores processors and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.

处理器2303控制执行设备的操作。具体的应用中，执行设备的各个组件通过总线系统耦合在一起，其中总线系统除包括数据总线之外，还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见，在图中将各种总线都称为总线系统。The processor 2303 controls the operations of the execution device. In a specific application, various components of the execution device are coupled together through a bus system, where the bus system may include not only a data bus, but also a power bus, a control bus, and a status signal bus. However, for the sake of clarity, the various buses are referred to as bus systems in the figures.

上述本申请实施例揭示的方法可以应用于处理器2303中，或者由处理器2303实现。处理器2303可以是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法的各步骤可以通过处理器2303中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器2303可以是通用处理器、数字信号处理器(digital signal processing，DSP)、微处理器或微控制器，还可进一步包括专用集成电路(application specific integratedcircuit，ASIC)、现场可编程门阵列(field-programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器2303可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器2304，处理器2303读取存储器2304中的信息，结合其硬件完成上述方法的步骤。The methods disclosed in the foregoing embodiments of the present application may be applied to the processor 2303 or implemented by the processor 2303 . The processor 2303 may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor 2303 or instructions in the form of software. The above-mentioned processor 2303 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), a field programmable gate Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The processor 2303 may implement or execute various methods, steps, and logic block diagrams disclosed in the embodiments of the present application. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory 2304, and the processor 2303 reads the information in the memory 2304, and completes the steps of the above method in combination with its hardware.

接收器2301可用于接收输入的数字或字符信息，以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器2302可用于通过第一接口输出数字或字符信息；发射器2302还可用于通过第一接口向磁盘组发送指令，以修改磁盘组中的数据；发射器2302还可以包括显示屏等显示设备。The receiver 2301 can be used to receive input digital or character information, and generate signal input related to performing device related settings and function control. The transmitter 2302 can be used to output digital or character information through the first interface; the transmitter 2302 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 2302 can also include display devices such as a display screen .

本申请实施例中，在一种情况下，处理器2303，用于执行通过上述图6的模型训练方法得到的模型。In the embodiment of the present application, in one case, the processor 2303 is configured to execute the model obtained through the above-mentioned model training method in FIG. 6 .

本申请实施例还提供了一种训练设备，请参阅图24，图24是本申请实施例提供的训练设备一种结构示意图，训练设备2400上可以部署有图22对应实施例中所描述的图像训练装置，用于实现图22对应实施例中图像训练装置的功能，具体的，训练设备2400由一个或多个服务器实现，训练设备2400可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上中央处理器(central processing units，CPU)2424(例如，一个或一个以上处理器)和存储器2432，一个或一个以上存储应用程序2442或数据2444的存储介质2430(例如一个或一个以上海量存储设备)。其中，存储器2432和存储介质2430可以是短暂存储或持久存储。存储在存储介质2430的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对训练设备中的一系列指令操作。更进一步地，中央处理器2424可以设置为与存储介质2430通信，在训练设备2400上执行存储介质2430中的一系列指令操作。The embodiment of the present application also provides a training device. Please refer to FIG. 24. FIG. 24 is a schematic structural diagram of the training device provided in the embodiment of the present application. The image described in the embodiment corresponding to FIG. 22 can be deployed on the training device 2400 The training device is used to implement the functions of the image training device in the embodiment corresponding to FIG. 22. Specifically, the training device 2400 is implemented by one or more servers. The training device 2400 may have relatively large differences due to different configurations or performances, and may include One or more central processing units (central processing units, CPU) 2424 (for example, one or more processors) and memory 2432, one or more storage media 2430 for storing application programs 2442 or data 2444 (for example, one or more mass storage devices). Wherein, the memory 2432 and the storage medium 2430 may be temporary storage or persistent storage. The program stored in the storage medium 2430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the training device. Furthermore, the central processing unit 2424 may be configured to communicate with the storage medium 2430 , and execute a series of instruction operations in the storage medium 2430 on the training device 2400 .

训练设备2400还可以包括一个或一个以上电源2426，一个或一个以上有线或无线网络接口2450，一个或一个以上输入输出接口2458；或，一个或一个以上操作系统2441，例如Windows ServerTM，Mac OS XTM，UnixTM,LinuxTM，FreeBSDTM等等。The training device 2400 can also include one or more power supplies 2426, one or more wired or wireless network interfaces 2450, one or more input and output interfaces 2458; or, one or more operating systems 2441, such as Windows Server™, Mac OS X™ , UnixTM, LinuxTM, FreeBSDTM and so on.

本申请实施例中，中央处理器2424，用于执行通过上述图6的模型训练方法。In the embodiment of the present application, the central processing unit 2424 is configured to execute the model training method in FIG. 6 above.

本申请实施例中还提供一种包括计算机程序产品，当其在计算机上运行时，使得计算机执行如前述执行设备所执行的步骤，或者，使得计算机执行如前述训练设备所执行的步骤。The embodiment of the present application also provides a computer program product, which, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or enables the computer to perform the steps performed by the aforementioned training device.

本申请实施例中还提供一种计算机可读存储介质，该计算机可读存储介质中存储有用于进行信号处理的程序，当其在计算机上运行时，使得计算机执行如前述执行设备所执行的步骤，或者，使得计算机执行如前述训练设备所执行的步骤。An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a program for signal processing, and when it is run on a computer, the computer executes the steps performed by the aforementioned executing device , or, causing the computer to perform the steps performed by the aforementioned training device.

本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片，芯片包括：处理单元和通信单元，所述处理单元例如可以是处理器，所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令，以使执行设备内的芯片执行上述实施例描述的数据处理方法，或者，以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地，所述存储单元为所述芯片内的存储单元，如寄存器、缓存等，所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元，如只读存储器(read-only memory，ROM)或可存储静态信息和指令的其他类型的静态存储设备，随机存取存储器(random access memory，RAM)等。The execution device, training device or terminal device provided in the embodiment of the present application may specifically be a chip. The chip includes: a processing unit and a communication unit. The processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuits etc. The processing unit can execute the computer-executed instructions stored in the storage unit, so that the chips in the execution device execute the data processing methods described in the above embodiments, or make the chips in the training device execute the data processing methods described in the above embodiments. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.

具体的，请参阅图25，图25为本申请实施例提供的芯片的一种结构示意图，所述芯片可以表现为神经网络处理器NPU 2500，NPU 2500作为协处理器挂载到主CPU(Host CPU)上，由Host CPU分配任务。NPU的核心部分为运算电路2503，通过控制器2504控制运算电路2503提取存储器中的矩阵数据并进行乘法运算。Specifically, please refer to FIG. 25. FIG. 25 is a schematic structural diagram of a chip provided by an embodiment of the present application. The chip can be represented as a neural network processor NPU 2500, and the NPU 2500 is mounted on the main CPU (Host CPU) as a coprocessor. CPU), the tasks are assigned by the Host CPU. The core part of the NPU is the operation circuit 2503, and the operation circuit 2503 is controlled by the controller 2504 to extract matrix data in the memory and perform multiplication operations.

在一些实现中，运算电路2503内部包括多个处理单元(Process Engine,PE)。在一些实现中，运算电路2503是二维脉动阵列。运算电路2503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中，运算电路2503是通用的矩阵处理器。In some implementations, the operation circuit 2503 includes multiple processing units (Process Engine, PE). In some implementations, arithmetic circuit 2503 is a two-dimensional systolic array. The arithmetic circuit 2503 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 2503 is a general-purpose matrix processor.

举例来说，假设有输入矩阵A，权重矩阵B，输出矩阵C。运算电路从权重存储器2502中取矩阵B相应的数据，并缓存在运算电路中每一个PE上。运算电路从输入存储器2501中取矩阵A数据与矩阵B进行矩阵运算，得到的矩阵的部分结果或最终结果，保存在累加器(accumulator)2508中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches the data corresponding to the matrix B from the weight memory 2502, and caches it in each PE in the operation circuit. The operation circuit takes the data of matrix A from the input memory 2501 and performs matrix operation with matrix B, and the obtained partial or final results of the matrix are stored in the accumulator (accumulator) 2508 .

统一存储器2506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller，DMAC)2505，DMAC被搬运到权重存储器2502中。输入数据也通过DMAC被搬运到统一存储器2506中。The unified memory 2506 is used to store input data and output data. The weight data directly accesses the controller (Direct Memory Access Controller, DMAC) 2505 through the storage unit, and the DMAC is transferred to the weight storage 2502 . The input data is also transferred to the unified memory 2506 through the DMAC.

BIU为Bus Interface Unit即，总线接口单元2510，用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer，IFB)2509的交互。The BIU is a Bus Interface Unit, that is, the bus interface unit 2510 , which is used for the interaction between the AXI bus, the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 2509 .

总线接口单元2510(Bus Interface Unit，简称BIU)，用于取指存储器2509从外部存储器获取指令，还用于存储单元访问控制器2505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 2510 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2509 to obtain instructions from the external memory, and for the storage unit access controller 2505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器2506或将权重数据搬运到权重存储器2502中或将输入数据数据搬运到输入存储器2501中。The DMAC is mainly used to move the input data in the external memory DDR to the unified memory 2506 , to move the weight data to the weight memory 2502 , or to move the input data to the input memory 2501 .

向量计算单元2507包括多个运算处理单元，在需要的情况下，对运算电路的输出做进一步处理，如向量乘，向量加，指数运算，对数运算，大小比较等等。主要用于神经网络中非卷积/全连接层网络计算，如Batch Normalization(批归一化)，像素级求和，对特征平面进行上采样等。The vector computing unit 2507 includes a plurality of computing processing units, and if necessary, further processes the output of the computing circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization (batch normalization), pixel-level summation, and upsampling of feature planes.

在一些实现中，向量计算单元2507能将经处理的输出的向量存储到统一存储器2506。例如，向量计算单元2507可以将线性函数；或，非线性函数应用到运算电路2503的输出，例如对卷积层提取的特征平面进行线性插值，再例如累加值的向量，用以生成激活值。在一些实现中，向量计算单元2507生成归一化的值、像素级求和的值，或二者均有。在一些实现中，处理过的输出的向量能够用作到运算电路2503的激活输入，例如用于在神经网络中的后续层中的使用。In some implementations, vector computation unit 2507 can store the vector of the processed output to unified memory 2506 . For example, the vector calculation unit 2507 can apply a linear function; or, a nonlinear function to the output of the operation circuit 2503, such as performing linear interpolation on the feature plane extracted by the convolution layer, and then for example, a vector of accumulated values to generate activation values. In some implementations, the vector computation unit 2507 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as an activation input to operational circuitry 2503, eg, for use in subsequent layers in a neural network.

控制器2504连接的取指存储器(instruction fetch buffer)2509，用于存储控制器2504使用的指令；An instruction fetch buffer (instruction fetch buffer) 2509 connected to the controller 2504 is used to store instructions used by the controller 2504;

统一存储器2506，输入存储器2501，权重存储器2502以及取指存储器2509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。The unified memory 2506, the input memory 2501, the weight memory 2502 and the fetch memory 2509 are all On-Chip memories. External memory is private to the NPU hardware architecture.

其中，上述任一处提到的处理器，可以是一个通用中央处理器，微处理器，ASIC，或一个或多个用于控制上述程序执行的集成电路。Wherein, the processor mentioned above can be a general-purpose central processing unit, microprocessor, ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

另外需说明的是，以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外，本申请提供的装置实施例附图中，模块之间的连接关系表示它们之间具有通信连接，具体可以实现为一条或多条通信总线或信号线。In addition, it should be noted that the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.

通过以上的实施方式的描述，所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现，当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下，凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现，而且，用来实现同一功能的具体硬件结构也可以是多种多样的，例如模拟电路、数字电路或专用电路等。但是，对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在可读取的存储介质中，如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，训练设备，或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware, and of course it can also be realized by special hardware including application-specific integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions completed by computer programs can be easily realized by corresponding hardware, and the specific hardware structure used to realize the same function can also be varied, such as analog circuits, digital circuits or special-purpose circuit etc. However, for this application, software program implementation is a better implementation mode in most cases. Based on this understanding, the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the instructions described in various embodiments of the present application method.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.

所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘(Solid State Disk，SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, training device, or data The center transmits to another website site, computer, training device or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, DVD), or a semiconductor medium (for example, a solid state disk (Solid State Disk, SSD)).

Claims

1. A method of model training, the method comprising:

acquiring a sample image;

sampling the sample image to obtain a first image block and a second image block, wherein the first image block and the second image block are different image blocks on the sample image;

respectively extracting the features of the first image block and the second image block through a feature extraction network based on the fact that the cross-over ratio of the first image block to the second image block on the sample image is larger than a threshold value, so as to obtain a first feature map and a second feature map;

determining loss according to the difference between the first feature map and the second feature map, and updating the feature extraction network based on the loss to obtain an updated feature extraction network.

2. The method of claim 1, wherein the threshold is a value greater than or equal to 0.4.

3. The method according to claim 1 or 2, wherein before determining the loss from the difference between the first profile and the second profile, the method further comprises:

aligning the first image block and the second image block to obtain an aligned first image block and an aligned second image block;

said determining a loss from a difference between said first feature map and said second feature map comprises:

determining a loss according to a difference between the aligned first feature map and the aligned second feature map.

4. The method according to claim 3, wherein the sample image includes a target area, the target area is an overlapping area where the first image block and the second image block are located on the sample image, and aligning the first image block and the second image block includes:

according to the target area, determining a first sub-feature map corresponding to the target area in the first feature map and a second sub-feature map corresponding to the target area in the second feature map;

performing upsampling on the first sub-feature map to obtain the aligned first image block;

and upsampling the second sub-feature map to obtain the aligned second image block, wherein the size of the aligned first image block is consistent with that of the aligned second image block.

5. The method of claim 3, wherein the first feature map and the second feature map are the same in size, the first feature map includes M first feature points, the second feature map includes M second feature points, the M first feature points correspond to M first pixel points in the sample image, the M second feature points correspond to M second pixel points in the sample image, the M first pixel points correspond to the M second pixel points one-to-one, and the method further comprises:

obtaining a third feature map according to the M first pixel points and the M second pixel points, wherein the third feature map and the second feature map have the same size, the third feature map comprises M third feature points, and each third feature point is obtained based on a pixel position difference between the first pixel point and the second pixel point having a corresponding relationship;

and fusing the third feature map and the first feature map to obtain the aligned first image block, wherein the second feature map is used as the aligned second image block.

6. The method of claim 5, wherein fusing the third feature map with the first feature map comprises:

and splicing the third characteristic map and the first characteristic map in the depth direction.

7. The method of any of claims 1 to 6, wherein determining the loss based on the difference between the first profile and the second profile comprises:

processing the M first feature points of the first feature map through a target prediction network to obtain a predicted value of each first feature point;

clustering M second feature points of the first feature map based on a target clustering algorithm to update feature values of the M second feature points, wherein the feature value of each updated second feature point is a clustering center feature value of a clustering category where the updated second feature point is located;

and determining loss according to the predicted value of each first characteristic point and the difference between the characteristic values of each updated second characteristic point.

8. The method of claim 7, further comprising:

and updating the target prediction network according to the loss.

9. The method of any one of claims 1 to 8, wherein the sample image comprises a plurality of objects, the objects comprising at least one of a person, a vehicle, a traffic sign, a lane line, a plant.

10. The method according to any one of claims 1 to 9, further comprising:

acquiring a target network and an image to be processed, wherein the target network comprises the updated feature extraction network and a downstream task network;

and processing the image to be processed through the target network to obtain a processing result.

11. A model training apparatus, the apparatus comprising:

the acquisition module is used for acquiring a sample image;

the sampling module is used for sampling the sample image to obtain a first image block and a second image block, wherein the first image block and the second image block are different image blocks on the sample image;

the characteristic extraction module is used for respectively extracting the characteristics of the first image block and the second image block through a characteristic extraction network based on the fact that the cross-over ratio of the first image block and the second image block between the areas of the sample image is larger than a threshold value so as to obtain a first characteristic diagram and a second characteristic diagram;

and the model updating module is used for determining loss according to the difference between the first feature map and the second feature map and updating the feature extraction network based on the loss so as to obtain an updated feature extraction network.

12. The apparatus of claim 11, wherein the threshold is a value greater than or equal to 0.4.

13. The apparatus of claim 11 or 12, further comprising:

an alignment module, configured to align the first image block and the second image block before determining a loss according to a difference between the first feature map and the second feature map, so as to obtain an aligned first image block and an aligned second image block;

the model update module is specifically configured to:

14. The apparatus according to claim 13, wherein the sample image includes a target area, the target area is an overlapping area where the first tile and the second tile are located on the sample image, and the alignment module is specifically configured to:

15. The apparatus according to claim 13, wherein the first feature map and the second feature map have the same size, the first feature map includes M first feature points, the second feature map includes M second feature points, the M first feature points correspond to M first pixel points in the sample image, the M second feature points correspond to M second pixel points in the sample image, the M first pixel points correspond to the M second pixel points one-to-one, and the alignment module is specifically configured to:

16. The apparatus according to claim 15, wherein the alignment module is specifically configured to:

and splicing the third feature map and the first feature map in the depth direction.

17. The apparatus according to any one of claims 11 to 16, wherein the model update module is specifically configured to:

and determining the loss according to the difference between the predicted value of each first characteristic point and the characteristic value of each updated second characteristic point.

18. The apparatus of claim 17, wherein the model update module is further configured to:

and updating the target prediction network according to the loss.

19. The apparatus of any of claims 11 to 18, wherein the sample image comprises a plurality of objects, the objects comprising at least one of a person, a vehicle, a traffic sign, a lane line, a plant.

20. The apparatus of any one of claims 11 to 19, further comprising:

the data processing module is used for acquiring a target network and an image to be processed, wherein the target network comprises the updated feature extraction network and a downstream task network;

21. A model training apparatus, the apparatus comprising a memory and a processor; the memory stores code, and the processor is configured to retrieve the code and perform the method of any of claims 1 to 10.

22. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.

23. A computer program product, characterized in that it comprises code for implementing the steps of the method of any one of claims 1 to 10 when said code is executed.