CN110135248A

CN110135248A - A deep learning-based text detection method in natural scenes

Info

Publication number: CN110135248A
Application number: CN201910270269.4A
Authority: CN
Inventors: 刘发贵; 陈成
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2019-08-16

Abstract

The invention discloses a natural scene text detection method based on deep learning. The method uses a CNN network to extract multi-scale features of text, and then uses RNN to encode these features to make full use of the contextual properties of text; then, the feature map is fed into the ROI pooling layer and outputs a series of text proposals. After non-maximum suppression, the generated text proposals are finally connected through a text connector, so as to realize multi-scale and multi-directional text detection flexibly and efficiently. The invention improves the accuracy and recall rate of text detection in natural scenes under the condition of multi-direction and variable scale.

Description

A deep learning-based text detection method in natural scenes

技术领域technical field

本发明属于图像处理技术领域，具体涉及一种基于深度学习的自然场景文本检测方法。The invention belongs to the technical field of image processing, and in particular relates to a natural scene text detection method based on deep learning.

背景技术Background technique

场景文本检测是文本识别的重要前提，常被应用在图像检索、机器翻译、自动驾驶等领域。但是，文本检测在复杂背景、多尺度、多语言、光照不均匀、模糊等情况下的检测仍然存在着诸多困难。Scene text detection is an important prerequisite for text recognition, and is often used in image retrieval, machine translation, autonomous driving and other fields. However, there are still many difficulties in the detection of text detection in complex background, multi-scale, multi-language, uneven illumination, blurred and so on.

自然场景文本的多样性与多变性：相比与文档中的文本，自然场景的文本可能是多尺度、多语言的，形状、方向、比例、颜色可能各不相同，这些变化都给文本的检测带来了诸多挑战。Diversity and variability of natural scene text: Compared with the text in the document, the text in the natural scene may be multi-scale and multi-language, and the shape, direction, scale, and color may be different. brought many challenges.

复杂背景：场景文本可能在任意的背景中出现，包括信号标示、砖块或是草丛、栅栏，这些背景可能具有和文本非常相似的特征，可能成为噪声影响文本的判断。同时，还有异物的遮挡造成的文本的缺失，导致潜在的检测错误。Complex background: The scene text may appear in any background, including signal signs, bricks, grass, and fences. These backgrounds may have very similar characteristics to the text, which may become noise and affect the judgment of the text. At the same time, there is also the lack of text caused by the occlusion of foreign objects, resulting in potential detection errors.

参差不齐的成像质量：由于不可控的收集手段，无法保证成像的质量。用于检测的图像可能由于不同的拍摄角度或是拍摄距离造成畸变、虚焦，或是由于拍摄时光照的不同形成噪点、阴影。Uneven imaging quality: Due to uncontrolled collection means, imaging quality cannot be guaranteed. The images used for detection may be distorted or out of focus due to different shooting angles or shooting distances, or noise and shadows may be formed due to different lighting during shooting.

针对自然场景文本检测问题，可将检测方法分为两类，一类是传统的检测方法，另一类是基于深度学习的检测方法。传统的方法有基于纹理的方法，如使用局部强度、滤波器响应、小波系数等；有基于区域的方法，如笔画宽度变换(Stroke Width Transform，SWT)、最大极值稳定区域(Maximally Stable Extremal Regions，MSER)、笔画特征变换(StrokeFeature Transform，SFT)等。近年来，随着深度神经网络的发展，深度学习在计算机视觉领域表现出越来越大的优势。目前，最流行的还是基于卷积神经网络(Convolutional NeuralNetworks，CNN)的深度学习方法。在使用了深度学习之后，大大提高了文本检测的准确性，并且将人们从复杂的特征设计工作中解放出来。常用的基于深度学习的自然场景文本检测模型通常基于常见的目标检测模型，如RCNN、YOLO、SSD等。这些模型的基本结构通常是用数个卷积层和池化层提取特征，最后使用全连接层进行检测框的分类和回归。For the problem of text detection in natural scenes, detection methods can be divided into two categories, one is the traditional detection method, and the other is the detection method based on deep learning. Traditional methods include texture-based methods, such as using local intensity, filter response, wavelet coefficients, etc.; there are region-based methods, such as stroke width transform (Stroke Width Transform, SWT), Maximum Stable Extremal Regions (Maximally Stable Extremal Regions). , MSER), stroke feature transform (StrokeFeature Transform, SFT) and so on. In recent years, with the development of deep neural networks, deep learning has shown more and more advantages in the field of computer vision. At present, the most popular deep learning method is based on Convolutional Neural Networks (CNN). After using deep learning, the accuracy of text detection is greatly improved, and people are freed from complex feature design work. Commonly used deep learning-based natural scene text detection models are usually based on common object detection models, such as RCNN, YOLO, SSD, etc. The basic structure of these models is usually to extract features with several convolutional layers and pooling layers, and finally use a fully connected layer for detection box classification and regression.

发明内容SUMMARY OF THE INVENTION

为了更加准确高效地在自然场景中进行文本检测，解决自然场景中文本多方向、变尺度的检测问题，本发明提出了一种基于深度学习的自然场景文本检测方法。In order to more accurately and efficiently perform text detection in natural scenes and solve the problem of multi-directional and variable-scale text detection in natural scenes, the present invention proposes a natural scene text detection method based on deep learning.

本发明的目的至少通过如下技术方案之一实现。The object of the present invention is achieved by at least one of the following technical solutions.

一种基于深度学习的自然场景文本检测方法，包括如下步骤：A natural scene text detection method based on deep learning, comprising the following steps:

(1)构建并训练基于神经网络的自然场景文本检测模型，包含以下子步骤：(1) Build and train a neural network-based natural scene text detection model, including the following sub-steps:

(1.1)构建基于特征金字塔网络(Feature Pyramid Networks，FPN)的特征提取器；(1.1) Construct a feature extractor based on Feature Pyramid Networks (FPN);

(1.2)使用循环神经网络(Recurrent Neural Network,RNN)对特征提取器提取到的特征进行编码；(1.2) Use Recurrent Neural Network (RNN) to encode the features extracted by the feature extractor;

(1.3)使用ROI池化层进一步提高检测的精度；(1.3) Use the ROI pooling layer to further improve the detection accuracy;

(1.4)最后使用全连接层进行检测框的分类和回归，形成文本检测模型；(1.4) Finally, the fully connected layer is used to classify and regress the detection frame to form a text detection model;

(1.5)将经过标注的训练图形输入模型；使用包含分类损失和回归损失的多任务损失函数计算损失值以训练模型；(1.5) Input the labeled training graph into the model; use a multi-task loss function including classification loss and regression loss to calculate the loss value to train the model;

(2)使用上述训练完成的自然场景文本检测模型对给定图像中的自然场景文本进行检测，包含以下子步骤：(2) Use the natural scene text detection model completed by the above training to detect the natural scene text in a given image, including the following sub-steps:

(2.1)输入待检测图像，使用上述训练后模型对给定图像进行文版检测，输出一系列文本提议检测框的得分和坐标。(2.1) Input the image to be detected, use the above trained model to perform text version detection on the given image, and output a series of scores and coordinates of the text proposal detection box.

(2.2)对得到的文本提议进行非极大值抑制，以去除部分冗余检测框。(2.2) Non-maximum suppression is performed on the obtained text proposals to remove partially redundant detection boxes.

(2.3)使用文本连接器对一系列的文本提议进行连接，生成最终的检测结果。(2.3) Use the text connector to connect a series of text proposals to generate the final detection result.

与现有技术相比，本发明具有如下优点和技术效果：Compared with the prior art, the present invention has the following advantages and technical effects:

(1)本发明对于变尺度的文本检测，使用了特征金字塔网络(Feature PyramidNetworks，FPN)，能够高效地同时利用各个不同大小的卷积层的信息，相比于使用最后一层特征图的方法，同时利用了高层的强语义信息和底层的高分辨率信息，从而实现更高的召回率和准确率；相比与基于图像金字塔的方法，则大大降低了计算量。(1) The present invention uses Feature Pyramid Networks (FPN) for variable-scale text detection, which can efficiently utilize the information of convolutional layers of different sizes at the same time, compared to the method using the feature map of the last layer. , while using high-level strong semantic information and low-level high-resolution information to achieve higher recall rate and accuracy; compared with the method based on image pyramid, it greatly reduces the amount of computation.

(2)对于多方向的文本检测，采用输出一系列文本提议的方式，最后通过文本连接器将这些文本提议连接起来，相比于使用任意四边形或是旋转矩形的方法，使用了更少的参数，从而对多方向文本的检测更加灵活高效。(2) For multi-directional text detection, a series of text proposals are output, and finally these text proposals are connected through text connectors. Compared with the method of using any quadrilateral or rotating rectangle, fewer parameters are used. , so that the detection of multi-directional text is more flexible and efficient.

附图说明Description of drawings

图1为实施例中自然场景文本检测流程图。FIG. 1 is a flowchart of text detection in a natural scene in an embodiment.

图2为实施例中使用的自然场景文本检测模型架构图。FIG. 2 is an architecture diagram of a natural scene text detection model used in the embodiment.

图3为实施例中使用本发明的文本检测方法在不同场景下检测的实际结果图。FIG. 3 is a diagram of actual results of detection in different scenarios using the text detection method of the present invention in the embodiment.

具体实施方式Detailed ways

为了使本发明的技术方案及优点更加清楚明白，以下结合附图，进行进一步的详细说明，但本发明的实施和保护不限于此。In order to make the technical solutions and advantages of the present invention clearer, further detailed descriptions are given below with reference to the accompanying drawings, but the implementation and protection of the present invention are not limited thereto.

首先说明本发明中的术语：First, the terms in the present invention are explained:

特征金字塔网络(Feature Pyramid Networks，FPN)：FPN直接在原来的骨架网络上做修改，每个分辨率的特征图引入后一分辨率缩放两倍的特征图做每个元素对应相加的操作。通过这样的连接，每一层预测所用的特征图都融合了不同分辨率、不同语义强度的特征，融合的不同分辨率的特征图分别做对应分辨率大小的物体检测。这样保证了每一层都有合适的分辨率以及强语义特征。Feature Pyramid Networks (FPN): FPN is directly modified on the original skeleton network, and the feature map of each resolution is introduced into the feature map of the next resolution scaled twice to do the corresponding addition operation of each element. Through such a connection, the feature maps used for each layer of prediction integrate features of different resolutions and different semantic strengths, and the fused feature maps of different resolutions are used for object detection of corresponding resolutions. This ensures that each layer has appropriate resolution and strong semantic features.

残差网络(ResNet)：是何凯明于2015年提出的深度卷积网络模型，根据模型所采用的层数的不同，分别命名为ResNet-34、ResNet-50、ResNet-101、ResNet-152等。Residual network (ResNet): It is a deep convolutional network model proposed by He Kaiming in 2015. According to the number of layers used in the model, it is named ResNet-34, ResNet-50, ResNet-101, ResNet-152, etc.

非极大值抑制(Non-Maximum Suppression，NMS)：抑制不是极大值的元素，可以理解为局部最大搜索。输出的每个检测框都有一个分数，这些检测框可能存在包含和交叉的情况，使用NMS来选取领域里得分最高的检测框，并抑制那些分数低的检测框。Non-Maximum Suppression (NMS): Suppressing elements that are not maximum values can be understood as local maximum search. Each outputted detection box has a score, and these detection boxes may contain and intersect. NMS is used to select the detection boxes with the highest scores in the field, and suppress the detection boxes with low scores.

如图1所示，本发明中基于深度学习的自然场景文本检测模型，包括以下步骤：As shown in Figure 1, the natural scene text detection model based on deep learning in the present invention includes the following steps:

(1)构建并训练基于神经网络的自然场景文本检测模型，如图2所示，包含以下子步骤：(1) Build and train a neural network-based natural scene text detection model, as shown in Figure 2, including the following sub-steps:

(1.1)构建基于特征金字塔网络(Feature Pyramid Networks，FPN)的特征提取器。使用ResNet-101作为骨架网络，生成特征金字塔，使用其中的从P2到P5的层级的特征。(1.1) Construct a feature extractor based on Feature Pyramid Networks (FPN). Using ResNet-101 as the skeleton network, a feature pyramid is generated, and the features of the layers from P2 to P5 are used.

(1.2)使用循环神经网络(Recurrent Neural Network,RNN)对提取到的特征进行编码。使用512个隐藏层的双向长短时记忆循环神经网络(Bi-directional Long Short-Term Memory，Bi-LSTM)作为RNN对提取到的特征进行编码。(1.2) Use Recurrent Neural Network (RNN) to encode the extracted features. The extracted features are encoded using a Bi-directional Long Short-Term Memory (Bi-LSTM) recurrent neural network with 512 hidden layers as RNN.

(1.3)使用ROI池化层进一步提高检测的精度。ROI池化的具体操作如下：(1.3) Using the ROI pooling layer to further improve the detection accuracy. The specific operation of ROI pooling is as follows:

(1.3.1)根据输入的图像，将ROI映射到特征图的对应位置；(1.3.1) According to the input image, map the ROI to the corresponding position of the feature map;

(1.3.2)将映射后的区域划分为相同大小的部分，划分的数量与输出的维度相同；(1.3.2) Divide the mapped area into parts of the same size, and the number of divisions is the same as the dimension of the output;

(1.3.3)对每个部分进行最大池化操作。(1.3.3) Perform a max-pooling operation on each part.

(1.4)最后使用全连接层进行检测框的分类和回归。经过ROI池化的特征分别通过两个全连接层进行分类和回归。若输出的检测框的数量为k，其中分类层输出的维度为2k，对应着文本和背景；回归层输出的维度为4k，对应检测框的左上和右下2个坐标。(1.4) Finally, the fully connected layer is used for the classification and regression of the detection frame. The ROI pooled features are classified and regressed through two fully connected layers, respectively. If the number of output detection frames is k, the output dimension of the classification layer is 2k, corresponding to the text and background; the output dimension of the regression layer is 4k, corresponding to the upper left and lower right coordinates of the detection frame.

(1.5)输入经过标注的训练图形对模型进行训练。其中，训练图像可以使用四边形标注，也可以使用矩形标注。但在输入模型之前，需将其按给定的宽度分割，若训练图像标注为四边形，则取其分割后的最小外接矩形；若标注为矩形，则直接分割。(1.5) Input the labeled training graph to train the model. Among them, the training images can be marked with quadrilaterals or rectangles. However, before inputting the model, it needs to be divided according to the given width. If the training image is marked as a quadrilateral, the smallest circumscribed rectangle after the segmentation is taken; if it is marked as a rectangle, it is directly divided.

设计包含分类损失和回归损失的多任务损失函数。使用设计的损失函数进行损失的计算：Design a multi-task loss function that includes classification loss and regression loss. Use the designed loss function to calculate the loss:

其中L、L_cls和L_reg分别为总损失、分类损失和回归损失，λ是平衡分类损失和回归损失之间的权重系数。p_i是第i个检测框预测的类别，是第i个检测框的真实类别。t_i是第i个检测框的预测坐标，是第i个检测框的真实坐标。where L, L _cls , and L _reg are the total loss, classification loss, and regression loss, respectively, and λ is the weight coefficient that balances the classification loss and regression loss. p _i is the category predicted by the ith detection box, is the ground-truth category of the ith detection box. t _i is the predicted coordinate of the ith detection box, are the true coordinates of the ith detection box.

(2.1)输入待检测图像，使用上述训练后模型对给定图像进行文本检测，输出一系列文本提议检测框的得分和坐标。(2.1) Input the image to be detected, use the above-trained model to perform text detection on the given image, and output a series of scores and coordinates of text proposal detection boxes.

(2.2)对得到的文本提议进行非极大值抑制，以去除部分冗余检测框。具体操作如下：(2.2) Non-maximum suppression is performed on the obtained text proposals to remove partially redundant detection boxes. The specific operations are as follows:

对于文本提议检测框的列表B及其对应的得分S，采用下面的计算方式。选择具有最大分数For the list B of text proposal detection boxes and their corresponding scores S, the following calculation methods are used. Choose the one with the largest score

的检测框M，将其从B集合中移除并加入到最终的检测结果D中。通常将B中剩余检测框中The detection frame M is removed from the B set and added to the final detection result D. Usually the remaining detection box in B

与M的IoU大于阈值的框从B中移除。重复这个过程，直到B为空。Boxes with an IoU greater than a threshold with M are removed from B. Repeat this process until B is empty.

(2.3)使用文本连接器对一系列的文本提议进行连接，生成最终的检测结果。使用如(2.3) Use the text connector to connect a series of text proposals to generate the final detection result. use as

下步骤进行文本提议的连接：The following steps perform concatenation of text proposals:

若提议P_j和提议P_i(此处的i、j表示不同的提议)满足下列两项条件，将提议P_j定义为提议P_i的邻居：If the proposal P _j and the proposal P _i (where i and j represent different proposals) satisfy the following two conditions, the proposal P _j is defined as the neighbor of the proposal P _i :

(1)提议P_j和提议P_i离得最近且他们之间的距离小于w_j+w_i (1) The proposal P _j and the proposal P _i are the closest and the distance between them is less than w _j + w _i

(2)提议P_j和提议P_i在垂直方向上具有大于0.5的重合度(2) The proposal P _j and the proposal P _i have a degree of coincidence greater than 0.5 in the vertical direction

其中w_i和w_j分别为提议P_i和提议P_j的宽度，如果提议P_i是提议P_j的邻居并且提议P_j是提议P_i的邻居，这将这两个提议连接为同一个检测框。重复执行上述步骤，直到所有的提议连接完成，则检测框为最终的输出结果。从图2和图3可知，本发明在自然场景中的检测效果，可见本发明能够对自然场景中变尺度、多方向的文本进行很好的检测。where _wi and w _j are the widths of proposal Pi and proposal P _j respectively, if proposal _Pi is a neighbor of proposal P _j and proposal P _j is a neighbor of proposal _Pi _, this connects these two proposals into the same detection frame. Repeat the above steps until all proposed connections are completed, and the detection frame is the final output result. It can be seen from FIG. 2 and FIG. 3 that the detection effect of the present invention in a natural scene shows that the present invention can perform good detection on text with variable scales and multiple directions in a natural scene.

Claims

1. a kind of natural scene Method for text detection based on deep learning, it is characterised in that the following steps are included:

(1) it constructs and trains natural scene text detection model neural network based, comprising:

(1.1) building is based on the feature extractor of feature pyramid network (Feature Pyramid Networks, FPN)；

(1.2) spy that feature extractor is extracted using Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) Sign is encoded；

(1.3) precision of detection is further increased using the pond ROI layer；

(1.4) classification and recurrence that detection block is finally carried out using full articulamentum, form text detection model；

It (1.5) will be by the training figure input model of mark；

(1.6) penalty values are calculated with training pattern using the multitask loss function comprising Classification Loss and recurrence loss；

(2) the natural scene text in given image is examined using the natural scene text detection model that training is completed It surveys, includes following sub-step:

(2.1) image to be detected is inputted, text inspection is carried out to given image using the natural scene text detection model after training It surveys, exports score and coordinate that a series of texts propose detection block；

(2.2) obtained text is proposed to carry out non-maxima suppression, to remove partial redundance detection block；

(2.3) proposal of a series of text is attached using text connector, generates final testing result.

2. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that constructing In natural scene text detection model neural network based, feature pyramid network (Feature Pyramid Networks, FPN the level from P2 to P5) has been only used.

3. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that constructing In natural scene text detection model neural network based, feature pyramid network (Feature Pyramid Networks, FPN) used ResNet-101 as back bone network.

4. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that constructing In natural scene text detection model neural network based, nerve is recycled using the two-way long short-term memory of 512 hidden layers Network (Bi-directional Long Short-Term Memory, Bi-LSTM) is used as Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) feature extracted is encoded.

5. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that constructing In natural scene text detection model neural network based, the calculating lost using following loss function:

Wherein L, L_clsAnd L_regRespectively total losses, Classification Loss and recurrence loss, λ are balanced sort loss and recurrence loss Between weight coefficient,It is the true classification of i-th of detection block.

6. the natural scene Method for text detection according to claim 5 based on deep learning, which is characterized in that classification damage Mistake is defined as follows:

Wherein, p_iIt is the prediction classification of i-th of detection block,It is the true classification of i-th of detection block.

7. the natural scene Method for text detection according to claim 5 based on deep learning, which is characterized in that return damage Mistake is defined as follows:

Wherein, t_iIt is the prediction coordinate of i-th of detection block,It is the true coordinate of i-th of detection block.

8. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that giving Determine to use following steps to carry out the connection of text proposal during the natural scene text in image detected:

If proposing P_jWith proposal P_iMeet following two conditions, will propose P_jIt is defined as proposing P_iNeighbours:

(1) propose P_jWith proposal P_iIt is nearest and they the distance between be less than w_j+w_i；

(2) propose P_jWith proposal P_iThere is the registration greater than 0.5 in vertical direction

Wherein w_iAnd w_jRespectively propose P_iWith proposal P_jWidth, if propose P_iIt is to propose P_jNeighbours and propose P_jIt is to mention Discuss P_iNeighbours, the two proposals are connected as the same detection block by this.