+

CN113095313A - Text string recognition method and device and server - Google Patents

Text string recognition method and device and server Download PDF

Info

Publication number
CN113095313A
CN113095313A CN202110370697.1A CN202110370697A CN113095313A CN 113095313 A CN113095313 A CN 113095313A CN 202110370697 A CN202110370697 A CN 202110370697A CN 113095313 A CN113095313 A CN 113095313A
Authority
CN
China
Prior art keywords
target
target image
preset
image
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110370697.1A
Other languages
Chinese (zh)
Other versions
CN113095313B (en
Inventor
周静玲
江子扬
刘强
罗伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110370697.1A priority Critical patent/CN113095313B/en
Publication of CN113095313A publication Critical patent/CN113095313A/en
Application granted granted Critical
Publication of CN113095313B publication Critical patent/CN113095313B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Character Discrimination (AREA)

Abstract

The specification provides a text string identification method, a text string identification device and a server. Based on the method, before concrete implementation, the model structure of the recognition model for recognizing and extracting the text character strings in the image is improved in a targeted way: and replacing the combination of the convolution network layer and the pooling layer with a preset hole convolution layer to serve as a feature extraction structure for extracting image features related to the text character string, so as to obtain an improved preset recognition model with a better effect. In specific implementation, after the acquired target image is preprocessed, the preset recognition model can be called to process the preprocessed target image, so that the method is well suitable for the situation that the character size of the character string in the image is small, the resolution is low, and the extracted related image features are relatively few, the target character string contained in the target image is accurately recognized and determined, the recognition accuracy and the recognition efficiency of the text character string are improved, and the recognition error is reduced.

Description

Text string recognition method and device and server
Technical Field
The specification belongs to the technical field of artificial intelligence, and particularly relates to a text string identification method, a text string identification device and a server.
Background
In many data processing scenarios, it is often necessary to first identify and extract text strings contained in an image; and then, the acquired text character strings are used for carrying out the next step of service data processing.
However, the character size of the character string in the image to be recognized is small and the resolution is low, and the image features related to the character string that can be extracted by the conventional recognition model are relatively small. For the above situation, when the existing method is used for recognizing and extracting the character string, errors often occur easily, and the accuracy of the obtained character string is relatively poor.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The specification provides a text character string recognition method, a text character string recognition device and a text character string recognition server, which can be well suitable for the situation that the character size of a character string in an image is small, the resolution is low, and the number of extracted relevant image features is relatively small, so that a target character string contained in a target image can be accurately recognized and determined.
The present specification provides a text string recognition method, including:
acquiring a target image containing a target character string to be recognized;
preprocessing the target image to obtain a preprocessed target image;
calling a preset recognition model to process the preprocessed target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of a convolution network layer and a pooling layer to extract target image characteristics related to the target character string from the preprocessed target image;
and determining a target character string in the target image according to the target processing result.
In one embodiment, the preset recognition model further comprises a positioning sub-model; the positioning sub-model is connected with the preset cavity convolution layer and used for respectively generating a plurality of corresponding candidate frames for each text character in the target character string through anchor regression according to the characteristics of the target image and preset anchor frame parameters; respectively screening a candidate box meeting the requirements from the corresponding candidate boxes for each text character to serve as a boundary box of the text character; the bounding box carries position information of the contained text character.
In one embodiment, the preset anchor frame parameter is obtained as follows:
acquiring a sample image containing a sample character string;
according to a preset labeling rule, respectively labeling a corresponding boundary frame for each sample character in a sample character string in the sample image; collecting parameters of a bounding box of the sample characters; the overlapped area range between the bounding boxes of two adjacent sample characters is smaller than a preset area range threshold value;
and clustering the boundary frame parameters of the sample characters to obtain the preset anchor point frame parameters.
In one embodiment, the step of respectively screening out a candidate box meeting the requirement from the corresponding multiple candidate boxes for each text character as a boundary box of the text character comprises the following steps:
selecting a candidate box meeting the requirement from a plurality of corresponding candidate boxes for the current text character in the target character string as a boundary box according to the following mode:
calling a preset softening non-maximum suppression algorithm to process the candidate frames so as to screen a candidate frame with a confidence coefficient meeting the requirement from the candidate frames, wherein the candidate frame is used as a boundary frame of the current text character; and filtering out other candidate frames except the boundary frame in the plurality of candidate frames.
In one embodiment, the preset recognition model further comprises a classification submodel; the classification submodel is connected with the void volume layer and used for identifying and determining the class value of each text character in the target character string to be identified according to the characteristics of the target image.
In one embodiment, the pre-processing of the target image comprises:
detecting a target image, and determining a target image area containing a target character string to be recognized in the target image;
and cutting out the target image area from the target image to be used as the preprocessed target image.
In one embodiment, preprocessing the target image further comprises:
carrying out image correction processing on the target image; and/or carrying out noise reduction processing on the target image.
In one embodiment, the target character string to be recognized includes at least one of: the serial number on the target currency, the account number of the drawer on the target check and the logistics number on the target express bill.
In one embodiment, in a case where the target character string to be recognized includes a crown word number on the target currency, after determining the target character string in the target image, the method further includes:
determining the target character string as a crown word number on the target currency;
tracking and determining a transaction circulation path of the target currency according to the crown word number on the target currency;
and determining whether a transaction risk exists according to the transaction circulation path of the target currency.
In one embodiment, the method further comprises:
replacing the combination of a convolution network layer and a pooling layer with a preset cavity convolution layer to serve as an extraction structure of image features in the network model so as to construct and obtain an initial identification model;
acquiring a sample image containing a sample character string to be identified; labeling the sample image to obtain a labeled sample image;
and training the initial recognition model by using the labeled sample image to obtain a preset recognition model.
In one embodiment, the labeling processing on the sample image to obtain a labeled sample image includes:
according to a preset labeling rule, respectively labeling a corresponding boundary frame for each sample character in a sample character string in the sample image; the overlapped area range between the bounding boxes of two adjacent sample characters is smaller than a preset area range threshold value;
and according to the sample characters contained in the boundary box, marking out corresponding character category values to obtain the marked sample images.
This specification also provides a text string recognition apparatus, including:
the acquisition module is used for acquiring a target image containing a target character string to be recognized;
the preprocessing module is used for preprocessing the target image to obtain a preprocessed target image;
the calling module is used for calling a preset identification model to process the preprocessed target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of a convolution network layer and a pooling layer to extract target image characteristics related to the target character string from the preprocessed target image;
and the determining module is used for determining the target character string in the target image according to the target processing result.
The present specification also provides a text string recognition method, including:
acquiring a target image containing a target character string to be recognized;
calling a preset recognition model to process the target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of the convolution network layer and the pooling layer to extract target image characteristics related to the target character string from the target image;
and determining a target character string in the target image according to the target processing result.
The specification also provides a server comprising a processor and a memory for storing processor-executable instructions, wherein the processor executes the instructions to implement the relevant steps of the text string recognition method.
The present specification also provides a computer readable storage medium having stored thereon computer instructions which, when executed, implement the relevant steps of the method of recognition of a text string.
Before the text character string recognition method, the text character string recognition device and the text character string recognition server provided by the specification are implemented, a model structure of a recognition model for recognizing and extracting a text character string in an image is improved in a targeted manner: and replacing the combination of the convolution network layer and the pooling layer with a preset hole convolution layer to serve as a feature extraction structure for extracting image features related to the text character string, so as to obtain an improved preset recognition model with a better effect. In specific implementation, after the acquired target image is preprocessed to obtain the preprocessed target image, the preprocessed target image can be processed by calling the preset recognition model, so that the method is well suitable for the situation that the character size of the character string in the image is small, the resolution is low, the extracted related image features are relatively few, and the recognition difficulty is high, the target character string contained in the target image is accurately recognized and determined, the recognition accuracy and the recognition efficiency of the text character string are improved, and the recognition error is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present specification, the drawings needed to be used in the embodiments will be briefly described below, and the drawings in the following description are only some of the embodiments described in the present specification, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic diagram of an embodiment of a structural component of a system to which a text string recognition method provided by an embodiment of the present specification is applied;
FIG. 2 is a diagram illustrating an embodiment of a method for recognizing a text string according to an embodiment of the present disclosure;
FIG. 3 is a diagram illustrating an embodiment of a method for recognizing a text string according to an embodiment of the present disclosure;
FIG. 4 is a flow chart illustrating a method for recognizing text strings according to an embodiment of the present disclosure;
FIG. 5 is a diagram illustrating an embodiment of a method for recognizing a text string according to an embodiment of the present disclosure;
FIG. 6 is a flow chart illustrating a method for recognizing text strings according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a server according to an embodiment of the present disclosure;
fig. 8 is a schematic structural component diagram of a text string recognition apparatus according to an embodiment of the present specification;
fig. 9 is a schematic diagram of an embodiment of a method for recognizing a text string provided by an embodiment of the present specification, in a scenario example.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.
The embodiment of the specification provides a text character string identification method, which can be particularly applied to a system comprising a server and terminal equipment. Specifically, as shown in fig. 1, the server and the terminal device may be connected in a wired or wireless manner to perform specific data interaction.
In this embodiment, the server may specifically include a background cloud server that is applied to a network platform side and can implement functions such as data transmission and data processing. Specifically, the server may be, for example, an electronic device having data operation, storage function and network interaction function. Alternatively, the server may be a software program running in the electronic device and providing support for data processing, storage and network interaction. In the present embodiment, the number of servers is not particularly limited. The server may specifically be one server, or may also be several servers, or a server cluster formed by several servers.
In this embodiment, the terminal device may specifically include a front-end electronic device that is disposed at a user side, is configured with or connected to a camera, and can implement functions such as picture data acquisition and data transmission. Specifically, the terminal device may be, for example, a monitoring camera, a desktop computer, a tablet computer, a notebook computer, a smart phone, and the like. Alternatively, the terminal device may be a software application capable of running in the electronic device. For example, it may be some APP running on a smartphone, etc.
In specific implementation, a user may use a smart phone as a terminal device to capture a photo containing a target character string as a target image by shooting the target character string to be recognized (for example, a crown word number on a target currency) on a target object. As can be seen in fig. 2. The target character string may specifically include one or more text characters.
After acquiring the target image, the terminal device can send the target image to a server in a wired or wireless mode. Correspondingly, the server receives and acquires the target image from the terminal equipment.
The server can pre-process the target image to obtain a pre-processed target image which is pre-processed and is more suitable for the recognition and extraction of the subsequent text character string.
Specifically, when preprocessing is performed, the server may perform entry denoising processing on the target image to preliminarily filter image noise in the target image, so as to obtain a relatively pure target image with less noise.
Then, the server may call a pre-trained preset text character region recognition model to process the target image, so as to find an image region containing a target character string to be recognized in the target image as a target image region. The server cuts out the target image area from the target image to obtain a preprocessed target image with relatively less data size. As can be seen in fig. 2.
It should be noted that, for a target character string such as a crown word number on a target currency, the size of a text character in the target character string is often smaller than that of a conventional character string, and the extracted feature information is relatively less and the resolution is poorer; moreover, since the target objects such as target banknotes are frequently used, most of the image areas where the target character strings are located in the acquired target images have image noise caused by factors such as creases and stains, and the recognition of the target character strings is interfered.
If the conventional recognition model is directly used for recognizing and extracting the target character string under the condition, the obtained target character string has larger error and relatively poor accuracy.
In particular, when the target character string is identified and extracted, the server uses a preset identification model with a model structure different from that of the conventional identification model and after improvement.
The server may input the preprocessed target image as a model input to the preset recognition model, and operate the preset recognition model to obtain a corresponding model output as a corresponding target processing result.
The preset identification model is different from a conventional identification model and is a combination of a convolution network layer and a pooling layer which are used by replacing the conventional identification model with a preset hole convolution layer.
When the preset recognition model is operated specifically, the target image features which are related to the target character string and have a large receptive field can be extracted from the target image finely and comprehensively through the preset hole convolution layer, and the feature loss caused by the pooling effect of the pooling layer is avoided.
The preset recognition model can be integrated with a positioning sub-model and a classification sub-model at the same time. The positioning sub-model is connected with the preset hole convolution layer, and the classification sub-model is connected with the preset hole convolution layer.
When the preset recognition model is operated specifically, the positioning process aiming at each text character in the target character string can be executed through the positioning sub-model. Specifically, target image features output from a preset hole convolution layer can be received through a positioning sub-model; according to the target image characteristics, combining preset anchor point frame parameters and generating a plurality of corresponding candidate frames for each text character in the target character string through anchor point regression; furthermore, a candidate box meeting the requirement can be screened from a plurality of corresponding candidate boxes for each text character through the positioning sub-model to serve as a boundary box of the text character, and the rest redundant candidate boxes are deleted. The bounding box may carry position information of the contained text characters. For example, the positional information such as the arrangement number of the included character in the target character string.
The preset recognition model can execute the positioning process through the positioning sub-model according to the above mode, and simultaneously can execute the classification process aiming at each text character in the target character string through the classification sub-model. Specifically, target image features output from a preset hole convolution layer can be received through the classification submodel; and according to the target image characteristics, identifying and determining the category value of each text character in the target character string through logistic regression.
The positioning sub-model and the classification sub-model are integrated in the same preset identification model and are connected with the same preset cavity convolution layer. Therefore, when the preset identification model is specifically operated, the related positioning process and the related classification process are executed simultaneously.
On one hand, the method can avoid the problems that the processing time length is increased and the processing efficiency is reduced because the positioning process (including the partition of the boundary box) and the classification process are separately and sequentially executed like the existing method and model; meanwhile, on the other hand, the method can also avoid the influence on the precision of the final result due to the fact that the loss precision is accumulated layer by layer when different processes are executed because the positioning process and the classification process are separated and executed in sequence like the existing method and model.
The preset recognition model operates according to the above mode, and by using the preset hole convolution layer, the positioning sub-model and the classification sub-model, the category value sequence of the text characters which are arranged in sequence based on the position information carried by the bounding box can be finally output as the target processing result. As can be seen in fig. 3.
The server can identify and extract the target character string with higher precision and smaller error from the target image according to the target processing result.
Furthermore, the server may perform further data processing using the target character string in combination with a specific application scenario according to the extracted target character string.
For example, in a transaction risk detection scenario, the server may track and determine a transaction flow path of the target currency according to the crown word number on the identified and extracted target currency. And then subsequently analyze whether the transaction action related to the target currency has transaction risks such as money washing, gambling and the like according to the transaction circulation path of the target currency. Therefore, the transaction risk of the transaction behavior can be detected more efficiently and intelligently.
Through the system, the improved preset recognition model is utilized, the method and the device can be effectively suitable for the situation that the character size of the character string in the image is small, the resolution is low, the extracted related image features are relatively few, the target character string contained in the target image is accurately recognized and determined, the recognition accuracy and the recognition efficiency of the text character string are improved, and the recognition error is reduced.
Referring to fig. 4, an embodiment of the present disclosure provides a method for recognizing a text character string. The method is particularly applied to the server side. In particular implementations, the method may include the following.
S401: acquiring a target image containing a target character string to be recognized;
s402: preprocessing the target image to obtain a preprocessed target image;
s403: calling a preset recognition model to process the preprocessed target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of a convolution network layer and a pooling layer to extract target image characteristics related to the target character string from the preprocessed target image;
s404: and determining a target character string in the target image according to the target processing result.
Through the embodiment, the preset recognition model of the combination of the convolutional network layer and the pooling layer in the conventional recognition model can be replaced by the preset hole convolutional layer, so that the method is effectively suitable for the situation that the character size of the character string in the image is small, the resolution is low, and the related image features which can be extracted are relatively few, the target character string contained in the target image is accurately recognized and determined, the recognition accuracy and the recognition efficiency of the text character string are improved, and the recognition error is reduced.
In some embodiments, the target image may be an image containing a target character string to be recognized. Specifically, a photograph including a target character string may be obtained by shooting as the target image; the target image can also be obtained by capturing a screenshot containing the target character string from the video.
The target character string may be a text character string to be recognized and extracted. The target character string may only contain one text character, or may contain a plurality of text characters ordered in sequence.
In some embodiments, the target character string may be a text character string with low recognition difficulty, and the extracted text character string (which may be marked as a first type of text character string) can be recognized more accurately based on a conventional recognition model, for example, a text character string with a large character size, a high resolution, and a large character interval in an image.
The target character string may also be a text character string with high recognition difficulty, and the extracted text character string (which may be marked as a second type character string) may not be accurately recognized based on a conventional recognition model, for example, a text character string with a small size, a low resolution, and a small character interval in an image. For such target character strings, the recognition accuracy is poor due to relatively few image features that can be extracted based on a conventional recognition model, the loss of the features and relatively limited receptive field. In addition, because the character interval between adjacent characters in the target character string is small, the recognition difficulty is further increased, and errors such as text characters in missing character strings, character recognition misplacement and the like are easy to occur when the conventional recognition model is used for recognition.
In some embodiments, the target character string to be recognized may specifically include at least one of: the number of the crown word on the target currency, the account number of the drawer on the target check, the logistics number on the target express bill and the like. The above-mentioned serial number may be a character string composed of a plurality of numbers and letters and provided on a piece of money (for example, a banknote). Typically, a crown word number corresponds one-to-one to a currency in which the crown word number is set.
The listed crown word number on the target currency, the account number of the drawer on the target check and the logistics number on the target express bill all belong to a second type of character string with high identification difficulty. The conventional recognition model used generally is difficult to accurately and quickly recognize and extract the target character string from the target image.
Of course, the above listed target strings are only a schematic illustration. In specific implementation, according to a specific application scenario and a processing requirement, other types of text strings can be introduced as the target string to be recognized. The present specification is not limited to these.
Through the embodiment, the text character string identification method provided by the specification can be applied, and is suitable for various different business scenes so as to accurately identify text character strings with high identification difficulty, such as the number of the crown word in the currency, the account number of the drawer on the target check, the logistics number on the target express bill and the like.
In some embodiments, the preset recognition model may be specifically understood as a pre-trained neural network model capable of recognizing and extracting the target character string from the image more accurately. The preset recognition model is different from conventional recognition and has an improved model structure.
Specifically, the predetermined recognition model at least includes a predetermined void convolution layer. In the preset identification model, the preset hole convolutional layer is used for replacing the combination of the conventional convolutional network layer and the pooling layer. By the preset hole convolution layer, the target image features with larger receptive field can be extracted, and meanwhile, the loss of feature information is avoided, so that the target image features which are relatively comprehensive and complete and have better use effect can be obtained.
It should be noted that, based on a conventional recognition model, in order to extract image features with relatively good use effect, after extracting corresponding image features from an image by using a convolutional network layer; and performing pooling operation on the extracted image features by using a pooling layer to achieve the effect of increasing the receptive field.
However, in the process of performing pooling operation on image features by using the pooling layer, part of feature information in an image state is filtered out at the same time, so that detailed features among characters and the like are lost, and finally obtained image features are incomplete and have defects. In the case that the character size of the character string in the image is small, the resolution is low, and the number of the related image features that can be extracted is relatively small, the image features obtained finally become fewer by performing the above processing with the conventional recognition model, and the accuracy of recognizing the character string becomes worse.
In this embodiment, by introducing the preset hole convolutional layer in the preset identification model instead of the combination of the conventional convolutional network layer and the pooling layer, the image features with a large receptive field and a good use effect can be extracted, and meanwhile, the loss of the features is avoided, and the integrity of the extracted image features is ensured.
In some embodiments, the above-mentioned hole Convolution layer (scaled Convolution) may specifically be a method of injecting holes into a standard Convolution Map to increase the repetition field. Compared with the original normal contribution (e.g., convolutional layers), the hole convolutional layer has one more super-parameter, which may be called a partition rate, specifically, the number of intervals of kernel.
In some embodiments, the predetermined hole convolution layer may be configured with a predetermined expansion coefficient and a corresponding convolution kernel. The preset expansion coefficient and the size parameter of the convolution kernel can be flexibly set according to information such as the proportion and the resolution of the target character string in the target image.
Specifically, when the preset recognition model is operated to process the preprocessed target image, the preset expansion coefficient may be used to perform expansion processing on an initial image feature matrix obtained after the convolution kernel performs normal convolution operation on the preprocessed target image, and a data value 0 is used to fill an expanded matrix region, so that an expanded image feature matrix including feature information with a relatively large view field range may be obtained as the target image feature. Therefore, the receptive field of the obtained target image features can be effectively increased, and the feature information originally extracted by the convolution kernel can not be lost.
Specifically, as shown in fig. 5, the preset hole convolution layer is configured with a preset expansion coefficient with a data value of 1 and a convolution kernel of 3 × 3. When the preprocessed target image is processed by using the preset hole convolution layer, the preprocessed target image may be convolved by a convolution kernel of 3 × 3, and an initial image feature matrix of 3 × 3 shown on the left side is extracted. Then, performing expansion processing on the initial image feature matrix by using a preset expansion coefficient (1) to obtain an expanded 5 x 5 matrix; and filling the newly added matrix area after the expansion by using a data value 0, thereby obtaining an expanded image characteristic matrix of 5 x 5 shown on the right side as a target image characteristic.
In some embodiments, the preset recognition model may specifically further include a positioning sub-model; the positioning sub-model is connected with the preset cavity convolution layer and used for respectively generating a plurality of corresponding candidate frames for each text character in the target character string through anchor regression according to the characteristics of the target image and preset anchor frame parameters; respectively screening a candidate box meeting the requirements from the corresponding candidate boxes for each text character to serve as a boundary box of the text character; the bounding box carries position information of the contained text character.
Through the embodiment, when the preset recognition model is operated, the boundary box of each text character in the target character string can be accurately positioned by utilizing the internally integrated positioning sub-model, and each character is accurately cut based on the boundary box.
In some embodiments, the preset anchor frame parameter may be specifically obtained as follows:
s1: acquiring a sample image containing a sample character string;
s2: according to a preset labeling rule, respectively labeling a corresponding boundary frame for each sample character in a sample character string in the sample image; collecting parameters of a bounding box of the sample characters; the overlapped area range between the bounding boxes of two adjacent sample characters is smaller than a preset area range threshold value;
s3: and clustering the boundary frame parameters of the sample characters to obtain the preset anchor point frame parameters.
Through the embodiment, the anchor point regression can be obtained and carried out by utilizing the preset anchor point frame parameters with better effect and more proper effect, so that the convergence rate of the network model can be increased, the calculation efficiency of the model is improved, and a plurality of corresponding candidate frames are generated for each text character more accurately and efficiently.
In some embodiments, the preset Anchor block parameter may specifically be a parameter value understood as an Anchor (Anchor).
In some embodiments, the clustering the bounding box parameters of the sample character may specifically include: and carrying out clustering processing on the bounding box parameters of the sample characters based on a K-means clustering algorithm.
It should be noted that, based on the existing method, it is usually dependent on manual fixed setting of the anchor point frame parameters. The parameters of the anchor frame used in the anchor point regression are not matched with the actual target character string, so that the anchor point regression result is influenced.
In some embodiments, the above-mentioned screening out a candidate box meeting the requirement from the multiple corresponding candidate boxes for each text character as a boundary box of the text character may include the following steps: selecting a candidate box meeting the requirement from a plurality of corresponding candidate boxes for the current text character in the target character string as a boundary box according to the following mode: calling a preset softening non-maximum suppression algorithm to process the candidate frames so as to screen a candidate frame with a confidence coefficient meeting the requirement from the candidate frames, wherein the candidate frame is used as a boundary frame of the current text character; and filtering out other candidate frames except the boundary frame in the plurality of candidate frames.
Through the embodiment, the softening non-maximum suppression algorithm can be well suitable for the recognition scene of the text character string with smaller character interval, and the problems that the finally extracted target character string is incomplete and lost due to the fact that the candidate frames of other adjacent text characters are deleted by mistake can be effectively avoided.
In some embodiments, it should be noted that, based on the existing method, a non-maximum suppression algorithm is often called to determine a corresponding bounding box from a plurality of candidate boxes. Specifically, based on the non-maximum suppression algorithm, in the implementation, the candidate frame of the character with the highest confidence coefficient is selected as the reference frame, if there is a candidate frame overlapping with the reference frame, the proportion of the total area occupied by the overlapping area of the two candidate frames is calculated, and if the proportion is greater than a set threshold, the confidence coefficient of the candidate frame is directly set to 0.
In this embodiment, the invoking the preset softening non-maximum suppression algorithm to process the candidate frames may include: the candidate frame of the character with the highest confidence coefficient is selected as a reference frame, if the candidate frame is overlapped with the reference frame, the proportion of the total area occupied by the overlapped area of the candidate frame and the reference frame is calculated, and if the proportion is larger than a set threshold value, the preset linear function is adopted to modify and adjust the confidence coefficient of the candidate frame instead of directly setting the confidence coefficient to be 0 like a non-maximum value suppression algorithm. Therefore, the method can effectively avoid the situation that under the condition that the intervals between the literal characters are close, the candidate frames of other adjacent literal characters are deleted by mistake.
The preset linear function may be specifically expressed in the following form:
Figure BDA0003009178890000111
wherein S isiIndicates the confidence of the ith candidate box, IoUiThe ratio of the overlapping area of the candidate frame and the reference frame to the total area is represented, and T represents a set threshold. The above-mentioned set threshold valueThe specific numeric value may be determined according to the minimum character interval between adjacent characters in the target character string.
In some embodiments, the preset recognition model may specifically further include a classification submodel; the classification submodel is connected with the void volume layer and used for identifying and determining the class value of each text character in the target character string to be identified according to the characteristics of the target image.
Through the embodiment, when the preset recognition model is operated, the classification value of each text character in the target character string can be accurately recognized by utilizing the internally integrated classification submodel.
In some embodiments, in practical implementation, the classification submodel may identify and determine the category value of each text character through logistic regression according to the target image feature.
In some embodiments, in a case where the target character string includes a crown word number on the target currency, the character category value may specifically include: 0-9 and/or A-Z, etc.
In some embodiments, the predetermined recognition model may be integrated with both the positioning sub-model and the classification sub-model. Correspondingly, when the preset recognition model is operated specifically, the positioning sub-model and the classification sub-model can be used for simultaneously executing the positioning process and the classification process according to the target image characteristics extracted by the preset cavity convolution layer, so that the category values of the text characters in each boundary box can be recognized together while the target character string is efficiently divided into a plurality of sequentially connected boundary boxes respectively containing one text character, and the category values of a plurality of text characters arranged in sequence based on the position information carried by the boundary boxes can be obtained and used as the target processing result output by the preset recognition model.
By utilizing the preset identification model integrating the positioning sub-model and the classification sub-model at the same time, the two flows of the positioning flow and the classification flow can be executed at the same time, the layer-by-layer superposition of precision loss when the two flows are separately and sequentially executed is avoided, and the precision of the obtained target processing result is improved; meanwhile, the two flows are executed simultaneously, so that the processing efficiency of the model is improved.
In some embodiments, the preprocessing of the target image may be implemented specifically as follows: detecting a target image, and determining a target image area containing a target character string to be recognized in the target image; and cutting out the target image area from the target image to be used as the preprocessed target image.
By the embodiment, a relatively small target image area containing the target character string can be preliminarily determined from the target image, and then an image only containing the target image area can be cut from the target image to be used as a preset identification model of the preprocessed target image input value for subsequent identification processing. Therefore, the data processing amount of the subsequent preset recognition model can be reduced, and the recognition efficiency and the recognition precision are improved.
In some embodiments, in implementation, the server may call a pre-trained preset text character region recognition model to process the target image, so as to detect and find a target image region containing the target character string in the target image more quickly and accurately.
In some embodiments, when the preprocessing is specifically implemented for the target image, the preprocessing may further include: carrying out image correction processing on the target image; and/or carrying out noise reduction processing on the target image.
Through the embodiment, the influence of interference factors such as image noise in the target image on the subsequent character string recognition can be reduced, and the accuracy of the subsequent character string recognition is improved.
In some embodiments, when the target character string to be recognized includes a crown word number on the target currency, and after the target character string in the target image is determined, the method may further include the following steps:
s1: determining the target character string as a crown word number on the target currency;
s2: tracking and determining a transaction circulation path of the target currency according to the crown word number on the target currency;
s3: and determining whether a transaction risk exists according to the transaction circulation path of the target currency.
Through the embodiment, the identified and extracted target character string can be used for subsequent specific data processing according to specific conditions and processing requirements.
In some embodiments, after determining the target string as a crown word number on the target currency, the method may further comprise: and determining the authenticity of the target paper currency according to the crown word number.
In some embodiments, when the target character string to be recognized includes a logistics number on the target express receipt, and after the target character string in the target image is determined, the method may further include the following steps: determining the target character string as a logistics number on a target express bill; and carrying out logistics tracking on the packages or mails provided with the target express lists according to the logistics numbers, and feeding back the latest logistics information to the user in time.
In some embodiments, before the method is embodied, the following may be further included:
s1: replacing the combination of a convolution network layer and a pooling layer with a preset cavity convolution layer to serve as an extraction structure of image features in the network model so as to construct and obtain an initial identification model;
s2: acquiring a sample image containing a sample character string to be identified; labeling the sample image to obtain a labeled sample image;
s3: and training the initial recognition model by using the labeled sample image to obtain a preset recognition model.
Through the embodiment, the preset recognition model which is suitable for recognizing the character string with small character size, low resolution and relatively low extracted related image features under the condition of relatively low recognition difficulty can be constructed and trained in advance.
In some embodiments, in implementation, on the basis of a network structure based on the Tiny-yollov 2, a preset hole convolution layer is used instead of a combination of a convolution network layer and a pooling layer, instead of the combination of the convolution network layer and the pooling layer, as a feature extraction structure in a model, so as to obtain an initial recognition model.
In some embodiments, the above-mentioned labeling processing on the sample image to obtain a labeled sample image may include the following contents in specific implementation: according to a preset labeling rule, respectively labeling a corresponding boundary frame for each sample character in a sample character string in the sample image; the overlapped area range between the bounding boxes of two adjacent sample characters is smaller than a preset area range threshold value; and according to the sample characters contained in the boundary box, marking out corresponding character category values to obtain the marked sample images.
By the embodiment, the sample image can be more effectively and accurately marked, and the marked sample image with relatively good training effect is obtained.
As can be seen from the above, before the text character string recognition method provided in the embodiments of the present specification is implemented specifically, a model structure of a recognition model for recognizing and extracting a text character string in an image is improved in a targeted manner: and replacing the combination of the convolution network layer and the pooling layer with a preset hole convolution layer to serve as a feature extraction structure for extracting image features related to the text character string, so as to obtain an improved preset recognition model with a better effect. In specific implementation, after the acquired target image is preprocessed, the preset recognition model can be called to process the preprocessed target image, so that the method is effectively suitable for the situation that the character string in the image is small in character size and low in resolution, and related image features which can be extracted are relatively few, the target character string contained in the target image is accurately recognized and determined, the recognition accuracy and the recognition efficiency of the text character string are improved, and the recognition error is reduced. And specific anchor point regression is carried out by using preset anchor point frame parameters obtained in advance based on clustering processing, so that the generated candidate frame is relatively more accurate and reasonable, and the processing precision in subsequent determination of the boundary frame is improved. And a plurality of candidate frames corresponding to each text character are processed by introducing and utilizing a preset softening non-maximum value inhibition algorithm to screen out the corresponding boundary frames, so that the problem that the finally extracted target character string is incomplete and has omission due to the fact that the candidate frames of other adjacent text characters are mistakenly deleted under the condition that the intervals between the text characters are close is effectively avoided.
Referring to fig. 6, another text string recognition method is further provided in the embodiments of the present disclosure. When the method is implemented, the following contents can be included:
s601: acquiring a target image containing a target character string to be recognized;
s602: calling a preset recognition model to process the target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of the convolution network layer and the pooling layer to extract target image characteristics related to the target character string from the target image;
s603: and determining a target character string in the target image according to the target processing result.
Through the embodiment, the target character string can be directly recognized from the target image more efficiently by using the preset recognition model.
The present specification also provides a text character recognition method, including: acquiring a target image containing target characters to be recognized; calling a preset recognition model to process the target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of the convolution network layer and the pooling layer to extract target image characteristics related to target characters from the target image; and determining the target characters in the target image according to the target processing result.
Embodiments of the present specification further provide a server, including a processor and a memory for storing processor-executable instructions, where the processor, when implemented, may perform the following steps according to the instructions: acquiring a target image containing a target character string to be recognized; preprocessing the target image to obtain a preprocessed target image; calling a preset recognition model to process the preprocessed target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of a convolution network layer and a pooling layer to extract target image characteristics related to the target character string from the preprocessed target image; and determining a target character string in the target image according to the target processing result.
In order to complete the above instructions more accurately, referring to fig. 7, another specific server is provided in the embodiments of the present specification, where the server includes a network communication port 701, a processor 702, and a memory 703, and the above structures are connected by an internal cable, so that the structures may perform specific data interaction.
The network communication port 701 may be specifically configured to acquire a target image including a target character string to be recognized.
The processor 702 may be specifically configured to perform preprocessing on the target image to obtain a preprocessed target image; calling a preset recognition model to process the preprocessed target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of a convolution network layer and a pooling layer to extract target image characteristics related to the target character string from the preprocessed target image; and determining a target character string in the target image according to the target processing result.
The memory 703 may be specifically configured to store a corresponding instruction program.
In this embodiment, the network communication port 701 may be a virtual port that is bound to different communication protocols, so that different data can be sent or received. For example, the network communication port may be a port responsible for web data communication, a port responsible for FTP data communication, or a port responsible for mail data communication. In addition, the network communication port can also be a communication interface or a communication chip of an entity. For example, it may be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it can also be a Wifi chip; it may also be a bluetooth chip.
In this embodiment, the processor 702 may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The description is not intended to be limiting.
In this embodiment, the memory 703 may include multiple layers, and in a digital system, the memory may be any memory as long as it can store binary data; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.
The present specification further provides a computer storage medium based on the text string recognition method, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer program instructions implement: acquiring a target image containing a target character string to be recognized; preprocessing the target image to obtain a preprocessed target image; calling a preset recognition model to process the preprocessed target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of a convolution network layer and a pooling layer to extract target image characteristics related to the target character string from the preprocessed target image; and determining a target character string in the target image according to the target processing result.
In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.
The present specification further provides another computer storage medium based on the above text string recognition method, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer program instructions implement: acquiring a target image containing a target character string to be recognized; calling a preset recognition model to process the target image to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of the convolution network layer and the pooling layer to extract target image characteristics related to the target character string from the target image; and determining a target character string in the target image according to the target processing result.
Referring to fig. 8, in a software level, an embodiment of the present specification further provides a text string recognition apparatus, which may specifically include the following structural modules:
the obtaining module 801 may be specifically configured to obtain a target image including a target character string to be recognized;
the preprocessing module 802 may be specifically configured to preprocess the target image to obtain a preprocessed target image;
the calling module 803 may be specifically configured to call a preset recognition model to process the preprocessed target image, so as to obtain a corresponding target processing result; the preset identification model at least comprises a preset cavity convolution layer; the preset hole convolution layer is used for replacing the combination of a convolution network layer and a pooling layer to extract target image characteristics related to the target character string from the preprocessed target image;
the determining module 804 may be specifically configured to determine a target character string in the target image according to the target processing result.
It should be noted that, the units, devices, modules, etc. illustrated in the above embodiments may be implemented by a computer chip or an entity, or implemented by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. It is to be understood that, in implementing the present specification, functions of each module may be implemented in one or more pieces of software and/or hardware, or a module that implements the same function may be implemented by a combination of a plurality of sub-modules or sub-units, or the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Therefore, the text character string recognition device provided by the embodiment of the specification can be better suitable for the situation that the character size of the character string in the image is small, the resolution is low, the extracted related image features are relatively few, the target character string contained in the target image is accurately recognized and determined, the recognition accuracy and the recognition efficiency of the text character string are improved, and the recognition error is reduced.
In a specific scenario example, the identification method of text character strings provided in the present specification can be applied to accurately identify the crown word number on the banknote. The following can be referred to as a specific implementation process.
In the present scenario example, it is considered that since the area of the crown word number (the target character string to be recognized) on the banknote is approximately fixed, the area of the crown word number of the banknote can be approximately determined according to the known prior information. If the existing recognition method is adopted, three steps (equivalent to a positioning process and a classification process) of accurate positioning, single character segmentation and character recognition are generally required. Each step is an independent process, so that different steps are mutually isolated, certain precision loss exists in the process of executing each step in sequence inevitably, errors caused by the precision loss in the three steps are accumulated layer by layer in the whole process, and finally, the errors act on the identification accuracy, so that the identification accuracy is poor. Therefore, considering that the error influence among different tasks can be reduced, three steps are integrated, and the method for constructing and training the end-to-end crown word number recognition network (corresponding to the preset recognition model) is provided.
Further, most of the existing text detection network models (namely, conventional identification models) are considered, and the used training data sets are more image sample images with more pixel points, less noise or more regular high-definition and less noise. However, during circulation, the paper money generates many creases and stains, and the noise is randomly distributed in the crown word area, which has a certain influence on the accuracy of crown word number identification. Meanwhile, the area of the serial number character is small, so that the serial number character picture has the characteristics of high noise and low resolution, and the identification difficulty is high.
In addition, if a deep learning model with a deeper network hierarchy is used, the accuracy can be satisfied, but the time is often sacrificed or the performance of the device is depended on. However, in the scene of identifying the serial number of the paper money, no equipment with stronger performance exists, and the requirement on time is higher. Therefore, most of the existing network structures cannot meet the actual service requirements.
Based on the above consideration, in order to solve the limitation caused by the recognition method that the conventional character recognition process is divided into three steps and the problem that the existing deep learning network structure precision and the time efficiency cannot be met at the same time, and obtain the correct crown word number character sequence, an end-to-end crown word number character recognition method based on Tiny-YOLOv2 for the banknote crown word number can be further provided by combining the recognition method of the text character string provided by the specification.
The method can be implemented by the following steps: firstly, an image (for example, a preprocessed target image) which is obtained by preprocessing and has a large area and a rough positioning of a crown word number area is used as input and is sent into a crown word number recognition network for feature extraction, the traditional pooling operation of the convolutional layer of the partial recognition network is removed, and the convolutional layer of the partial recognition network is replaced by using a hole convolution (for example, a preset hole convolutional layer).
Then, a plurality of different possible candidate character boxes (e.g., candidate boxes) may be generated based on the extracted characteristic properties, and the characters may be initially classified into specific categories. Redundant boxes are removed by a Non-Maximum Suppression algorithm (Soft-NMS), so that only one prediction box with the highest confidence level (e.g., a bounding box) is ultimately output for each character. The specific coordinate information of the individual characters is then determined by anchor regression.
And finally, sequentially arranging the crown word numbers according to the crown word number arrangement characteristics and the obtained coordinate information of each character to obtain a final identification result (for example, a target processing result).
In the image containing the banknote serial number, the resolution of the characters of the serial number is low, the features of the serial number capable of being extracted are few, and the requirements on the recognition speed and the recognition accuracy are relatively high. Therefore, the whole identification network structure needs to adopt a relatively simplified network as a basic network of the algorithm, and the algorithm improvement is made according to the development of the network, so as to achieve a balance between the identification speed and the accuracy.
The image containing the banknote may first be pre-processed. The method specifically comprises the following steps: correcting (Rectification), cutting (Crop) and adjusting (Resize) the image to obtain a coarse positioning image (for example, a preprocessed target image) containing a crown word number region; then, the rough positioning image is used as an input image and is sent to a crown word number identification network for feature extraction; predicting specific categories of each character, namely characters between 0 and 9 or A and Z; obtaining the specific coordinates of each character through anchor point regression; removing redundant frames through a non-maximum suppression algorithm, and finally outputting only one prediction frame with the maximum confidence coefficient for each character; and finally, sequentially arranging the crown word numbers according to the arrangement characteristics of the crown word numbers and the predicted coordinate information of each character, and finally outputting the crown word number identification result. As can be seen in fig. 9.
The specific treatment may include the following steps.
Step 101: and extracting image features by the hole convolution operation.
In the structure of a general convolutional neural network, features are extracted by using convolutional layers, partial features are screened by using a pooling layer, and the effect of increasing a receptive Field (Reception Field) is achieved. However, because the number-crown character images are few, the features which can be extracted by the character-crown image extraction device are few, and the pooling layer is adopted, the effect of feature filtering cannot be achieved, and the detailed features among the characters can be lost. However, if the pooling layer is removed directly, the same receptive field cannot be guaranteed. Therefore, the cavity convolution is adopted, and on the basis of not reducing character characteristic information, the receptive field with the same effect as pooling can be kept.
In this scenario example, based on the hole convolution, the convolution kernel can be expanded to the scale set by the expansion coefficient by 1 expansion coefficient, and the expanded redundant region is filled with 0, so that each convolution kernel can extract a larger range of feature information than before, and as shown in fig. 5, a convolution kernel of 3 × 3 is shown, and after undergoing the expansion operation with the expansion coefficient of 1, the convolution kernel actually performs the convolution operation.
The hole convolution has no difference in time consumption from the ordinary convolution, and since the hole convolution does not increase the number of parameters, the previous effect can be achieved by using a smaller convolution kernel. Meanwhile, as the receptive field is increased, the pooling layer can be correspondingly reduced, thereby reducing the loss of information.
The general convolution plus pooling operation can be replaced by a hole convolution, and the hole convolution can speed up the computational efficiency of the convolutional neural network without additional time consumption. For the case of crown word number recognition, pooling is a process that reduces features, but the resolution of the crown word number character itself is low.
If the pooling is performed for a plurality of times, the feature information which is not abundant originally in the prefix character is further lost, so that the local feature information of the character is not extracted sufficiently, and the similar image is not resolved clearly.
The accuracy of identifying the crown word number is reduced. Therefore, by reducing the pooling layer, the accuracy of the identification of the prefix number can be improved.
Step 102: and presetting anchor points through k-means clustering.
The frame of target detection is usually preset with frames with different sizes and different aspect ratios on the image in advance, and these frames are called Anchor points (anchors). For a network frame of target detection, the anchor point frame is set to a reasonable value, so that the network convergence speed can be accelerated, and the final detection effect can be ensured.
The values of the anchor points are generally set manually, for example, in comparison with the famous Fast-RCNN, 9 different anchor points are designed, but the manually designed anchor points have a problem that the manually designed anchor points cannot be well applied to the data sets of the capital-letter number characters of the paper money. Therefore, we propose to automatically generate anchor points suitable for the data set by using a K-means clustering algorithm.
The adopted clustering algorithm converts the original Euclidean distance clustering method into clustering by an Intersection-over-Unit (IoU) by using a K-means clustering algorithm for reference, so as to generate an initial anchor point, and the size of an error is ensured to be irrelevant to the size of a real frame.
Step 103: filtering out redundant bounding boxes with softened non-maxima suppression.
After the bounding boxes are obtained by anchor point regression, multiple targets may be detected for the same character class, and multiple overlapping bounding boxes may exist for each character. In a network for target detection, a bounding box with a higher confidence is usually obtained by adopting non-maximum suppression screening. The core of the algorithm idea of non-maximum suppression is as follows: only one optimal box is kept for each character. Selecting the frame of the character with the highest confidence coefficient as a reference, calculating the proportion of the total area occupied by the overlapped area of the two candidate frames if the candidate frame overlapped with the character exists, setting the confidence coefficient of the frame to be 0 if the proportion is larger than a set threshold value, namely rejecting the candidate frame, and not changing the confidence coefficient if the proportion is smaller than the set threshold value, namely keeping the candidate frame if the proportion is possible to be the candidate frame of other characters. The method has a good effect on the situation that a plurality of targets exist in the picture and the space between the targets is large.
However, because the characters on the banknote crown word number are distributed densely, the intervals between the characters are small, and when noise interference occurs, the overlapping between two character candidate frames is easy to be large. When one of the characters is selected as the reference, and another character with a higher overlap with it, it is easy to cause a loss of character detection. As can be seen in fig. 3, when character 4 (the 4 th character from left to right) is selected as the reference, the candidate box for character 0 (the 5 th character from left to right) is deleted.
To cope with this, in the present scenario example, a softening non-maximum suppression method is adopted, and for candidate frames with a ratio greater than a threshold, the confidence is no longer set to 0 directly, but is reduced by a linear function. The attenuation function corresponding to the smoothing general non-maximum suppression algorithm is expressed by the following formula, where formula (1) represents non-maximum suppression and formula (2) represents softened.
Figure BDA0003009178890000191
Figure BDA0003009178890000192
Wherein S isiIndicates the confidence of the ith candidate box, IoUiThe ratio of the total area occupied by the overlapping region of the candidate frame and a reference frame is represented, and T represents a set threshold value. The method of softening non-maximum value suppression is adopted, the candidate frames of other characters with higher overlapping degree are not directly deleted, the method is suitable for the images with fewer pixel points and small object space, such as the paper currency crown character, the detection rate is increased, and meanwhile, the final recognition accuracy rate is improved.
Through the scene example, the recognition method of the text character string provided by the specification is verified to be capable of integrating character segmentation and classification problems in a network model, and the problem of feature loss caused by the fact that the character segmentation and the classification are separated is well solved; by adopting the cavity convolution to extract the image features of the serial number characters, the problem that the features of images with few pixels, such as the serial number characters of the paper currency, are lost due to the pooling process can be solved better; the original evaluation function is smoothed by adopting a softening non-maximum suppression algorithm, so that the problems that the crown word number characters are densely distributed and the candidate frames of adjacent characters are easily deleted by mistake due to the overlapping of character detection frames are solved.
Although the present specification provides method steps as described in the examples or flowcharts, additional or fewer steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an apparatus or client product in practice executes, it may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) according to the embodiments or methods shown in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded. The terms first, second, etc. are used to denote names, but not any particular order.
Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
From the above description of the embodiments, it is clear to those skilled in the art that the present specification can be implemented by software plus necessary general hardware platform. With this understanding, the technical solutions in the present specification may be essentially embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments in the present specification.
The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The description is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
While the specification has been described with examples, those skilled in the art will appreciate that there are numerous variations and permutations of the specification that do not depart from the spirit of the specification, and it is intended that the appended claims include such variations and modifications that do not depart from the spirit of the specification.

Claims (15)

1.一种文本字符串的识别方法,其特征在于,包括:1. the identification method of a text string, is characterized in that, comprises: 获取包含有待识别的目标字符串的目标图像;Get the target image containing the target string to be recognized; 对所述目标图像进行预处理,得到预处理后的目标图像;Preprocessing the target image to obtain a preprocessed target image; 调用预设的识别模型处理所述预处理后的目标图像,得到对应的目标处理结果;其中,所述预设的识别模型至少包括预设的空洞卷积层;所述预设的空洞卷积层用于代替卷积网络层和池化层的组合从所述预处理后的目标图像中提取出与目标字符串相关的目标图像特征;Calling a preset recognition model to process the preprocessed target image to obtain a corresponding target processing result; wherein, the preset recognition model includes at least a preset hole convolution layer; the preset hole convolution layer The layer is used to replace the combination of the convolutional network layer and the pooling layer to extract the target image features related to the target string from the preprocessed target image; 根据所述目标处理结果,确定出目标图像中的目标字符串。According to the target processing result, the target character string in the target image is determined. 2.根据权利要求1所述的方法,其特征在于,所述预设的识别模型还包括定位子模型;其中,所述定位子模型与所述预设的空洞卷积层相连,所定位子模型用于根据目标图像特征,以及预设的锚点框参数,为目标字符串中的各个文本字符分别生成相对应的多个候选框;并分别为各个文本字符从所对应的多个候选框中筛选出一个符合要求的候选框作为文本字符的边界框;所边界框携带有所包含的文本字符的位置信息。2. The method according to claim 1, wherein the preset recognition model further comprises a positioning sub-model; wherein, the positioning sub-model is connected with the preset atrous convolutional layer, and the positioning sub-model is It is used to generate a plurality of corresponding candidate boxes for each text character in the target string according to the target image features and the preset anchor frame parameters; and for each text character from the corresponding plurality of candidate frames A candidate box that meets the requirements is filtered out as the bounding box of the text character; the bounding box carries the position information of the contained text characters. 3.根据权利要求2所述的方法,其特征在于,所述预设的锚点框参数按照以下方式获取:3. The method according to claim 2, wherein the preset anchor frame parameters are obtained in the following manner: 获取包含有样本字符串的样本图像;Get a sample image containing a sample string; 根据预设的标注规则,在所述样本图像中,为样本字符串中的各个样本字符分别标注出对应的边界框;并采集样本字符的边界框参数;其中,相邻的两个样本字符的边界框之间存在重叠的区域范围小于预设的区域范围阈值;According to the preset labeling rules, in the sample image, a corresponding bounding box is marked for each sample character in the sample string; and the bounding box parameters of the sample character are collected; The overlapping area between the bounding boxes is smaller than the preset area threshold; 对所述样本字符的边界框参数进行聚类处理,以得到所述预设的锚点框参数。Clustering processing is performed on the bounding box parameters of the sample characters to obtain the preset anchor box parameters. 4.根据权利要求2所述的方法,其特征在于,分别为各个文本字符从所对应的多个候选框中筛选出一个符合要求的候选框作为文本字符的边界框,包括:4. The method according to claim 2, wherein, for each text character, a candidate frame that meets the requirements is screened from a plurality of corresponding candidate frames as the bounding box of the text character, comprising: 按照以下方式为目标字符串中的当前文本字符从所对应的多个候选框中筛选出一个符合要求的候选框作边界框:Select a candidate box that meets the requirements from the corresponding candidate boxes for the current text character in the target string in the following manner as a bounding box: 调用预设的软化非极大值抑制算法处理所述多个候选框,以从多个候选框中筛选出一个置信度符合要求的候选框,作为当前文本字符的边界框;并滤除所述多个候选框中除所述边界框以外的其他候选框。Invoke the preset softening non-maximum value suppression algorithm to process the multiple candidate boxes, so as to filter out a candidate box whose confidence meets the requirements from the multiple candidate boxes, as the bounding box of the current text character; and filter out the Other candidate boxes other than the bounding box in the plurality of candidate boxes. 5.根据权利要求2所述的方法,其特征在于,所述预设的识别模型还包括分类子模型;其中,所述分类子模型与所述空洞卷积层相连,所述分类子模型用于根据目标图像特征,识别并确定出待识别的目标字符串中的各个文本字符的类别值。5. The method according to claim 2, wherein the preset recognition model further comprises a classification sub-model; wherein, the classification sub-model is connected with the atrous convolutional layer, and the classification sub-model uses According to the characteristics of the target image, the category value of each text character in the target character string to be recognized is identified and determined. 6.根据权利要求1所述的方法,其特征在于,对所述目标图像进行预处理,包括:6. The method according to claim 1, wherein preprocessing the target image comprises: 检测目标图像,并在目标图像中确定出包含有待识别的目标字符串的目标图像区域;Detecting the target image, and determining the target image area containing the target character string to be recognized in the target image; 从所述目标图像中裁剪出所述目标图像区域,作为所述预处理后的目标图像。The target image area is cropped from the target image as the preprocessed target image. 7.根据权利要求6所述的方法,其特征在于,对所述目标图像进行预处理还包括:7. The method according to claim 6, wherein preprocessing the target image further comprises: 对所述目标图像进行图像校正处理;和/或,对所述目标图像进行降噪处理。Perform image correction processing on the target image; and/or perform noise reduction processing on the target image. 8.根据权利要求1所述的方法,其特征在于,所述待识别的目标字符串包括以下至少之一:目标货币上的冠字号、目标支票上的出票人账号、目标快递单上的物流编号。8. The method according to claim 1, wherein the target character string to be identified comprises at least one of the following: a serial number on the target currency, a drawer account number on the target check, a tracking number. 9.根据权利要求8所述的方法,其特征在于,在所述待识别的目标字符串包括目标货币上的冠字号的情况下,在确定出目标图像中的目标字符串之后,所述方法还包括:9. The method according to claim 8, wherein, when the target character string to be identified includes a serial number on the target currency, after determining the target character string in the target image, the method Also includes: 将所述目标字符串确定为所述目标货币上的冠字号;Determining the target character string as the serial number on the target currency; 根据所述目标货币上的冠字号,跟踪并确定出目标货币的交易流转路径;According to the serial number on the target currency, track and determine the transaction flow path of the target currency; 根据所述目标货币的交易流转路径,确定是否存在交易风险。According to the transaction flow path of the target currency, it is determined whether there is a transaction risk. 10.根据权利要求1所述的方法,其特征在于,所述方法还包括:10. The method of claim 1, wherein the method further comprises: 使用预设的空洞卷积层代替卷积网络层和池化层的组合,作为网络模型中的图像特征的提取结构,以构建得到初始的识别模型;Use the preset atrous convolutional layer to replace the combination of the convolutional network layer and the pooling layer as the extraction structure of the image features in the network model to construct the initial recognition model; 获取包含有待识别的样本字符串的样本图像;并对所述样本图像进行标注处理,得到标注后的样本图像;obtaining a sample image containing the sample character string to be identified; and performing labeling processing on the sample image to obtain a labelled sample image; 利用所述标注后的样本图像训练所述初始的识别模型,以得到预设的识别模型。The initial recognition model is trained by using the labeled sample images to obtain a preset recognition model. 11.根据权利要求10所述的方法,其特征在于,对所述样本图像进行标注处理,得到标注后的样本图像,包括:11. The method according to claim 10, characterized in that, performing labeling processing on the sample images to obtain labelled sample images, comprising: 根据预设的标注规则,在所述样本图像中,为样本字符串中的各个样本字符分别标注出对应的边界框;其中,相邻的两个样本字符的边界框之间存在重叠的区域范围小于预设的区域范围阈值;According to the preset labeling rules, in the sample image, corresponding bounding boxes are respectively marked for each sample character in the sample string; wherein, there is an overlapping area range between the bounding boxes of two adjacent sample characters less than the preset area range threshold; 根据边界框中所包含的样本字符,标注出相应的类别值,得到所述标注后的样本图像。According to the sample characters contained in the bounding box, the corresponding category value is marked to obtain the marked sample image. 12.一种文本字符串的识别装置,其特征在于,包括:12. A device for identifying text strings, comprising: 获取模块,用于获取包含有待识别的目标字符串的目标图像;The acquisition module is used to acquire the target image containing the target string to be recognized; 预处理模块,用于对所述目标图像进行预处理,得到预处理后的目标图像;a preprocessing module for preprocessing the target image to obtain a preprocessed target image; 调用模块,用于调用预设的识别模型处理所述预处理后的目标图像,得到对应的目标处理结果;其中,所述预设的识别模型至少包括预设的空洞卷积层;所述预设的空洞卷积层用于代替卷积网络层和池化层的组合从所述预处理后的目标图像提取出与目标字符串相关的目标图像特征;a calling module for calling a preset recognition model to process the preprocessed target image to obtain a corresponding target processing result; wherein, the preset recognition model includes at least a preset hole convolution layer; the preset The set hole convolution layer is used to replace the combination of the convolution network layer and the pooling layer to extract the target image features related to the target string from the preprocessed target image; 确定模块,用于根据所述目标处理结果,确定出目标图像中的目标字符串。The determining module is used for determining the target character string in the target image according to the target processing result. 13.一种文本字符串的识别方法,其特征在于,包括:13. A method for identifying text strings, comprising: 获取包含有待识别的目标字符串的目标图像;Get the target image containing the target string to be recognized; 调用预设的识别模型处理所述目标图像,得到对应的目标处理结果;其中,所述预设的识别模型至少包括预设的空洞卷积层;所述预设的空洞卷积层用于代替卷积网络层和池化层的组合从所述目标图像中提取出与目标字符串相关的目标图像特征;Invoke a preset recognition model to process the target image, and obtain a corresponding target processing result; wherein, the preset recognition model includes at least a preset hole convolution layer; the preset hole convolution layer is used to replace The combination of the convolutional network layer and the pooling layer extracts the target image features related to the target string from the target image; 根据所述目标处理结果,确定出目标图像中的目标字符串。According to the target processing result, the target character string in the target image is determined. 14.一种服务器,其特征在于,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现权利要求1至11中任一项所述方法的步骤。14. A server, comprising a processor and a memory for storing instructions executable by the processor, the processor implementing the steps of the method of any one of claims 1 to 11 when the processor executes the instructions. 15.一种计算机可读存储介质,其特征在于,其上存储有计算机指令,所述指令被执行时实现权利要求1至11中任一项所述方法的步骤。15. A computer-readable storage medium having computer instructions stored thereon, the instructions, when executed, implement the steps of the method of any one of claims 1 to 11.
CN202110370697.1A 2021-04-07 2021-04-07 Text character string recognition method, device and server Active CN113095313B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110370697.1A CN113095313B (en) 2021-04-07 2021-04-07 Text character string recognition method, device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110370697.1A CN113095313B (en) 2021-04-07 2021-04-07 Text character string recognition method, device and server

Publications (2)

Publication Number Publication Date
CN113095313A true CN113095313A (en) 2021-07-09
CN113095313B CN113095313B (en) 2025-07-18

Family

ID=76674354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110370697.1A Active CN113095313B (en) 2021-04-07 2021-04-07 Text character string recognition method, device and server

Country Status (1)

Country Link
CN (1) CN113095313B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113902046A (en) * 2021-12-10 2022-01-07 北京惠朗时代科技有限公司 Special effect font recognition method and device
CN115631573A (en) * 2022-10-24 2023-01-20 中国农业银行股份有限公司 Cash paper money identification method and device
CN118609140A (en) * 2024-06-28 2024-09-06 上海阿法迪智能数字科技股份有限公司 Real-time collection and input method of book in-print cataloging data and book collection and editing equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108537146A (en) * 2018-03-22 2018-09-14 五邑大学 A kind of block letter mixes line of text extraction system with handwritten form
CN110348447A (en) * 2019-06-27 2019-10-18 电子科技大学 A kind of multiple-model integration object detection method with rich space information
CN110674804A (en) * 2019-09-24 2020-01-10 上海眼控科技股份有限公司 Text image detection method and device, computer equipment and storage medium
CN111062854A (en) * 2019-12-26 2020-04-24 Oppo广东移动通信有限公司 Method, device, terminal and storage medium for detecting watermark
CN111797834A (en) * 2020-05-28 2020-10-20 华南理工大学 Text recognition method and device, computer equipment and storage medium
JP2021022367A (en) * 2019-07-29 2021-02-18 富士通株式会社 Image processing method and information processor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108537146A (en) * 2018-03-22 2018-09-14 五邑大学 A kind of block letter mixes line of text extraction system with handwritten form
CN110348447A (en) * 2019-06-27 2019-10-18 电子科技大学 A kind of multiple-model integration object detection method with rich space information
JP2021022367A (en) * 2019-07-29 2021-02-18 富士通株式会社 Image processing method and information processor
CN110674804A (en) * 2019-09-24 2020-01-10 上海眼控科技股份有限公司 Text image detection method and device, computer equipment and storage medium
CN111062854A (en) * 2019-12-26 2020-04-24 Oppo广东移动通信有限公司 Method, device, terminal and storage medium for detecting watermark
CN111797834A (en) * 2020-05-28 2020-10-20 华南理工大学 Text recognition method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈淼妙;续晋华;: "基于高分辨率卷积神经网络的场景文本检测模型", 计算机应用与软件, no. 10, 12 October 2020 (2020-10-12), pages 144 - 150 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113902046A (en) * 2021-12-10 2022-01-07 北京惠朗时代科技有限公司 Special effect font recognition method and device
CN113902046B (en) * 2021-12-10 2022-02-18 北京惠朗时代科技有限公司 Special effect font recognition method and device
CN115631573A (en) * 2022-10-24 2023-01-20 中国农业银行股份有限公司 Cash paper money identification method and device
CN118609140A (en) * 2024-06-28 2024-09-06 上海阿法迪智能数字科技股份有限公司 Real-time collection and input method of book in-print cataloging data and book collection and editing equipment

Also Published As

Publication number Publication date
CN113095313B (en) 2025-07-18

Similar Documents

Publication Publication Date Title
CN108985214B (en) Image data annotation method and device
CN110008809B (en) Method and device for acquiring form data and server
CN109272509B (en) A continuous image target detection method, device, equipment and storage medium
EP4099217B1 (en) Image processing model training method and apparatus, device, and storage medium
CN110853033B (en) Video detection method and device based on inter-frame similarity
CN106650662B (en) Target object shielding detection method and device
CN110991231B (en) Living body detection method and device, server and face recognition equipment
CN108256404B (en) Pedestrian detection method and device
CN113095313A (en) Text string recognition method and device and server
CN111626163B (en) Human face living body detection method and device and computer equipment
CN108710847A (en) Scene recognition method, device and electronic equipment
CN106845352B (en) Pedestrian detection method and device
CN112101359B (en) Text formula positioning method, model training method and related device
CN111178147B (en) Screen crushing and grading method, device, equipment and computer readable storage medium
CN110647896B (en) Phishing page identification method based on logo image and related equipment
CN110163265A (en) Data processing method, device and computer equipment
CN110942456B (en) Tamper image detection method, device, equipment and storage medium
CN110647895B (en) Phishing page identification method based on login box image and related equipment
CN112132130B (en) A real-time license plate detection method and system for all scenarios
CN116630367B (en) Target tracking method, device, electronic equipment and storage medium
CN112883956B (en) Text character recognition method, device and server
CN113327242A (en) Image tampering detection method and device
CN116977719B (en) Data detection method, device, computer, storage medium, and program product
CN116778534A (en) Image processing methods, devices, equipment and media
CN117251156A (en) Service interface code generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载