Background
The handwritten mathematical formula identification can identify and analyze the mathematical formula from the handwritten material, and has important application in many fields such as education, scientific research, engineering design, data analysis and the like. The technology can help us to quickly and accurately convert and process complex mathematical formulas, reduce repeated labor and provide convenience for life and work of people. For example, in the educational field, handwriting recognition can help teachers and students to quickly convert formulas in a blackboard or note into electronic documents, improving teaching and learning efficiency. In the scientific research field, researchers can utilize the technology to extract and analyze mathematical formulas from manuscripts rapidly, and the scientific research process is accelerated. In addition, handwriting formula recognition can also help engineers and data analysts to quickly convert and apply various complex formulas in design and data processing, thereby improving work efficiency and accuracy. Through carrying out efficient recognition and digital processing on the handwriting formula, a user can integrate and utilize the information better, and the development of the related field is promoted.
Handwriting mathematical formula recognition is classified into online recognition and offline recognition, which differ in the manner of input, in that online input is dynamic in nature and offline input is static. The on-line input is a handwriting stroke track, the track comprises the sequence, curvature and the like of the strokes, and the off-line input is an image containing a handwriting mathematical formula. The recognition method of the invention belongs to an offline handwriting mathematical formula recognition method.
The mathematical formula consists of symbols and symbol relations, wherein the symbols comprise Arabic numerals, english letters, greek letters, operators, other special characters and the like, the symbol relations are different from the character sequence relations of the conventional linear arrangement, the characteristics of multilevel and high structuring are presented, and mathematical symbols such as scores, root numbers, integrals and the like not only comprise basic characters, but also involve complex structures such as upper and lower marks, nested symbols and the like, and the recognition difficulty is greatly increased. Therefore, in the task of recognition of handwritten mathematical formulas, a core challenge is how to efficiently model the sign and structural relationships of the mathematical formulas. The recognition method is not only required to accurately recognize characters in the image, but also is more important to construct complex hierarchical relationships among the characters, and in addition, the recognition difficulty is greatly increased due to the diversity of handwriting, the similarity of handwriting symbols, the difference of handwriting styles, the complex structure of mathematical formulas and the dependency of the context.
The handwriting mathematical formula recognition technology uses a rule-based method, and with the continuous progress of the deep learning technology, the encoder-decoder-based recognition method can tightly combine an input image with recognition contents, so that a model can fully understand the contents in the image and transcribe into an accurate character sequence according to the image contents, has stronger modeling capability, and shows more excellent performance compared with the traditional method.
In a computer system, the representation method of the mathematical formula is mainly divided into two types of representation based on a sequence and representation based on a tree structure, and two corresponding decoding models, namely a model based on sequence decoding and a model based on tree decoding, are also developed by the identification method based on an encoder-decoder in handwriting mathematical formula identification. The method based on the sequence decoding model takes the LaTeX expression as a decoding target, has the advantages that the LaTeX expression is a widely adopted mathematical formula expression, can be well adapted to the existing research results in the fields of natural language processing, optical character recognition and the like, and has stronger universality. The method based on the tree decoding model regards a mathematical formula as a tree structure to carry out predictive decoding. The model design fully integrates the characteristic of the tree structure, which enhances the ability of the model to understand the mathematical formula in theory, and ensures that the model can always decode the legal mathematical formula in the predictive reasoning process.
Methods based on sequence decoding models have certain limitations. First, the model does not fully take into account the tree-like structure characteristics of the mathematical formulas during the training process, which may lead to insufficient understanding of the formula structure by the model, thereby affecting its generalization ability when processing complex structural formulas. Secondly, as a target of sequence prediction, the LaTeX expression cannot guarantee the grammar normalization of the model decoding result. For example, in processing complex mathematical formulas, the model may produce left and right brackets that do not match, resulting in predictions that do not conform to LaTeX's grammar specification.
Meanwhile, the prior method based on the tree-shaped decoding model has the defects. First, these models need to be built depending on the recurrent neural network, and cannot be trained in parallel with high efficiency. In addition, these models are relatively complex in decoding manner, lack of versatility and wide applicability as compared with LaTeX expression, and their performance is generally inferior to the sequence decoding model-based method, resulting in failure of the tree decoding method to find wide application.
Disclosure of Invention
Aiming at the respective advantages and disadvantages of the two decoding models, the invention combines the advantages of the sequence decoding model and the tree-shaped decoding model, and provides a handwritten mathematical formula recognition method and a handwritten mathematical formula recognition system based on formula tree structure perception.
The technical scheme adopted by the invention is as follows:
a handwritten mathematical formula recognition method comprising the steps of:
constructing handwritten mathematical formula labeling data;
Constructing a handwritten mathematical formula recognition initial model, training the initial model by using labeling data, and performing joint optimization on a sequence decoding task and a tree structure prediction task to obtain the handwritten mathematical formula recognition model;
And carrying out handwriting mathematical formula recognition by adopting a handwriting mathematical formula recognition model obtained through training.
Further, the constructing handwritten mathematical formula annotation data includes converting the mathematical formula into a tree structure, the converting the mathematical formula into a tree structure including representing the tree structure of the mathematical formula as a series of tuples, wherein each tuple (c, p) includes a child node c of each subtree and its parent node p.
Further, the model structure of the initial model includes:
The visual coding model is used for extracting high-level visual characteristics from the input handwritten mathematical formula sheet;
the image position coding module and the character position coding module are used for respectively providing explicit position information for the visual characteristics and the word embedding vectors;
The decoding model is used for outputting a character semantic feature sequence;
the formula tree structure perception module is used for constructing the feature vector output by the decoding model into a formula structure tree;
and the projection layer is used for decoding the mathematical formula sequence in an autoregressive mode by utilizing the feature vector output by the decoding model.
Further, the decoding model employs a transducer decoding model that introduces an overlay attention mechanism.
Further, the processing procedure of the formula tree structure sensing module comprises the following steps:
Inputting a character semantic feature sequence X extracted by a decoding model;
Performing feature extraction on the character semantic feature sequence X by using a transducer encoder to obtain X';
The feature vector X' output by the transducer encoder is mapped into a child node vector and a father node vector through two linear projection functions F c (-) and F p (-), respectively;
Combining and adding the child node vectors and the father node vectors in pairs to obtain a relation feature matrix M, wherein a vector M i,j at each position in the matrix M represents a relation feature vector of which the character with the index of i is used as a child node and the character with the index of j is used as a father node;
Using function F score (·) to convert the relationship feature matrix M to a relationship score matrix S, function F score (·) consisting of a ReLU activation function and a vector dot product operation;
Each element S i,j in the relation score matrix S output by the formula tree structure perception module represents relation scores with the ith character as a child node and the jth character as a father node, wherein the higher the score is, the greater the possibility of establishing father-son relation is, and for the element S a,b with the highest score in each row, the model prediction is considered to have father-son relation with the a symbol as a child node and the b symbol as a father node, so that the whole mathematical formula tree structure is constructed by establishing the relation among the characters.
Further, the training strategy of the initial model comprises the steps of combining a loss function of a sequence decoding task and a loss function of a tree structure prediction task to train the model;
The loss function of the sequence decoding task is:
Wherein, X t represents the feature vector of the t time step output by the decoding model, W o represents the linear projection parameter matrix, and b o represents the bias vector;
The loss function of the tree structure prediction task is as follows:
s is a relation score matrix output by the formula tree structure sensing module and used for estimating the probability of parent-child relation between each character t and other characters.
Further, the handwriting mathematical formula recognition by using the handwriting mathematical formula recognition model obtained by training comprises:
Acquiring a handwritten mathematical formula image to be identified;
Preprocessing a handwritten mathematical formula image to be recognized;
Inputting the preprocessed handwritten mathematical formula image into a handwritten mathematical formula recognition model obtained through training, predicting each symbol in the formula image by the model, and generating an initial recognition sequence;
generating a group of candidate sequences by adopting a beam search strategy for the initial identification sequence, and calculating a sequence decoding score for each candidate sequence;
Calculating a relation score matrix of the candidate sequences by using a formula tree structure perception module, and further calculating a tree structure prediction score for each candidate sequence;
and adding the sequence decoding score and the tree structure prediction score, and selecting a candidate sequence with the highest comprehensive score as a final output sequence.
A handwritten mathematical formula recognition system, comprising:
the annotation data construction module is used for constructing annotation data of the handwriting mathematical formula;
the model training module is used for constructing a handwritten mathematical formula recognition initial model, training the initial model by using the labeling data, and carrying out joint optimization on a sequence decoding task and a tree structure prediction task to obtain the handwritten mathematical formula recognition model;
And the formula recognition module is used for carrying out handwriting mathematical formula recognition by adopting the handwriting mathematical formula recognition model obtained through training.
The beneficial effects of the invention are as follows:
The invention effectively improves the understanding and generalization capability of the model to the mathematical formula structure by fusing the methods based on the sequence decoding and the tree decoding, and obviously improves the accuracy of handwriting mathematical formula identification on the basis of maintaining the high-efficiency parallel training advantages of the existing encoder-decoder model method. Specifically, the invention can perform joint optimization on the two tasks in the training stage, and simultaneously trains the understanding capability of the model on the two aspects of the sequence arrangement and the tree structure of the mathematical formula. In the reasoning prediction stage, the invention provides that the tree structure prediction scoring mechanism is integrated into the beam search algorithm, so that the model considers the sequence decoding of a mathematical formula and the rationality of the tree structure when generating the LaTeX expression, and the identification accuracy is further improved. In summary, the recognition method provided by the invention improves the recognition capability of the sequence decoding model on the handwriting mathematical formula of the complex structure.
Detailed Description
The present invention will be further described in detail with reference to the following examples and drawings, so that the above objects, features and advantages of the present invention can be more clearly understood.
The invention combines the advantages of a sequence decoding model and a tree-shaped decoding model, provides a handwritten mathematical formula recognition method based on formula tree structure perception, introduces a formula tree structure perception module on the basis of the flexibility and high-efficiency training efficiency of the sequence decoding model, and improves the understanding and generalization capability of complex formula structures. By carrying out joint optimization on two tasks of sequence prediction and tree structure prediction, the method can decode the LaTeX sequence and consider the rationality of the formula tree structure, so that the recognition result is more complete and accurate in grammar structure. Experimental results on a plurality of data sets show that compared with the traditional sequence decoding model and tree-shaped decoding model, the method provided by the invention has better recognition performance, and the potential of a method for combining sequence and tree-shaped decoding in the field of handwriting mathematical formula recognition is also demonstrated.
The invention relates to a handwritten mathematical formula recognition method based on an encoder-decoder, which completes a recognition flow by constructing a deep neural network model. Firstly, constructing handwritten mathematical formula labeling data, constructing a handwritten mathematical formula identification initial model, training the initial model by using the labeling data, and automatically learning visual characteristics of the handwritten mathematical formula and semantic structures of LaTeX expressions from the data to obtain a final model. And then, carrying out handwriting mathematical formula recognition by adopting a final model obtained through training.
And 1, constructing handwritten mathematical formula labeling data, and converting a mathematical formula based on a LaTeX expression into a tree structure.
In particular, the present invention represents the tree structure of the mathematical formula as a series of tuples, wherein each tuple (c, p) contains the child node c of each subtree and its parent node p. Notably, in order to disambiguate the repeated symbols that may be introduced and ensure compatibility with the LaTeX expression labels used by the sequence decoding model, the present invention chooses to use the index of the mathematical symbol in the LaTeX expression as the node identification in the formula tree, rather than directly using the mathematical symbol itself. For example, for mathematical formula 3 2 -1=8 in fig. 1, its corresponding LaTeX expression is labeled "3 {2} -1=8", and its formula tree structure label may be expressed as "(0, -1), (1, -1), (2, -1), (3, 0), (4, -1), (5, 0), (6, 5), (7, 6), (8, 7)", where-1 indicates that the node has no parent node, i.e., the node does not participate in the tree structure building process. By the method, the sequence labeling and tree structure labeling data of the mathematical formula are unified, and further, the model is allowed to perform end-to-end joint optimization on sequence decoding and tree structure prediction tasks.
And 2, constructing a handwritten mathematical formula to identify an initial model.
The model structure of the invention is shown in figure 1, and mainly comprises four parts, namely 1) a visual coding model for extracting high-level visual features with rich meaning from an input handwritten mathematical formula. The module adopts DenseNet model widely applied in the field of handwriting mathematical formula recognition as backbone network. 2) The image position coding module and the character position coding module respectively provide explicit position information for the visual features and the word embedding vectors. The image position coding module and the character position coding module use the same sine and cosine position coding. 3) The invention adopts a transform decoding model which introduces an overlay attention mechanism. 4) And the formula tree structure perception module is used for constructing the feature vector output by the decoding model into a formula structure tree to help the model to better capture the structural information of the mathematical formula. 5) And the projection layer is used for decoding the LaTeX sequence in an autoregressive mode by utilizing the feature vector output by the decoding model.
The visual coding model can be implemented by adopting DenseNet models and the like.
The decoding model is shown in fig. 1 and comprises a self-attention module, an overlay attention module and a feedforward neural network. The self-attention module is responsible for modeling the information interaction between the current decoding position and the previous position, the overlay attention module is responsible for capturing the information interaction between the decoder and the encoder, guiding the decoder to generate a final output sequence by utilizing the characteristic information generated by the encoder, and the feedforward neural network is responsible for carrying out nonlinear transformation on the representation after attention calculation, so that the expressive power of the model is enhanced and more accurate characteristic characterization is generated.
An architectural overview of the proposed formula tree structure aware module is shown in fig. 2, where the "SOS" flag is used to indicate the start of a sequence. The input of the module is a character semantic feature sequence X epsilon R T×d extracted by a decoding model, wherein T represents the LaTeX sequence length, and d represents the feature vector dimension of the character. In order to map the semantic features of the characters into the semantic space of the formula tree structure, a transducer encoder is used for extracting features of the character semantic feature sequence X to obtain X' E R T×d, so that the tree structure relationship can be conveniently built in the semantic space later.
X′=TransformerEncoder(X),
Then, the feature vector X' output by the transducer encoder is mapped into a child node vector X c∈RT×d and a parent node vector X p∈RT×d, respectively, by two linear projection functions F c (·) and F p (·).
Xc=Fc(X′)=X′Wc,
Xp=Fp(X′)=X′Wp,
Where W c∈Rd×d and W p∈Rd×d are both trainable linear projection parameter matrices, where d represents the feature vector dimensions of the character. In order to construct the formula tree structure relation between characters, the invention combines and adds the child node vector and the father node vector pairwise to obtain the relation characteristic matrix M. In this matrix, the vector M i,j∈Rd for each position represents a relationship feature vector in which the character with index i in the LaTeX expression is used as a child node and the character with index j is used as a parent node.
Finally, the present invention uses a function F score (·) consisting of a ReLU activation function and a vector dot product operation to convert the relationship feature matrix M into a relationship score matrix S ε R T ×T:
S=Fscore(M)=max(0,M)vs.
wherein v s is a weight vector.
Each element S i,j in the relationship score matrix S output by the formula tree structure sensing module represents a relationship score with the ith character as a child node and the jth character as a parent node, where a higher score means a higher likelihood that a parent-child relationship is established. In the mathematical formula tree structure, each node except the root node has only one parent node. Therefore, for the element S a,b with the highest score in each row, the model prediction can be considered to have a parent-child relationship with the a symbol as a child node and the b symbol as a parent node, so that the whole mathematical formula tree structure is constructed by establishing the relationship between characters.
And step 3, training the initial model by using the labeling data to obtain a final model.
1) Training strategy
In order to combine the tasks of sequence decoding and tree decoding and thus enable the model to perform end-to-end joint optimization on both tasks, the present invention combines the sequence decoding and tree structure prediction loss functions to train the model.
Specifically, for the LaTeX expression sequence y 1,…,yT, in the sequence decoding task, the feature vector X e R T×L output by the decoding model can be used to calculate the probability that each character appears at time step t, and then the cross entropy loss function is used to calculate the loss L seq of the sequence decoding task.
Where X t represents the eigenvector of the t-th time step of the decoding model output, W o represents the linear projection parameter matrix, and b o represents the bias vector.
In the tree structure prediction task, the relation score matrix S e R T×T output by the formula tree structure perception module is used for estimating the probability of the parent-child relation between each character t and other characters, and then the cross entropy loss function is also used for calculating the loss L struct of the tree structure prediction task.
And finally, adding the loss of the sequence decoding and tree structure prediction tasks to obtain a loss function L for training the method.
L=Lseq+Lstruct
2) Model training
According to the invention, the PyTorch framework can be used for constructing an initial model, the labeling data is used for performing supervised training, and the model optimization super parameters are adjusted according to experimental conditions to obtain a final model.
And 4, carrying out handwriting mathematical formula recognition by adopting a final model obtained through training.
1) And acquiring a formula image to be identified, and acquiring a handwriting formula image to be identified by using a scanning or mobile equipment photographing method. The sample is shown in fig. 3, for example.
2) Preprocessing a handwriting formula to be identified, performing image enhancement and noise removal on the formula image, and then performing inclination correction and scale normalization operation on the formula image, so that the formula image to be identified and the image in the training image data set have the same scale and angle.
3) And inputting the preprocessed formula image into a final model obtained by training, predicting each symbol in the formula image by the model, and generating an initial recognition sequence.
4) A bundle search strategy is applied to the initial recognition sequence to generate a set of candidate sequences, and a sequence decoding score S seq (y) is calculated for each candidate sequence y using the forward and reverse sequence decoded bundle search scores.
5) And calculating a relation score matrix of the candidate sequences by using a formula tree structure perception module, constructing corresponding tree structure labels by using the candidate sequences y, calculating cross entropy loss of the relation score matrix and the tree structure labels, and calculating a tree structure prediction score S struct (y) for each candidate sequence by using negative cross entropy loss.
6) And adding the sequence decoding score and the tree structure prediction score, namely S seq(y)+Sstruct (y), and selecting the candidate sequence with the highest comprehensive score as a final output sequence.
By considering the scores on sequence decoding and tree structure prediction at the same time, the method ensures the accuracy of the final generated sequence and also considers the rationality of the formula tree structure.
The key point of the invention is a formula tree structure sensing module. The module effectively improves the understanding of the model to the mathematical formula structure and the generalization capability thereof by fusing the methods based on the sequence decoding and the tree decoding. Specifically, the module can perform joint optimization on the two tasks in a training stage, and simultaneously trains the understanding capability of the model on the two aspects of the sequence arrangement and the tree structure of the mathematical formula. In the reasoning prediction stage, the invention provides that the tree structure prediction scoring mechanism is integrated into the beam search algorithm, so that the model considers the sequence decoding of a mathematical formula and the rationality of the tree structure when generating the LaTeX expression, and the identification accuracy is further improved.
The above embodiment takes a mathematical formula expression in a LaTeX format as an example to describe the handwriting mathematical formula recognition method of the present invention. In addition to LaTeX expressions, the method of the present invention is also applicable to other forms of mathematical formula expressions, such as MathML, etc.
The performance of the model of the present invention on formula recognition accuracy (ExpRate) is compared with the existing most advanced model on CROHME 2014/2016/2019 test datasets common to handwritten mathematical formula recognition tasks. None of the methods uses data enhancement to ensure fair comparison. As shown in Table 1, the process of the present invention is always superior to the existing process in all indexes.
TABLE 1 formula identification accuracy comparison of the invention with the prior art
Another embodiment of the present invention provides a handwritten mathematical formula recognition system, comprising:
the annotation data construction module is used for constructing annotation data of the handwriting mathematical formula;
the model training module is used for constructing a handwritten mathematical formula recognition initial model, training the initial model by using the labeling data, and carrying out joint optimization on a sequence decoding task and a tree structure prediction task to obtain the handwritten mathematical formula recognition model;
And the formula recognition module is used for carrying out handwriting mathematical formula recognition by adopting the handwriting mathematical formula recognition model obtained through training.
The above-mentioned division of the modules is only illustrative, and the above-mentioned functions can be distributed by different functional modules according to the needs in practical application to complete all or part of the functions described in the above-mentioned method. The specific working process of each module may refer to the corresponding process in the foregoing method embodiment, which is not described herein.
Another embodiment of the invention provides a computer device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor and a processor, the computer program comprising instructions for performing the steps of the method of the invention.
Another embodiment of the invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, performs the steps of the method of the invention.
The above-disclosed embodiments of the present invention are intended to aid in understanding the contents of the present invention and to enable the same to be carried into practice, and it will be understood by those of ordinary skill in the art that various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention. The invention should not be limited to what has been disclosed in the examples of the specification, but rather by the scope of the invention as defined in the claims.