JP2021022367A

JP2021022367A - Image processing method and information processor

Info

Publication number: JP2021022367A
Application number: JP2020092452A
Authority: JP
Inventors: ジャン・ホォイガン; Hui Gang Zhang; 留安汪; Liu An Wang; 俊孫; Shun Son
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-07-29
Filing date: 2020-05-27
Publication date: 2021-02-18
Also published as: CN112381079A

Abstract

【課題】本発明は、画像処理方法和情報処理装置を提供する。【解決手段】画像処理方法は、テキストを含む処理待ち画像を、文字認識のための畳み込みニューラルネットワークモデルに入力し、テキスト特徴を抽出し；及び、抽出されたテキスト特徴を、畳み込みニューラルネットワークモデルと接続される語義認識のための再帰型ニューラルネットワークモデルに入力し、処理待ち画像におけるテキストを認識することを含み、畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルは、互いに接続される、確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルに基づいて、事前定義の候補モデル空間内で各確定待ち層について捜索を行い、エンドツーエンドの方式で、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルに対してジョイント訓練を行うことで取得される。【選択図】図１PROBLEM TO BE SOLVED: To provide an image processing method sum information processing apparatus. An image processing method inputs a processing-waiting image containing text into a convolutional neural network model for character recognition, extracts text features; and uses the extracted text features as a convolutional neural network model. Convolutional neural network models and recursive neural network models include inputting into a recursive neural network model for connected word meaning recognition and recognizing text in awaiting images, and convolutional neural network models and recursive neural network models connect to each other with a definite waiting layer. Based on the initial convolutional neural network model that has and the initial recursive neural network model that has a definite waiting layer, each definite waiting layer is searched in the predefined candidate model space and connected to each other in an end-to-end manner. Obtained by performing joint training on a convolutional neural network model and a recursive neural network model. [Selection diagram] Fig. 1

Description

本発明は、画像処理分野に関し、特に、画像に含まれるテキストを認識する画像処理方法及び該画像処理方法を実現し得る情報処理装置に関する。 The present invention relates to the field of image processing, and more particularly to an image processing method for recognizing text contained in an image and an information processing apparatus capable of realizing the image processing method.

画像におけるテキストに対しての認識、例えば、手書き文字の認識（OCR）がコンピュータビジョン分野における研究課題の1つである。 Recognition of text in images, such as handwritten character recognition (OCR), is one of the research topics in the field of computer vision.

今のところ、英文字に基づく研究について言えば、その多くは、深層畳み込みニューラルネットワーク（DCNN）に関連するものである。これらの研究では、例えば、手書き文字からなるテキストの認識を1つの画像分類問題と見なし、また、各英単語に1つの類別ラベル（トータルで約9万個の単語がある）を割り当てる。これは、大量の類別を有する大規模訓練モデルである。このような単語のシーケンスの基本的な組み合わせの数が100万を超えているから、他の言語の文字、例えば、中国語のテキスト、日本語のテキストなどに拡張することができない。よって、DCNNに基づく英文字認識システムは、直接、画像に基づくテキストシーケンスの認識に用いることができない。例えば、このような英文字認識システムを中国語の文字シーケンスの認識に転用しようとすれば、もう一回設計及び訓練を行うために大量の手動（人工）設計が要される。 So far, much of the English-based research is related to deep convolutional neural networks (DCNNs). In these studies, for example, recognition of text consisting of handwritten characters is regarded as one image classification problem, and one classification label (there are about 90,000 words in total) is assigned to each English word. This is a large training model with a large amount of categorization. Since the number of basic combinations of such word sequences exceeds one million, it cannot be extended to characters in other languages, such as Chinese text, Japanese text, and so on. Therefore, the DCNN-based English character recognition system cannot be directly used for image-based text sequence recognition. For example, if such an English character recognition system is to be diverted to the recognition of a Chinese character sequence, a large amount of manual (artificial) design is required for another design and training.

そのため、より取得しやすく、且つ各種の言語の文字認識（例えば、中国語の文字認識）に適したモデルを用いて、画像に含まれるテキストを認識し得る方法が望ましい。 Therefore, a method capable of recognizing the text included in the image by using a model that is easier to acquire and suitable for character recognition of various languages (for example, Chinese character recognition) is desirable.

上述のように、より取得しやすいモデルを用いて、画像に含まれるテキストを認識する方法が望まれることに鑑み、本発明の目的は、捜索（サーチ）及びエンドツーエンドの訓練方式で取得された、互いに接続される畳み込みニューラルネットワーク（CNN）及び再帰型ニューラルネットワーク（RNN）を用いて、テキスト認識を行うことができる、画像処理方法及び該画像処理方法を実現し得る情報処理装置を提供することにある。 As described above, in view of the desire for a method of recognizing text contained in an image using a model that is easier to obtain, an object of the present invention is obtained by a search and end-to-end training method. Further, to provide an image processing method capable of performing text recognition using a convolutional neural network (CNN) and a recurrent neural network (RNN) connected to each other, and an information processing device capable of realizing the image processing method. There is.

本発明の一側面によれば、画像処理方法が提供され、該画像処理方法は、
テキストを含む処理待ち画像を、文字認識のための畳み込みニューラルネットワークモデルに入力し、テキスト特徴を抽出し；及び
抽出されたテキスト特徴を、畳み込みニューラルネットワークモデルと接続される、語義認識のための再帰型ニューラルネットワークモデルに入力し、処理待ち画像におけるテキストを認識することを含み、
畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルは、互いに接続される、確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルに基づいて、事前定義の候補モデル空間内で各確定待ち層について捜索を行い、エンドツーエンドの方式で、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルに対してジョイント訓練を行うにより取得されるものである。 According to one aspect of the present invention, an image processing method is provided, the image processing method.
Input awaiting image containing text into a convolutional neural network model for character recognition, extract text features; and recurrently connect the extracted text features with a convolutional neural network model. Includes inputting into a type neural network model and recognizing text in awaiting images
The convolutional neural network model and the recurrent neural network model are in a predefined candidate model space based on the initial convolutional neural network model having a definite waiting layer and the initial recurrent neural network model having a definite waiting layer connected to each other. It is obtained by searching for each fixed waiting layer in, and performing joint training on convolutional neural network models and recurrent neural network models connected to each other in an end-to-end manner.

本発明の他の側面によれば、情報処理装置が提供され、該情報処理装置は、処理器を含み、該処理器は、
テキストを含む処理待ち画像を、文字認識のための畳み込みニューラルネットワークモデルに入力し、テキスト特徴を抽出し；及び
抽出されたテキスト特徴を、畳み込みニューラルネットワークモデルと接続される、語義認識のための再帰型ニューラルネットワークモデルに入力し、処理待ち画像におけるテキストを認識するために構成され、
畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルは、互いに接続される、確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルに基づいて、事前定義の候補モデル空間内で各確定待ち層について捜索を行い、エンドツーエンドの方式で、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルに対してジョイント訓練を行うことにより取得されるものである。 According to another aspect of the present invention, an information processing device is provided, the information processing device includes a processor, and the processor includes a processor.
Input awaiting image containing text into a convolutional neural network model for character recognition, extract text features; and recurrently connect the extracted text features with a convolutional neural network model. Constructed to enter into a type neural network model and recognize text in awaiting images,
The convolutional neural network model and the recurrent neural network model are in a predefined candidate model space based on the initial convolutional neural network model having a definite waiting layer and the initial recurrent neural network model having a definite waiting layer connected to each other. It is obtained by searching for each fixed waiting layer in, and performing joint training on convolutional neural network models and recurrent neural network models connected to each other in an end-to-end manner.

本発明の他の側面によれば、コンピュータに、上述のような画像処理方法を実行させるプログラムが提供される。 According to another aspect of the present invention, there is provided a program that causes a computer to perform the image processing method as described above.

本発明の他の側面によれば、さらに、対応する記憶媒体が提供され、その中には、マシン可読指令コードが記憶されており、該指令コードは、マシン（例えば、コンピュータ）により読み取られて実行されるときに、該マシンに、上述の画像処理方法を実行させることができる。 According to another aspect of the invention, a corresponding storage medium is further provided, in which a machine-readable command code is stored, which command code is read by a machine (eg, a computer). When executed, the machine can be made to perform the image processing method described above.

上述の本発明の各側面によれば、少なくとも、次のような利点を得ることができ、即ち、処理待ち画像におけるテキストに対して認識を行うときに、捜索及びエンドツーエンドの訓練方式で取得された、互いに接続されるCNNモデル及びRNNモデルが利用されており、このようなモデルは、前期のモデル構築プロセスにおいて手動介入を減少させることができるため、モデルを取得するためのコストを削減することができる。 According to each aspect of the present invention described above, at least the following advantages can be obtained, that is, when recognizing text in a waiting image, it is acquired by a search and end-to-end training method. CNN and RNN models that are connected to each other are utilized, and such models can reduce manual intervention in the model building process of the previous period, thus reducing the cost of acquiring the model. be able to.

本発明の実施例における画像処理方法の例示的フローのフローチャートである。It is a flowchart of the exemplary flow of the image processing method in the Example of this invention. 図1に示す画像処理方法に使用される、互いに接続されるCNNモデル及びRNNモデルを説明するための一般的なアーキテクチャを示す図である。It is a figure which shows the general architecture for explaining the connected CNN model and RNN model used in the image processing method shown in FIG. 図2に示すアーキテクチャの具体的な構造を説明するための図である。It is a figure for demonstrating the concrete structure of the architecture shown in FIG. 図3中のCNNモデルにおける確定待ち層を示す図である。It is a figure which shows the confirmation waiting layer in the CNN model in FIG. 図3中のRNNモデルにおける確定待ち層を示す図である。It is a figure which shows the confirmation waiting layer in the RNN model in FIG. 図1に示す画像処理方法に使用される、互いに接続されるCNNモデル及びRNNモデルを取得するための例示的な処理のフローチャートである。It is a flowchart of an exemplary process for acquiring a CNN model and an RNN model connected to each other used in the image processing method shown in FIG. 本発明の実施例に係る画像処理装置の例示的構造を示すブロック図である。It is a block diagram which shows the exemplary structure of the image processing apparatus which concerns on embodiment of this invention. 本発明の実施例による画像処理方法及び装置を実現し得るハードウェア構成を示す図である。It is a figure which shows the hardware structure which can realize the image processing method and apparatus according to the Example of this invention.

以下、添付した図面を参照しながら、本発明を実施するための好適な実施例を詳細に説明する。なお、このような実施例は、例示に過ぎず、本発明を限定するものでない。 Hereinafter, suitable examples for carrying out the present invention will be described in detail with reference to the attached drawings. It should be noted that such an example is merely an example and does not limit the present invention.

本発明の一側面によれば、画像処理方法が提供される。以下、図1を参照しながら、本発明の実施例における画像処理方法の例示的なフローを説明する。 According to one aspect of the present invention, an image processing method is provided. Hereinafter, an exemplary flow of the image processing method according to the embodiment of the present invention will be described with reference to FIG.

図1は、本発明の実施例に係る画像処理方法の例示的フローのフローチャートである。図1に示すように、画像処理方法の例示的フロー100は、以下のステップを含んでも良い。 FIG. 1 is a flowchart of an exemplary flow of the image processing method according to the embodiment of the present invention. As shown in FIG. 1, the exemplary flow 100 of the image processing method may include the following steps.

ステップS101：テキストを含む処理待ち画像を、文字認識のための畳み込みニューラルネットワーク（CNN）モデルに入力し、テキスト特徴を抽出し；
ステップS103：抽出されたテキスト特徴を、畳み込みニューラルネットワークモデルと接続される、語義認識のための再帰型ニューラルネットワーク（RNN）モデルに入力し、処理待ち画像におけるテキストを認識する。 Step S101: Input awaiting image containing text into a convolutional neural network (CNN) model for character recognition and extract text features;
Step S103: The extracted text features are input to a recurrent neural network (RNN) model for word meaning recognition connected to a convolutional neural network model, and the text in the waiting image is recognized.

ここで、ステップS101において使用される畳み込みニューラルネットワーク（CNN）モデル及びステップS103において使用される再帰型ニューラルネットワーク（RNN）モデルは、互いに接続される、確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルに基づいて、事前定義の候補モデル空間内で各確定待ち層について捜索を行い、エンドツーエンドの方式で、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルに対してジョイント訓練を行うことにより取得されるものである。 Here, the convolutional neural network (CNN) model used in step S101 and the recurrent neural network (RNN) model used in step S103 are an initial convolutional neural network model having a confirmation waiting layer and a confirmation layer connected to each other. Based on an early recurrent neural network model with a waiting layer, a search is performed for each fixed waiting layer in a predefined candidate model space, and a convolutional neural network model and a recurrent neural network connected to each other in an end-to-end manner. It is acquired by performing joint training on the network model.

例えば、上述の捜索及びジョイント訓練の方式で取得される、互いに接続されるCNNモデル及びRNNモデル、並びにオプションとしての転写層（例えば、コネクショニスト時間分類（CTC）層）により、全体畳み込み再帰型ニューラルネットワーク（CRNN）モデルを構成することができる。今のところ、CRNNに基づくモデルは、画像におけるテキスト又は文字シーケンスを認識するための手法として注目されている。従来のDCNNモデルに比べ、CRNNモデルの利点は、より少ないパラメータを使用することや、より少ない記憶空間を占用することなどを含む。しかし、従来のCRNNモデルにおけるCNNの部分がすべて手動で設計されているため、通常、大量の先験的知識及び前期準備を要する。 For example, a fully convolutional recurrent neural network with interconnected CNN and RNN models acquired by the search and joint training schemes described above, and an optional transfer layer (eg, connectionist time classification (CTC) layer). (CRNN) Models can be constructed. So far, CRNN-based models are attracting attention as a method for recognizing text or character sequences in images. The advantages of the CRNN model over the traditional DCNN model include using fewer parameters and occupying less storage space. However, since all parts of the CNN in the conventional CRNN model are designed manually, it usually requires a large amount of a priori knowledge and early preparation.

これに対して、本実施例における画像処理方法では、事前定義の候補モデル空間内で捜索及びエンドツーエンドの訓練を行うことで取得された、互いに接続されるCNNモデル及びRNNモデルにより構成されるCRNNモデルが利用されており、このようなモデルは、手動介入を大幅に減少させることができるため、該モデルに基づくテキスト認識方法の開発コストを低減することができる。 On the other hand, the image processing method in this embodiment is composed of a CNN model and an RNN model connected to each other acquired by performing a search and end-to-end training in a predefined candidate model space. A CRNN model is utilized, which can significantly reduce manual intervention, thus reducing the cost of developing a text recognition method based on the model.

1つの好適な実施例では、テキストを正確に認識する確率を表す損失関数、及び、畳み込みニューラルネットワークモデルと再帰型ニューラルネットワークモデルとの全体複雑度の最小化を最適化目標として、捜索及びジョイント訓練を行うことで、最適化後の畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルを得ることができる。なお、最適化目標の設定並びに捜索及びジョイント訓練については、互いに接続されるCNNモデル及びRNNモデルのアーキテクチャを説明した後に、具体的な例を基づいて詳しく説明する。 In one preferred embodiment, search and joint training with the loss function representing the probability of accurately recognizing the text and the optimization objectives of minimizing the overall complexity of the convolutional neural network model and the recurrent neural network model. By performing the above, a convolutional neural network model and a recurrent neural network model after optimization can be obtained. The setting of optimization goals, search and joint training will be explained in detail based on specific examples after explaining the architecture of the CNN model and RNN model connected to each other.

テキストを正確に認識する確率、及び、モデルの全体複雑度を最適化するというマルチ目標最適化の方式で捜索及びジョイント訓練を行うことにより、この好適な実施例における画像処理方法に採用されるモデルは、認識の準確性及びモデルの規模又は計算量を同時に考慮することができる。認識の準確性のみを考慮する一般の認識モデルに比べ、本実施例に使用される、最適化された全体複雑度を有するモデルアーキテクチャは、計算量又は規模が比較的小さいため、モバイルプラットフォームやリソース制約有りのような環境に特に適する。 The model adopted in the image processing method in this preferred embodiment by performing search and joint training with a multi-target optimization method that optimizes the probability of accurately recognizing the text and the overall complexity of the model. Can simultaneously consider the quasi-accuracy of recognition and the scale or complexity of the model. Compared to a general recognition model that considers only recognition semi-accuracy, the model architecture with optimized overall complexity used in this example is relatively small in complexity or scale, so mobile platforms and resources. Especially suitable for restricted environments.

続いて、図2乃至図5に基づいて、図1に示す例示的な方法に応用される、互いに接続されるCNNモデル及びRNNモデルの構造、及び該モデルアーキテクチャを得るための候補モデル空間と捜索操作をより具体的に説明する。 Subsequently, based on FIGS. 2 to 5, the structures of the connected CNN and RNN models, and the candidate model space and search for obtaining the model architecture, which are applied to the exemplary method shown in FIG. The operation will be described more specifically.

まず、図2を参照する。図2は、図1に示す画像処理方法に使用される、互いに接続されるCNNモデル及びRNNモデルの一般的なアーキテクチャの説明図である。図2に示すように、図1に示す例示的な方法に使用されるCRNN全体モデル20は、順次接続されるCNNモデル21、RNNモデル22、及び好ましいCTC層23を含んでも良い。従来技術では、図2に示すようなCRNN全体モデルを画像におけるテキストの認識に応用する方法が既に存在する。例えば、Baoguang Shiなどの作者が2017年11月にIEEE Transactions on Pattern Analysis & Machine Intelligenceにて発表した論文“An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition”を参照することで、このようなCRNNモデルの原理、構造及び具体的な細部を理解することができる。よって、以下、後続の説明の便宜のため、該全体モデルの原理及び構造を簡単に説明する。 First, refer to FIG. FIG. 2 is an explanatory diagram of the general architecture of the connected CNN model and RNN model used in the image processing method shown in FIG. As shown in FIG. 2, the CRNN overall model 20 used in the exemplary method shown in FIG. 1 may include a sequentially connected CNN model 21, RNN model 22, and preferred CTC layer 23. In the prior art, there is already a method of applying the entire CRNN model as shown in FIG. 2 to the recognition of text in an image. For example, the paper “An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition” published by authors such as Baoguang Shi at IEEE Transactions on Pattern Analysis & Machine Intelligence in November 2017. By referring to, the principle, structure and concrete details of such a CRNN model can be understood. Therefore, the principle and structure of the overall model will be briefly described below for the convenience of subsequent explanations.

図2に示す例示的なCRNN全体モデル20では、CNNモデル21は、一般的な畳み込みニューラルネットワークの基本構造を有しても良く、例えば、複数の畳み込み層、プーリング層などを含んでも良いが、全結合層を有しない。CNNモデル21は、特徴の抽出のために用いられ、それは、例えば、テキスト“SUBWAY”を含む入力画像に基づいて特徴マップを生成し、そして、生成された特徴マップに基づいて、再帰型ニューラルネットワークの入力に適した特徴シーケンスを取得することで、該特徴シーケンスをRNNモデル22に入力することができる。CNNモデル21が抽出した特徴シーケンスにより、処理待ち画像におけるテキストを表すことができるため、ここでは、該特徴シーケンスを“テキスト特徴”と称する場合がある。 In the exemplary CRNN overall model 20 shown in FIG. 2, the CNN model 21 may have the basic structure of a general convolutional neural network, and may include, for example, a plurality of convolutional layers, pooling layers, and the like. It does not have a fully connected layer. CNN model 21 is used for feature extraction, it generates a feature map based on an input image containing, for example, the text "SUBWAY", and a recurrent neural network based on the generated feature map. By acquiring the feature sequence suitable for the input of, the feature sequence can be input to the RNN model 22. Since the feature sequence extracted by the CNN model 21 can represent the text in the image waiting to be processed, the feature sequence may be referred to as a "text feature" here.

CNNモデル21から入力された特徴シーケンス（即ち、上述の“テキスト特徴”）に基づいて、RNNモデル22は、テキストシーケンスの認識を行い、そして、予測結果を出力する。RNNモデル22が出力する予測結果は、それに入力された特徴シーケンスにおける各特徴ベクトルのラベル分布、即ち、リアル結果の確率リストであっても良い。RNNモデル22の出力結果は、ノイズ、即ち、不正確なスペース、重複文字などに対応するラベル分布を含む可能性がある。理解しやすくするために、図2に示す例では、RNNモデル22が出力したシーケンスラベル分布に対応する、ノイズを含む文字シーケンス、即ち、“SSUBBWWAYY”を示している。このようなノイズを除去するため、RNNモデル22の後に転写層としてのCTC層23を設け、重複の除去、統合などの操作により、RNNモデル22の予測を最終的なラベルシーケンスに変換する。図2に示すように、CTC層23の処理の後に、ノイズが除去されたテキストシーケンスの認識結果、例えば、“SUBWAY”（即ち、該ラベルシーケンスに対応する確率分布）を得ることができる。 Based on the feature sequence input from the CNN model 21 (ie, the "text feature" described above), the RNN model 22 recognizes the text sequence and outputs the prediction result. The prediction result output by the RNN model 22 may be the label distribution of each feature vector in the feature sequence input to it, that is, the probability list of the real result. The output of the RNN model 22 may include label distributions that correspond to noise, that is, inaccurate spaces, duplicate characters, and so on. To make it easier to understand, the example shown in FIG. 2 shows a character sequence containing noise, that is, "SSUBBWWAYY", which corresponds to the sequence label distribution output by the RNN model 22. In order to remove such noise, a CTC layer 23 as a transfer layer is provided after the RNN model 22, and the prediction of the RNN model 22 is converted into a final label sequence by operations such as duplication removal and integration. As shown in FIG. 2, after the processing of the CTC layer 23, the recognition result of the text sequence from which noise has been removed, for example, “SUBWAY” (that is, the probability distribution corresponding to the label sequence) can be obtained.

図2に示す、順次接続されるCNNモデル21、RNNモデル22、及びCTC層23を含むCRNN全体モデル20を理解した後に、続いて、図3乃至図5に基づいて、該全体モデルの具体的な構造、及び該具体的な構造を得るための候補モデル空間と捜索操作について説明する。 After understanding the CRNN overall model 20 including the sequentially connected CNN model 21, RNN model 22, and CTC layer 23 shown in FIG. 2, the specifics of the overall model are subsequently based on FIGS. 3 to 5. The structure, the candidate model space for obtaining the specific structure, and the search operation will be described.

まず、図3を参照する。図3は、図2に示すCRNN全体モデルの具体的な構造の説明図である。図3に示すように、CRNN全体モデルは、順次接続されるCNNモデル21、RNNモデル22、及び好ましいCTC層23を含んでも良い。図3では、CNNモデル21の第1層乃至第N層が陰影無し実線枠で表され、RNNモデル22の第1層乃至第L層が陰影有り実線枠で表され、CTC層23が点線枠で表される。ここで、N及びLはすべて、予め確定された自然数であり、それぞれ、CNNモデル21及びRNNモデル22中の対応する層数（レイヤー数）を示す。 First, refer to FIG. FIG. 3 is an explanatory diagram of the specific structure of the CRNN overall model shown in FIG. As shown in FIG. 3, the overall CRNN model may include a sequentially connected CNN model 21, RNN model 22, and preferred CTC layer 23. In FIG. 3, the first to Nth layers of the CNN model 21 are represented by a solid line frame without shading, the first to L layers of the RNN model 22 are represented by a solid line frame with shading, and the CTC layer 23 is a dotted frame. It is represented by. Here, N and L are all predetermined natural numbers, and indicate the corresponding number of layers (number of layers) in the CNN model 21 and the RNN model 22, respectively.

1つの例では、画像における文字シーケンスの認識の応用について、CNNモデルの層数N=12及びRNNモデルの層数L=2の初期ネットワークアーキテクチャを構成しても良い。なお、本発明の内容をもとに上、当業者は、システムの要求、例えば、複雑度、正確度、処理負荷、計算速度などの各種のファクターに基づいて、実験により、予め、N及びLの具体的な値を合理的に設定しても良いが、ここでは、その詳しい説明を省略する。また、応用のニーズに応じて、全体ネットワークアーキテクチャの他の必要なパラメータ、例えば、具体的な応用又はデータセットに関連するCNNモデルの第一層のチャネル数、最後の層のチャネル数などを設定しても良く（中間層のチャネル数が計算により自動で取得され得る）、これにより、初期のCNNモデル21及び初期のRNNモデル22を構築することができる。もちろん、本発明の内容をもとに、当業者は、システムの要求などのファクターに基づいて、実験により、予め、全体ネットワークアーキテクチャの必要なパラメータを設定しても良いが、ここでは、その詳しい説明を省略する。 In one example, an initial network architecture with N = 12 layers in the CNN model and L = 2 layers in the RNN model may be configured for the application of character sequence recognition in images. In addition, based on the contents of the present invention, those skilled in the art will perform N and L in advance by experiments based on various factors such as complexity, accuracy, processing load, and calculation speed of the system. A specific value of may be reasonably set, but a detailed description thereof will be omitted here. Also, depending on the needs of the application, set other necessary parameters of the overall network architecture, such as the number of channels in the first layer of the CNN model related to the specific application or dataset, the number of channels in the last layer, and so on. You may (the number of channels in the middle layer can be calculated automatically), which allows you to build an early CNN model 21 and an early RNN model 22. Of course, based on the contents of the present invention, those skilled in the art may set the necessary parameters of the entire network architecture in advance by experiment based on factors such as system requirements, but the details are described here. The description is omitted.

本実施例では、CNNモデル21中の第1層乃至第N層及びRNNモデル22中の第1層乃至第L層は、すべて、確定待ち層であって良い。換言すると、図3に示すCRNN全体モデルについて、例えば、捜索、及び、モデルアーキテクチャ全体の入力層（即ち、CNNモデル21の第1層）からモデルアーキテクチャ全体の出力層（例えば、CTC層23）までのエンドツーエンドの訓練により、CNNモデル21中の各層及びRNNモデル22中の各層の操作又は構造を確定することができる。或いは、予め、CNNモデル21及びRNNモデル22中の若干個の層の操作又は構造を確定し、残りの層を、捜索により操作又は構造を確定する確定待ち層としても良い。CNNモデル21又はRNNモデル22とは異なり、CTC層23の構造は、予め、先験的知識、実験などの方式で手動（人工）確定されても良い。 In this embodiment, the first to Nth layers in the CNN model 21 and the first to Lth layers in the RNN model 22 may all be confirmation waiting layers. In other words, for the CRNN-wide model shown in FIG. 3, for example, from the input layer of the entire model architecture (that is, the first layer of the CNN model 21) to the output layer of the entire model architecture (for example, CTC layer 23). End-to-end training can determine the operation or structure of each layer in the CNN model 21 and each layer in the RNN model 22. Alternatively, the operation or structure of some layers in the CNN model 21 and the RNN model 22 may be determined in advance, and the remaining layers may be used as a confirmation waiting layer whose operation or structure is determined by searching. Unlike the CNN model 21 or the RNN model 22, the structure of the CTC layer 23 may be manually (artificially) determined in advance by a method such as a priori knowledge or experiment.

1つの好適な実施例では、事前定義の候補モデル空間は、畳み込みニューラルネットワークモデルの1つの確定待ち層について、複数の候補畳み込み操作及び候補プーリング操作を含んでも良い。また、畳み込みニューラルネットワークモデルの1つの確定待ち層について、複数の候補畳み込み操作及び候補プーリング操作の中で捜索及びジョイント訓練を行うことで、該層の操作を確定することができる。 In one preferred embodiment, the predefined candidate model space may include multiple candidate convolution and candidate pooling operations for one definite waiting layer of the convolutional neural network model. Further, the operation of one convolutional neural network model can be confirmed by performing search and joint training in a plurality of candidate convolution operations and candidate pooling operations.

一例として、図3に示すCNNモデル21中の1つの確定待ち層について、候補畳み込み操作は、穴付き畳み込み操作を含んでも良い。穴付き畳み込み（dilated convolutions）は、拡張畳み込みと称されても良く、それは、穴を用いて充填することで畳み込みカーネルのサイズを拡張し、より大きい視野又は受容野を取得し、これにより、画像内部のデータ構造の保留に有利である。例えば、穴が付く3*3の畳み込み、穴が付く5*5の畳み込みなどを候補畳み込み操作として使用することができる。 As an example, for one confirmed waiting layer in the CNN model 21 shown in FIG. 3, the candidate convolution operation may include a perforated convolution operation. Dilated convolutions may also be referred to as dilated convolutions, which extend the size of the convolution kernel by filling with holes to obtain a larger field of view or receptive field, thereby image. It is advantageous for holding the internal data structure. For example, 3 * 3 convolution with holes, 5 * 5 convolutions with holes, etc. can be used as candidate convolution operations.

一例として、図3に示すCNNモデル21中の1つの確定待ち層について、候補プーリング操作は、1*nのプーリング操作を含んでも良く、ここで、nは、2以上の所定の自然数である。矩形ウィンドウの1*nのプーリング操作を使用する目的は、CNNモデル21の出力がテキストシーケンスの「長い」及び「狭い」の特徴に適合するようにさせることにある。よって、CNNモデル21の確定待ち層では、プーリング操作を行うときに、好ましくは、従来の方形ウィンドウでなく、矩形ウィンドウを使用し、これにより、マッピングされる特徴の幅をより小さくし、最終的な特徴シーケンスが長く且つ狭いものであるように保証し、後続のRNNモデル22のテキストシーケンスの認識とマッチすることができる。例えば、1*2の最大プーリング又は平均プーリング、1*3の最大プーリング又は平均プーリングなどを候補プーリング操作として使用することができる。 As an example, for one confirmed waiting layer in the CNN model 21 shown in FIG. 3, the candidate pooling operation may include a 1 * n pooling operation, where n is a predetermined natural number of 2 or more. The purpose of using the 1 * n pooling operation of a rectangular window is to make the output of the CNN model 21 fit the "long" and "narrow" features of the text sequence. Therefore, the final waiting layer of the CNN model 21 preferably uses a rectangular window instead of the traditional square window when performing the pooling operation, thereby reducing the width of the mapped features and finally. It guarantees that the feature sequence is long and narrow, and can match the recognition of the text sequence of the subsequent RNN model 22. For example, 1 * 2 maximum pooling or average pooling, 1 * 3 maximum pooling or average pooling, and the like can be used as candidate pooling operations.

1つの例では、事前定義の候補モデル空間中で図2に示すようなCNNモデル21の1つの確定待ち層について含まれる複数の候補畳み込み操作及び候補プーリング操作は、異なる型及び／又は異なるサイズの候補畳み込み操作、及び／又は、異なる類型及び／又は異なるサイズの候補プーリング操作を含んでも良い。例えば、上述の候補畳み込み操作は、従来の3*3の畳み込み、従来の5*5の畳み込み、分離可能な3*3の畳み込み、分離可能な5*5の畳み込み、穴が付く3*3の畳み込み、穴が付く5*5の畳み込みなどの操作を含んでも良い、上述の候補プーリング操作は、1*2の最大プーリング、1*2の平均プーリング、1*3の最大プーリング、1*3の平均プーリングなどの操作を含んでも良い。 In one example, multiple candidate convolution and candidate pooling operations contained for one determined waiting layer of CNN model 21 in the predefined candidate model space, as shown in Figure 2, are of different types and / or different sizes. Candidate convolution operations and / or candidate pooling operations of different types and / or different sizes may be included. For example, the candidate convolution operation described above includes conventional 3 * 3 convolution, conventional 5 * 5 convolution, separable 3 * 3 convolution, separable 5 * 5 convolution, and perforated 3 * 3 Operations such as convolution and 5 * 5 convolution with holes may be included, the candidate pooling operations described above are 1 * 2 maximum pooling, 1 * 2 average pooling, 1 * 3 maximum pooling, 1 * 3 maximum pooling. Operations such as average pooling may be included.

1つの好適な実施例では、初期設定の畳み込みニューラルネットワークモデルにおいて、1つの確定待ち層の入力及び出力がそれぞれ、1*1の畳み込み操作を実現するための付加畳み込み層と接続されても良い（換言すると、該付加畳み込み層は、予め設定されており、捜索により確定されるものでない）。残差ネットワーク（Residual Network）において畳み込み層の前後にそれぞれ1*1の畳み込み操作を応用することと同様に、この好適な実施例では、確定待ち層の入力及び出力とそれぞれ接続される、1*1の畳み込み操作を実現するための付加畳み込み層を用いることで、確定待ち層の入力及び出力のチャネル数がすべて減少するようにさせることができ、これにより、パラメータの減少によるモデルの複雑度の低減に有利である。 In one preferred embodiment, in the default convolutional neural network model, the inputs and outputs of one fixed-waiting layer may each be connected to an additional convolutional layer to achieve a 1 * 1 convolutional operation (1 * 1 convolutional operation). In other words, the additional convolutional layer is preset and is not determined by search). Similar to applying a 1 * 1 convolution operation before and after the convolution layer in the Residual Network, in this preferred embodiment, 1 * 1 * 1 is connected to the inputs and outputs of the waiting layer, respectively. By using an additional convolution layer to achieve the convolution operation of 1, it is possible to reduce the number of input and output channels of the confirmation waiting layer, thereby reducing the complexity of the model due to the reduction of parameters. It is advantageous for reduction.

オプションとして、1つの例では、1*1の畳み込み操作を実現する付加畳み込み層は、非線形活性化関数により確定待ち層と接続されることで、ネットワークの非線形特性の増加に有利である。図4は、図3におけるCNNモデル21中の1つの確定待ち層の例（例えば、CNNモデル21中の第i層、i=1，2，3，…，N）を示し、該確定待ち層201の入力及び出力はそれぞれ、バッチ正規化層（BN）及びランプ関数（ReLU）により、1*1の畳み込み操作を実現するための付加畳み込み層202a、202bと接続される。 As an option, in one example, the additional convolution layer that realizes the 1 * 1 convolution operation is connected to the decision-waiting layer by the nonlinear activation function, which is advantageous for increasing the nonlinear characteristics of the network. FIG. 4 shows an example of one deterministic waiting layer in the CNN model 21 in FIG. 3 (for example, the i-th layer in the CNN model 21, i = 1, 2, 3, ..., N). The inputs and outputs of 201 are connected by the batch normalization layer (BN) and the ramp function (ReLU) to the additional convolutional layers 202a and 202b for realizing the 1 * 1 convolutional operation, respectively.

1つの好適な実施例では、畳み込みニューラルネットワークモデルの1つの確定待ち層について、上述の複数の候補畳み込み操作及び候補プーリング操作以外に、事前定義の候補モデル空間は、さらに、それ相応に、該層から畳み込みニューラルネットワークモデル中の後続の各確定待ち層までの候補接続を含んでも良い。また、畳み込みニューラルネットワークモデルの1つの確定待ち層について、前記候補接続の中で捜索及びジョイント訓練を行うことで、該層から畳み込みニューラルネットワークモデル中の後続の確定待ち層までの少なくも1つの接続を確定することもできる。 In one preferred embodiment, for one definite waiting layer of a convolutional neural network model, in addition to the plurality of candidate convolution and candidate pooling operations described above, the predefined candidate model space is further correspondingly said to that layer. Can include candidate connections from to each subsequent decision-waiting layer in the convolutional neural network model. In addition, by performing search and joint training in the candidate connection for one definite waiting layer of the convolutional neural network model, at least one connection from that layer to the subsequent definite waiting layer in the convolutional neural network model is performed. Can also be confirmed.

図3の例では、CNNモデル21中の第1層及び第2層から後続の確定待ち層までの候補接続を例示的に示している。CNNモデル21における現在の確定待ち層としての第i層（i=1，2，3，…，N-2）について、一般的に該層からCNNモデル21中の第i+1層までの接続を保留する他に、さらに、捜索及びジョイント訓練により、該層からCNNモデル21中の第i+2層、第i+3層、…、第N層までの各候補接続を保留すべきかを確定する。適切な候補接続を保留することで、CNNモデルの処理能力と処理負荷との間の良好なバランスを取ることができる。 In the example of FIG. 3, candidate connections from the first layer and the second layer in the CNN model 21 to the subsequent confirmation waiting layer are shown exemplarily. Regarding the i-th layer (i = 1, 2, 3, ..., N-2) as the current confirmation waiting layer in the CNN model 21, the connection from that layer to the i + 1 layer in the CNN model 21 is generally made. In addition to suspending, it is further determined by search and joint training whether each candidate connection from that layer to the i + 2 layer, i + 3 layer, ..., N layer in the CNN model 21 should be suspended. To do. By deferring the appropriate candidate connections, a good balance between the processing power and processing load of the CNN model can be achieved.

以上、図3及び図4に基づいて、畳み込みニューラルネットワークモデルの具体的な構造、及び、事前定義の候補モデル空間内で畳み込みニューラルネットワークモデルの確定待ち層について行われる捜索を説明した。続いて、図3に戻り、また、図5を参照して、再帰型ニューラルネットワークモデルの具体的な構造、及び、事前定義の候補モデル空間内で再帰型ニューラルネットワークモデルの確定待ち層について行われる捜索を説明する。 In the above, based on FIGS. 3 and 4, the specific structure of the convolutional neural network model and the search performed for the confirmation waiting layer of the convolutional neural network model in the predefined candidate model space have been described. Then, returning to FIG. 3 and referring to FIG. 5, the concrete structure of the recurrent neural network model and the confirmation waiting layer of the recurrent neural network model in the predefined candidate model space are performed. Explain the search.

図3に示すように、RNNモデル22は、L個の確定待ち層を含んでも良く、捜索及びジョイント訓練により、このL個の層のうちの各層の構造及び最適化パラメータを確定することができる。 As shown in FIG. 3, the RNN model 22 may include L awaiting layers, and search and joint training can determine the structure and optimization parameters of each layer of the L layers. ..

1つの好適な実施例では、初期の再帰型ニューラルネットワークモデルは、各確定待ち層について所定数のノードを有しても良く、事前定義の候補モデル空間は、再帰型ニューラルネットワークモデルの各確定待ち層について、複数の候補活性化関数を含んでも良く、そのうち、再帰型ニューラルネットワークモデルの各確定待ち層について、複数の候補活性化関数の中で捜索及びジョイント訓練を行うことで、該層における各ノードの活性化関数を確定し、また、各ノード間の接続関係も確定する。 In one preferred embodiment, the initial recurrent neural network model may have a predetermined number of nodes for each decision-waiting layer, and the predefined candidate model space is each decision-waiting for the recurrent neural network model. A plurality of candidate activation functions may be included for each layer, and among these, each layer waiting for confirmation of the recurrent neural network model can be searched and jointly trained in the plurality of candidate activation functions. The activation function of the node is determined, and the connection relationship between each node is also determined.

図5は、図3に示すRNNモデルにおける1つの確定待ち層の説明のための図である。それは、RNNモデルにおける1つの確定待ち層の可能な構造を示している。図5に示すように、該確定待ち層には、ノード1からノード6までの計6個のノードが設定されている。各ノードkついて、複数の候補活性化関数が設定されており、例えば、tanh（正接）関数、ReLU関数、identity（恒等）関数、及びSigmoid関数を含んでも良く、また、ノードkからノードk+1、ノードk+2、…、ノード6（k=1、2、…、6）などの後続の各ノードまでの候補接続も設定されている。捜索及びジョイント訓練を行うことで、各ノードの活性化関数及び各ノード間の接続関係を確定することができる。換言すると、捜索及びジョイント訓練により、各ノードkの活性化関数を確定することができ、また、図5に示す全ての接続関係のうち、どの（1つ又は複数）接続関係を保留するかを確定することができる。 FIG. 5 is a diagram for explaining one fixed waiting layer in the RNN model shown in FIG. It shows the possible structure of one fixed waiting layer in the RNN model. As shown in FIG. 5, a total of 6 nodes from node 1 to node 6 are set in the confirmation waiting layer. A plurality of candidate activation functions are set for each node k, and may include, for example, a tanh (tangent) function, a ReLU function, an identity (identity) function, and a Sigmoid function, and from node k to node k. Candidate connections to each subsequent node such as +1 and node k + 2, ..., Node 6 (k = 1, 2, ..., 6) are also set. By performing search and joint training, the activation function of each node and the connection relationship between each node can be determined. In other words, the search and joint training can determine the activation function of each node k, and which (one or more) of all the connections shown in FIG. 5 is reserved. Can be confirmed.

なお、再帰型ニューラルネットワークが各層において一般的に同じ構造を使用するので、RNNモデル22の確定待ち層について、上述の図5に基づいて説明した捜索及び／又は訓練を行うときに、RNNモデル22の各層について、同じ構造を捜索し、また、同じパラメータを使用して良い。 Since the recurrent neural network generally uses the same structure in each layer, the RNN model 22 is used when searching and / or training the confirmation waiting layer of the RNN model 22 based on FIG. 5 above. For each layer of, the same structure may be searched and the same parameters may be used.

以上、図3乃至図5に基づいて、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルを含む全体モデルの具体的な構造の例、及び、該全体モデルを得るための捜索操作の例を説明した。該例をもとに、さらに図6を参照して、捜索及びエンドツーエンドの訓練を実現することで最適化モデルを得る1つの処理例を説明する。 As described above, based on FIGS. 3 to 5, an example of a specific structure of an overall model including a convolutional neural network model and a recurrent neural network model connected to each other, and an example of a search operation for obtaining the overall model. Explained. Based on this example, with reference to FIG. 6, one processing example for obtaining an optimized model by realizing search and end-to-end training will be described.

図6は、最適化モデルを取得する処理の例示的フロー600を示している。図6に示すように、例示的フロー600は、以下のステップを含んでも良い。 FIG. 6 shows an exemplary flow 600 of the process of acquiring an optimized model. As shown in FIG. 6, the exemplary flow 600 may include the following steps:

ステップS601：確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルについて、事前定義の候補モデル空間内で各確定待ち層について捜索を行うことで、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルの好ましい全体アーキテクチャを取得し；及び
ステップS603：互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルの好ましい全体アーキテクチャについて、エンドツーエンドのジョイント訓練により、該好ましい全体アーキテクチャの最適化パラメータを取得する。 Step S601: An initial convolutional neural network model having a fixed waiting layer and an early recurrent neural network model having a fixed waiting layer are connected to each other by searching for each fixed waiting layer in a predefined candidate model space. Obtain the preferred overall architecture of the convolutional neural network model and the recurrent neural network model; and Step S603: By end-to-end joint training on the preferred overall architecture of the convolutional neural network model and the recurrent neural network model connected to each other. Obtain the preferred overall architecture optimization parameters.

例示的フロー600では、ステップS601及びステップS603により、それぞれ、全体モデルの構造の最適化及び具体的なパラメータの最適化を実現する。なお、モデルのアーキテクチャを最適化し、そして、最適なアーキテクチャ下のモデルのパラメータを最適化することで、例示的フロー600は、処理を簡略化することができる。しかしながら、本発明の内容をもとに、当業者が理解すべきは、十分な計算能力がある場合、理論上、モデルのアーキテクチャ及びモデルのパラメータの最適化を同時実現することもできる。よって、本明細書のコンテキストにおいては、“捜索及びジョイント訓練により”確定された確定待ち層の具体的な操作、構造及び／又は接続は、実際には、ステップS601の処理により実現することができ、また、それに対応する最適化パラメータは、後続のステップS603の処理により取得することができる。 In the exemplary flow 600, steps S601 and S603 realize optimization of the structure of the entire model and optimization of specific parameters, respectively. By optimizing the architecture of the model and optimizing the parameters of the model under the optimal architecture, the exemplary flow 600 can simplify the process. However, based on the contents of the present invention, those skilled in the art should understand that, in theory, the architecture of the model and the optimization of the parameters of the model can be simultaneously realized if there is sufficient computing power. Thus, in the context of this specification, the specific operation, structure and / or connection of the confirmed waiting layer "determined by search and joint training" can actually be realized by the process of step S601. , And the corresponding optimization parameters can be obtained by the processing of subsequent step S603.

好ましくは、ステップS601における全体モデルの構造の最適化及びステップS603におけるパラメータの最適化は、統一の基準で実現する。即ち、ステップS601及びステップS603はすべて、テキストを正確に認識する確率を示す損失関数、及び、畳み込みニューラルネットワークモデルと再帰型ニューラルネットワークモデルとの全体複雑度の最小化を最適化目標として、各自の処理を行うことができる。 Preferably, the optimization of the structure of the whole model in step S601 and the optimization of the parameters in step S603 are realized by a unified standard. That is, steps S601 and S603 are all their own with the optimization goals of a loss function indicating the probability of accurately recognizing the text and minimizing the overall complexity of the convolutional neural network model and the recurrent neural network model. Processing can be performed.

一例として、ステップS601の捜索処理において、図3に示すようなCNNモデル21中の各層について、複数の候補畳み込み操作及び候補プーリング操作の中で該層の操作を捜索し、そして、選択可能に該層からCNNモデル21中の後続の各確定待ち層までの候補接続のうちの少なくも1つの接続を捜索し、また、図3に示すRNNモデル22中の各確定待ち層について、複数の候補活性化関数の中で該層における各ノードの活性化関数を捜索し、そして、各ノード間の接続関係を確定することができる。 As an example, in the search process of step S601, for each layer in the CNN model 21 as shown in FIG. 3, the operation of the layer is searched for in a plurality of candidate convolution operations and candidate pooling operations, and the operation of the layer is selectively selected. It searches for at least one of the candidate connections from the layer to each subsequent confirmed waiting layer in the CNN model 21, and multiple candidate actives for each confirmed waiting layer in the RNN model 22 shown in FIG. The activation function of each node in the layer can be searched in the conversion function, and the connection relationship between each node can be determined.

上述の捜索により得られた互いに接続されるCNNモデル21及びRNNモデル22の可能な各全体構造（例えば、図3に示す全体CRNN構造）について、該構造の初期又はランダムパラメータの下で、該全体構造がテキストを正確に認識する確率を示す損失関数、及び、畳み込みニューラルネットワークモデルと再帰型ニューラルネットワークモデルとの全体複雑度を計算することができる。換言すると、ステップS601における好ましい構造の捜索処理では、可能な各全体構造について、そのパラメータ値を変えず、例えば、ランダムに確定される初期パラメータ値を使用する。捜索により得られた、互いに接続されるCNNモデル21及びRNNモデル22の全ての可能な全体構造のうちの各全体構造の上述の損失関数及び全体複雑度を計算し、そして、両者がすべて所定の要求を満足する構造を、ステップS601で確定される全体モデルの好ましいアーキテクチャとして選択することができる。 For each possible overall structure of the interconnected CNN model 21 and RNN model 22 obtained by the above search (eg, the overall CRNN structure shown in FIG. 3), the whole, under the initial or random parameters of the structure. It is possible to calculate the loss function, which indicates the probability that the structure will recognize the text accurately, and the overall complexity of the convolutional neural network model and the recurrent neural network model. In other words, the search process for the preferred structure in step S601 does not change the parameter values for each possible overall structure, but uses, for example, a randomly determined initial parameter value. Calculate the above-mentioned loss function and overall complexity of each overall structure of all possible overall structures of the interconnected CNN model 21 and RNN model 22 obtained by the search, and both are all given A structure that satisfies the requirements can be selected as the preferred architecture for the overall model determined in step S601.

例えば、予め全体複雑度閾値を設定し、そして、全体複雑度が該閾値よりも小さい可能な構造のうちから、テキストを正確に認識する確率を示す損失関数の表現（パフォーマンス）が最も良い可能な構造を、ステップS601中で確定される全体モデルの好ましいアーキテクチャとして選択することができる。或いは、上述の損失関数及び全体複雑度を同時に表す全体最適化関数を直接構築し、そして、該最適化関数の解を求めることで、全体モデルの好ましいアーキテクチャを確定しても良い。 For example, the overall complexity threshold is set in advance, and the loss function expression (performance) indicating the probability of accurately recognizing the text is best possible from among the possible structures in which the overall complexity is smaller than the threshold. The structure can be selected as the preferred architecture for the overall model determined in step S601. Alternatively, the preferred architecture of the overall model may be determined by directly constructing the overall optimization function that simultaneously represents the loss function and the overall complexity described above, and finding the solution of the optimization function.

一例として、リアルラベルが付けられた、テキストシーケンスを含む入力画像が与えられている場合、図3に示す基本アーキテクチャを有する全体モデルmについて、図3のCTC層23が正確なシーケンスラベルを出力する確率を用いて、CTC層23の出力のような損失関数を、以下の公式（1）に示すように、該全体モデルがテキストを正確に認識する確率を示す損失関数LOSS(m)として構築することができる。

As an example, given an input image containing a text sequence with a real label, the CTC layer 23 in FIG. 3 outputs the exact sequence label for the entire model m with the basic architecture shown in FIG. Using probabilities, we construct a loss function, such as the output of CTC layer 23, as a loss function LOSS (m), which indicates the probability that the overall model will accurately recognize the text, as shown in formula (1) below. be able to.

ここで、p(z｜x)は、入力xを与えてシーケンスzを出力する確率を表し、Sは、訓練集である。公式（1）の損失関数は、サンプルを与えた後に正確なラベルを出力する確率の積に対しての対数の負の値を取ることであり、該損失関数が小さいほど、例えば、図3に示す全体モデルがテキストシーケンスを正確に認識する確率が高い。 Here, p (z | x) represents the probability of giving the input x and outputting the sequence z, and S is a training collection. The loss function of formula (1) is to take a negative logarithmic value with respect to the product of the probabilities of outputting an accurate label after giving a sample, and the smaller the loss function, for example, as shown in FIG. There is a high probability that the overall model shown will recognize the text sequence accurately.

また、各種の従来の方法により、CNNモデル21、RNNモデル22、及びCTC層23に基づいて構成される全体モデルmの全体複雑度を計算することもできる。例えば、この例では、CNNモデル21の畳み込み操作の計算量がモデルm全体の計算量を決めるから、CNNモデル21中の畳み込み操作の計算量FLOP(m)を以てモデルm全体の全体複雑度を表すことができる。 It is also possible to calculate the overall complexity of the overall model m constructed based on the CNN model 21, RNN model 22, and CTC layer 23 by various conventional methods. For example, in this example, since the computational complexity of the convolutional operation of the CNN model 21 determines the computational complexity of the entire model m, the computational complexity of the convolutional operation FLOP (m) in the CNN model 21 represents the overall complexity of the entire model m. be able to.

一例として、CNNモデル21中の第i個目の畳み込み層の入力チャネルの個数がC_inであり、出力チャネルの個数がC_outであり、畳み込みの尺度がw*w（wは、自然数である）であり、特徴マップのサイズがH*W（H及びWは、自然数である）である場合、該畳み込み層の計算量は、

である。 As an example, the number of input channels of the i-th convolutional layer in the CNN model 21 is C _in , the number of output channels is C _out , and the scale of convolution is w * w (w is a natural number). ), And the size of the feature map is H * W (H and W are natural numbers), the calculated amount of the convolutional layer is

Is.

関連する畳み込み層の畳み込み操作が分離可能な畳み込みである場合、該畳み込み層の計算量は、以下の公式（2’）に修正することができる。

If the convolution operation of the relevant convolution layer is a separable convolution, the complexity of the convolution layer can be modified to the following formula (2').

上述の公式（2）又は（2’）により取得された各畳み込み層の計算量に基づいて、以下の公式（3）のように、全ての畳み込み層の計算量の和を求めることで、全体モデルmの全体複雑度FLOP(m)を得ることができる。

Based on the computational complexity of each convolution layer obtained by the above formula (2) or (2'), the sum of the computational complexity of all convolution layers is calculated as in the following formula (3). The overall complexity FLOP (m) of the model m can be obtained.

1つの例では、上述の方法で取得された損失関数LOSS(m)及び全体複雑度FLOP(m)に基づいて、以下の公式（4）のような全体最適化関数を構築することができる。

In one example, based on the loss function LOSS (m) and the total complexity FLOP (m) obtained by the above method, a total optimization function such as the following formula (4) can be constructed.

公式（4）では、FLOP₀及びωは、具体的な応用に基づいて予め設定された定数である。1つの例では、FLOP₀は、全体モデルの期待複雑度の目標値に設定されても良く、ωは、0.07に設定されても良い。 In formula (4), FLOP ₀ and ω are preset constants based on specific applications. In one example, FLOP ₀ may be set to the target complexity of the overall model and ω may be set to 0.07.

好ましくは、強化学習方法により、例えば、公式（4）のような全体最適化関数が最小値を取ることを目標とし、例えば、図3に示すようなCNNモデル21中の各確定待ち層及びRNNモデル22中の各確定待ち層について、繰り返して（iteratively）捜索を行うことで、最適な全体モデルアーキテクチャを確定することができる。なお、本発明の内容の説明をもとに、上述の公式（4）のような全体最適化関数について、当業者は、各種の従来の最適化方法、例えば、強化学習などにより、該全体最適化関数の最適解を求めることで、最適化された全体モデルアーキテクチャを確定することができる。 Preferably, by the reinforcement learning method, the goal is for the global optimization function as in formula (4) to take the minimum value, for example, each fixed waiting layer and RNN in the CNN model 21 as shown in FIG. An optimal overall model architecture can be determined by performing an iteratively search for each confirmation waiting layer in model 22. In addition, based on the explanation of the content of the present invention, those skilled in the art can use various conventional optimization methods, for example, reinforcement learning, to perform the overall optimization function as described in the above formula (4). By finding the optimal solution of the optimization function, the optimized overall model architecture can be determined.

上述の好適な実施例では、予測の正確性を表すだけの正確率の代わりに、全体モデルがテキストを正確に認識する確率を表す損失関数を最適化目標の一部として使用することで、強化学習によってより良いシステム構造を確定するのに有利である。また、プラットホームの推理時間でなく、計算量に基づく全体複雑度を、最適化目標の他の部分として採用することで、最適化プロセスにおいて該モデルを利用するプラットホームの影響を受けることがなく、また、1つのプラットホーム上で学習したモデルを他のプラットホームに転用するのにも有利である。 In the preferred embodiment described above, instead of just representing the accuracy of the prediction, the loss function, which represents the probability that the overall model will recognize the text accurately, is enhanced by using it as part of the optimization goal. It is advantageous to establish a better system structure by learning. In addition, by adopting overall complexity based on computational complexity rather than platform inference time as another part of the optimization goal, it is not affected by the platform that uses the model in the optimization process, and It is also advantageous to transfer the model trained on one platform to another platform.

もちろん、本発明の内容をもとに、当業者が理解すべきは、具体的な応用又はタスクによって、予測の正確性を示す正確率ACC(m)を、テキストを正確に認識する確率の損失関数LOSS(m)の代わりに使用し、及び／又は、全体モデルの遅延時間T(m)を、畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルの全体複雑度FLOP(m)の代わりに使用することで、異なる全体最適化関数又は最適化目標を構築しても良い。例えば、代替案として、下述の公式(4’)のような全体最適化関数を取得しても良い。

Of course, based on the content of the present invention, those skilled in the art should understand that, depending on the specific application or task, the accuracy rate ACC (m) indicating the accuracy of the prediction is lost in the probability of accurately recognizing the text. Use in place of the function LOSS (m) and / or use the delay time T (m) of the whole model instead of the total complexity FLOP (m) of convolutional neural network models and recurrent neural network models. You may build different global optimization functions or optimization goals. For example, as an alternative, you may get a global optimization function such as the formula (4') below.

公式（4’）では、T₀及びωは、具体的な応用に基づいて予め設定された定数である。1つの例では、T₀は、全体モデルの期待遅延時間の目標値に設定されても良く、ωは、0.07に設定されても良い。 In formula (4'), T ₀ and ω are preset constants based on specific applications. In one example, T ₀ may be set to the target value for the expected delay time of the overall model and ω may be set to 0.07.

1つの代替的な最適化方式では、上述の公式（4）又は（4’）のような全体最適化関数を用いず、直接、例えば、公式（2）のような損失関数を第一最適化目標とし、また、例えば、公式（3）のような全体複雑度を第二最適化目標とし、例えば、Pareto（パレート）最優解アルゴリズムなどの多目標最適化問題を解くための方法により、2つの最適化目標の最優解を確定することにより、最適化された全体アーキテクチャを確定することもできる。なお、本発明の内容をもとに、2つの最適化目標を与えている場合、当業者は、各種の多目標最適化問題を解くための方法を採用して最優解を確定しても良いが、ここでは、その詳しい説明を省略する。 One alternative optimization method does not use the global optimization function such as formula (4) or (4') above, but directly optimizes the loss function such as formula (2) first. By setting the goal and, for example, the overall complexity as in formula (3) as the second optimization goal, and by a method for solving a multi-target optimization problem such as the Pareto best solution algorithm, 2 It is also possible to determine the optimized overall architecture by determining the best solution for one optimization goal. In addition, when two optimization goals are given based on the contents of the present invention, those skilled in the art may adopt a method for solving various multi-goal optimization problems to determine the best solution. It is good, but the detailed description is omitted here.

以上、ステップS601で実行する全体モデルアーキテクチャの最適化のための処理を説明した。ステップS601における処理により全体モデルの最適化アーキテクチャを得た後に、ステップS603では、同じ最適化関数又は最適化目標を用いて、該最適化アーキテクチャについて、エンドツーエンドのジョイント訓練により、モデルの具体的パラメータの最適化を実現することができる。 The process for optimizing the overall model architecture executed in step S601 has been described above. After obtaining the optimization architecture of the entire model by the processing in step S601, in step S603, using the same optimization function or optimization goal, the optimization architecture is concretely trained by end-to-end joint training. Parameter optimization can be realized.

具体的には、例えば、公式（4）のような全体最適化関数が最小値を得ることを目標とし、各確定待ち層の操作及び接続が既に確定されたCNNモデル21、及び、各確定待ち層の構造が既に確定されたRNNモデル22（並びに、RNNモデル22に接続される好ましいCTC層23）について、エンドツーエンドの方式で、モデル全体に対してジョイント訓練を行い、モデル全体における各パラメータの最優値を取得することができる。なお、モデルのアーキテクチャが確定されており、且つ全体最適化関数が与えられている場合、当業者は、各種の従来の方法により、このような訓練プロセスを実現しても良いが、ここでは、その詳しい説明を省略する。 Specifically, for example, the CNN model 21 in which the operation and connection of each confirmation waiting layer are already confirmed with the goal of obtaining the minimum value by the global optimization function such as the formula (4), and each confirmation waiting layer. For the RNN model 22 (and the preferred CTC layer 23 connected to the RNN model 22) whose layer structure has already been determined, joint training is performed on the entire model in an end-to-end manner, and each parameter in the entire model is performed. You can get the highest value of. If the architecture of the model is fixed and a global optimization function is given, those skilled in the art may implement such a training process by various conventional methods. The detailed description will be omitted.

以上、図6に示すフローチャートをもとに、図3乃至図5に示す具体例を参照しながら、例えば、図1に示す画像処理方法に使用される全体モデルアーキテクチャを得るための例示的な処理を説明した。図6に示す例示的な処理を用いることで、手動介入が減少し、良好なパフォーマンスを有するモデルのアーキテクチャを取得することができる。また、完全な人工（手動）設計でなく、捜索により、モデルの最適化アーキテクチャを得るため、異なるタスク及び／又はデータ集について、対応する最適化モデルを容易に取得することができる。即ち、該方法により取得されたモデルは、異なるタスク及び／又は数拠データに容易に転用することができる。 As described above, based on the flowchart shown in FIG. 6, while referring to the specific examples shown in FIGS. 3 to 5, for example, exemplary processing for obtaining the overall model architecture used in the image processing method shown in FIG. Explained. By using the exemplary process shown in Figure 6, manual intervention can be reduced and the architecture of the model with good performance can be obtained. Also, the corresponding optimization model can be easily obtained for different tasks and / or data collections in order to obtain the model optimization architecture by searching rather than by completely artificial (manual) design. That is, the model acquired by this method can be easily diverted to different tasks and / or numerical data.

また、テキストを正確に認識する確率を最適化し、及び、モデルの全体複雑度を最適化するという多目標最適化の方式で捜索及びジョイント訓練を行うことにより、図6に示す例示的な処理で取得されるモデルは、認識の正確性及びモデルの規模又は計算量を同時に考慮することができる。認識の正確性をのみ考慮する一般の認識モデルに比べ、ここで取得される、最適化された全体複雑度も同時に有するモデルの計算量又は規模が比較的小さいため、モバイルプラットフォーム及びリソース制約有りの環境に特に適用される。 In addition, by optimizing the probability of accurately recognizing the text and performing search and joint training by a multi-target optimization method of optimizing the overall complexity of the model, the exemplary processing shown in FIG. 6 is performed. The acquired model can simultaneously consider the accuracy of recognition and the scale or complexity of the model. Compared to a general recognition model that considers only recognition accuracy, the amount of calculation or scale of the model acquired here that also has optimized overall complexity is relatively small, so there are mobile platform and resource constraints. Especially applicable to the environment.

本発明のもう1つの側面によれば、画像処理装置が提供される。図7は、本発明の実施例における画像処理装置の1つの例示的構造を示すブロック図である。 According to another aspect of the present invention, an image processing device is provided. FIG. 7 is a block diagram showing one exemplary structure of the image processing apparatus according to the embodiment of the present invention.

図7に示すように、画像処理装置700は、以下のものを含んでも良い。 As shown in FIG. 7, the image processing apparatus 700 may include the following.

特徴抽出ユニット701：テキストを含む処理待ち画像を、文字認識のための畳み込みニューラルネットワーク（CNN）モデルに入力し、テキスト特徴を抽出し；及び
テキスト認識ユニット702：抽出されたテキスト特徴を、畳み込みニューラルネットワークモデルと接続される、語義認識のための再帰型ニューラルネットワーク（RNN）モデルに入力し、処理待ち画像におけるテキストを認識する。 Feature extraction unit 701: Inputs awaiting image containing text into a convolutional neural network (CNN) model for character recognition and extracts text features; and text recognition unit 702: Convolutional neural network of extracted text features. Input to a recurrent neural network (RNN) model for word meaning recognition connected to the network model, and recognize the text in the waiting image.

ここで、特徴抽出ユニット701に使用される畳み込みニューラルネットワーク（CNN）モデル及びテキスト認識ユニット702に使用される再帰型ニューラルネットワーク（RNN）モデルは、互いに接続される、確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルに基づいて、事前定義の候補モデル空間内で各確定待ち層について捜索を行い、また、エンドツーエンドの方法で、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルに対してジョイント訓練を行うことにより取得される。 Here, the convolutional neural network (CNN) model used for the feature extraction unit 701 and the recurrent neural network (RNN) model used for the text recognition unit 702 are initial convolutional neural networks having a fixed waiting layer connected to each other. Based on a network model and an early recurrent neural network model with a fixed waiting layer, a search is performed for each fixed waiting layer in a predefined candidate model space, and a convolutional neural network connected to each other in an end-to-end manner. Obtained by performing joint training on network models and recurrent neural network models.

上述の画像処理装置及びそのユニットは、例えば、上述の図1を参照して説明した画像処理方法及びその各ステップの操作及び／又は処理を行い、類似した効果を実現することができるため、ここでは、その詳しい説明を省略する。また、上述の画像処理装置に使用されるCNNモデル及びRNNモデルは、上述の図6をもとに説明したモデルアーキテクチャの最適化のための例示的な処理により取得することができる。 Since the above-mentioned image processing apparatus and its unit can, for example, perform the image processing method described with reference to FIG. 1 and the operation and / or processing of each step thereof, and realize similar effects, the present invention is described here. Then, the detailed explanation is omitted. Further, the CNN model and the RNN model used in the above-mentioned image processing apparatus can be obtained by the exemplary processing for optimizing the model architecture described with reference to FIG. 6 above.

本発明の他の側面によれば、情報処理装置が提供される。該情報処理装置は、本発明の実施例に係る画像処理方法を実現することができ、それは、処理器を含み、該処理器は、次のように構成され、即ち、テキストを含む処理待ち画像を、文字認識のための畳み込みニューラルネットワークモデルに入力し、テキスト特徴を抽出し；及び、抽出されたテキスト特徴を、畳み込みニューラルネットワークモデルに接続される、語義認識のための再帰型ニューラルネットワークモデルに入力し、処理待ち画像におけるテキストを認識する。ここで、畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルは、互いに接続される、確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルに基づいて、事前定義の候補モデル空間内で各確定待ち層について捜索を行い、また、エンドツーエンドの方法で、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルに対してジョイント訓練を行うことにより取得される。 According to another aspect of the present invention, an information processing device is provided. The information processing apparatus can realize the image processing method according to the embodiment of the present invention, which includes a processor, and the processor is configured as follows, that is, a processing waiting image including text. To the convolutional neural network model for character recognition and extract the text features; and to connect the extracted text features to the recurrent neural network model for word meaning recognition connected to the convolutional neural network model. Input and recognize the text in the image waiting to be processed. Here, the convolutional neural network model and the recurrent neural network model are predefined candidates based on the initial convolutional neural network model having a definite waiting layer and the initial recurrent neural network model having a definite waiting layer connected to each other. It is obtained by searching each fixed waiting layer in the model space and performing joint training on convolutional neural network models and recurrent neural network models connected to each other in an end-to-end manner.

情報処理装置の処理器は、例えば、上述の図1を参照して説明した画像処理方法及びその各ステップの操作及び／又は処理を行い、類似した効果を実現するように構成されても良いが、ここでは、その詳しい説明を省略する。また、処理器が使用するCNNモデル及びRNNモデルは、上述の図6を参照して説明したモデルのアーキテクチャの最適化のための例示的処理により取得されても良い。 The processor of the information processing device may be configured to realize similar effects by, for example, performing the image processing method described with reference to FIG. 1 and the operation and / or processing of each step thereof. , Here, the detailed description is omitted. In addition, the CNN model and RNN model used by the processor may be obtained by exemplary processing for optimizing the architecture of the model described with reference to FIG. 6 above.

オプションとして、処理器は、次のように構成されても良く、即ち、テキストを正確に認識する確率を示す損失関数、及び、畳み込みニューラルネットワークモデルと再帰型ニューラルネットワークモデルとの全体複雑度の最小化を最適化目標として捜索及びジョイント訓練を行い、最適化後の畳み込みニューラルネットワークモデルと再帰型ニューラルネットワークモデルを取得する。 Optionally, the processor may be configured as follows: a loss function that indicates the probability of accurately recognizing the text, and the minimum overall complexity of the convolutional neural network model and the recurrent neural network model. Search and joint training are performed with optimization as the optimization target, and the convolutional neural network model and recurrent neural network model after optimization are acquired.

一例として、事前定義の候補モデル空間は、畳み込みニューラルネットワークモデルの1つの確定待ち層について、複数の候補畳み込み操作及び候補プーリング操作を含んでも良く、処理器は、次のように構成されても良く、即ち、畳み込みニューラルネットワークモデルの1つの確定待ち層について、複数の候補畳み込み操作及び候補プーリング操作の中で捜索及びジョイント訓練を行うことで、該層の操作を確定する。 As an example, the predefined candidate model space may include multiple candidate convolution and candidate pooling operations for one convolutional neural network model decision waiting layer, and the processor may be configured as follows: That is, for one confirmation waiting layer of the convolutional neural network model, the operation of the layer is confirmed by performing search and joint training in a plurality of candidate convolution operations and candidate pooling operations.

一例として、事前定義の候補モデル空間は、畳み込みニューラルネットワークモデルの1つの確定待ち層について、さらに、該層から畳み込みニューラルネットワークモデルにおける後続の各確定待ち層までの候補接続を含んでも良く、処理器は、さらに、次のように構成されても良く、即ち、畳み込みニューラルネットワークモデルの1つの確定待ち層について、前記候補接続の中で捜索及びジョイント訓練を行い、該層から畳み込みニューラルネットワークモデルにおける後続の確定待ち層までの少なくとも1つの接続を確定する。 As an example, the predefined candidate model space may include for one decision-waiting layer of the convolutional neural network model and further candidate connections from that layer to each subsequent decision-waiting layer in the convolutional neural network model. Further, it may be configured as follows, that is, one confirmation waiting layer of the convolutional neural network model is searched and jointly trained in the candidate connection, and the subsequent layer in the convolutional neural network model is followed. Confirm at least one connection to the confirmation waiting layer.

オプションとして、畳み込みニューラルネットワークモデルでは、1つの確定待ち層の入力及び出力がそれぞれ、1*1の畳み込み操作を実現するための付加畳み込み層に接続される。 As an option, in the convolutional neural network model, the inputs and outputs of one definite waiting layer are each connected to an additional convolutional layer to achieve a 1 * 1 convolutional operation.

オプションとして、候補畳み込み操作は、穴付き畳み込み操作を含む。 Optionally, the candidate convolution operation includes a perforated convolution operation.

オプションとして、候補プーリング操作は、1*nのプーリング操作を含み、ここで、nは、2以上の所定の自然数である。 Optionally, the candidate pooling operation includes a 1 * n pooling operation, where n is a predetermined natural number greater than or equal to 2.

オプションとして、複数の候補畳み込み操作及び候補プーリング操作は、異なる類型又は／又は異なるサイズの候補畳み込み操作、及び／又は、異なる類型及び／又は異なるサイズの候補プーリング操作を含む。 Optionally, the plurality of candidate convolution and / or candidate pooling operations includes different types and / or different size candidate convolution operations and / or different types and / or different size candidate pooling operations.

一例として、初期再帰型ニューラルネットワークモデルは、各確定待ち層について、所定数のノードを有し、そのうち、事前定義の候補モデル空間は、再帰型ニューラルネットワークモデルの各確定待ち層について、複数の候補活性化関数を含み、処理器は、さらに、次のように構成されても良く、即ち、再帰型ニューラルネットワークモデルの各確定待ち層について、複数の候補活性化関数の中で捜索及びジョイント訓練を行い、該層における各ノードの活性化関数を確定し、また、各ノード間の接続関係も確定する。 As an example, the initial recurrent neural network model has a predetermined number of nodes for each fixed waiting layer, of which the predefined candidate model space is a plurality of candidates for each fixed waiting layer of the recurrent neural network model. Including an activation function, the processor may further be configured as follows: for each recurrent neural network model waiting layer, search and joint training among multiple candidate activation functions. The activation function of each node in the layer is determined, and the connection relationship between the nodes is also determined.

図8は、本発明の実施例における情報処理方法及び装置を実現し得るハードウェア構成（汎用マシン）800の構造図である。 FIG. 8 is a structural diagram of a hardware configuration (general-purpose machine) 800 that can realize the information processing method and device according to the embodiment of the present invention.

汎用マシン800は、例えば、コンピュータシステムであっても良い。なお、汎用マシン800は、例示に過ぎず、本発明による方法及び装置の応用範囲又は機能について限定しない。また、汎用マシン800は、上述の方法及び装置における任意のモジュールやアセンブリなど又はその組み合わせに依存しない。 The general-purpose machine 800 may be, for example, a computer system. The general-purpose machine 800 is merely an example, and does not limit the application range or function of the method and device according to the present invention. Further, the general-purpose machine 800 does not depend on any module, assembly, or a combination thereof in the above-mentioned method and device.

図8では、中央処理装置（CPU）801は、ROM 802に記憶されているプログラム又は記憶部808からRAM 803にロッドされているプログラムに基づいて各種の処理を行う。RAM 803では、ニーズに応じて、CPU 801が各種の処理を行うときに必要なデータなどを記憶することもできる。CPU 801、ROM 802及びRAM 803は、バス804を経由して互いに接続される。入力／出力インターフェース805もバス804に接続される。 In FIG. 8, the central processing unit (CPU) 801 performs various processes based on the program stored in the ROM 802 or the program rodged from the storage unit 808 to the RAM 803. The RAM 803 can also store data required when the CPU 801 performs various processes according to needs. CPU 801 and ROM 802 and RAM 803 are connected to each other via bus 804. The input / output interface 805 is also connected to bus 804.

また、入力／出力インターフェース805には、さらに、次のような部品が接続され、即ち、キーボードなどを含む入力部806、液晶表示器（LCD）などのような表示器及びスピーカーなどを含む出力部807、ハードディスクなどを含む記憶部808、ネットワーク・インターフェース・カード、例えば、LANカード、モデムなどを含む通信部809である。通信部809は、例えば、インターネット、LANなどのネットワークを経由して通信処理を行う。ドライブ810は、ニーズに応じて、入力／出力インターフェース805に接続されても良い。取り外し可能な媒体811、例えば、半導体メモリなどは、必要に応じて、ドライブ810にセットされることにより、その中から読み取られたコンピュータプログラムを記憶部808にインストールすることができる。 Further, the following components are connected to the input / output interface 805, that is, an input unit 806 including a keyboard and the like, an output unit including a display such as a liquid crystal display (LCD), and a speaker. The 807, the storage unit 808 including the hard disk, and the communication unit 809 including the network interface card, for example, the LAN card and the modem. The communication unit 809 performs communication processing via a network such as the Internet or LAN. Drive 810 may be connected to input / output interface 805, if desired. The removable medium 811 (for example, a semiconductor memory) is set in the drive 810 as needed, and the computer program read from the medium 811 can be installed in the storage unit 808.

また、本発明は、さらに、マシン可読指令コードを含むプログラムプロダクトを提供する。このような指令コードは、マシンにより読み取られて実行されるときに、上述の本発明の実施形態における方法を実行することができる。それ相応に、このようなプログラムプロダクトをキャリー（carry）する、例えば、磁気ディスク（フロッピーディスク（登録商標）を含む）、光ディスク（CD-ROM及びDVDを含む）、光磁気ディスク（MD（登録商標）を含む）、及び半導体記憶器などの各種記憶媒体も、本発明に含まれる。 The present invention also provides a program product that includes a machine-readable command code. When such a command code is read and executed by the machine, the method according to the embodiment of the present invention described above can be executed. Correspondingly, carry such program products, such as magnetic disks (including floppy disks (registered trademarks)), optical disks (including CD-ROMs and DVDs), magneto-optical disks (MD (registered trademarks)). ), And various storage media such as semiconductor storage devices are also included in the present invention.

上述の記憶媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、半導体記憶器などを含んでも良いが、これらに限定されない。 The above-mentioned storage medium may include, but is not limited to, for example, a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor storage device, and the like.

また、上述の方法における各操作（処理）は、各種のマシン可読記憶媒体に記憶されているコンピュータ実行可能なプログラムの方式で実現することもできる。 In addition, each operation (process) in the above method can also be realized by a method of a computer-executable program stored in various machine-readable storage media.

また、以上の実施例などに関し、さらに以下のように付記として開示する。 In addition, the above examples and the like will be further disclosed as additional notes as follows.

（付記1）
画像処理方法であって、
テキストを含む処理待ち画像を、文字認識のための畳み込みニューラルネットワークモデルに入力し、テキスト特徴を抽出し；及び
抽出されたテキスト特徴を、畳み込みニューラルネットワークモデルと接続される語義認識のための再帰型ニューラルネットワークモデルに入力し、処理待ち画像におけるテキストを認識することを含み、
畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルは、互いに接続される、確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルに基づいて、事前定義の候補モデル空間内で各確定待ち層について捜索を行い、エンドツーエンドの方式で、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルに対してジョイント訓練を行うことで取得される、方法。 (Appendix 1)
It is an image processing method
Input awaiting image containing text into a convolutional neural network model for character recognition, extract text features; and recursive type for word recognition to connect the extracted text features with a convolutional neural network model. Includes inputting into a neural network model and recognizing text in awaiting images
The convolutional neural network model and the recurrent neural network model are in a predefined candidate model space based on the initial convolutional neural network model having a definite waiting layer and the initial recurrent neural network model having a definite waiting layer connected to each other. A method obtained by searching for each fixed waiting layer in, and performing joint training on convolutional neural network models and recurrent neural network models connected to each other in an end-to-end manner.

（付記2）
付記1に記載の画像処理方法であって、
テキストを正確に認識する確率を示す損失関数、及び、畳み込みニューラルネットワークモデルと再帰型ニューラルネットワークモデルとの全体複雑度の最小化を最適化目標として捜索及びジョイント訓練を行うことで、最適化後の畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルを取得する、方法。 (Appendix 2)
The image processing method described in Appendix 1
After optimization, search and joint training are performed with the optimization goal of minimizing the overall complexity of the convolutional neural network model and the recurrent neural network model, as well as the loss function that indicates the probability of accurately recognizing the text. A method of obtaining a convolutional neural network model and a recurrent neural network model.

（付記3）
付記1又は2に記載の画像処理方法であって、
事前定義の候補モデル空間は、畳み込みニューラルネットワークモデルの1つの確定待ち層について、複数の候補畳み込み操作及び候補プーリング操作を含み、
畳み込みニューラルネットワークモデルの1つの確定待ち層について、複数の候補畳み込み操作及び候補プーリング操作の中で捜索及びジョイント訓練を行うことで、該層の操作を確定する、方法。 (Appendix 3)
The image processing method described in Appendix 1 or 2.
The predefined candidate model space contains multiple candidate convolution and candidate pooling operations for one definite waiting layer of the convolutional neural network model.
A method of confirming the operation of one confirmation waiting layer of a convolutional neural network model by performing search and joint training in a plurality of candidate convolution operations and candidate pooling operations.

（付記4）
付記3に記載の画像処理方法であって、
事前定義の候補モデル空間は、畳み込みニューラルネットワークモデルの1つの確定待ち層について、さらに、該層から畳み込みニューラルネットワークモデルにおける後続の各確定待ち層までの候補接続を含み、
畳み込みニューラルネットワークモデルの1つの確定待ち層について、前記候補接続の中で捜索及びジョイント訓練を行うことで、該層から畳み込みニューラルネットワークモデルにおける後続の確定待ち層までの少なくとも1つの接続を取得する、方法。 (Appendix 4)
The image processing method described in Appendix 3
The predefined candidate model space includes for one deterministic waiting layer of the convolutional neural network model and further containing candidate connections from that layer to each subsequent deterministic waiting layer in the convolutional neural network model.
By performing a search and joint training in the candidate connection for one definite waiting layer of the convolutional neural network model, at least one connection from the layer to the subsequent definite waiting layer in the convolutional neural network model is acquired. Method.

（付記5）
付記3又は4に記載の画像処理方法であって、
畳み込みニューラルネットワークモデルでは、1つの確定待ち層の入力及び出力が、それぞれ、1*1の畳み込み操作を実現するための付加畳み込み層と接続される、方法。 (Appendix 5)
The image processing method described in Appendix 3 or 4,
In a convolutional neural network model, the input and output of one fixed-waiting layer are each connected to an additional convolutional layer to achieve a 1 * 1 convolutional operation.

（付記6）
付記3に記載の画像処理方法であって、
候補畳み込み操作は、穴付き畳み込み操作を含み、
及び／又は
候補プーリング操作は、1*nのプーリング操作を含み、ここで、nは、2以上の所定の自然数である、方法。 (Appendix 6)
The image processing method described in Appendix 3
Candidate convolution operations include perforated convolution operations.
And / or candidate pooling operations include 1 * n pooling operations, where n is a predetermined natural number greater than or equal to 2.

（付記7）
付記3に記載の画像処理方法であって、
複数の候補畳み込み操作及び候補プーリング操作は、
異なる類型及び／又は異なるサイズの候補畳み込み操作；及び／又は
異なる類型及び／又は異なるサイズの候補プーリング操作を含む、方法。 (Appendix 7)
The image processing method described in Appendix 3
Multiple candidate convolution operations and candidate pooling operations
Candidate convolution operations of different types and / or different sizes; and / or methods that include candidate pooling operations of different types and / or different sizes.

（付記8）
付記1又は2に記載の画像処理方法であって、
初期再帰型ニューラルネットワークモデルは、各確定待ち層について、所定数のノードを有し、
事前定義の候補モデル空間は、再帰型ニューラルネットワークモデルの各確定待ち層について、複数の候補活性化関数を含み、
再帰型ニューラルネットワークモデルの各確定待ち層について、複数の候補活性化関数の中で捜索及びジョイント訓練を行うことで、該層における各ノードの活性化関数及び各ノード間の接続関係を確定する、方法。 (Appendix 8)
The image processing method described in Appendix 1 or 2.
The initial recurrent neural network model has a predetermined number of nodes for each decision waiting layer.
The predefined candidate model space contains multiple candidate activation functions for each decision-waiting layer of the recurrent neural network model.
By performing search and joint training in a plurality of candidate activation functions for each confirmation waiting layer of the recurrent neural network model, the activation function of each node in the layer and the connection relationship between each node are determined. Method.

（付記9）
処理器を含む情報処理装置であって、
処理器は、
テキストを含む処理待ち画像を、文字認識のための畳み込みニューラルネットワークモデルに入力し、テキスト特徴を抽出し；及び
抽出されたテキスト特徴を、畳み込みニューラルネットワークモデルと接続される語義認識のための再帰型ニューラルネットワークモデルに入力し、処理待ち画像におけるテキストを認識するように構成され、
畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルは、互いに接続される、確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルに基づいて、事前定義の候補モデル空間内で各確定待ち層について捜索を行い、エンドツーエンドの方式で、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルに対してジョイント訓練を行うことで取得される、装置。 (Appendix 9)
An information processing device that includes a processor
The processor is
Input awaiting image containing text into a convolutional neural network model for character recognition, extract text features; and recursive type for word recognition to connect the extracted text features with a convolutional neural network model. It is configured to input into a neural network model and recognize text in awaiting images.
The convolutional neural network model and the recurrent neural network model are in a predefined candidate model space based on the initial convolutional neural network model having a definite waiting layer and the initial recurrent neural network model having a definite waiting layer connected to each other. A device that is acquired by searching for each fixed waiting layer in, and performing joint training on convolutional neural network models and recurrent neural network models that are connected to each other in an end-to-end manner.

（付記10）
付記9に記載の情報処理装置であって、
処理器は、さらに、
テキストを正確に認識する確率を示す損失関数、及び、畳み込みニューラルネットワークモデルと再帰型ニューラルネットワークモデルとの全体複雑度の最小化を最適化目標として捜索及びジョイント訓練を行うことで、適化後の畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルを得るように構成される、装置。 (Appendix 10)
The information processing device described in Appendix 9
The processor is also
After optimization, search and joint training are performed with the optimization goal of the loss function showing the probability of accurately recognizing the text and the minimization of the overall complexity of the convolutional neural network model and the recurrent neural network model. A device configured to obtain a convolutional neural network model and a recurrent neural network model.

（付記11）
付記9又は10に記載の情報処理装置であって、
事前定義の候補モデル空間は、畳み込みニューラルネットワークモデルの1つの確定待ち層について、複数の候補畳み込み操作及び候補プーリング操作を含み、
処理器は、さらに、
畳み込みニューラルネットワークモデルの1つの確定待ち層について、複数の候補畳み込み操作及び候補プーリング操作の中で捜索及びジョイント訓練を行い、該層の操作を確定するように構成される、装置。 (Appendix 11)
The information processing device according to Appendix 9 or 10.
The predefined candidate model space contains multiple candidate convolution and candidate pooling operations for one definite waiting layer of the convolutional neural network model.
The processor is also
A device configured to perform search and joint training in a plurality of candidate convolution operations and candidate pooling operations for one confirmation waiting layer of a convolutional neural network model, and to confirm the operation of the layer.

（付記12）
付記11に記載の情報処理装置であって、
事前定義の候補モデル空間は、畳み込みニューラルネットワークモデルの1つの確定待ち層について、さらに、該層から畳み込みニューラルネットワークモデルにおける後続の各確定待ち層までの候補接続を含み、
処理器は、さらに、
畳み込みニューラルネットワークモデルの1つの確定待ち層について、前記候補接続の中で捜索及びジョイント訓練を行い、該層から畳み込みニューラルネットワークモデルにおける後続の確定待ち層までの少なくとも1つの接続を確定するように構成される、装置。 (Appendix 12)
The information processing device described in Appendix 11
The predefined candidate model space includes for one deterministic waiting layer of the convolutional neural network model and further containing candidate connections from that layer to each subsequent deterministic waiting layer in the convolutional neural network model.
The processor is also
It is configured to search and joint train one of the convolutional neural network models in the candidate connection to determine at least one connection from that layer to the subsequent definite waiting layer in the convolutional neural network model. The device to be done.

（付記13）
付記11又は12に記載の情報処理装置であって、
畳み込みニューラルネットワークモデルでは、1つの確定待ち層の入力及び出力が、それぞれ、1*1の畳み込み操作を実現するための付加畳み込み層に接続される、装置。 (Appendix 13)
The information processing device according to Appendix 11 or 12.
In a convolutional neural network model, a device in which the inputs and outputs of one definite waiting layer are each connected to an additional convolutional layer to achieve a 1 * 1 convolutional operation.

（付記14）
付記11に記載の情報処理装置であって、
候補畳み込み操作は、穴付き畳み込み操作を含み、
及び／又は、
候補プーリング操作は、1*nのプーリング操作を含み、ここで、nは、2以上の所定の自然数である、装置。 (Appendix 14)
The information processing device described in Appendix 11
Candidate convolution operations include perforated convolution operations.
And / or
Candidate pooling operations include 1 * n pooling operations, where n is a predetermined natural number greater than or equal to 2.

（付記15）
付記11に記載の情報処理装置であって、
複数の候補畳み込み操作及び候補プーリング操作は、
異なる類型及び／又は異なるサイズの候補畳み込み操作；及び／又は
異なる類型及び／又は異なる尺寸の候補プーリング操作を含む、装置。 (Appendix 15)
The information processing device described in Appendix 11
Multiple candidate convolution operations and candidate pooling operations
A device that includes candidate convolution operations of different types and / or different sizes; and / or candidate pooling operations of different types and / or different sizes.

（付記16）
付記9又は10に記載の情報処理装置であって、
初期再帰型ニューラルネットワークモデルは、各確定待ち層について所定数のノードを有し、
事前定義の候補モデル空間は、再帰型ニューラルネットワークモデルの各確定待ち層について、複数の候補活性化関数を含み、
処理器は、さらに、
再帰型ニューラルネットワークモデルの各確定待ち層について、複数の候補活性化関数の中で捜索及びジョイント訓練を行うことで、該層における各ノードの活性化関数及び各ノード間の接続関係を確定するように構成される、装置。 (Appendix 16)
The information processing device according to Appendix 9 or 10.
The initial recurrent neural network model has a predetermined number of nodes for each decision waiting layer.
The predefined candidate model space contains multiple candidate activation functions for each decision-waiting layer of the recurrent neural network model.
The processor is also
For each confirmation waiting layer of the recurrent neural network model, the activation function of each node in the layer and the connection relationship between each node are determined by performing search and joint training in a plurality of candidate activation functions. A device that is configured in.

（付記17）
マシン可読指令コードを記憶した記憶媒体であって、
前記指令コードは、マシンにより読み取られて実行されるときに、マシンに、画像処理方法を実行させ、前記画像処理方法は、
テキストを含む処理待ち画像を、文字認識のための畳み込みニューラルネットワークモデルに入力し、テキスト特徴を抽出し；及び
抽出されたテキスト特徴を、畳み込みニューラルネットワークモデルと接続される語義認識のための再帰型ニューラルネットワークモデルに入力し、処理待ち画像におけるテキストを認識することを含み、
畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルは、互いに接続される、確定待ち層を有する初期畳み込みニューラルネットワークモデル及び確定待ち層を有する初期再帰型ニューラルネットワークモデルに基づいて、事前定義の候補モデル空間内で各確定待ち層について捜索を行い、エンドツーエンドの方式で、互いに接続される畳み込みニューラルネットワークモデル及び再帰型ニューラルネットワークモデルに対してジョイント訓練を行うことで取得される、記憶媒体。 (Appendix 17)
A storage medium that stores a machine-readable command code.
When the command code is read and executed by the machine, the machine is made to execute the image processing method, and the image processing method is
Input awaiting image containing text into a convolutional neural network model for character recognition, extract text features; and recursive type for word recognition to connect the extracted text features with a convolutional neural network model. Includes inputting into a neural network model and recognizing text in awaiting images
The convolutional neural network model and the recurrent neural network model are in a predefined candidate model space based on the initial convolutional neural network model having a definite waiting layer and the initial recurrent neural network model having a definite waiting layer connected to each other. A storage medium obtained by searching for each fixed waiting layer and performing joint training on convolutional neural network models and recurrent neural network models connected to each other in an end-to-end manner.

以上、本発明の好ましい実施形態を説明したが、本発明はこの実施形態に限定されず、本発明の趣旨を離脱しない限り、本発明に対するあらゆる変更は、本発明の技術的範囲に属する。 Although the preferred embodiment of the present invention has been described above, the present invention is not limited to this embodiment, and any modification to the present invention belongs to the technical scope of the present invention unless the gist of the present invention is departed.

Claims

It is an image processing method
Input awaiting image containing text into a convolutional neural network model for character recognition, extract text features; and recursive type for word recognition to connect the extracted text features with a convolutional neural network model. Includes inputting into a neural network model and recognizing text in awaiting images
The convolutional neural network model and the recurrent neural network model are in a predefined candidate model space based on the initial convolutional neural network model having a definite waiting layer and the initial recurrent neural network model having a definite waiting layer connected to each other. A method obtained by searching for each fixed waiting layer in, and performing joint training on convolutional neural network models and recurrent neural network models connected to each other in an end-to-end manner.

The image processing method according to claim 1.
After optimization, search and joint training are performed with the optimization goal of minimizing the overall complexity of the convolutional neural network model and the recurrent neural network model, as well as the loss function that indicates the probability of accurately recognizing the text. A method of obtaining a convolutional neural network model and a recurrent neural network model.

The image processing method according to claim 1 or 2.
The predefined candidate model space contains multiple candidate convolution and candidate pooling operations for one definite waiting layer of the convolutional neural network model.
A method of confirming the operation of one confirmation waiting layer of a convolutional neural network model by performing search and joint training in a plurality of candidate convolution operations and candidate pooling operations.

The image processing method according to claim 3.
The predefined candidate model space includes for one deterministic waiting layer of the convolutional neural network model and further containing candidate connections from that layer to each subsequent deterministic waiting layer in the convolutional neural network model.
By performing a search and joint training in the candidate connection for one definite waiting layer of the convolutional neural network model, at least one connection from the layer to the subsequent definite waiting layer in the convolutional neural network model is acquired. Method.

The image processing method according to claim 3.
In a convolutional neural network model, the input and output of one fixed-waiting layer are each connected to an additional convolutional layer to achieve a 1 * 1 convolutional operation.

The image processing method according to claim 3.
Candidate convolution operations include perforated convolution operations; and / or candidate pooling operations include 1 * n pooling operations, where n is a predetermined natural number greater than or equal to 2.

The image processing method according to claim 3.
Multiple candidate convolution operations and candidate pooling operations
Candidate convolution operations of different types and / or different sizes; and / or methods that include candidate pooling operations of different types and / or different sizes.

The image processing method according to claim 1 or 2.
The initial recurrent neural network model has a predetermined number of nodes for each decision waiting layer.
The predefined candidate model space contains multiple candidate activation functions for each decision-waiting layer of the recurrent neural network model.
By performing search and joint training in a plurality of candidate activation functions for each confirmation waiting layer of the recurrent neural network model, the activation function of each node in the layer and the connection relationship between each node are determined. Method.

An information processing device that includes a processor
The processor
Input awaiting image containing text into a convolutional neural network model for character recognition, extract text features; and recursive type for word recognition to connect the extracted text features with a convolutional neural network model. It is configured to input into a neural network model and recognize text in awaiting images.
The convolutional neural network model and the recurrent neural network model are in a predefined candidate model space based on the initial convolutional neural network model having a definite waiting layer and the initial recurrent neural network model having a definite waiting layer connected to each other. A device that is acquired by searching for each fixed waiting layer in, and performing joint training on convolutional neural network models and recurrent neural network models that are connected to each other in an end-to-end manner.

A storage medium in which a program for executing the image processing method according to any one of claims 1 to 8 is stored.